The Vital Importance Of Data For AI: Interview With Rick McFarland, Chief Data Officer


There is no doubt of the importance of data powering the most advanced applications in use today, and especially artificial intelligence and machine learning applications that are so dependent on good quality, relevant data. Indeed, the foundations of best practices AI methodologies, the CPMAI methodology in particular, requires a foundation of understanding the necessary data for an AI project and preparing that data for use.

Reinforcing the importance of the role of data in AI and advanced analytic systems, Rick McFarland, Chief Data Officer – LexisNexis Legal & Professional shares his insights at an upcoming Data for AI virtual event on August 5. In an interview with Forbes, he shares some perspectives on the role data takes for AI projects at his organization.

Q: What are some innovative ways you’re leveraging advanced data analytics to benefit LexisNexis?

Rick McFarland: As the Chief Data Officer at LexisNexis Legal & Professional, I think one of the innovative and important ways we leverage advanced data analytics is by increasing the utilization and the quality of our Data. For example, most of our data is text-based (e.g. legal briefs, pleadings, motions, case law, dockets, and other documents). Dealing with this type of textual data in its raw state is challenging, complex, and nuanced. That’s why many data scientists and analysts steer clear of it. However, by transforming this data into embedding vectors by using advanced NLP methods like BERT, we create an entirely new data asset that can be leveraged by data scientists and analysts across the organization to create models and features for our products. Think of the embedding vector transformation as a conversion of the text data into numerical data. For example, every word and sentence can be represented by a numerical vector; think a row of numbers in a spreadsheet. These numerical vectors are much easier to work with mathematically and can be used to develop classification models, entity extraction models, question and answering – the list goes on and on.    

Q: How do you identify which problem area(s) to start with for your data analytics and cognitive technology projects?

Rick McFarland: In the professional-grade AI space that we work in, i.e. Law and Medicine, we have a unique challenge that developers for consumer-grade AI don’t necessarily have. In our space, the education distance between the developer and the customer is wide. For example, a lawyer (our customer) has gone to school for many years including post graduate work to learn a very advanced and specific skill and, one would argue, even mastered a new language (if you’ve ever read a legal document, you know what I am talking about!) Similarly, Data Scientists also have spent a long time honing their skills and often have a PhD in computer science, and also have learned a programming language or two. With these two professions on different ends of the educational spectrum, communication and business understanding between the two are often difficult. Because of this, the data science and development teams at LN partner very closely with our Product Team. Our Product Team, some of whom are lawyers, stay closely connected to our customers and perform frequent surveys and focus groups to understand their needs. They are also experts at working with developers and data scientists. These three job families (product, developer and data scientist/analyst make up our standard team. We rely on this close connection to keep us on top of the main problem areas of our clients.

Q: What are some of the unique opportunities you have when it comes to data and AI?

Rick McFarland: AI is only as good as the data used to train and feed it.  And LexisNexis has one of the largest repositories of legal data in the world. But, as any data scientists will tell you, having raw data is half the battle. What makes our data special is that, since the invention of the computer, we have had thousands of lawyers on staff enriching, summarizing, identifying entities, mapping citations, etc., on this massive corpus. Therefore, we also have the other half of the data scientists’ requirement: we have petabytes of training data. We have all the raw building materials for AI development. Our data scientists have almost unlimited opportunity to create AI products and features. We can take nearly any “AI idea” from concept to POC in a matter of weeks!

Q: Can you share some of the challenges when it comes to AI and ML adoption?

Rick McFarland: In the professional world (e.g., Legal, Medical, and Scientific), the bar for AI and ML is very high, so the adoption rate is much lower than in the consumer market. In these professions, where people’s lives or freedoms are at stake, the cost of being wrong has significant consequences. For these professionals to rely on an AI application means that it must perform well and must be accurate consistently. With one wrong answer, trust in that tool is breached – recovery is long and perhaps not even possible. For LexisNexis to release any AI to the professional marketplace, it must meet that high bar. 

We do something unique in the professional-grade space, something consumer-grade technology providers probably don’t do. Most AI developers are familiar with the “holdout” sample method to evaluate the quality of a model – the holdout data is randomly selected from the training data, hidden from the developers, not used as part of the training process, and used to evaluate the final model. In the professional-grade process, we also maintain another holdout data set which we refer to as the Platinum Data. In our Q&A process, this is a set of questions that the current model correctly answers. In fact, these answers are rated as the “best” by our SME’s and customers. With every new model we release, we re-check the answers generated by our system on the Platinum questions. If the answers provided are not the same (or worse), we don’t release the model into production. As you might guess, with each new release this Platinum Data grows. We are not only providing correct answers – we are ensuring that we are providing consistency, which is very important in the Legal industry where court cases can sometimes go on for months or even years. 

Q: How do analytics, automation, and AI work together at LexisNexis?

Rick McFarland: At LexisNexis, a visualization of how these three functions operate can be represented by a three-way Venn Diagram. There are situations when Analytics can be applied independently, or in combination with AI, or with Automation, or with both. It’s true with each of these three. We have team leaders running each of these functions, each with deep skills in their area, and they also frequently work together on projects. For example, it’s not uncommon for the Automation team to partner with the AI team to develop a chatbot for our Customer Support team to help reduce the need for human support. Similarly, the Analytics team will work with the AI team to develop a cross-sell model that could then be deployed on an Analytics dashboard to help make recommendations to support our Sales Teams.     

Q: How are you navigating privacy, trust, and security concerns around the use of your data?

Rick McFarland: Most of our data is from publicly available sources (i.e., case law, dockets, etc.), and we tap into more than 50,000 sources. In addition, we digitize and standardize it to make it easily readable and searchable. Attorneys use the data to serve their clients, and we view our role as making the Law accessible and readable. This supports our global vision of making the world a more just place by advancing the Rule of Law.

Q: What are you doing to develop a data-literate and AI-ready workforce?

Rick McFarland: One of the things we are especially proud of at LexisNexis is our Advanced Technical Curriculum. We learned early on when we started hiring Data Scientists that most are well trained at consumer-grade AI development. That’s what is taught in schools and training courses because there’s a lot of data and examples out there.  What Data Scientist hasn’t made at least one model from Twitter data today? So, we created the ATC as a sort of “grad school” for Data Scientists to learn how to develop professional-grade AI.  

This curriculum, like in most universities, has a syllabus of classes with levels ranging from 100-level to 400-level. Our 100-level classes are accessible to all, and we focus these on the platforms and tools that are essential to AI development. We have AWS and Azure training because we learned that most data scientists enter the door with a different knowledge of how to use these core platforms, and we needed our people to learn how to use these platforms while understanding the professional-grade requirements set by our clients. Our 200-level classes focus on domain-specific and core languages like Python. Our 300-level courses are designed to focus on AI methods. Since we primarily deal with Legal and Textual (non-structured) data, we offer classes in Text Processing, Machine Learning, Natural Language Processing (NLP), and contextualized embeddings (e.g., BERT). Our 400-level courses are where we pull it all together and focus on the application of these advanced AI building-blocks to develop AI features and products.  So, for example, these courses teach how to build professional-grade chatbots, recommender systems, learning-to-rank models, etc. Once our internal students have completed their Advanced Tech Curriculum Education, they will earn their LexisNexis Profession-Grade Data Science Certification and be on their way to developing the next Legal AI tool.


Q: What AI technologies are you most looking forward to in the coming years?

Rick McFarland: What I am really looking forward to is the true Star Trek-like “voice intelligence” that can answer any question we ask. It will understand colloquialisms and idioms.  It will easily shift from answering generic, everyday questions to answering complex scientific and professional questions. It also will understand the context of questions – it would be able to distinguish between me talking about Coach “the person” or Coach the “brand.” The true leap will come when this voice-intelligence anticipates (or even predicts) what my ultimate end goal is based on my question and context and gives me some additional information that I may not have asked for to help me achieve my goal. That will be truly exciting.

Hear more from Rick McFarland at the  upcoming Data for AI virtual event on August 5.