What business leaders need to know about the new focus on “data-centric AI”
Beyond all the debate and concern about the potential and perils of artificial intelligence (AI), another urgent conversation is taking place in the machine-learning community: How to make AI more “data-centric.”
Unlike the AI advances to date, which have largely been driven by data scientists, programmers and systems engineers, the push to re-center AI around the quality of underlying data requires the involvement of every business leader hoping to leverage AI for productivity gains, risk mitigation, product development, business planning and other benefits.
Speaking at a recent MIT conference, AI pioneer Andrew Ng, founder of the Google Brain research lab and Coursera, called for a move toward “data-centric AI,” which he described as “the discipline of systematically engineering the data needed to build a successful AI system.”
The problem, Ng said, is that companies in industries ranging from health care and biotech to manufacturing are not ready to embrace the promise of AI technology because their datasets are not yet labeled, organized and consistent enough for machine-learning (ML) systems to be effective.
You might ask: “Isn’t all AI data-centric?” The possibly surprising, and unfortunate, answer is no. Most of today’s AI approach is “model-centric,” with the goal of producing the best model for a given dataset. The principal goal of data-centric AI is to produce the best dataset to train a given ML model.
For much of its history, AI has been model-centric, focused on things like model architecture, tuning the “hyperparameters” that control the machine-learning process and improving techniques to train computers to recognize patterns. The data powering the models was viewed as a kind of static ground truth or a variable outside practitioners’ and researchers’ control.
Ng and others now argue the focus needs to shift to getting the smaller amounts of data that enterprises have into proper shape to make it truly useful for AI applications. That means companies that want to make the leap into AI must do what others have done: hire or contract with data scientists to work closely with their own subject-matter experts.
Data labeling, augmentation and “distribution drift”
In a traditional model-centric approach, the data scientist writes computer code that trains AI models based on underlying data, iterating until she gets the best results using separate test data. Data-centric AI shifts the focus to data quality, data augmentation and data in deployment. Taking a data-centric approach may involve iteratively finding and correcting label errors, imputing missing values in the most sensible ways possible and choosing the best examples for training the model’s learning algorithm.
Properly labeling, or annotating, the data is critical for optimization. Too often the data used to train AI models is labeled in inconsistent or ambiguous ways, confusing AI systems. Mislabeling inflates the cost and lowers the value of AI deployments.
Data augmentation is another reason why clean, well-structured data is critical – especially for companies that lack large datasets. Data augmentation is a set of techniques that artificially increase the amount of data available to train models, by modifying or synthetically generating new copies of data. But beware: Inaccuracies or other issues with the synthetic data will get amplified as well, inflating the cost and lowering the value of the model. Therefore, it is crucial to use data augmentation techniques in ways that improve the overall quality of data and enhance the accuracy of AI models.
“Distribution drift” is a third challenge business leaders should understand as they prepare their companies’ data for machine learning. Models trained on historical datasets can become obsolete and produce inaccurate results. The problem of distribution drift is particularly prevalent in real-world applications, where data sources are often subject to change due to factors such as shifts in consumer behavior or changes in external conditions.
For example: If a company wants to model how much customer churn to anticipate in the coming year, it’s critical to understand the biases and limitations of the past churn data going into the model:
- Does the time period truly represent a typical business cycle, or were there one-time events that skew the data?
- How are you defining and measuring “churn”: Is it based on a change in usage of a product, a cancellation or a decline in repeat purchasing?
Understanding your business goals and success measures is a critical first step toward knowing what kind of data to include in a model.
Addressing distribution drift also requires continuous monitoring of the performance of machine learning models and adapting them to new conditions. Failure to do so can lead to costly errors and reduced trust in the accuracy of AI-powered systems.
Putting high-quality data at the center of the enterprise
Taking a data-centric approach to AI is the path toward overcoming these limitations. It can also help organizations break down silos across teams. By focusing on data quality, stakeholders including domain and subject matter experts across the organization can collaborate with data engineers, data scientists and machine learning engineers to ensure that models are trained on the best data possible. This collaboration also encourages organizations to develop comprehensive data governance strategies that ensure the quality of the data they collect and use.
With the right data management strategies in place, organizations can leverage the power of AI to drive value and create competitive advantages.
Ultimately, data-centric AI can help businesses unlock the potential of their data and create AI and data driven solutions that are more accurate, reliable and cost-effective. By focusing on data quality and data-centric AI methods, businesses can make AI-powered solutions a reality and take advantage of the massive opportunities in the rapidly evolving AI landscape.