The Importance of Data for AI

Thomas Kuhn, an American physicist and self-described “practicing historian of science”, has emphasized the critical role of “paradigms” in The Structure of Scientific Revolutions, a seminal work devoted to the history of science. In his work, paradigms are characterized by “universally recognized scientific achievements that for a time provide model problems and solutions to a community of practitioners.”

We are witnessing a major “paradigm shift” (in the Kuhn sense) in terms of what constitutes an artificial intelligence (AI) system and how such a system can be produced. The “deep learning” revolution started more than a decade ago (see ImageNet Classification with Deep Convolutional Neural Networks) and it continues unabated. Major progress has been reported in application areas such as agriculture, biology, climate change, healthcare, scientific discovery, and many more.

There are three major elements that fostered the emergence of this paradigm and its continuous growth. The first is the readily available, cost effective, and highly scalable computing infrastructure. The second is the major advances in algorithms related to the training of deep neural networks as well as user friendly, easily accessible, machine learning libraries that made these algorithms available to everyone as solution components whether for small experimentation or in larger learning architectures of commercial use. The third major element for a state-of-the-art deep learning system is the availability of high-quality datasets.

It is important to realize the significant role of data in this AI revolution; after all, the revolution started with ImageNet, a large-scale hierarchical image database announced in 2009. The success of AI systems, measured in multiple dimensions (e.g., ability to generalize, accuracy, scope relevance), depends crucially on the quality of the data, not just their size. The data used for the training of AI is all the system knows about the world and what it learns from to perform its tasks. Moreover, the supply of these datasets should not be taken for granted. For example, a recent study raises the question of whether the stock of high-quality language data will be exhausted soon; likely before 2026 by their estimation. For low-quality language and image data they project that exhaustion may happen between 2030 and 2050 and between 2030 and 2060, respectively.

This situation places the “data element” of the AI revolution at the heart of all developments in specific application areas. In fact, Andrew Ng believes that :

“… at AI’s current level of sophistication, the bottleneck for many applications is getting the right data to feed to the software. We’ve heard about the benefits of big data, but we now know that for many applications, it is more fruitful to focus on making sure we have good data”

Consequently, knowing where one can find data that are most appropriate for their work (whether internally within one’s organization or externally from a third-party vendor or public source), being able to discover these data within the context of one’s business workflow, being able to assess the quality of a given dataset, and many other data related functions will be crucial for the information managers in the years to come.