Enterprise Data Science: What It Is and Why It Matters

The typical enterprise is inundated with data. From scientific literature to technical reports, from laboratory notebooks to news reports, from structured databases to domain-specific lexicons and ontologies; the volume of relevant data is increasing year-over-year. Much of these data are siloed and uncategorized, making them inaccessible. When accessible, often these data are not indexed, making them hard to discover even with authorization to access. While the curation of these data requires a sophisticated approach, which can be costly, the untapped business value which proper programs will expose is far greater. Regardless of industry, the ability to systematically exploit digital assets will prove to be the single most important differentiator between leaders and laggards over the next decade.

The Digital Transformation Journey Called “Enterprise Data Science”

For that reason, many organizations are exploring a digital transformation journey that I call enterprise data science. Enterprise data science goes beyond big data initiatives and takes advantage of recent advances in machine learning algorithms and cloud computing infrastructure, to extract all possible knowledge (i.e. actionable information) from the digital assets of an enterprise and use it as a driver for change and value creation throughout the organization. The term “enterprise” differentiates and distinguishes this strategic approach from the field known as “data science”, which is currently limited to the application of machine learning and statistics on a case-by-case basis, rather than a holistic approach that aims to maximize the value of digital assets across the enterprise.

There are many benefits to this approach, ranging from operational efficiencies to identifying new opportunities. Enterprise data science can accelerate knowledge discovery and facilitate its diffusion across the enterprise. That, in turn, can lead to substantial value creation.

Almost ten years ago, Kirit Pandit and I wrote the book “Spend Analysis: The Window into Strategic Sourcing“. Therein, we described how a systematic approach for analyzing expenses can “cut millions of dollars in costs, not on a one-time basis, but annually—and to keep raising the bar for the procurement and operations teams to squeeze out additional savings year-over-year.” The concept of enterprise data science is a generalization of that approach, accelerated and enhanced by the tremendous advances that took place in cloud computing and machine learning over the past ten years.

Here is how enterprise data science and the approach referenced above relate. The main challenge in spend analysis is the consolidation of expense records into a single system in a way that one can discover “the truth, the whole truth, and nothing but the truth.” The truth, of course, being the precise amount of expenditure, for a given commodity, for a given supplier, for a given time, and so on. The solution to that problem is essentially the combination of the following four modules:

A data definition and loading (DDL) module
A data enrichment (DE) module
A knowledge base (KB) module
An analytics module

Many enterprises that pursue enterprise data science today face a supply chain problem. Instead of a supply chain of things, we now face a supply chain of information. Instead of a multitude of ERP systems and other sources, enterprises have a multitude of information sources. Instead of expense records that can be incomplete, inaccurate, or imprecise, we have data of many types that can suffer from the same undesirable characteristics. Instead of General Ledger or Commodity hierarchies that can differ between departments or territories, we may have other reference structures that differ depending on the nature of their source (e.g. domain specific ontologies). From an information perspective, there is a level of abstraction where the problems are equivalent and therefore similar solution strategies can be pursued.

For an enterprise data science program, your DDL module would be a cloud-based information acquisition module responsible for connecting to your data sources and loading your data.
Your DE module would be a general data processing framework that would allow you to analyze, synthesize, enrich, and even summarize your data. The term “enrichment” as used here goes beyond data augmentation, and includes elements of data quality such as structure, consistency, and derivative data reference schemes.
The KB module would be the distilled knowledge from all your data that has been integrated, evaluated, and can be used for analysis and fuel the creation of value in many areas across the enterprise.
The analytics module would now be responsible for providing various interfaces between the work accomplished by the program and every other activity that the enterprise performs. The spectrum of interfaces that can be provided to consumers of knowledge range from raw data access to question/answering systems and data science notebooks.

Companies who embrace enterprise data science as a strategic approach to sourcing of information will be handsomely rewarded and be leaders in their sectors.

RightFind Navigate breaks down inefficiencies caused by information silos by bringing together scientific literature, global life science patents,  preprints, information on clinical trials, drugs, and projects from publicly available data sources with licensed content and internal proprietary data in a single intuitive interface. Learn more here.