Enterprise Data Science: What It Is and Why It Matters

Written by Haralambos Marmanis, CTO & Vice President, Engineering and Product Development, CCC

The typical enterprise today is inundated with data. From scientific literature to technical reports, from laboratory notebooks to news reports, from structured databases to domain-specific lexicons and ontologies, the list is long and the volume of data that’s considered relevant to businesses is increasing quickly year-over-year. Much of that data is usually siloed and uncategorized, making it inaccessible. Even if data was accessible, quite often it is not indexed, making it hard to discover even with the appropriate access authorization. The aggregation and organization of data has a high cost and typically involves multi-year initiatives that are interdepartmental in nature and require program-level coordination.

Given the challenges at hand, why is it important to undergo such an expensive and failure-prone transformation?

The ocean of digital assets available to the modern enterprise contains an enormous amount of untapped business value. In fact, the ability to systematically exploit these digital assets could prove to be the single most important differentiator between leaders and laggards over the next decade in every industry.


For that reason, many organizations are exploring a digital transformation journey that I call enterprise data science. Enterprise data science goes beyond big data initiatives and takes advantage of recent advances in machine learning algorithms and cloud computing infrastructure, intended to extract all possible knowledge (i.e., actionable information) from the digital information assets of an enterprise and use it as a driver for change and value creation throughout the organization. The term “enterprise” differentiates and distinguishes this strategic approach from the field known as “data science,” which is currently limited to the application of machine learning and statistics on a case-by-case basis, rather than a holistic approach that aims to maximize the value of digital assets across the enterprise.

There are many benefits to this approach, ranging from creating operational efficiencies to identifying new opportunities. Enterprise data science can accelerate knowledge discovery and facilitate its diffusion across the enterprise. That, in turn, can lead to substantial value creation.


Almost ten years ago, Kirit Pandit and I wrote the book “Spend Analysis: The Window into Strategic Sourcing”. Therein, we described how a systematic approach for analyzing expenses can enable a business to “cut millions of dollars in costs, not on a one-time basis, but annually — and to keep raising the bar for the procurement and operations teams to squeeze out additional savings year-over-year.” The concept of enterprise data science is a generalization of that approach, accelerated and enhanced by the tremendous advances that have taken place in cloud computing and machine learning over the past ten years.

Here is how enterprise data science and the approach referenced above relate. The main challenge is the consolidation of records into a single system in a way that enables a user of that system to discover “the truth, the whole truth, and nothing but the truth.” The “truth” in the example above, of course, being the precise amount of expenditure, for a given commodity, for a given supplier, for a given time, and so on. The solution to that problem is essentially the combination of the following four modules:

1. A data definition and loading (DDL) module.

2. A data enrichment (DE) module.

3. A knowledge base (KB) module.

4. An analytics module.

Many enterprises that pursue enterprise data science today face a supply chain problem. Instead of a supply chain of things, we now face a supply chain of information. Instead of a multitude of ERP systems and other sources, enterprises have a multitude of information sources. Instead of transaction records that can be incomplete, inaccurate, or imprecise, we have data of many types that can suffer from the same undesirable characteristics. Instead of General Ledger or Commodity hierarchies that can differ between departments or territories, we may have other reference structures that differ depending on the nature of their source (e.g., domain specific ontologies).

From an information perspective, there is a level of abstraction where the problems are equivalent and therefore similar solution strategies can be pursued.

• For an enterprise data science program, your DDL module would be a cloud computing service appropriate for connecting to your data sources and loading your data.

• Your DE module would be a general data processing framework that would allow you to enrich your data — the term “enrichment” is used here in the same way as in the book, including beyond data augmentation, elements of data quality such as structure, consistency, and derivative data reference schemes.

• The KB module would be the distilled knowledge from all your data that can be used for analysis and fuel the creation of value in many areas across the enterprise.

Companies who embrace enterprise data science as a strategic approach to the sourcing of information will be handsomely rewarded and become leaders in their sectors.