Enterprise data science is a multidisciplinary approach to data analysis that transforms large volumes and varieties of complex data into hypothesis-generating knowledge representations such as graphs. Of particular interest are strategies to extract and integrate relevant features and relationships from unstructured, unstandardized bulk data, such as draft and published texts.

In life sciences, this requires the integration of expertise from several disciplines such as molecular biology and organic chemistry, statistics and machine learning, computer science and systems engineering.

Life science companies accumulate vast amounts of textual data: laboratory protocols and observations, experiment outcomes, grant and patent applications, conference abstracts, pre- and post-publication research articles. Additionally, companies collect and curate structured textual data in the form of dictionaries and ontologies. The quality of these data varies a lot and is often stored in various formats across multiple systems. Taking advantage of this information requires a robust, scalable, and adaptable pipeline of automatic and semi-automatic approaches that:

  • collect and pool unstructured as well as structured data;
  • ensure data coherency through validation and filtering, standardization, abstraction, and integration of the inputs;
  • transform and merge unstructured inputs into structured features and define relationships between them;
  • iteratively improve the process and its output through cross-validation and the addition of new data sources.

Enterprise data science pipelines require application of advanced analytical methods to large datasets in real time, and thus rely on scalable computational platforms. Although recent advances in machine learning, cloud computing, and big data made these approaches a reality, realization of these data systems requires deep knowledge of the data sources and practical knowledge of machine learning algorithms. A lot of work is currently directed at developing ways of making enterprise data science less opaque and more practically available to users of all levels.


RightFind Navigate breaks down inefficiencies caused by information silos by bringing together scientific literature, global life science patents,  preprints, information on clinical trials, drugs, and projects from publicly available data sources with licensed content and internal proprietary data in a single intuitive interface. Learn more here.

Author: Anna Lyubetskaya

Anna Lyubetskaya formerly served as CCC's data scientist.
Don't Miss a Post

Subscribe to the award-winning
Velocity of Content blog