New research suggests the submission rate to scholarly journals increased exponentially in the early months of 2020. With the amount of available information continuously growing, R&D-intensive companies increasingly turn to text and data mining of full-text scientific literature – both at scale and in context of discrete projects – to extract information and power their knowledge supply chain. Naturally, these initiatives vary in scope and requirements, from company to company or even project to project.

When consuming full-text content, there are many factors a knowledge manager should consider when trying to develop an optimal workflow that is suitable for their organization’s needs. Here are a few.

1) End-to-End Workflow

As a knowledge manager, it is essential to understand your company’s expected end-to-end workflow for text mining full-text literature. It can be useful to map the anticipated inputs and outputs at each phase of the workflow, as well as clarifying expected timelines and business criticality. This applies both to any backend data processing pipeline as well as to the dependent end-user workflows. By looking at this workflow as one continuous stream, a knowledge manager can ensure that adjustments upstream do not break processes downstream.

2) Corpus Parameters

The parameters for defining a full-text corpus of scientific literature will vary depending on the organization’s end-to-end workflow. For example, the dimensions of a corpus being leveraged in a text mining process applied to specific projects – such as a pharmacovigilance workflow – will differ from those used within broader initiatives to process scientific information at scale, apply machine learning or artificial intelligence capabilities, or construct knowledge graph representations. In narrower use cases, specific queries may rely on keywords or subject-related metadata (such as Medical Subject Headings aka MeSH or other indexing aids) that will pull relevant content based on the project specifications. The broader the use case, the less likely an organization is to be able to pre-filter for specific topics; in these cases, time- or journal-based, or other broader categories of content, need to be applied. Based on the end-to-end workflow envisioned, knowledge managers can help their stakeholders by identifying key questions that will define the approach to creating a useful corpus, such as:

  • What are the desired outputs from processing full-text content?
  • Is there an ongoing need for new literature, or will a backfile of historical research suffice?
  • What is the expected outcome of the text and data mining effort?
  • What journals, timelines, and fields of research are most relevant to the project?

3) Volume

As described above, there are different approaches to defining a corpus of scientific literature. Those parameters will naturally affect the volume of content in question. Looking back at the two use cases mentioned above, the organization that is text mining for specific projects may consume only several, dozens, or hundreds of articles at a given time for interrogation. Larger scale initiatives, with broader corpus parameters, may result in the processing of hundreds of thousands, or even millions, of articles. And, apart from the project needs at the present snapshot of time, it is important to consider also the maintenance of the project over time and its likely content needs in future.  Predicting the amount of content needed for text mining going forward is an important exercise to undertake but may also be a challenge. One method that helps with this estimation is analyzing current or backfile needs, then using this metric to forecast future needs.  This calculation can help a knowledge manager choose the appropriate method for consuming the content and predicting costs.

4) Timeliness 

For some organizations, timeliness is an important factor and is essential for supporting their text mining use case. As a knowledge manager, it’s important to understand the stakeholder expectations for when a particular published article is expected to yield outputs from a text and data mining processing pipeline. Delays may be introduced – and therefore impact business commitments, service level agreements, and expectations – by article lag time from publication to presence in a feed of data or accessibility by a public API, by any batch or asynchronous processing rules, and so on.

5) Licensing

Published scientific literature is a precious asset and important investment for an organization. Based on the expected end-to-end workflow and corpus parameters, the knowledge manager should determine whether content needs can be satisfied through existing subscriptions/licenses, or whether these should be augmented through extended licensing or through transactional steps. Any expected transactional impact should likewise be factored into the workflow, and its impact measured on timeliness requirements.


Considering these five factors will help a knowledge manager uncover the optimum workflow for their organization to consume full-text content. Text mining, for many, is a new and exciting technology that can drastically improve research and innovation within an organization. By understanding and analyzing these five considerations, the potentially daunting task of developing a new workflow to support text mining will become significantly more manageable.

Interested in learning more? Check out:


Author: Garrett Dintaman

Garrett Dintaman is CCC’s Associate Product Manager and Product Owner for RightFind XML for Mining. He focuses on client needs and use cases related to text and data mining, data processing pipelines, and analytics. In his free time, Garrett enjoys following college basketball and spending time in nature.
Don't Miss a Post

Subscribe to the award-winning
Velocity of Content blog