R&D organizations today make significant investments in scientific literature. Sophisticated knowledge management teams recognize the importance of research data across the enterprise and increasingly augment traditional journal subscription packages with data feeds and the appropriate rights to exploit them in a data pipeline using AI and machine learning techniques.

After ingesting these feeds, use of the data by particular project teams, internal data lakes, and applications can range from early phase R&D to competitive intelligence, M&A, and licensing, to post-market surveillance and pharmacovigilance.

But often, these groups will face a variety of challenges and opportunities to get the most return for the organization’s investment.  Here are a few examples of challenges you face when normalizing full-text XML data, with tips to overcome them: 

Delivery method integration

 A data provider may deliver materials via SFTP, API, AWS S3 bucket, or some other option, requiring proper scheduling of data transfer jobs. Ideally, these necessitate minimal manual intervention, whilst permitting oversight and awareness of data feeds that have missed scheduled deliveries or produced anomalies (such as an unusual volume of data). The farther upstream such anomalies can be noticed and acted on, the better. 

Tip: Take a baseline of your data feeds and then regularly calculate variance against this baseline in order to detect potential underlying changes. These changes may turn out to be perfectly explainable – such as a change in the ownership of a journal that results in its disappearance from a longstanding delivery; but in other cases, this comparison may turn up discrepancies that are true mistakes, giving you the opportunity to rectify.

Data parsing

Across data providers, and even within a single provider of data, there can and will be format variations. Changes over time, and from imprint to imprint in the case of published journals that may have changed ownership, require attention at the parsing stage. In an example project from experience at CCC, ingesting full-text data across more than 50 STM publishers resulted in having to account for not only variations of more than 10 stated XML formats (including NLM, JATS, and proprietary), but also to address varying levels of adherence to these stated formats.

Tip: Discuss the potential for variation with your data provider(s) in advance, probing on differences across time and across lines of the provider’s business.  

Desired user experience

Someone, somewhere downstream in the data pipeline, will do something with these data. What is their intended experience to interact with the data, and does this require further processing of the data to satisfy the need?  

Examples to consider: 

  • If the user needs to interact with tabular data or figures, are these provided in a consistent manner and properly extracted from the data?  
  • If the data feed(s) are to be merged with other aggregate feeds of data from repositories like MEDLINE/PubMed, how will the pipeline identify and manage duplicate records, ensuring the appropriate metadata survives for the end user’s needs?  
  • Does longer textual information need to be enriched or annotated using vocabularies or ontologies to support consistent search/discovery, analytics, or knowledge graph applications? 

Tip: Establish a clear set of requirements for data parsing, linking those to the business benefits your stakeholders expect downstream. With this set of requirements, you can then also prioritize work and set phases of scope. For example: tabular data and figures might be difficult to extract in a first phase, and could be set aside as you refine your approach. 

A single point of access for article content in normalized XML format  

Insights that can be found only in the full text of scientific articles undoubtedly enrich AI, machine learning, and data visualization projects. With RightFind XML, organizations can choose from a variety of flexible models to access normalized, full-text scientific literature in XML format, licensed for commercial text and data mining.  

Keep Learning: 


Author: Mike Iarrobino

Mike Iarrobino is CCC's product manager for content and rights workflow solutions RightFind® XML for Mining and RightFind Music. He has previously managed marketing technology and content discovery products at FreshAddress, Inc., and HCPro, Inc. He speaks at webinars and conferences on the topics of content discovery and data management, and loves to get into conversations about the nature of free will.
Don't Miss a Post

Subscribe to the award-winning
Velocity of Content blog