Before You Begin: What Your Team Needs to Know About Text Mining

Natural Language Processing: Standing on the shoulders of giants

Many of us know the joys and sorrows of research.  Weeks, months and years can pass, developing hypotheses, working in the lab or clinic, analyzing results, sometimes going back to square one, but then writing the paper, and finally, seeing the final version published and in print. The intent is that your research is shared, discussed, re-used, so that others can build on it, “standing on the shoulders of giants,” as Isaac Newton famously said.

Traditionally, getting information out of written papers for re-use has been manual; individuals reading, reviewing and extracting the key facts from tens or hundreds of papers by hand, in order to summarize the most up to date research in a field, or understand the landscape of information around a particular research topic.  Over the past few decades AI tools, such as Natural Language Processing (NLP), have evolved that can hugely speed up and improve this data extraction. NLP solutions can enable researchers to access information from huge volumes of scientific abstracts and literature; developing strategies and rules that drill deep into literature for hidden nuggets, or more broadly, ploughing the landscape for the nuggets of desired information.

To give you a couple of examples, I’ll share two use cases, both published recently, that use Linguamatics NLP platform across published literature, enabling researches to benefit from years of previous research.

Eli Lilly use NLP to analyze relationships between publication volume and drug development success

At Eli Lilly, researchers wanted to understand the correlations between the level of scientific research for a particular drug and the success of that drug in the regulatory approval process. It’s well known that R&D productivity is a continuing challenge for pharma, and that pharma organizations are relying more and more on external sources of innovation such as biotech and academic research.  Liu and colleagues used NLP to get a broad landscape of information relating to drug approvals or failures from the top 13 pharma companies. Specifically, they assessed the publications prior to approval for 113 drugs, and prior to clinical failure for 39 drugs (in Phase 3) across a decade (2006-2016).

In order to gain a comprehensive picture, they used Linguamatics NLP-based text mining to effectively search millions of MEDLINE abstracts for research papers published before the approval (or failure) of the selected small molecule (new molecular entity, NME) or biologic drugs. They benefited from the ability of Linguamatics NLP to provide rich ontologies and thus ensure good recall for the drug names, organizations and other data features (including synonyms, spelling mistakes, etc) and normalization to one primary term for each concept. This rigorous systematic search provided the substrate for analysis: 6781 papers for the 113 approved drugs, and 524 for the 39 failed drugs.

Their paper described a number of interesting findings:

  • The volume of research papers preceding either approval or failure is drastically different; on average 60 papers preceding approval vs. 13 prior to failure
  • Academic institutes contributed the majority of papers (79% for biologics, 76% for NMEs) compared to the volume from the top 13 pharma (10% or 13% respectively)
  • Average time from first publication to approval was ~6.5 years, with no significant difference between biologics and NMEs

Liu et al conclude that increased investment in the external innovation environment, with complementary efforts from top pharma, academics, and biotech companies, is likely to manifest in a more effective discovery process and approval of innovative medications. And, taking a step back, it’s clear that the creation of the substrate for analysis in this study (the re-use of prior research) was made much more effective using text mining, and far quicker than old-fashioned manual review and annotation.

Georgetown University Medical Center uses NLP for immediate patient care decisions

A second recent paper highlights the timeliness that using NLP-powered text mining can bring.  Hartmann and Van Keuren are using text mining to drill deep into literature for hidden nuggets that can inform key clinical decisions in real time on patient wards. Jonathan Hartmann is Director of Clinical Integration Services & Data Discovery at Dahlgren Memorial Library, Georgetown University Medical Center (GUMC), and has talked at Linguamatics seminars about the use of NLP for clinical support. The ability to text mine MEDLINE abstracts and full-text journal articles on patient rounds has allowed GUMC’s clinical informationists to quickly search large amounts of medical literature that would otherwise be unsearchable, thus allowing them to provide additional answers to physicians’ clinical questions.

In 2016 Jonathan gave a presentation on the evolution of Linguamatics NLP to improve patient care, describing the evolution of the technology and approach and detailing several real-life patient use cases where rapid effective text mining influenced the clinical decisions and patient care.  Examples he presented from the Pediatric Intensive Care Unit included:

  • A critically ill child with Type 1 diabetes, autoimmune thyroiditis, and coeliac disease. The physician “wanted to know right there and then about the importance of the co-existence of these disorders”. A search with NLP found two relevant articles from Italy showing type 1 diabetes is often linked to autoimmune disorders like celiac disease and thyroiditis; and the information helped the physician determine appropriate treatment

Jonathan said that Linguamatics NLP enhances the range of information that can be accessed and provides information that wasn’t available before, especially when little research has been published on a topic, such as rare or complex cases with many variables.

NLP helps clinicians and researchers “stand on the shoulders”

These two very different papers demonstrate the value of using an artificial intelligence (AI) technology such as natural language processing to gain rapid effective value from the treasure trove of research contained in scientific abstracts and full text papers. Whether digging deep for very focused information, or extracting a broad set of knowledge from hundreds or thousands of papers, text mining can provide significant benefits; it enables a more efficient approach to finding key information, meaning that researchers spend less time is spent on search, and more time on understanding and action.

Interested in learning more? Keep exploring:


Author: Jane Reed

Jane Reed joined Linguamatics in March 2014, as the head of life science strategy. She is responsible for developing the strategic vision for Linguamatics’ growing product portfolio and business development in the life science domain. Jane has extensive experience in life sciences informatics. She worked for more than 15 years in vendor companies supplying data products, data integration and analysis and consultancy to pharma and biotech – with roles at Instem, BioWisdom, Incyte, and Hexagen. Before moving into the life science industry, Jane worked in academia with post-docs in genetics and genomics
Don't Miss a Post

Subscribe to the award-winning
Velocity of Content blog