WHAT RESEARCHERS MAY BE MISSING WHEN MINING ABSTRACTS
Text mining allows for the rapid review and analysis of large volumes of biomedical literature, giving life science companies valuable insights to drive R&D and inform business decisions. For example, the results of mining projects can provide a greater understanding of the underlying biology behind specific diseases and how they respond to certain drugs and support the target discovery process.
Given the easy accessibility of article abstracts through such databases as PubMed, many researchers use this summary information to identify a collection of articles (or “corpus”) for use in text mining rather than taking steps to obtain the full text of the articles. While abstracts provide some valuable pieces of information, there are limitations in using abstracts (e.g., there is no “materials and methods” section) that can affect the quality of text mining results when compared to the results of mining a corpus of full-text content. Here are some of the key advantages to mining full-text articles versus abstracts.
MORE FACTS AND RELATIONSHIPS
Full-text articles include more named entities and relationships (between those named entities) than abstracts. According to a study published in the Journal of Biomedical Informatics, full-text articles contain far more connections between biological entities than abstracts. The study’s findings show that only 7.84% of the scientific claims made in the full-text articles were found in their abstracts.1
Identifying relationships is also a challenge when mining abstracts only. A relationship is a connection between at least two named entities; for example, that gene BCL-2 is an independent predictor of breast cancer. Publisher Elsevier conducted a study that used abstracts in PubMed to derive relevant information about drugs and proteins that affect the progression of fibromyalgia. They found 31 relationships in the literature mining abstracts and an additional 53 relationships when they ran the same search across the full-text articles.2
Mining full-text articles versus abstracts gives researchers a more thorough understanding of the pathology of a disease, how it responds to certain drugs, and the interactions of genes and proteins involved in that process to identify new drug targets.
ACCESS TO SECONDARY STUDY FINDINGS AND ADVERSE EVENT DATA
While authors often include their most important findings in the abstract, other discoveries and observations are frequently found in the full-text article. Given the size limitations of abstracts and their concise nature, they often exclude, or underrepresent, mutation data or results that are considered to be less relevant or out of scope with the main idea of the publication. There is also the problem of timeliness in reporting subsequent study findings in abstracts. Following initial publication of a new discovery in a particular journal the research is often repeated and published in other publications. But there is a significant delay between when those findings are published in the full article and when that information appears in the abstract. In fact, it can take one to two years for a study finding to get into the abstract of a subsequent article.3
Negative study results are also often missing from abstracts. According to a study published in BMC Medical Research Methodology, “abstracts published in high impact factor medical journals underreport harm even when the articles provide information in the main body of the article.”4 This missing information can reduce the value of abstracts, as the “raw material” to mine, especially in pharmacovigilance use cases, or when researchers want to make connections that haven’t yet been a major focus of the literature.
While text mining article abstracts provides some value, there are limitations as to what can be discovered through that process. Researchers need access to the full text of the articles to ensure they don’t miss vital data and undiscovered assertions that can lead to new discoveries.
1 Catherine Blake. “Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles.” Journal of Biomedical Informatics Volume 43, Issue 2, April 2010, Pages 173–189
2 Elsevier (2015) Harnessing the Power of Content – Extracting value from scientific literature: the power of mining full-text articles for pathway analysis. Available at www.elsevier.com/__data/assets/pdf_file/0016/83005/R_D-Solutions_Harnessing- Power-of-Content_DIGITAL.pdf
4 Enrique Bernal-Delgado and Elliot S Fisher. “Abstracts in high profile journals often fail to report harm.” BMC Medical Research Methodology (2008); 8:14