Popular Available Data Sources and How Researchers Use Them: Exploring PubMed

“Life moves pretty fast.”

Ferris Bueller made that pronouncement popular back in 1986. But today, in the world of R&D-intensive sciences, data is what’s moving quickly. It is abundant, ever increasing, essential, and constantly updating all around us. Identifying information in a timely fashion is a critical need, and there are many curated resources out there. How are people using them, how could they be improved, and what can we learn from their similarities and differences? In this series, we’ll look at some popular databases and strategies for how to use them effectively.

What is PubMed?

PubMed, from the National Library of Medicine (NLM), is a well-known and beloved search engine for biomedical literature. PubMed has been available free to the public since 1997 and currently includes over 30 million citations, of which 20 million include abstracts and/or links to full text. PubMed receives about 2.5 million users per day.

The popularity of PubMed allows for the collection of vast amounts of usage data, and for the intelligent use of that data to improve the search experience.

Back in 2009, researchers from the National Center for Biotechnology Information (NCBI) analyzed the PubMed search logs. Here were some of their more interesting findings:

80% of retrievals resulted from PubMed searches; the other 20% were redirected from other search engines (e.g. Google, Wikipedia).
The 3 most common types of searches were author name, gene/protein, and disease/condition, with author name searches being the most frequent.
PubMed users spend more time examining content and are more likely to interact with additional queries than visitors to more general web search engines.

Applying Machine Learning and Natural Language Processing Techniques to PubMed

To improve the search experience and the success of user sessions, PubMed set out to apply machine learning and natural language processing techniques using PubMed citations, abstracts, and aggregated search logs. For example, by mining anonymized queries that appear with high frequency and which lead to successful retrieval, the system generates query suggestions. When this feature was first introduced to a small set of random PubMed users, the average click-through rate was 8%. Even better, there was on average of 20% more clicks in the search results for sessions where suggested queries were utilized. Now the suggestions are provided in an autocomplete box displayed as the user types.

Query Expansion

Query expansion is also used to add terms or replace common misspellings. This is of course highly regulated to increase recall without sacrificing precision. A common expansion is for abbreviations, which are heavily used in disease, gene, and technique queries. PubMed also normalizes British and American spellings (e.g. “gastroesophageal reflux” also retrieves “gastro-oesophageal reflux”). Whereas general search engines use personalization to favor regional variations, assuming an American user wants American consumer health information, PubMed researchers want the most recent research regardless of whether it was reported by a British or American team.

Relevance Ranking

PubMed has recently migrated to relevance ranked (best-matching) result sorting for information searches. The ranking algorithm looks at factors such as weighted term frequencies, date recency, and number of citations. In addition, user activity is aggregated, so that the most clicked articles will increase in relevance. When a title is clicked, the PubMed detail view provides both Similar Articles and Cited By articles. Once users start browsing related articles, they continue on this path more than 40% of the time. Browsing additional related articles is more common that selecting another record from the result list or revising the query.

Author Name Disambiguation

Almost half of PubMed queries are for known items, either a particular citation or an author name. For these users, PubMed has an author name disambiguation feature which attempts to put the most likely matches to the top of the list. There are also algorithms optimized for parsing citations to increase the precision of these queries.

PubMed, with its extensive curated collection of citations and related full text plus access to global user logs, is a phenomenal aggregated data set. In the future the possibilities for expanded mining of the free text, exploring relationships which identify synonyms or decrease ambiguity, will only increase the value of this resource.

Through this series, we have looked at a range of information needs. The goals and strategies of those involved in clinical trials, potential drug-drug interaction research, clinical care, or systematic reviews vary widely. There are multiple resources, both public and private, designed to provide intelligent access to the most timely and relevant data. Life is moving fast. Biomedical sciences are committed to identifying and refining tools and best practices to advance our knowledge, in the 21^st century and beyond.

Interested in learning about other data sources and how they’re used? Check out:

Did you know? RightFind® Navigate unifies data sources within an open integration ecosystem to maximize the value of your organization’s digital information assets and enables knowledge workers with contextualized discovery to relevant information. Learn more here.