Whether working in business development, human resources, medical affairs, clinical affairs, or competitive intelligence teams, people need to identify researchers and key opinion leaders (KOLs) in different therapeutic areas. The following are some of the reasons for this need: 

  • to speak on their behalf about their products at conferences and educational events
  • to collaborate and form research partnerships
  • to identify candidates for hiring
  • to endorse products 

One way to identify researchers is from large corpora of scientific literature. But without the right tools, this can be challenging.

Here are some of the challenges our clients come across when working with bibliographic data:

Challenge 1: Entity Resolution and author disambiguation

How can you easily know who is who and what is what? If you look at a dataset like LitCOivd, or even more broadly, PubMed, the data does not provide clear ways to distinguish one author from another. By our own research, under 3% of author records in PubMed provide a standard author identifier, such as ORCID. Even fewer records (<1%) provided any sort of standardized institutional ID.

Challenge 2: No Clear Entry Point

Large datasets that focus on a single domain can be daunting. For example, The National Library of Medicines LitCovid corpus has added around 2000 articles per week since May 2020. A dataset of this volume and growth often offers no clear point of entry. How does one even go about answering their own questions?

Challenge 3: Manual, word-of-mouth processes

Getting answers from large datasets requires the ability to systematically explore the data and pursue lines of inquiry through various connections. One of the problems we hear from customers today is that their current effort to find researchers, who fit certain research profiles, is highly manual and often a matter of chance (word-of-mouth, people-they-know). 

Challenge 4: Answers without Explanations

Advanced analytical techniques can help to provide answers, but sophisticated analytical solutions also risk presenting answers without explanations.

Where Knowledge Graphs Play a Role 

It was the challenges above that provided CCC the impetus to take some experimental work we were doing in-house on data pipelines and build a COVID author graph. In April 2020, we released a prototype of a knowledge graph of authors who specialize in COVID and related fields of study. Today, is it the basis for CCC Expert View.

Knowledge graphs allow for the statistical calculation of accuracy and precision, so you can know how good, or how bad, your answers are. Graphs allow for systematic, nimble, and nuanced exploration of relationships and facilitate entry into the data from specific questions, like:

  • Who are the top authors in a given domain area?  
  • Who is working with whom?
  • Which topics are being researched at a specific institution? 
To keep learning, check out: 
  • The Data Quality Imperative” from CTO Babis Marmanis. He discusses the impact data quality has on knowledge production, with examples from our experiences working with bibliographic raw metadata for the CCC COVID Author Graph. 


Author: Stephen Howe

Stephen has spent his career working at the intersection of publishing, education, and technology, holding positions in sales, sales management, production, project management, digital publishing, digital editorial, and product management. Trained in the liberal arts tradition, Stephen holds a BA and MA in philosophy, an MBA in management, and a Masters in Analytics. Stephen currently works as the Senior Product Manager - Analytics at CCC and serves on the advisory board at Brandeis University for the Masters in Strategic Analytics program.
Don't Miss a Post

Subscribe to the award-winning
Velocity of Content blog