The Data Quality Imperative

In a previous blog post, entitled “Knowledge Graphs as Belief System Encapsulations”, I presented the view that “knowledge must necessarily be associated with a degree of confidence that expresses the strength of our conviction about the accuracy of the information.”

In this follow-up blog post I want to discuss the impact of data quality on the production of knowledge. Our decision making is increasingly based on information that relies on data analytics and these analytics, in turn, are the result of processing some “raw data” (I use the term to mark the data that one starts with when doing an analysis) — even when the analytics rely on derivative data, the points we make herein remain valid since any further processing has the potential to amplify all the data quality issues.

I will focus on the impact that the quality of raw data can have on knowledge products such as a knowledge graph. In particular, I will share our experience in building a specific knowledge graph — the CCC COVID Author Graph — and provide some details of the data quality issues that we encountered while working with the bibliographic (raw) metadata.

CCC COVID-19 Author Graph

Throughout 2020, as the COVID-19 pandemic spread, the global research community has gone into high-gear to study this disease in the hope of containing its effects (on those that were infected) and eliminating its spread through vaccination. These efforts resulted in an increase of research output that impacted the peer-review process because the number of qualified researchers needed for reviewing the new manuscripts doubled or tripled in a matter of weeks.

The rapid growth can be put in context if we look at the historical production of articles in that area. In 2018 there were 817 articles related to coronavirus, in 2019 that number was 906, whereas in 2020 and 2021 those numbers were 72177 and 93538, respectively; these numbers are the results of running a simple query for “coronavirus” in PubMed, the exact numbers will differ if you reformulate the query but the point remains that there’s been a two order of magnitude increase in the number of published articles. At the onset of the pandemic, the average production of articles was about 2000 articles per week and it hovers slightly over 1000 articles per week at the time of this writing. From the perspective of a publisher overseeing peer reviewed journals, that is a tremendous amount of new manuscripts to vet, edit, and publish.

In response to this need, CCC created an author graph to facilitate the discovery of people who are working on this field. We selected a set of published scientific articles in virology with special attention to coronaviruses and we used the associated bibliographic (raw) metadata to extract the authors, the titles of their articles, the journals in which they were published, the appropriate medical subject headings (MeSH) terms, and their citations.

The idea was to build a graph of the above mentioned information so that someone could explore it through a visualization tool since that is a great way to interact with the data and quickly identify qualified experts in any field of interest. We built a visualization tool so that everyone could benefit since it doesn’t require any special skills or knowledge of a query language.

It is important to note that we created a knowledge graph from the published literature, not merely a graph. Every knowledge graph is a graph but not every graph is a knowledge graph. In the following, I elaborate on that distinction and focus on the aspects of data quality that have an impact on the knowledge that is captured in the graph.

Knowledge Graphs, Knowledge Systems, and Data Quality

When we are talking about a knowledge graph we are talking about the product or output of a knowledge system. A knowledge system takes raw data as its input and, through a series of transforms, produces “actionable information”, which is our definition of knowledge.

Suppose that we could quantify how much knowledge we could produce from a given set of data (S), let’s call that amount Kmax(S). A good measure of success for a knowledge system would be the ratio of the knowledge actually produced (let’s denote that with K(S)) over the maximum amount of knowledge that is theoretically possible to produce, i.e. the ratio K(S) / Kmax(S). That ratio can be used as a criterion for determining whether the data is “good enough” since above a threshold value the effort expanded for a given (percentage) “gain” in the ratio will grow exponentially fast.

Intuitively, we also know that given a certain input (the raw data) the amount of work needed to produce a given amount of knowledge increases as the quality of input data decreases. The better our data are, the greater the amount of knowledge that can be produced for a given effort.

Managing knowledge is an iterative process that is based on continuous improvement. At a high-level, all knowledge systems implement some version of the Deming cycle (a.k.a. the PDCA cycle) because the nature of knowledge is dynamic and the preservation of quality (for its data) requires feedback loop mechanisms.

Let us consider the problem that we were trying to solve with the COVID author graph. In that context, knowledge (i.e., “actionable information”) is information that can support the decision of selecting a specific researcher as the most qualified candidate for conducting the peer-review of a given manuscript; our prototype system did not address the issue of availability or any other matters of logistics. One can define what “most qualified” means as he or she pleases, and that point is important since it automatically implies that there isn’t one knowledge graph to rule them all — a fact that has much greater generality than our simple example may suggest. What is very important is that, no matter what definition one selects, the confidence in supporting the decision can be quantified.

Data Quality Issues

When building the COVID Author Graph, we used data sourced from MEDLINE, a bibliographic citation repository for articles published in medicine and the life sciences produced by the US National Library of Medicine. We selected 184,187 scientific articles from the MEDLINE corpus by searching for terms related to corona viruses; if you are curious about the specifics of the query, we looked for the following terms: ‘upper respiratory infection’, ‘COVID-19’, ‘novel coronavirus’, ‘SARS-CoV-2’, ‘SARS’, ‘MERS-CoV’, and ‘severe acute respiratory syndrome’.

MEDLINE is a widely used source for data of over 30 million articles. However, there are several data quality issues in MEDLINE, which surface quickly when one attempts to create a knowledge graph from them. Let’s see what some of those issues are.

Issues concerning standard identifiers

The use of standard identifiers is an excellent way to ensure the correct identification of entities in the data. For example, since every article in MEDLINE has a PMID (that is an identifier produced by the creators of MEDLINE), we have a reliable way to correctly and repeatedly identify a journal article in MEDLINE. Alas, that is not the case with all identifiers.

To begin with, other standard identifiers were sparse in the data. Out of over 1.1 million author instances in the COVID dataset, around 10% had an ORCID attributed to the author and a number of these ORCIDs were invalid. In some cases, all authors of the paper were assigned the same ORCID, which clearly cannot be true for a “unique identifier”. Moreover, it should be noted that the number of authors with an ORCID is fewer than 2% for all of MEDLINE. ISNI and GRID identifiers were even less prevalent, with 2884 links to ISNI identifiers and 3276 links to GRID identifiers out of 112 million authors. There were roughly only 700 institutions represented by ISNI and GRID, respectively.

In short, the existence of unique identifiers is neither prevalent nor does it guarantee the high quality of data in MEDLINE.

Issues concerning authors and affiliations

For an author graph, the author identity is an essential piece of information. We needed to extract the authors of the published articles from the bibliographic metadata reliably and with a measure of confidence as to whether the information about an author in the graph is accurate. Do the names “R S Baric”, “Ralph Baric”, “Ralph S Baric”, “Ralph A Baric”, and “Ralph Steven Baric” all refer to the same author?

Since we are building a knowledge graph of authors, we have spent a lot of our time examining the quality of the author data that we receive. For example, if we look at the length of the author last name for the COVID dataset, we see the following distribution:

The minimum length is equal to a string of a single character
The 25% quartile is equal to 4 characters
The median is equal to 6 characters
The 75% quartile is equal to 8 characters
The maximum number of characters for a name is equal to 322

Outside the maximum value of 322, these values fall into an expected range for an author’s last name. That maximum value, incidentally, is the result of a “Collective Name” or “Working Group Name” incorrectly placed in the last name field. More on that later. If we look at the whole of MEDLINE, the distribution is as follows:

The minimum length is equal to a string of a single character
The 25% quartile is equal to 6 characters
The median is equal to 8 characters
The 75% quartile is equal to 79 characters
The maximum number of characters for a name is equal to 322

In the 75% quartile and above, we can see the impact of Collective Names mixed in with the author’s last names. Moreover, if we look at just the length of an author’s last name in 2020 for the COVID dataset, the number of names with a length of 1 character and those with a length >30 characters exploded. In just that metric, we can clearly see the impact (on the data quality) of rushing out so many COVID-19 publications in 2020.

Another perspective that shows the impact of Collective Names is to consider the distribution of the number of words within an author’s last name. The maximum in both the COVID dataset and the whole of MEDLINE is 41 words; they diverge at the 75% quartile, where the COVID dataset is 1 word, but in the whole MEDLINE dataset it is 12 words. In practice, the longest author names have about 5 words, whereas Collective / Working Group Names tend to start from 2 words.

Another challenge with Collective Names / Working Groups is that authors who are also part of working groups can end up being listed twice, or more times, in the author list. For example, “Frédéric Choulet” appears 12 times in the author metadata for PMID 30115783.

Finally, data quality issues also appear with author affiliations. Author affiliations may be concatenated and duplicated across all the authors rendering correct attribution impossible. When we do have any author affiliation, that affiliation is not always useful. The most common affiliation in MEDLINE is the institution-less ‘Department of Psychology.’ (including the period), which appears over 8000 times. Whereas both ‘.’ and ‘,.’ make the top five.

Issues concerning publication dates

Another area where we observe data quality issues is with publication dates. The oldest publication year in all of MEDLINE is the year 0001, which is clearly wrong, while the next oldest publication year is 1041, Gutenberg was not alive yet let alone the subject matter of the paper. It turns out that, in the latter example, the number 1041 is actually the page number for the article misplaced in the date field! Months are meant to be a three-character string in MEDLINE but can also be numeric and this leads to other problems. These types of issues with publication dates are not uncommon in our experience nor are they unique to MEDLINE. In other datasets that we work with, we have seen publication dates far out in the future (e.g. 2104) that are clearly numerical character transpositions of the correct publication data (e.g. 2014).

Schema-level issues

Processing issues can be encountered due to problems with the design of the PubMed XML DTD. For instance, the 2019 DTD permits multiple ReferenceList elements. Consequently, it was observed that a large set of records were using one ReferenceList containing many Reference elements, whereas another set of records had many ReferenceList elements each containing one Reference. The schema also misses out a key relationship, namely, the relationship between an Investigator and a Collective, which is referenced in the documentation:

“For records containing more than one collective/corporate group author, InvestigatorList does not indicate to which group author each personal name belongs.”

It then explains the NLM’s reasoning for repeating names:

“In this context, the names are entered in the order that they are published; the same name listed multiple times is repeated because NLM can not make assumptions as to whether those names are the same person.”

Summary

When my team set out to build the COVID Author Graph, our aim was not to build just a graph of authors from the COVID-related scientific literature. Our aim was to build a knowledge graph of authors. In the sense that we wanted to quantify our confidence in the underlying quality of the data and sort out the problem cited above in NLM’s reasoning for repeated names. We wanted to disambiguate authors and their affiliations as well as we possibly could. Managing knowledge is an iterative process that is based on continuous improvement. At a high-level, all knowledge systems implement some version of the Deming cycle (a.k.a. the PDCA cycle) because the nature of knowledge is dynamic and the preservation of quality (for its data) requires feedback loop mechanisms. The outcome of such systematic data engineering can be a variety of data products, where the degree of confidence for every piece of information can be quantified and our knowledge can be extended through inference — the Bayesians amongst us can see where all this is going!

This post has emphasized that even the simplest data sets are fraught with problems that impede the production of knowledge. Thus, every knowledge system must define and measure data quality systematically. The better the quality of our data, the greater the amount of knowledge that can be produced for a given effort. My colleagues and I have published more details on the quality of MEDLINE data in a new paper you can read now.