The Data Quality Imperative: The Critical Importance of Data Quality

This is a summary of a pre-print article my team has written on this topic (with supporting data and references); it is accessible at Europe PMC.

In “Knowledge Graphs as Belief System Encapsulations,” I presented the view that “knowledge must necessarily be associated with a degree of confidence that expresses the strength of our conviction about the accuracy of the information.”

In this blog post, I will share our experience in building a knowledge graph of researchers that have published articles related to coronaviruses and provide some details of the data quality issues that we encountered while working with the bibliographic metadata.

Data Quality Issues Encountered in building CCC COVID-19 Author Graph

In order to facilitate the discovery of experts who are working in the COVID-19 field, we selected a set of published scientific articles in virology with special attention to coronaviruses and we used the associated bibliographic metadata to extract the authors, their affiliations, the titles of their articles, the journals in which they were published, the appropriate medical subject headings (MeSH) terms, and their citations.

The idea was to build a graph of the above-mentioned information so that someone could explore it through a visualization tool since that is a great way to interact with the data and quickly identify qualified experts in any field of interest. We built such a visualization tool so that everyone could benefit from that knowledge since the tool does not require any special skills, intuition suffices to enable the exploration of the graph.

Knowledge Graphs, Knowledge Systems, and Data Quality

A knowledge graph is the product of a knowledge system. A knowledge system takes a set of data as its input and, through a series of data processing steps, extracts as much of the “actionable information” in the data as it is possible or desirable; the term “actionable information” is our working definition of “knowledge.”

Suppose that we could quantify the maximum amount of knowledge that one could produce from a given set of data (S), let us call that amount Kmax(S). A good measure of success for a knowledge system would be the ratio of the knowledge actually produced (let us denote that with K(S)) over the maximum amount of knowledge that is theoretically possible to produce, i.e. the ratio K(S) / Kmax(S). That ratio can be used as a criterion for determining whether the data is “good enough” since, above a threshold value, the effort expanded for a given (percentage) “gain” in the ratio will grow exponentially fast. This is a fancy way of saying that, for a given input, the amount of knowledge that can be obtained diminishes with increasing effort beyond a certain level of effort. Now, the catch is that we do not know that theoretical maximum value for any given input that is not trivial. Nevertheless, sensible (domain specific) rules can be created to estimate that value.

Another empirical fact that we know is that the amount of work needed to produce a given amount of knowledge increases as the quality of input data decreases. The better our data are, the greater the amount of knowledge that can be produced for a given effort. This is important when you select the data that you want to work with. Sometimes, you may not have an option but other times you may be selecting from a list of vendors the data that you will be using. The higher the quality of your input data the smaller the amount of work that you would have to do, for a given amount of knowledge that you want to produce.

With these general remarks in mind, let us consider the problem that we were trying to solve with the COVID-19 author graph. In that context, “knowledge” is the information that can support the decision of selecting a specific researcher as the most qualified candidate for conducting the peer-review of a given manuscript; our prototype system did not address the issue of availability or any other matters of logistics. One can define what “most qualified” means as he or she pleases, and that point is important since it automatically implies that there is not one knowledge graph ‘to rule them all.’ What is most important here is that no matter what definition one selects, the confidence in supporting the decision can be quantified.

Data Quality Issues

When building the COVID Author Graph, we used data sourced from MEDLINE, a bibliographic citation repository for articles published in medicine and the life sciences produced by the US National Library of Medicine. We initially selected 184,187 scientific articles from the MEDLINE corpus by searching for terms related to corona viruses; if you are curious about the specifics of the query, we looked for the following terms: ‘upper respiratory infection’, ‘COVID-19’, ‘novel coronavirus’, ‘SARS-CoV-2’, ‘SARS’, ‘MERS-CoV’, and ‘severe acute respiratory syndrome’.

MEDLINE is a widely used, public source for data of over 30 million articles. However, there are several known data quality issues in MEDLINE, which surface quickly when one attempts to create a knowledge graph from them. Let us see what some of those issues are.

Issues concerning standard identifiers

The use of standard identifiers is an excellent way to ensure the correct identification of entities in the data. For example, since every article in MEDLINE has a PMID (that is an identifier produced by the creators of MEDLINE), we have a reliable way to identify a journal article correctly and repeatedly in MEDLINE.

To begin with, other standard identifiers were sparse in the data. Out of over 1.1 million author instances in the COVID dataset, around 10% had an ORCID attributed to the author and a number of these ORCIDs were invalid. In some cases, all authors of the paper were assigned the same ORCID, which clearly cannot be true for a “unique identifier.” Moreover, it should be noted that the number of authors with an ORCID is fewer than 3% for all of MEDLINE. ISNI and GRID identifiers were even less prevalent, with 2884 links to ISNI identifiers and 3276 links to GRID identifiers out of 112 million authors in the COVID dataset. There were roughly only 700 institutions represented by ISNI and GRID, respectively.

Further Issues concerning authors and affiliations

For an author graph, the author identity is an essential piece of information. We needed to extract the authors of the published articles from the bibliographic metadata reliably and with a measure of confidence as to whether the information about an author in the graph is accurate. Do the names “R S Baric,” “Ralph Baric,” “Ralph S Baric,” “Ralph A Baric,” and “Ralph Steven Baric” all refer to the same author?

Moreover, if we look at just the length of an author’s last name in 2020 for the COVID dataset, the number of names with a length of 1 character and those with a length >30 characters exploded. In just that metric, we can clearly see the impact (on the data quality) of rushing out so many COVID-19 publications in 2020.

Another challenge with Collective Names / Working Groups is that authors who are also part of working groups can end up being listed twice, or more times, in the author list. E.g., “Frédéric Choulet” appears 12 times in the author metadata for PMID 30115783.

Finally, data quality issues also appear with author affiliations. Author affiliations may be concatenated and duplicated across all the authors rendering correct attribution impossible. When we do have any author affiliation, that affiliation is not always useful. Believe it or not, the most common affiliation in MEDLINE is the institution-less ‘Department of Psychology.’, including the period! It appears over 8,000 times, whereas both ‘.’ and ‘,.’ make the top five.

Issues concerning publication dates

Another area where we observe data quality issues is with publication dates. The oldest publication year in all of MEDLINE is the year 0001, which is clearly wrong, while the next oldest publication year is 1041, Gutenberg was not alive yet let alone the subject matter of the paper. It turns out that, in the latter example, the number 1041 is actually the page number for the article misplaced in the date field! Months are meant to be a three-character string in MEDLINE but can also be numeric and this leads to other problems. These types of issues with publication dates are not uncommon in our experience nor are they unique to MEDLINE. In other datasets that we work with, we have seen publication dates far out in the future (e.g. 2104) that are clearly numerical character transpositions of the correct publication date (e.g. 2014).

Schema-level issues

The team also identified issues at the schema-level which I am not going to get into here. (For additional details, see my LinkedIn post, which references problematic elements of the 2019 DTD.) Briefly, the schema also misses out a key relationship, namely, the relationship between an Investigator and a Collective, which is referenced in the documentation:

“For records containing more than one collective/corporate group author, InvestigatorList does not indicate to which group author each personal name belongs.”

It then explains the NLM’s reasoning for repeating names:

“In this context, the names are entered in the order that they are published; the same name listed multiple times is repeated because NLM cannot make assumptions as to whether those names are the same person.”

Summary

When my team set out to build the COVID Author Graph, we did not want to build a “quick-and-dirty” graph of authors from the COVID-related scientific literature. Our aim was to build a knowledge graph of authors, in the sense that we wanted to quantify our confidence in the underlying quality of the data and sort out the litany of known problems, many of which I mentioned above. We wanted to disambiguate authors and their affiliations as well as we possibly could. Producing and maintaining knowledge is an iterative process that is based on continuous improvement. At a high level, all knowledge systems implement some version of the Deming cycle (a.k.a. the PDCA cycle) because the nature of knowledge is dynamic, and the preservation of quality (for its data) requires feedback loop mechanisms. The outcome of such systematic data engineering can be a variety of data products, where the degree of confidence for every piece of information can be quantified and its provenance be reported. Such data can then lead to more knowledge through inference — the Bayesians amongst us can see where all this is going!

This post has emphasized that even the simplest data sets are fraught with problems that impede the production of knowledge. Thus, every knowledge system must define and measure data quality systematically. The better the quality of our data, the greater the amount of knowledge that can be produced for a given effort.