Recently, my colleague, Laura Cox, Senior Director, Publishing Industry Data, spoke on a panel at NISO Plus 2023 to discuss our experiences and the challenges around metadata fitness. In this presentation, Laura and the other panel members described how metadata is all around us and has become one of the core tools in scholarly publishing workflows. As scholarly communication is experiencing a collective push to open access publishing and open science, the role for quality metadata across the entire ecosystem is becoming indispensable. Despite its growing importance and prominence, metadata today is not where it needs to be. Simply put, metadata needs to be more robust, uniform, and comprehensive in all parts of the research lifecycle.
At CCC, data is at the heart of everything we do and build, whether that is our own internal master data management systems, data products such as the Ringgold Identify database, or applications of advanced data analytics such as with knowledge graphs. Our goal is to always be improving the quality of the data we use and provide. However, in order to improve the fitness of data, one must understand that data. And in order to understand something we must first examine it. In the remainder of this blog post, I am going to describe some of the approaches we use to understand the quality of our data, specifically our Ringgold organization data.
Why Data Quality?
Data quality is an expression of a data’s usefulness and value. We often describe the things we build as knowledge systems, or a system that takes data as its input and, through a series of data process steps, extracts as much of the ‘actionable information’ in the data as is possible. In knowledge systems the amount of knowledge we can generate from information—and the effort involved to generate it—is directly related to the quality of the data. You can think of data quality as akin to FAIR. Whereas FAIR describes the degree to which data is findable, accessible, interoperable, and reusable, data quality is a set of principles we can apply to our data to understand its potential value. Imagine if every data set came with its own certified statement of data quality. We do.
Data Quality Dimensions
When we examine the quality of our data, we define quality across different dimensions. Each dimension provides a different perspective into the data and in aggregate gives us a much better understanding of the overall quality of that data. Whereas I have seen some people talk about as many as thirteen different dimensions, today I will focus on six.
- Completeness. This dimension evaluates the content of the information actually available with respect to the maximum content possible. For example, if I posit that every record describing an organization should have ten core fields (name, place, URL, type, status, etc.), and if I have 10 records describing organizations, then my data needs to have values in all 100 fields to be considered 100% complete.
- Consistency. This dimension evaluates coherence to a standard or referential conformity between fields. A good example of consistency are identifiers with specified syntax and a calculated check-sum (e.g ISSN, ISBN, ORCID, ISNI). If an ISNI value I have in my data doesn’t match the prescribed ISNI format or pass its check-sum calculation, I can say that data to be inconsistent.
- Currency. This dimension evaluates the timeliness of the data. The definition of currency, or timeliness, depends on the domain context for the data. If the data is from a repository describing bibliographic metadata, I may look at the difference between when an article was published and when a record for that article was created in the database to determine its timeliness. If the data describes organizations, I may look at the percentage of records that were updated in a specific timeframe.
- Redundancy. This dimension evaluates the extent to which unique entities are duplicated across records. Duplicate records are the bane of any dataset and coming up with a precise definition of a duplicate is not always as easy as it would seem. One way to look for duplicates is to evaluate PIDs (persistent identifiers) in your dataset. For example, do you have PIDs that are supposed to be unique used to describe more than one record?
- Accuracy. This dimension evaluates the extent to which a record describes a real entity out in the world. Accuracy is paired with precision. While accuracy is straight-forward to describe in theory—does a record’s data match the values of that same entity out in the world—it requires having a trusted reference dataset with which to compare your own dataset.
- Reliability. This dimension is a combined measure of completeness, consistency, and accuracy. One way to view reliability is to see it as an evaluation across ‘voting’ experts. How does the database’s data compare with the data provided by other, trusted datasets?
Measuring Data Quality
Now that we have a better understanding on how to define and categorize data quality into different dimensions, we can turn our attention to measuring our own data. In the case of Ringgold Identify, we have defined over 50 different data quality metrics across these six dimensions. For example:
- Is the name complete for each organization?
- Does the ISNI value pass the check-sum validation?
- Do only consortia have members?
- Are all our FunderID values unique?
We run these metrics every time we produce a new version of the dataset. Not only do the results improve our own understanding of our data, but they are actionable. We use these results to guide our operations team to help them focus their maintenance and curation tasks for the overall dataset. Most importantly, since we measure our data quality with every version of the dataset, we have a view of our data’s quality over time and can see how all of our improvements—business, operational, and engineering—are contributing to the changes in the underlying data quality and its value potential.