Musings of a COVID-19 Data Analyst

As a data architect, I help assemble CCC’s Links to COVID-19 Data Resources. This project began in April as an extension of our efforts to amplify how publishers were making COVID information available in an open to read format. When I first began searching for and vetting dashboards that attempted to tell a story about the ongoing COVID-19 pandemic, I noticed that many of the dashboards relied on one or two common sets of data. That provided by Johns Hopkins University in the United States was particularly popular, and readily available for researchers at GitHub. Another popular data set was provided by the European Centre for Disease Prevention and Control, which makes its data available in a variety of formats. There are good reasons why many practitioners have focused on a small set of data sources. Indeed, in a case like the current pandemic, often there are few sources of data available. Yet, there are some challenges that researchers must overcome when using the same set of data that most other investigators choose.

Because lots of the dashboards use the same data, it was not surprising that many of the dashboards presented very similar pictures of the pandemic. This information was often shown in the same way too, with minor variations of color, chart type, and other secondary features. It probably comes as no surprise to researchers that one outcome of using the same set of data is that the story to be told is pretty much the same, no matter who is creating the visualization. After all, we are limited by the volume, variety, and types of data available for analysis. If the set of data doesn’t contain date information, for example, it would be impossible to show trends over time. One is limited to the kinds of summary statistics available if all the data in a set is categorical, or descriptive, only.

However, there are sometimes good reasons for researchers to use a single or small set of data sources, particularly if the data comes from a party with a reputation for good data governance. Often that data is vetted and corrected according to basic data hygiene practices. The data usually is updated frequently. In some cases the data is open data, which makes it possible for a larger community to help with vetting and data cleanup. One provider of data and dashboards that I have recently encountered, outbreak.info, encourages such community involvement soliciting GitHub issues, whether for errors, new sources of data, or feature requests. An organization like Johns Hopkins University has important reasons to ensure the quality of data made available to other parties, the chief being the University’s reputation.

Given a single data set, or a small number of related sets, how can a dashboard designer differentiate their presentation, with the aim of providing a more detailed or nuanced picture? One method is to augment or enrich the original data set with other relevant data from other sources, but which does not exactly overlap the primary set. Supplementary data can provide context and facets that the original data set alone cannot. A researcher may combine demographic data with information about where the data was collected. Combining data in this way requires good data governance so that the quality of the data, particularly that which is used to match against other data sets, is reliable. During a time like the present, when the exigencies of making data available as quickly as possible might seem to override concerns about data quality, governance around data is actually more important than ever. Data quality consists of multiple dimensions. Timeliness is certainly one important dimension, but so are accuracy, cleanliness, and completeness.

Beyond being unable to make a compelling presentation, a disadvantage to using a single set of data, or a small set of data sources, for investigating any issue, is that the effect of bias or poor data collection practices is magnified. For example, does the data properly account for all strata within a given population? We all remember the postmortems to the presidential election of 2016, when polling organizations realized that the data they collected underrepresented certain demographic groups. Was the study that collected the data properly designed? It’s vital to ask and find answers to such questions before using any set of data, but particularly so when only one set of data is available to analyze.

At a time like the current pandemic, when timeliness appears to outweigh all other data quality dimensions, data governance has never been more important. It does no one any good if the first set of data available for study is also full of garbage characters, inaccurate dates, and similar problems. Working with one or a few sets of data on a topic can work well if sources are reliable and accountable. However, when that is not the case, it can be a recipe for misinformation.

Continue learning: