As the power of computing and access to information has increased, data, and the knowledge derived from that data, has become an increasingly important driver for organizational decision-making. In almost every position in today’s world, decisions are made based on compiling and analyzing data and implementing strategies based on the data findings. In my world, there isn’t a day that goes by where I am not doing something with data. The scholarly publishing community is certainly not an exception. Publishers rely on data in a variety of ways, from modeling OA institutional agreements to looking at publishing trends to increase diversity and inclusivity in authorship to increasing efficiency in the time between the submission and publication of research.
Understand the Question First
When given a large data set, the temptation is almost always to be drawn into the data and to start exploring, but in practice I have found the data should not be your first step. The first step in analytics or for any analyst is to understand the question that is being asked or the problem that needs to be solved. Data can only be understood in that problem context, and I won’t know the data that I will need or the potential problems with the data until I know what I am trying to do with it. As you explore the data, you begin to discover new questions which let you return to your stakeholders and find out more information and gain a deeper understanding of what is being asked. Analytics always starts with a question, and it is worth spending time defining that question before any data analysis.
Finding Value in Cleaning Data
There is a common saying that cleaning data takes 80%-90% of a data scientist’s or analyst’s time. There is truth to this statement, but I think that its real meaning is often overlooked. This statement is often spoken with a certain amount of derision, as if time is being wasted by having to clean the data. However, I find that the time spent cleaning data is where value is added. I equate data cleaning to data quality, ensuring the data is fit for purpose and a fair representation of the real-world entities and events it depicts. It is in data cleaning that you learn just how well your data represents the problem you seek to solve.
Data Analysis Doesn’t Stop
It is important to recognize that data cleaning is not entirely an ‘up-front’ process. Your data will shape-shift over time. You need to always pay attention to how your data is changing so that your models and analyses don’t become invalid. Just as the data will shift, so may the problem you are attempting to solve. Iterative understanding and frequent communication with stakeholders is integral in the data process to make sure that you are answering the question you’ve been presented and that are important to your users.
Take Action with Your Data
There are plenty of well-known examples from the history of statistics and analytics that illustrate the power of data to reveal previously unseen trends, from Florence Nightingale’s work during the Crimean War illustrating how more British soldiers were dying from contagious diseases than in battle to W.E.B. Dubois vivid infographics that helped to explain systematic racism. I come from the world of product management, and I have a very simple yet powerful model for explaining how to build products: we start with human experience, we find words for these, and then build systems around them.
As Senior Product Manager, Analytics I work with the foundational engineering teams to define and build the data processing pipelines and analytics capabilities needed to support our data operations, internal reporting and intelligence, and external analytics and data product needs. If you stopped by my office at any given time, I am probably writing some queries to answer a question, looking at our data to learn how we can do things better, or having a conversation about data models and technology.
In my work at CCC, we use data to drive what we do. We use it when we are analyzing problems and trying to find their root causes. We also use data to proactively find potential problems to fix or new areas of interest that can provide value to our partners proactively. And above all, we use data to learn how we are doing and how we can improve.