Understanding Affiliation Disambiguation: A Q&A with Shannon Reville and Stephen Howe, Part I: Challenges

At CCC, we talk a lot about “affiliation disambiguation.” It’s a mouthful, but it’s a struggle publishers know very well: the challenge of untangling authors and institutions in their customer records. Inaccurate and incomplete affiliation data disrupts operations across the research lifecycle, with negative impacts to researchers, institutions, funders, and publishers.

Fortunately, there are solutions and best practices to help publishers tackle this problem. I spoke with Shannon Reville and Stephen Howe, product management leaders at CCC, to break down the concept of affiliation disambiguation, why it matters in scholarly publishing, and what publishers can do to improve data quality.

In Part 1 of this blog series, we discuss the challenges of affiliation disambiguation in scholarly publishing.

What is affiliation disambiguation, and why does it matter in scholarly publishing?

Affiliation disambiguation is the process of linking references to institutions encountered throughout the publishing workflows to actual institutions. Researchers are typically affiliated with one or more universities or research organizations, and their work may also be associated with a research funder under a grant. How these institutions are referenced can be inconsistent, and removing uncertainty about the relationships between a researcher or author, their research, and their related institution(s) is at the heart of the disambiguation question. From research integrity to open science, accurately identifying these relationships is critical to removing friction from business processes and restoring trust in science. This is particularly true for scholarly publishers who need a clear sense of which institutions their authors are affiliated with, so they may create sustainable deals, support transparent and data-driven communication with institutional partners, and assess the impact of global governmental mandates on their publishing programs.

How does affiliation data typically get captured during the publishing process?

In our work with dozens of the world’s leading scholarly publishing organizations, we’ve seen a variety of approaches to capturing author affiliation information, demonstrating significant effort to improve the reliability of this information. Unfortunately, there remain gaps that undermine the overall goal. We can boil the approaches down to three:

Unstandardized collection: During the manuscript submission process, editors across journals give authors the option to enter their affiliation as free text. Due to the free-form nature of this data entry, the affiliation data is often incomplete, filled with abbreviations and misspellings, and very difficult to use when it comes time to structure an OA deal or report publishing activity to an institution.
Optional standardized collection: During the submission process or another pre-acceptance stage, publishers ask authors to search for and select their affiliation using a database of organization persistent identifiers (PIDs), like Ringgold. Publishers may have made a strategic investment in this data, or they’ve gained access to it by way of a publishing systems vendor with an integration into the data. Leveraging PIDs and standardized, descriptive metadata means data is much cleaner when provided, but this data is only as current as the time-of-entry and does not help with affiliation data collected through other means or pulled from historical systems.
Validated collection: Some publishers have implemented a manual check as part of their publication lifecycle, often just prior to acceptance, to ensure all manuscripts have a Ringgold or other affiliation identifier assigned. This is not fully automated, as there are always edge cases, for example, when an author is not associated with any organization. But this manual intervention in the publication lifecycle results in the cleanest data we see. Some publishers go so far as to ensure the affiliation selected aligns with the email domain for that author.

What are the most common sources of affiliation ambiguity or inconsistency?

Broadly speaking, the main source of data inconsistencies is that metadata such as author affiliation is captured in multiple steps and by multiple systems throughout the publishing workflow. Even when we focus on one part of the workflow, the submission workflow, there can be inconsistencies as to how data is handled.

Publishers typically have access to customizations in the submission workflow, like requiring a field to be completed as opposed to leaving it optional, or adding key help text as to why affiliation info is important (it could help you publish OA at no cost!). However, we see time and again that these settings are not being utilized.

In some cases, publishers can also surface article processing charge (APC) or OA funding information in the submission workflow. This can alert authors that they’re missing potential funding opportunities due to incomplete or inaccurate data. The rub? We hear that editorial teams are siloed from OA teams, or that journal submission settings have been in place for so long that newer staff lack deep familiarity with its current state, so these simple but effective changes are overlooked.

If decent affiliation information is provided, OA workflow systems like RightsLink for Scientific Communications can automate agreement operations, institutional reporting, and more. OA modeling and analysis systems like OA Intelligence can enrich that data to give publishers a well-rounded picture of their entire program. Therefore, we encourage our publishers to address the state of their most upstream author affiliation steps, make the most out of the tools at their disposal, and commit to getting affiliation information collected correctly from the very first time they receive a submitted manuscript.

What are the main challenges publishers face when trying to ensure clean and accurate affiliation data?

There seem to be four core issues among most publishers who are really struggling:

Due to resource constraints and/or a lack of shared understanding across editorial, sales, operations, and the entire organization, they haven’t yet made data quality a strategic focus.
Their growth is outpacing their infrastructure, limiting their ability to scale. They may have a robust program, dynamic publishing agreements, and many institutional clients, but the complexity of their programs and the broader landscape requires investment in advanced workflow solutions and high-quality data services.
More cross-team collaboration is required to make sure editors are implementing submission steps that serve the big picture, OA and sales teams get what they need to operationalize and model deals, and the executive suite can assess how much they should worry about today’s news.
You need agreement on what set of organizations you are going to standardize on. It is one thing to make affiliation data mandatory and to capture it consistently; however, you still need to link the received data to a standard set of organizations such as Ringgold or ISNI.

Why is it so difficult to standardize affiliations, even within the same institution or across journals?

Journal submission implementations are often unique and distinct, so for some publishers, updating their journal submission process to improve collecting this metadata could mean updating hundreds of journal sites.

Publishers also need to think about their institutional relationships in terms of levels and exceptions. Often publishers strike deals with a university, but those deals do not cover the university hospitals, certain labs, sister schools, or departments. So, an author simply identifying a university system is not sufficient for operations, reporting, and modeling renewals. Publishers need to ask authors for the right level of detail about their affiliations to scale their program and facilitate automation, and the PIDs they implement need to exist at the “right” level. Authors can also legitimately have multiple affiliations, although maybe just one of those is appropriate to a specific workflow.

Lastly, there are always edge cases: authors that have no affiliation, multiple affiliations, or work for an organization we’ve never heard of. In our market research and discussions with customers, publishers seem to agree that some degree of manual checks and balances will always be required to achieve even a near-perfect picture of their authors’ affiliations.

As this conversation highlights, collecting unstandardized organization data results in a web of intertwined problems for publishers. In Part 2 of this series, we will discuss solutions and best practices for untangling affiliations.

In Part 2 of this series, we will discuss solutions and best practices for untangling affiliations.