Top 3 Challenges When Using Scientific Articles with Internal AI & Machine Learning Projects

In today’s R&D organizations, corporate librarians and the data scientists they support are looking to acquire as much AI-ready content as they can for their internal AI workflows, including scientific articles that number in the thousands, and sometimes millions.

Negotiating with a wide range of rightsholders for those articles is time consuming, and hoping for consistency in how content is delivered is often unrealistic. Yet access to high-quality scientific literature is critical for building AI and machine learning models that can improve accuracy and drive efficiency.

How can R&D teams build a collection of AI-ready scientific literature? Here are some challenges our clients have come across.

1. Article abstracts may not provide enough insights

While article abstracts provide some value, they offer a limited view of the underlying research in terms of both the volume and types of data available. Many researchers build their AI datasets using scientific article abstracts because they are easily accessible through databases such as PubMed. However, by relying solely on abstracts, researchers can miss detailed and sometimes hidden information, including descriptions of methods and protocols, tabular data, and complete study results.

Internal AI and machine learning projects can be significantly enriched with insights that are only found in the full text of scientific articles. Access to full-text content authorized for internal AI use, including detailed methodologies and complete results, helps ensure researchers do not miss vital data, discoveries, or scientific assertions that can improve model accuracy and outcomes.

2. Data quality issues when working with PDF-based scientific articles

When companies rely on journal subscriptions, scientific articles are often available only as PDFs. Researchers and data scientists must then spend time converting PDFs into XML (Extensible Markup Language), the preferred machine-readable format for AI and machine learning projects.

XML enables computers to parse and interpret content more effectively, but converting PDFs into XML requires additional tools and manual effort. This process is inefficient and can introduce errors, including loss of tables and figures, conflation of document sections into a single “blob” of text, and the introduction of bad characters and non-words. These issues create a real risk of missing or corrupted data.

Relying on original source XML from publishers, especially when it is normalized to a standard schema like JATS, typically delivers higher-quality results and are generally more reliable. The old axiom applies here: bad data in, bad data out.

3. Inconsistent licensing terms and fees

AI and machine learning projects vary widely in scale. Some require only a few dozen scientific articles, while others need to process hundreds of thousands or even millions.

To achieve meaningful results, many projects depend on access to a broad range of full-text scientific content. This can require organizations to work directly with multiple rightsholders and publishers, which can lead to inconsistent licensing terms, varying fee structures, and increased administrative overhead. Voluntary collective licensing options can help streamline this process by providing access across publishers and reducing the burden of individual negotiations. It is also important to involve the person or department responsible for managing subscriptions, typically a knowledge or information manager, as they often have insight into existing agreements and relationships that can help simplify licensing.

How can CCC help with AI-ready scientific content?

CCC helps organizations simplify access to scientific content for internal AI and machine learning workflows and tools through centralized delivery and streamlined licensing across publishers.

RightFind XML is part of CCC’s RightFind Suite, a set of integrated and flexible software solutions that help R&D teams search, discover, access, share, and analyze scientific content to generate data-driven insights and gain a competitive edge.