Fueling Your AI & Machine Learning Projects with XML Content

It is (perhaps) a truth universally acknowledged that a piece of knowledge possessed within a scholarly text must be in want of being discovered. So Jane Austen would never have written. It is, however, increasingly true that organizations globally are in search of that elusive element: knowledge.

Knowledge formed by connecting ideas and concepts 
Knowledge derived from the spaces left between articles and journals 
Knowledge that can power real change, or deliver deep understanding to your industry 
Knowledge that propels companies to make new discoveries and maintain competitive advantage

 CCC supports many industries by making it simple to access scholarly content to power their research and development and help deliver innovation. According to a survey conducted in 2021 by Gradient Flow, “78 percent of respondents from healthcare organizations at a mature stage of AI adoption and 80 percent of respondents from organizations in an early stage stated that structured data … were used in their AI applications.” We are seeing similar trends in our corporate customers who want to consume scholarly articles into machines to support their research.

Challenges

Text mining has been around for some years now, yet there are still considerable challenges to making effective use of the approach. 

Go wide, go deep

In order to increase confidence in a dataset, we know from basic statistics and data analysis that we need larger data sets – and more reliable data. The more data we have, the more we increase our confidence score in the knowledge we have derived. However, for many, access to the breadth of data is not simple and requires an investment of time and effort to manage (see below). 

And what of FOMO (fear of missing out)? This is no longer the province of social media junkies, but increasingly concerns researchers: what if I am missing something because I don’t have enough data to mine? What if that piece of information would change the way we view the output? What if…? This is FOMO that might have serious consequences in the life science space. To get the best results, and to reduce FOMO with consequences, researchers need access to a breadth of content.

Too much to do

A typical research organization has to reach out to many separate publishers to negotiate the rights to access, download, store, and mine scientific articles to perform their analyses. For the knowledge managers, this means multiple ways of paying for the content. Multiple ways of accessing the content. Handling multiple formats for the content. Managing multiple sets of rights for the content. 

In short, a never-ending list of things to manage and questions to investigate: 

Can I use this XML from this publisher in the same way as that XML from that publisher? 
What restrictions are there on how I use the content? 
Whether I can store the content? 
What can I do with the output of the mining exercise? 
What format is the content in? 
Do I have to write another ingestion script to get my mining tool to accept the XML?

This presents a significant chunk of time, and – as noted above – can often mean that accessing a breadth of content is compromised, therefore potentially compromising the output. To support their research, organizations need to be able to access a lot of content, easily and not have to manage multiple relationships which distract them. 

DIY?

As an alternative to sourcing XML, some organizations will see conversion of existing content to XML as a way forward. “I’ve subscribed to the journal,” they think, “therefore, I can include that article in my AI or machine learning project.” 

Conversion of PDFs to XML, or HTML, is fraught with difficulties. Not only must you have specialized tools to do this, but all too often the output is littered with mistaken characters, or ‘blobs’ of text that are not easily searchable. In an effort to get breadth (see above), organizations can find that they have compromised quality and therefore negatively impacted the output of their mining activities.

But, more worryingly, it is often not true that because you have subscribed to the journal that you have the rights to use machines to mine it. In many cases, these rights are not automatically granted and need to be requested from the publisher separately and for extra fees. Without taking this additional step of requesting and negotiating rights, it leaves companies open to copyright infringement risk. As we’ve seen, that’s part of the long to-do list for knowledge managers.

To deliver high-quality output, organizations need to be sure of that their input – the data – is high-quality, and compliant with copyright and licensing terms.

Flexibility is key

In order to address the challenges we hear, we believe that offering flexible and normalized XML along with a uniform set of usage rights provides companies with the best chance to overcome the challenges we’ve discussed. Crucially, this allows those companies to use mining to power artificial intelligence and machine learning. This in turn helps deliver positive outcomes in many fields, not least healthcare and life sciences.

CCC has continuously developed its RightFind XML offering since its inception in 2016, and we now offer even more flexible ways to incorporate scientific articles from more than 50 publishers into AI and machine-learning initiatives.

RightFind XML’s search interface is flexible and usable on a project-by-project basis, allowing you to search across the whole database CCC manages and refine searches to get to the content you most need for your current projects. 
RightFind XML’s newer data feed approach offers companies the option to get larger amounts of data – journal articles going back many years – as a bulk data download with ongoing updates. This helps you get the breadth as well as the depth to power multiple projects and longer term research needs. 
RightFind XML can also provide Open Access content enhanced by our team to enable them to be ingested into your knowledge extraction tools more easily.

CCC has a mission to advance copyright, accelerate knowledge and power innovation. RightFind XML is a key element in supporting our customers to do the same. Let’s start the conversation and find out if CCC can help you get to the science faster. 

Challenges

Go wide, go deep

Too much to do

DIY?

Flexibility is key

Learn More