It’s no secret that information available in the digital ecosystem is rapidly expanding.  Three million papers are published in scholarly journals annually, and that’s just one high value content type for R&Dintensive organizations that may also need quick and easy synthesis of patent, clinical, and other types of content.

In the past decade, R&D-intensive companies increasingly rely on text mining to glean important insights from vast amounts of published materialWhen we think about text mining, we’re really thinking about processing larger amounts of data. After all, there are natural limits to the qualitative analysis an individual can do as a researcher in a specific business function when dealing with an abundance of information.  

In past years, we have observed more of project-based approach to text and data mining.  There would be a specific ‘ask’ sponsored by some area of the business, and text mining would be applied to deliver a particular result, answer, or response through machine analysis. Although a text mining application or a tool may have been used, it was ad hoc for the particular project, and its use often waned after project completion.

As time has gone on and much more data is being produced, there’s a broadening set of applications for text and data mining – including those where text mining is ‘baked in’ to an end user information experience, and where text mining is applied as part of an ongoing data processing pipeline. There are a lot of different factors that can be considered as you look at the different options for how it might be implemented in an organization.

Based on the trends we’ve seen among clients, we’d recommend considering the following:

User Experience

The business problem and the user for whom that problem is being addressed are key inputs to whether and how text mining can be applied. For example, a day-to-day end user – such as a research scientist – will likely not want to engage with a heavy-duty text mining tool. However, even a simple end-user search and discovery tool can benefit from text mining underneath the hood, by providing the user with intuitive auto-suggestion, synonym- and class-based searching, and other assistive capabilities. Conversely, a data scientist or text mining expert may have a need for more granular control, or may also seek programmatic methods, such as APIs, to interact with a text mining tool.

Data Source and Dimensions of the Data

The selection of which content/data and in which formats will also be related to the business objective you are addressing. Dimensions of the data that will bear on whether text mining is a good fit are: volume, frequency, type, and format. For example, the volume of scientific literature content is high, and the content are is readily available in digital format with the appropriate rights to text mine. However, even if volume is low, if there are inefficiencies in a manual process that can be addressed through automated machine analysis – such as in expert review for pharmacovigilance – it may suggest a process that can benefit from text mining.

We know people within the organization, from early research to drug safety to competitive intelligence, are dealing with the proliferation in volume of content and content formats. That means leveraging a lot of content in various formats, without it being overwhelming, is the main goal. At CCC we’ve developed integrated solutions that make it simple to license, access, semantically enrich and index full-text XML articles from a wide range of scientific publishers. Learn more about RightFind XML  here.


Author: Mike Iarrobino

Mike Iarrobino is CCC's product manager for content and rights workflow solutions RightFind® XML for Mining and RightFind Music. He has previously managed marketing technology and content discovery products at FreshAddress, Inc., and HCPro, Inc. He speaks at webinars and conferences on the topics of content discovery and data management, and loves to get into conversations about the nature of free will.
Don't Miss a Post

Subscribe to the award-winning
Velocity of Content blog