If you find text mining to be a confusing concept, you’re not alone. The process of deriving high-quality information from text materials using software requires background knowledge before you can fully grasp how it works, and how it can benefit your team’s research and discovery efforts.

If you’re ready to beef up your text mining knowledge, here’s your crash course in the basics:

First, the definition: What is text mining?

Text mining is a process that derives high-quality information from text materials using software. It is used to extract assertions, facts and relationships from unstructured text (e.g., scholarly articles, internal documents, and more), and identify patterns or relations between items that would otherwise be difficult to discern.

Text analytics and semantic search are two concepts that are closely related to text mining.

Why? Text mining enables R&D teams to systematically and efficiently examine content to answer questions that ultimately guide business decisions and resource investments. The alternative would be to curate thousands of pieces of content (or more) manually. At today’s fast pace, this is unfeasible, unrealistic, and error-prone.

How? Text mining tools employ sophisticated software which uses natural language processing (NLP) algorithms to read and analyze text. There are two basic steps:

  1. The first step is identifying the entities an organization is interested in. In a biomedical setting, these might include genes, cell lines, proteins, small molecules, cellular processes, drugs, or diseases.
  2. The next step is analyzing sentences in which those key entities appear, to determine how they are related. A relationship is a connection between at least two named entities; for example, that gene BCL-2 is an independent predictor of breast cancer.

Text mining can uncover relationships that might not have been found otherwise, unlocking previously hidden information to help researchers:

  • Identify and develop new hypotheses
  • Attain knowledge and improve understanding
  • Discover links between diseases and existing drugs to find new therapeutic uses
  • Detect potential safety issues early

The results of these types of projects can provide a greater understanding of the underlying biology behind specific diseases, show how they respond to certain drugs and support the target discovery process.

What Format Is Used in Text Mining Software?

XML is the preferred format used in text mining software. XML is a markup language used to encode documents in a format that is easily read by computers. It is used widely for encoding documents so that computer programs can parse or display the content appropriately.

Related Blog: The Benefits of Text Mining Full Text Instead of Abstracts

How Does Text Mining Differ from a Web Search?

Typical web searches may seem like the process of text mining, but there are stark differences. Search is the retrieval of documents or other results based on certain search terms. Search engines such as Google, Yahoo or Bing are commonly used to conduct these types of searches, and your organization may also use an enterprise search solution. The output is typically a hyperlink to text/ information residing elsewhere, along with a small amount of text that describes what is found at the other end of the link. The purpose is to find the entire existing work so that its content can be used.

In text mining, the researcher looks to analyze text. The goal is to extract useful information, not solely to find, link to, and retrieve documents that contain specific facts. Unlike with search, the output of text mining varies depending on how the researcher wishes to apply the results.

Search functionality helps users find the specific document(s) they are looking for, where text mining goes well beyond search, to find particular facts and assertions in the literature in order to derive new value.

Check out Copyright Clearance Center’s text mining solutions here. 

Keep Learning:


Author: Mike Iarrobino

Mike Iarrobino is CCC's product manager for content and rights workflow solutions RightFind® XML for Mining and RightFind Music. He has previously managed marketing technology and content discovery products at FreshAddress, Inc., and HCPro, Inc. He speaks at webinars and conferences on the topics of content discovery and data management, and loves to get into conversations about the nature of free will.
Don't Miss a Post

Subscribe to the award-winning
Velocity of Content blog