Machine Learning: Understanding the Difference Between Unstructured/Structured Data

We know machine learning has the potential to transform the workflows of pharma and biotech organizations looking to turn content into smart data, improve patient safety and increase drug development. But for those of us who aren’t scientists, and don’t work with machine learning on a regular basis, the concept can be confusing. What is machine learning, and how does it fit into our everyday processes?

In CCC’s Beyond the Book podcast, we spoke with Lee Harland, SciBite’s founder, about the role of humans in big data.

Listen to the podcast below, or check out our summary:

As a first step in the machine learning process, we need to assess our two data types: structured and unstructured. In the world of machine learning, unstructured data is not only critical, but also the more challenging piece of the puzzle.

Structured Data – Think of a Spreadsheet

When thinking about structured data, envision a spreadsheet. When a person looks at a spreadsheet that’s full of numbers or other data, he or she is typically able to understand the significance of the measurements by reading the data in the chart.

Computers, generally, can understand this data, too.

Unstructured Data – Think of a Text Document

For computers, understanding a text document is far more difficult than understanding a spreadsheet. Instead of being able to conceptualize what a word means, computers see strings of letters. This is an example of data that is unstructured.

“If a computer sees the letters M-O-U-S-E, it doesn’t know it means mouse, and it doesn’t know if that’s referring to an animal, to a rodent, and or if it relates to any other document that mentions other types of rodent,” Lee explained.

At SciBite, scientists take this unstructured data, and turn it into more structured information. When unstructured text data is presented in a structured way, the goal is for computers to be able to understand:

Aha! This document is about a mouse, a rodent!

Once the computer can understand this, it opens up the possibilities for “exciting stuff” that couldn’t be done with raw documents. (Here are a few examples of the “exciting stuff” machine learning is helping the industry accomplish.)

Don’t Forget Data Quality

When analyzing structured text, data quality is critical to the performance of machine learning algorithms. At SciBite, the mission is to solve what Lee describes as the “garbage in/garbage out problem.”

Today, there are several different approaches to taking raw documents and throwing them into machine learning algorithms. While this isn’t an invalid way forward, data quality will be better if you’re working with structured data.

“If you’re putting lots and lots of random data into machine learning [algorithms], it’s good, but it may not be that good,” Lee said. “Whereas, if you can go a little bit further and pretreat your data so that it’s a bit more structured, a bit more organized, and then feed that to these algorithms, we’ve seen time and time again with our customers that these algorithms start performing much better.”

Simply put: The quality of what you get out is directly related to the quality of what you put in.

Ready to Learn More? Check out:

White Paper: The Evolution & Importance of Biomedical Ontologies
Blog: Semantic Search vs. Keyword Search
Blog: What is Text Mining? And How is it Different from a Web Search?

RightFind® Insight, powered by the SciBite Platform, brings the power of semantic enrichment to the search and reading experience to turn information into knowledge and accelerate new discoveries. Learn more here.