Linked Knowledge: A Data-Driven Approach to Life Sciences


In any knowledge area, content of interest can be found across many different data sources. We can think of a data source as a “bucket” of information items, where each item belongs to a particular category or type, according to the nature of the information it contains – journal articles, patents, trials/studies, chemical substances, etc. A bucket might contain one or several data categories and potentially thousands of items, which are often semantically related to but disconnected from items in other buckets. That is why we usually speak of distinct data sources as isolated “information silos.”

Linking Siloed Information with Searches

Metasearch engines allow us to search homogeneously across sources of different nature. They allow researchers to retrieve results from all those sources by building powerful queries and the application does the magic. They bridge the gap between silos, as with a single search we can retrieve content with common text terms, no matter which silo the content belongs to. This is one possible way of linking items from different silos.

However, these “links” are sometimes weak and inaccurate, since sharing a text term does not necessarily mean that two documents have much in common or are concomitantly relevant to the query. A single word might have different meanings in different contexts and therefore the above-mentioned approach can produce ambiguous results. For example, ‘ALS’ can refer to the gene ‘acid-labile subunit’ or to the disease ‘amyotrophic lateral sclerosis’. A strict lexicographic retrieval approach will result in our search omitting items that use different words (synonyms) to refer to the same concept. In other words, linking items based on lexicographic terms means that semantically related content items will appear to be unrelated simply because the same concepts or topics are identified with different words.

Semantics to the Rescue

Named entity recognition (NER) engines are systems that are able to detect semantic entities in unstructured text. Because they are specialized in particular knowledge areas, they apply heuristics in order to overcome the difficulties I just described. In essence, these systems enrich the content with semantic tags so we can easily know, at a single glance, the concrete topics covered by a document.

For a researcher, having such a powerful help while reviewing content is incredibly valuable. But once again automation makes the difference, and those semantic tags are useful not only for humans but for advanced engines like RightFind Navigate: semantic or entity-based searches can be much more powerful and accurate than text-based searches. Semantic tags do not only provide improvements in terms of reaching content but also in terms of linking it. By taking advantage of topic co-occurrence, we are crossing the borders between data silos.

Refinement by Searching, Discovery by Navigation

At some stages during research, there might already be enough context to choose the new terms properly to conduct successful searches and therefore refine the results. However, searching is sometimes not enough. When a step forward is needed, getting out of the box and including factors that have not been on our radar is the only way. But how can I include in my search the proper terms to retrieve related content if those terms are still unknown?

At that point, discovery becomes essential. We need our information system, as RightFind Navigate does, to extract and suggest new correlated topics from our searches which are not evident. Linked topics naturally build a knowledge graph. New content might be discovered by navigating through it, without searching explicitly.

Answering Questions

Leveraging topic correlation is an interesting path to extract knowledge from dynamic content. In particular, co-occurrence links are able to detect relations between topics in literature without any manual curation or previous knowledge base. They are even able to find unknown relations in new documents which have not yet been captured by ontologies.

A step further would be to detect strong co-occurrence links and give them better ranking. For instance, a link between two topics that appear in the same sentence, joined by a verb, in sentences from several sources should be considered a strong link, for example “Drug X (other terms in the same sentence) Disease Y”.

Moreover, a system could detect directed and semantic links between topics; for example, “Drug X is indicated for Disease Y” adds semantic knowledge on top of “Drug X (other terms) Disease Y”. These semantic predications formed by triples (patterns detected by systems like SemRep) are reliable and meaningful links which might enable a system to provide answers to more complex questions based on a directed graph.

Automation is No Longer a Luxury

Not only is the amount and complexity of life sciences content growing exponentially, but so is the pressure on R&D teams to deliver better solutions faster and with less cost. Therefore, data-dependent research processes increasingly rely on software systems which automate much of the analysis work that researchers used to tackle manually.

Topic:

Author: Christian Marmolejo

Christian Marmolejo is a Technical Lead at Copyright Clearance Center. A computer engineer and AWS Associate Solutions Architect before jumping into information retrieval, Marmolejo worked previously in network and cloud automation and security. Outside of the office, he is a nature enthusiast and loves being in the wild and diving.