If you are starting out on your FAIR (Findable, Accessible, Interoperable, and Reusable) data journey and are exploring better ways to search and analyze your data, then you have undoubtedly come across ontologies, as these are prerequisites to FAIR data. An ontology is a representation of knowledge within a given domain and provides a consensus view that is agreed upon by subject matter experts in the field. It describes the key concepts and how these are related to one another.
The heterogeneity and complexity of language in the life sciences have necessitated the development of a broad range of ontologies covering concepts such as genes, drugs, indications and adverse events. Key to data interoperability is the use of public identifiers within ontologies that ensure that individuals working in different organizations refer to the same concept when searching for content or recording experimental findings.
While ontologies provide a consensus view of knowledge in a given domain, aligning unstructured or semi-structured data like scientific literature or internal reports to ontologies, creating structured and harmonized data is not easy. The acceleration of data generation and broadening the scope of sources also create challenges in maintaining comprehensive and up-to-date consensus. SciBite was built around lowering the barriers to implementing ontologies to structure data, and providing the expertise and tooling to effectively do this. SciBite’s ontology team collectively has over 100 years of experience in developing ontologies and using these to structure data. This expertise translates to accelerated implementation of ontologies for customers and partners, paving the way to robust search, analytics, and AI/ML.
SciBite’s VOCabs are integrated into CCC’s RightFind Navigate software, fueling search capabilities by enabling users to locate relevant data beyond just the specific words searched. This streamlines the research process by contextualizing information with the use of biomedical vocabularies and semantic search capabilities, driving faster discovery.
Ontology vs VOCab?
What is needed to take a public ontology and apply this to your unstructured data? SciBite approaches this challenge of going from a public ontology to structured data that aligns to this ontology via a process called Named Entity Recognition or NER. The process of NER relies on holding an ontology in memory and involves scanning through a document and identifying concepts such as drugs or indications. Once identified, a public identifier from the ontology is assigned, along with additional information contained within that ontology, such as synonyms, mappings to other ontologies and links to public databases. Public ontologies are a rich and valuable resource but are often not ideally suited to this process NER for a number of reasons. Firstly, the complexity and depth in terms of coverage and synonym support varies greatly between ontologies, with some ontologies such as MedDRA being very large and broad but completely lacks synonyms. Most public ontologies completely lack any synonyms as the synonyms are really needed for NER which is not the primary use case of public ontologies. Secondly, the context in which an entity occurs in a document is critical to the accuracy of entity recognition. For example, hedgehog could be a species or a gene and the context of the surrounding sentence is critical to accurate tagging of the concept and therefore accuracy of downstream search or analytics. SciBite’s NER optimised version of a public ontology is named a VOCab, and SciBite’s ontology team undertake a number of automated and manual steps to go from public ontology to VOCab:
1. QC and validation of the underlying public ontology
Public ontologies are worked on by the community and are constantly evolving. As such, there are different versions of these public ontologies and they often sometimes contain errors or inconsistencies. SciBite manages the versioning of public ontologies and performs QC and validation on them, and corrects addressing errors when they are identified, and feeding back so they can be fixed in the source ontology.
2. Subsetting public ontologies into functional units that align to use cases
Some public ontologies can be very large, covering a broad range of concepts which may overlap with other ontologies. If these are implemented in their entirety, they would be computationally intensive and cover concepts that are better served by more specialised ontologies. This is where SciBite’s expertise and awareness of public ontologies come into effect and SciBite can advise customers and partners on which VOCabs to use for a particular use case.
3. Semi-automated synonym expansion
SciBite automatically adds a large number of additional synonyms to the source ontologies via rules, for example to that support British vs American spelling, pluralization, differences in word ordering and common misspellings.
4. Manual, subject matter expert curation
A key differentiator for SciBite is our expertise in the life sciences. SciBite’s ontology team manually curate our VOCabs, adding synonyms above and beyond what is contained within the public ontologies. In addition, rules are created to consider the context with which an entity occurs, improving accuracy in the recognition process.
VOCab packs
SciBite currently offers around 100 VOCabs, which are based on public ontologies, leveraging the unique identifiers but optimized for aligning unstructured data via named entity recognition (NER). These VOCabs are grouped together into complimentary packs to support particular types of use cases. The VOCab packs SciBite currently offers are Core, Biopharma, Clinical, Genotype/Phenotype, Business Intelligence, Clinical Data Interchange Standards Consortium (CDISC), Allotrope, Chemistry Manufacturing and Controls (CMC), and Agrochemical.
What use cases does the Core VOCab pack support?
When customers think about using SciBite VOCabs for the first time, the logical starting point is the core VOCab pack. This is the largest of the SciBite VOCab packs and is also the broadest in terms of coverage of biomedical concepts. The core VOCab pack covers biomedical concepts that would support scientific literature review, including biological concepts such as anatomy, enzymes, proteins, genes and microRNA, sex.
Related to human biology, this VOCab also covers disease and pathogenesis, including indications, drugs, mechanisms of action, microbial species, and metabolites. To support the description of experimental procedures, the core VOCabs pack covers concepts such as biochemicals, laboratory procedures, measures, and species in the case of experimental systems.
CCC’s RightFind Navigate incorporates over 20 million synonyms from SciBite’s biomedical vocabularies, and semantically enriches search results across a variety of connected data sources including PubMed, Europe PMC, NIH Clinical Trials, FDA@Drugs and over 40 additional 3rd party licensed data sources. With vocabularies applied in real-time, RightFind Navigate with Semantic Search provides a flexible, scalable, open ecosystem designed to maximize organizations’ return on their content and data investments.
Some examples of use cases that the core VOCab pack could support include review of scientific literature to identify potential biomarkers associated with a particular indication. RightFind Navigate with Semantic Search consolidates diverse content types and leverages the VOCab pack to provide focused exploration and confidence without the need for extensive data science expertise.
Searching across content sources in RightFind Navigate like the RightFind catalog of scientific literature, Medline, or Elsevier’s Science Direct using the SciBite Core VOCab pack would allow researchers to identify biomarkers (genes, proteins or miRNAs) associated with an indication or subtypes of an indication. This would allow researchers to identify which patients may respond to a particular therapeutic, supporting the design of a clinical trial or study. This information could also be used to assess the potential market size of a new therapeutic. Another potential related use case would be the identification of biomarkers or pathways associated with treatment itself, helping researchers to identify potential off-target effects and which pathways are being modulated.
The core VOCab pack could also be used to review the scientific literature for competitive intelligence, looking for therapies that target the same proteins or pathways or that may have a common mechanism of action. Reviewing an aggregated search of published scientific literature, patents, and clinical trials, in RightFind Navigate would help provide intelligence of what is already on the market, and what may potentially be coming to market in the future with the added value of SciBite VOCab’s to contextualize search results for competitive intelligence workflows.
Early-stage researchers can leverage the core vocab pack in RightFind Navigate during the drug discovery stage, to identify assays or techniques that have been used to successfully evaluate similar compounds, helping save a researcher valuable time in developing a new assay from scratch. SciBite ontologies, in conjunction with RightFind Navigate, can help early researchers get to the right information faster, and reduce the fear of missing critical data. This could also help identify potential experts or key opinion leaders within the field. If the core VOCab pack is also used to search across internal documentation such as ELN data or SharePoint drives, it could also be used to identify internal expertise on a particular assay or technique and even avoid duplication of data that already exists within the organization. And, for any of these activities occurring on a regular basis, RightFind Navigate alerts can be used to continuously monitor and update on new content, so researchers do not need to manually run the same searches over and over.
In conclusion
In summary, SciBite’s core VOCab pack integrated into RightFind Navigate provides a great entry point for adding ontologies and semantic search across a broad range of biological concepts. The extensive synonym support and contextualization means that search accuracy and retrieval are dramatically improved compared to keyword search. The core VOCab pack can also be combined with more specialized VOCabs to support a broader range of use cases. We will cover some of the more specialized VOCabs in subsequent blog posts.
Related Reading: