In recent years, the number of biomedical publications available in literature has grown enormously, resulting in a huge source of new knowledge. However, the underlying biomedical data behind this knowledge is typically hidden within unstructured text. Researchers have to undertake the time-consuming process of manual curation of published articles to obtain a comprehensive overview of the state-of-the-art in any matter. Moreover, the overwhelming amount of available scientific information makes it infeasible to manually track all new relevant discoveries, even on specialized topics. It also makes it harder for researchers to find connections between publications, patterns in the data, or correlations between facts.

As an example, we have seen how the COVID-19 pandemic has disrupted science in 2020. According to Nature, around 4% of the total world’s research output was devoted to coronavirus in 2020. As of May 12th 2021, according to Dimensions database, there have been a total of 427,347 publications related to COVID from 25,119 organizations in 197 countries. The entire world has faced a situation that required an urgent and immediate answer from science that has highlighted the necessity of tools that can help scientists to perform effective research.

Finding Content vs. Finding Answers 

The collection of the facts related to COVID-19 has been the key to fighting it; scientists needed to understand how SARS-CoV-2 spreads, how long the incubation period is, what the risk factors are and so on.

In summary, scientists needed answers to specific questions.

Biomedical literature search engines have traditionally been the starting point for a scientist to start a new research activity. Those search engines have remarkable capabilities, but they are still limited since they need to address multiple domains simultaneously, where the specificity of the content and vocabularies are necessary to get results by contextual meaning instead of keyword matching. For specialized topics, quite often users guess the necessary keywords needed to find the answers that they seek. But, even if a user chooses the best keywords for the query and the desired answer can be found in one of the first results, the user will still have to dig through the content to find it. For instance, using a traditional biomedical literature search engine, a user interested in the “list of COVID-19 symptoms” would get a large amount of publications’ links as a result where a detailed answer can be probably found, but the user won’t get a direct answer. Search engines that are based solely on an inverted index are not able to make knowledge inferences and answer natural language questions.

Question Answering (QA) systems represent an evolution of search engines, especially in domains where search intention is not navigational, and users expect to be able to express questions in natural language and obtain answers rather than search results.

Compared with information retrieval systems that typically return a list of relevant documents for the users to read, QA systems provide answers to users’ questions in a more straightforward and intuitive way. For example, continuing with our previous example, facing the natural language query “list of COVID-19 symptoms” a trained QA system would be able to collect the whole list of symptoms from different publications and present it to the user as a result.

In general terms, traditional Content-Based QA systems go beyond search engines by extracting facts from unstructured content, understanding a user’s question, finding the relevant context in the publications, and summarizing a possible answer:

  1. Relevant publication retrieval: by understanding the nature and structure of the user question, a QA system will build detailed queries to retrieve relevant publications where the answer of the question might be found.
  2. Context Candidates: the next stage highlights evidential fragments within the text that may provide answers for the question.
  3. Answer Ranking: the list of possible answers generated in the previous stage is ranked based on certain criteria.
  4. Final Answer Generation: Finally, the system will compose an answer in natural language. Sometimes, the answer can be found explicitly in the content, but otherwise it needs to be composed from the collected evidence, making it necessary to use abstractive and extractive summarization techniques

In this way, facing the question “What are the symptoms of COVID-19?”, the QA system will transform the question into a query to retrieve relevant paragraphs containing the answer or part of it. For example, it can collect fragments like “symptoms such as fever or cough, and signs such as oxygen saturation are the first and most readily available diagnostic information”. The system interprets and summarizes all retrieved fragments to compile a final and precise answer: “fever, fatigue, cough, shortness of breath among others”.

The Era of Transfer Learning 

The state-of-the-art QA systems are based on “transformers”. One of the main limitations of traditional QA systems is that they don’t usually perform well across multiple domains, meaning that generalized systems tend to have poor accuracy when they are applied to specific domains of knowledge, making it necessary to modifications and fine-tuning. The concept of transformers was first successfully applied in the computer vision field, where huge (general) pre-trained models that required lengthy training with vast compute resources, were fine-tuned to learn new capabilities for a new scenario.

In recent years, pretrained transformer models have enabled a significant reduction in the effort required to perform several NLP tasks. During pretraining, the model learns an inner representation of the natural language that can then be used to extract features useful for downstream tasks. The advantage of using transformers is that, if a model is trained on a large and general enough dataset then this model can effectively serve as a generic model for a specific task.

One of the most famous pre-trained NLP models is BERT, a deep learning model developed and published by Google in 2018. BERT encodes a language representation model that has been applied to various NLP problems. Language models like BERT have been proven to be amazingly effective in NLP tasks as they are able to capture different facets of language such as hierarchical relations and high-level semantic and syntactic meanings. During the fine-tuning phase, the representations learned by BERT are used as a baseline for problem-specific training. Following this strategy, researchers have trained BERT based models to solve QA tasks using large training datasets like SQuAD, which is a collection of questions and answers asked during Wikipedia edits. SQuAD is a valuable resource for training machine learning models to answer questions. Training a QA model on the SQuAD dataset provides the model with the ability to supply short answers to general-knowledge questions like “What is Question Answering?”. These models will perform well on question/answers that are comparable in vocabulary, grammar, and structure to Wikipedia (or general text media such as that found on the web) – essentially, content that is similar to what the model was originally trained on.

When applying these pre-trained models (like those trained on SQuAD) to a more specialized domain that uses specific technical vocabularies and a different writing style, the accuracy of the answers is expected to get compromised. This is where the transfer learning fine-tuning concept plays a role. If fine-tuning a pre-trained language model like BERT on the SQuAD dataset allowed the model to learn the task of question answering, then applying transfer learning a second time (fine-tuning on a specialized domain dataset) should provide the model some knowledge of the specialized domain. This specialized dataset is usually composed by a collection of question/answers pairs annotated by experts on the target content.


The information explosion is impacting every aspect of our life and our businesses, including biomedical research. Our capacity to deal with the enormous (and still growing) volume of scientific information will mark the velocity with which we can give answers to current and future research questions. In this context, we envision the adoption of transformers models that are pretrained on large corpora to result in very effective language models in domain-specific systems. In the near future, we should expect them to be applied in more and more NLP solutions.


Author: Rafa Haro

Rafa Haro works as Search Architect at CCC. He holds a graduate degree in Computer Science Engineering from the University of Seville and a MSc in Linguistic Technologies and Data Mining from UNED. Haro has more than 15 years of experience in the IT industry. He is an enthusiast of anything related to Semantic Technologies, Machine Learning, Natural Language Processing and Text Mining. Haro enjoys spending time with his family, friends, cycling and music.
Don't Miss a Post

Subscribe to the award-winning
Velocity of Content blog