In a recent paper, researchers from the Korea University, Seoul, have proposed a pre-trained biomedical language representation model for biomedical text mining called BioBERT.
Emphasizing the lack of training data in biomedical fields, required for training large deep neural network models, they propose a domain specific language representation model pre-trained on large-scale biomedical corpora.
BioBERT is in fact, a large pre-trained model that can be used in a transfer learning setting for biomedical text mining. According to the authors, BioBERT can be applied successfully to a wide variety of tasks with minimal task-specific modifications.
They show that BioBERT significantly outperforms previous state-of-the-art models on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.51% absolute improvement), biomedical relation extraction (3.49% absolute improvement), and biomedical question answering (9.61% absolute improvement).
The success of the proposed BERT model comes from the fact that it has been pre-trained on large corpora of both general and biomedical data. Researchers took an interesting approach that incorporates different combinations of general & biomedical corpora.
First, they initialized BioBERT with a BERT pre-trained model trained on Wikipedia 2.5 billion words & Books Corpus 0.8 billion words. Then, they pre-trained on PubMed Abstracts 4.5 billion words & PMC Full-text articles 13.5 billion words. Finally, the pre-trained model was used to fine-tune on various biomedical text mining tasks like NER, question & answer, relation extraction.
The full paper can be read here. A Tensorflow implementation of the BioBERT model is provided on Github.