Overview

BLURB is a comprehensive benchmark for biomedical NLP, with 13 biomedical NLP datasets in 6 tasks (see Table below). Our aim is to facilitate investigations of biomedical natural language processing with a specific focus on language model pretraining and to help accelerate progress in universal Biomedical NLP applications. The table below compares the datasets comprising BLURB versus the various datasets used in previous Biomedical and Clinical BERT language models.
Dataset Task Train Dev Test Evaluation Metrics
BC5-chem NER 5203 5347 5385 F1 entity-level
BC5-disease NER 4182 4244 4424 F1 entity-level
NCBI-disease NER 5134 787 960 F1 entity-level
BC2GM NER 15197 3061 6325 F1 entity-level
JNLPBA NER 46750 4551 8662 F1 entity-level
EBM PICO PICO 339167 85321 16364 Macro F1 word-level
ChemProt Relation Extraction 18035 11268 15745 Micro F1
DDI Relation Extraction 25296 2496 5716 Micro F1
GAD Relation Extraction 4261 535 534 Micro F1
BIOSSES Sentence Similarity 64 16 20 Pearson
HoC Document Classification 1295 186 371 Average Micro F1
PubMedQA Question Answering 450 50 500 Accuracy
BioASQ Question Answering 670 75 140 Accuracy
Datasets used in the BLURB biomedical NLP benchmark. We list the numbers of instances in train, dev, and test (e.g., entity mentions in NER and PICO elements in evidence-based medical information extraction).

NER

BioCreative II Gene Mention Recognition (BC2GM)
Description from the authors:
The BioCreative II GM task builds on the similar task from BioCreative I [3]. The training corpus for the current task consists mainly of the training and testing corpora (text collections) from the previous task, and the testing corpus for the current task consists of an additional 5,000 sentences that were held 'in reserve' from the previous task.
...
In the current corpus, tokenization is not provided; instead participants are asked to identify a gene mention in a sentence by giving its start and end characters. As before, the training set consists of a set of sentences, and for each sentence a set of gene mentions (GENE annotations).
Links:


Related Papers:


BC5CDR Drug/Chemical (BC5-Chem)
Description from the authors:
The corpus consists of three separate sets of articles with diseases, chemicals and their relations annotated. The training (500 articles) and development (500 articles) sets were released to task participants in advance to support text-mining method development. The test set (500 articles) was used for final system performance evaluation.
Links:


Related Papers:


BC5CDR Disease (BC5-Disease)
Description from the authors:
The corpus consists of three separate sets of articles with diseases, chemicals and their relations annotated. The training (500 articles) and development (500 articles) sets were released to task participants in advance to support text-mining method development. The test set (500 articles) was used for final system performance evaluation.
Links:


Related Papers:


JNLPBA
Description from the authors:
The BioNLP / JNLPBA Shared Task 2004 involves the identification and classification of technical terms referring to concepts of interest to biologists in the domain of molecular biology. The task was organized by GENIA Project based on the annotations of the GENIA Term corpus (version 3.02).
Links:


Related Papers:


NCBI Disease
Description from the authors:
[T]he NCBI disease corpus contains 6,892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts.
Links:


Related Papers:

PICO

EBM PICO
Description from the authors:
EBM-NLP comprises ~5,000 medical abstracts describing clinical trials, multiply annotated in detail with respect to characteristics of the underlying trial Populations (e.g., diabetics), Interventions (insulin), Comparators (placebo) and Outcomes (blood glucose levels). Collectively, these key informational pieces are referred to as PICO elements; they form the basis for well-formed clinical questions (Huang et al., 2006).
Links:


Related Papers:


Relation Extraction

ChemProt
Description from the authors:
ChemProt [is] a disease chemical biology database, which is based on a compilation of multiple chemical–protein annotation resources, as well as disease-associated protein–protein interactions (PPIs). We assembled more than 700 000 unique chemicals with biological annotation for 30 578 proteins. We gathered over 2-million chemical–protein interactions, which were integrated in a quality scored human PPI network of 428 429 interactions.
Links:


Related Papers:


Drug-Disease Interaction (DDI)
Description from the authors:
[This task] focuses on extraction of drug-drug interactions. This shared task is designed to address the extraction of DDIs as a whole, but divided into two subtasks to allow separate evaluation of the performance for different aspects of the problem. The shared task includes two challenges:
  • Task 9.1: Recognition and classification of drug names.
  • Task 9.2: Extraction of drug-drug interactions.
Links:


Related Papers:


Gene-Disease Associations (GAD)
Description from the authors:
GAD is an archive of published genetic association studies that provides a comprehensive, public, web-based repository of molecular, clinical and study parameters for >5,000 human genetic association studies at this time. This approach will allow the systematic analysis of complex common human genetic disease in the context of modern high-throughput assay systems and current annotated molecular nomenclature.
Links:


Related Papers:

Sentence Similarity

BIOSSES
Description from the authors:
The dataset comprises 100 sentence pairs, in which each sentence was selected from the TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset containing articles from the biomedical domain. TAC dataset consists of 20 articles (reference articles) and citing art- icles that vary from 12 to 20 for each of the reference articles. We se- lected the BIOSSES sentence pairs from citing sentences, i.e. sentences that have a citation to a reference article, instead of choos- ing random sentence pairs, majority of which would be unrelated.
Links:


Related Papers:

Document Classification

HoC (Hallmarks of Cancer)
Description from the authors:
The Hallmarks of Cancer (HoC) corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to the Hallmarks of Cancer taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus.
Links:


Related Papers:




Question Answering (QA)

BioASQ
Description from the authors:
Task 7b will use benchmark datasets containing training and test biomedical questions, in English, along with gold standard (reference) answers. The participants will have to respond to each test question with relevant concepts (from designated terminologies and ontologies), relevant articles (in English, from designated article repositories), relevant snippets (from the relevant articles), relevant RDF triples (from designated ontologies), exact answers (e.g., named entities in the case of factoid questions) and 'ideal' answers (English paragraph-sized summaries). 2747 training questions (that were used as dry-run or test questions in previous year) are already available, along with their gold standard answers (relevant concepts, articles, snippets, exact answers, summaries).
Links:


Related Papers:
PubmedQA
Description from the authors:
We introduce PubMedQA, a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts. PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially generated QA instances. Each PubMedQA instance is composed of (1) a question which is either an existing research article title or derived from one, (2) a context which is the corresponding abstract without its conclusion, (3) a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and (4) a yes/no/maybe answer which summarizes the conclusion.
Links:


Related Papers: