Overview
BLURB is a comprehensive benchmark for biomedical NLP, with 13 biomedical NLP datasets in 6 tasks (see Table below). Our aim is to facilitate investigations of biomedical natural language processing with a specific focus on language model pretraining and to help accelerate progress in universal Biomedical NLP applications. The table below compares the datasets comprising BLURB versus the various datasets used in previous Biomedical and Clinical BERT language models.Dataset | Task | Train | Dev | Test | Evaluation Metrics |
---|---|---|---|---|---|
BC5-chem | NER | 5203 | 5347 | 5385 | F1 entity-level |
BC5-disease | NER | 4182 | 4244 | 4424 | F1 entity-level |
NCBI-disease | NER | 5134 | 787 | 960 | F1 entity-level |
BC2GM | NER | 15197 | 3061 | 6325 | F1 entity-level |
JNLPBA | NER | 46750 | 4551 | 8662 | F1 entity-level |
EBM PICO | PICO | 339167 | 85321 | 16364 | Macro F1 word-level |
ChemProt | Relation Extraction | 18035 | 11268 | 15745 | Micro F1 |
DDI | Relation Extraction | 25296 | 2496 | 5716 | Micro F1 |
GAD | Relation Extraction | 4261 | 535 | 534 | Micro F1 |
BIOSSES | Sentence Similarity | 64 | 16 | 20 | Pearson |
HoC | Document Classification | 1295 | 186 | 371 | Average Micro F1 |
PubMedQA | Question Answering | 450 | 50 | 500 | Accuracy |
BioASQ | Question Answering | 670 | 75 | 140 | Accuracy |
NER
BioCreative II Gene Mention Recognition (BC2GM)
Description from the authors:
Related Papers:
The BioCreative II GM task builds on the similar task from BioCreative I
[3].
The training corpus for the current task consists mainly of the training and testing corpora (text
collections) from the previous task, and the testing corpus for the current task consists of an
additional 5,000 sentences that were held 'in reserve' from the previous task.
...
In the current corpus, tokenization is not provided; instead participants are asked to identify a gene mention in a sentence by giving its start and end characters. As before, the training set consists of a set of sentences, and for each sentence a set of gene mentions (GENE annotations).
Links:...
In the current corpus, tokenization is not provided; instead participants are asked to identify a gene mention in a sentence by giving its start and end characters. As before, the training set consists of a set of sentences, and for each sentence a set of gene mentions (GENE annotations).
Dataset Home
Related Papers:
Overview of BioCreative II gene mention recognition
BC5CDR Drug/Chemical (BC5-Chem)
Description from the authors:
Related Papers:
The corpus consists of three separate sets of articles with diseases, chemicals and their
relations annotated. The training (500 articles) and development (500 articles) sets were
released to task participants in advance to support text-mining method development. The
test set (500 articles) was used for final system performance evaluation.
Links:Dataset Home
Related Papers:
BioCreative V CDR task corpus: a resource for chemical disease relation extraction
BC5CDR Disease (BC5-Disease)
Description from the authors:
Related Papers:
The corpus consists of three separate sets of articles with diseases, chemicals and their
relations annotated. The training (500 articles) and development (500 articles) sets were
released to task participants in advance to support text-mining method development. The
test set (500 articles) was used for final system performance evaluation.
Links:Dataset Home
Related Papers:
BioCreative V CDR task corpus: a resource for chemical disease relation extraction
JNLPBA
Description from the authors:
Related Papers:
The BioNLP / JNLPBA Shared Task 2004 involves the identification and classification of
technical terms referring to concepts of interest to biologists in the domain of
molecular biology. The task was organized by GENIA Project based on the annotations of
the GENIA Term corpus (version 3.02).
Links:Dataset Home
Related Papers:
Introduction to the Bio-entity Recognition Task at JNLPBA
NCBI Disease
Description from the authors:
Related Papers:
[T]he NCBI disease corpus contains 6,892 disease mentions, which are mapped to 790 unique
disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM
identifier. We were able to link 91% of the mentions to a single disease concept, while the
rest are described as a combination of concepts.
Links:Dataset Home
Related Papers:
NCBI Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization
PICO
EBM PICO
Description from the authors:
Related Papers:
EBM-NLP comprises ~5,000 medical abstracts describing clinical trials, multiply annotated
in detail with respect to characteristics of the underlying trial Populations (e.g.,
diabetics), Interventions (insulin), Comparators (placebo) and Outcomes (blood glucose
levels). Collectively, these key informational pieces are referred to as PICO elements;
they form the basis for well-formed clinical questions
(Huang et al., 2006).
Links:Dataset Home
Related Papers:
A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature
Relation Extraction
ChemProt
Description from the authors:
Related Papers:
ChemProt [is] a disease chemical biology database, which is based on a compilation of
multiple chemical–protein annotation resources, as well as disease-associated
protein–protein interactions (PPIs). We assembled more than 700 000 unique chemicals
with biological annotation for 30 578 proteins. We gathered over 2-million
chemical–protein interactions, which were integrated in a quality scored human PPI
network of 428 429 interactions.
Links:Dataset Home
Related Papers:
ChemProt: a disease chemical biology database
Drug-Disease Interaction (DDI)
Description from the authors:
Related Papers:
[This task] focuses on extraction of drug-drug interactions. This shared task is designed
to address the extraction of DDIs as a whole, but divided into two subtasks to allow
separate evaluation of the performance for different aspects of the problem. The shared
task includes two challenges:
Links:- Task 9.1: Recognition and classification of drug names.
- Task 9.2: Extraction of drug-drug interactions.
Dataset Home
Related Papers:
The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions
Gene-Disease Associations (GAD)
Description from the authors:
Related Papers:
GAD is an archive of published genetic association studies that provides a comprehensive,
public, web-based repository of molecular, clinical and study parameters for >5,000 human
genetic association studies at this time. This approach will allow the systematic
analysis of complex common human genetic disease in the context of modern high-throughput
assay systems and current annotated molecular nomenclature.
Links:Dataset Home
Related Papers:
The Genetic Association Database
Sentence Similarity
BIOSSES
Description from the authors:
Related Papers:
The dataset comprises 100 sentence pairs, in which each sentence was selected from the
TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset
containing articles from the biomedical domain. TAC dataset consists of 20 articles
(reference articles) and citing art- icles that vary from 12 to 20 for each of the
reference articles. We se- lected the BIOSSES sentence pairs from citing sentences, i.e.
sentences that have a citation to a reference article, instead of choos- ing random
sentence pairs, majority of which would be unrelated.
Links:Dataset Home
Related Papers:
BIOSSES: a semantic sentence similarity estimation system for the biomedical domain
Document Classification
HoC (Hallmarks of Cancer)
Description from the authors:
Related Papers:
The Hallmarks of Cancer (HoC) corpus consists of 1852 PubMed publication abstracts manually
annotated by experts according to the Hallmarks of Cancer taxonomy. The taxonomy consists of
37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the
corpus.
Links:Dataset Home
Related Papers:
Automatic semantic classification of scientific literature according to the hallmarks of cancer
Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer
Cancer hallmark text classification using convolutional neural networks
Initializing neural networks for hierarchical multi-label text classification
Question Answering (QA)
BioASQ
Description from the authors:
Related Papers:
Task 7b will use benchmark datasets containing training and test biomedical questions,
in English, along with gold standard (reference) answers. The participants will have to
respond to each test question with relevant concepts (from designated terminologies and
ontologies), relevant articles (in English, from designated article repositories),
relevant snippets (from the relevant articles), relevant RDF triples (from designated
ontologies), exact answers (e.g., named entities in the case of factoid questions) and
'ideal' answers (English paragraph-sized summaries). 2747 training questions (that were
used as dry-run or test questions in previous year) are already available, along with
their gold standard answers (relevant concepts, articles, snippets, exact answers, summaries).
Links:Dataset Home
Related Papers:
Automatic semantic classification of scientific literature according to the hallmarks of cancer
PubmedQA
Description from the authors:
Related Papers:
We introduce PubMedQA, a novel biomedical question answering (QA) dataset collected from
PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe
(e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass
grafting?) using the corresponding abstracts. PubMedQA has 1k expert-annotated, 61.2k
unlabeled and 211.3k artificially generated QA instances. Each PubMedQA instance is
composed of (1) a question which is either an existing research article title or derived
from one, (2) a context which is the corresponding abstract without its conclusion, (3)
a long answer, which is the conclusion of the abstract and, presumably, answers the
research question, and (4) a yes/no/maybe answer which summarizes the conclusion.
Links:Dataset Home
Related Papers:
PubMedQA: A Dataset for Biomedical Research Question Answering