View on GitHub

nlp-recipes

Natural Language Processing Best Practices & Examples

Dataset

This submodule includes helper functions for downloading datasets and formatting them appropriately as well as utilities for splitting data for training / testing.

Data Loading

There are dataloaders for several datasets. For example, the snli module will allow you to load a dataframe in pandas from the SNLI dataset, with the option to set the number of rows to load in order to test algorithms and evaluate performance benchmarks. Most datasets may be split into train, dev, and test, for example:

from utils_nlp.dataset.snli import load_pandas_df

df = load_pandas_df(DATA_FOLDER, file_split ="train", nrows = 1000)

Dataset List

|Dataset|Dataloader script| |——-|—————–| |Microsoft Research Paraphrase Corpus|msrpc.py| |The Multi-Genre NLI (MultiNLI) Corpus|multinli.py| |The Stanford Natural Language Inference (SNLI) Corpus|snli.py| |Wikigold NER|wikigold.py| |The Cross-Lingual NLI (XNLI) Corpus|xnli.py| |The STSbenchmark dataset|stsbenchmark.py| |The Stanford Question Answering Dataset (SQuAD)|squad.py| |CNN/Daily Mail(CNN/DM) Dataset|cnndm.py| |Preprocessed CNN/Daily Mail(CNN/DM) Dataset for Extractive Summarization|cnndm.py|

Dataset References

Please see Dataset References for notice and information regarding datasets used.