msmarco

TREC 2021 Deep Learning Track Guidelines

Please read: Data refresh

Our main focus in 2021 is to get started on using a new, larger, cleaner corpus, which unifies the passage and document datasets. The new document dataset has been available since early July 2021, and the passage dataset was released in mid July 2021.

This leaves participants less than a month before the submission deadline of August 9th. We hope the community can come together and submit runs by: 1) Submitting standard approaches and baselines from TREC 2020, to see how these perform on the new datasets, 2) Implementing newer approaches too that you are working on now or have developed since TREC 2020, and 3) Trying some hybrid approaches that are enabled by the new document and passage corpus with passage-document mapping. For example, a passage task run could start with full ranking on the document corpus, then identifying candidate passages from top documents that lead in to passage reranking.

Previous edition

Timetable

Registration

To participate in TREC please pre-register at the following website: https://ir.nist.gov/trecsubmit.open/application.html

Introduction

The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).

Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in 2019 and 2020 aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.

In 2021, the track will continue to have the same tasks (document ranking and passage ranking) and goals. Similar to the previous year, one of the main goals of the track in 2021 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?

Deep Learning Track Tasks

The Deep Learning Track has two tasks: Passage ranking and document ranking; and two subtasks in each case: full ranking and reranking. You can submit up to three runs for each of the subtasks.

Each task uses a large human-generated set of training labels, from the MS MARCO dataset. The two tasks use the same test queries. They also use the same form of training data with usually one positive training document/passage per training query. In the case of passage ranking, there is a direct human label that says the passage can be used to answer the query, whereas for training the document ranking task we infer document-level labels from the passage-level labels.

For both tasks, the participants are encouraged to study the efficacy of transfer learning methods. Our current training labels (from MS MARCO) are generated differently than the test labels (generated by NIST), although some labels from past years (mapped to the new corpus) may also be available. Participants can (and are encouraged to) also use external corpora for large scale language model pretraining, or adapt algorithms built for one task of the track (e.g. passage ranking) to the other task (e.g. document ranking). This allows participants to study a variety of transfer learning strategies.

Below the two tasks are described in more detail.

Document Ranking Task

The first task focuses on document ranking. We have two subtasks related to this: Full ranking and top-100 reranking.

In the full ranking (retrieval) subtask, you are expected to rank documents based on their relevance to the question, where documents can be retrieved from the full document collection provided. You can submit up to 100 documents for this task. It models a scenario where you are building an end-to-end retrieval system.

In the reranking subtask, we provide you with an initial ranking of 100 documents from a simple IR system, and you are expected to rerank the documents in terms of their relevance to the question. This is a very common real-world scenario, since many end-to-end systems are implemented as retrieval followed by top-k reranking. The reranking subtask allows participants to focus on reranking only, without needing to implement an end-to-end system. It also makes those reranking runs more comparable, because they all start from the same set of 100 candidates.

Passage Ranking Task

Similar to the document ranking task, the passage ranking task also has a full ranking and reranking subtasks.

In context of full ranking (retrieval) subtask, given a question, you are expected to rank passages from the full collection in terms of their likelihood of containing an answer to the question. You can submit up to 100 passages for this end-to-end retrieval task.

In context of top-100 reranking subtask, we provide you with an initial ranking of 100 passages and you are expected to rerank these passages based on their likelihood of containing an answer to the question. In this subtask, we can compare different reranking methods based on the same initial set of 100 candidates, with the same rationale as described for the document reranking subtask.

Datasets

Since the main asset in MS MARCO is the training data, and we do not have any new training data, the main purpose of this data release is to make the document/passage data larger, cleaner and more realistic. Some notes:

Downloading the datasets

To download large files more quickly and reliably use AzCopy (see instructions).

azcopy copy https://msmarco.blob.core.windows.net/msmarcoranking/msmarco_v2_doc.tar msmarco_v2_doc.tar

We also saw a suggestion for speeding up downloads without azcopy:

wget --header "X-Ms-Version: 2019-12-12" https://msmarco.blob.core.windows.net/msmarcoranking/msmarco_v2_doc.tar

Document ranking dataset

Type Filename File size Num Records Format
Corpus msmarco_v2_doc.tar 32.3 GB 11,959,635 tar of 60 gzipped jsonl files
Train docv2_train_queries.tsv 12.9 MB 322,196 tsv: qid, query
Train docv2_train_top100.txt.gz 404.5 MB 32,218,809 TREC submission: qid, “Q0”, docid, rank, score, runstring
Train docv2_train_qrels.tsv 11.9 MB 331,956 TREC qrels format
Dev 1 docv2_dev_queries.tsv 187.5 KB 4,552 tsv: qid, query
Dev 1 docv2_dev_top100.txt.gz 5.6 MB 455,200 TREC submission: qid, “Q0”, docid, rank, score, runstring
Dev 1 docv2_dev_qrels.tsv 173.4 KB 4,702 TREC qrels format
Dev 2 docv2_dev2_queries.tsv 205.0 KB 5,000 tsv: qid, query
Dev 2 docv2_dev2_top100.txt.gz 6.1 MB 500,000 TREC submission: qid, “Q0”, docid, rank, score, runstring
Dev 2 docv2_dev2_qrels.tsv 190.9 KB 5,178 TREC qrels format
Validation 1 (TREC test 2019) msmarco-test2019-queries.tsv.gz 4.2 KB 200 tsv: qid, query
Validation 1 (TREC test 2019)   KB   TREC submission: qid, “Q0”, docid, rank, score, runstring
Validation 1 (TREC test 2019) docv2_trec2019_qrels.txt.gz 105 KB 13,940 qid, “Q0”, docid, rating
Validation 2 (TREC test 2020) msmarco-test2020-queries.tsv.gz 8.2 KB 200 tsv: qid, query
Validation 2 (TREC test 2020)   KB   TREC submission: qid, “Q0”, docid, rank, score, runstring
Validation 2 (TREC test 2020) docv2_trec2020_qrels.txt.gz 60.9 KB 7,942 qid, “Q0”, docid, rating
Test (TREC test 2021) 2021_queries.tsv 24.0 KB 477 tsv: qid, query
Test (TREC test 2021) 2021_document_top100.txt.gz 603.7 KB 47,700 TREC submission: qid, “Q0”, docid, rank, score, runstring

The document corpus is in jsonl format. Each document has:

If you unzip the corpus, you can quickly access a document using:

import json

def get_document(document_id):
    (string1, string2, bundlenum, position) = document_id.split('_')
    assert string1 == 'msmarco' and string2 == 'doc'

    with open(f'./msmarco_v2_doc/msmarco_doc_{bundlenum}', 'rt', encoding='utf8') as in_fh:
        in_fh.seek(int(position))
        json_string = in_fh.readline()
        document = json.loads(json_string)
        assert document['docid'] == document_id
        return document

document = get_document('msmarco_doc_31_726131')
print(document.keys())

Producing output:

dict_keys(['url', 'title', 'headings', 'body', 'docid'])

Passage ranking dataset

Type Filename File size Num Records Format
Corpus msmarco_v2_passage.tar 20.3 GB 138,364,198 tar of 70 gzipped jsonl files
Train passv2_train_queries.tsv 11.1 MB 277,144 tsv: qid, query
Train passv2_train_top100.txt.gz 324.9 MB 27,713,673 TREC submission: qid, “Q0”, docid, rank, score, runstring
Train passv2_train_qrels.tsv 11.1 MB 287,889 TREC qrels format
Dev 1 passv2_dev_queries.tsv 160.7 KB 3,903 tsv: qid, query
Dev 1 passv2_dev_top100.txt.gz 4.7 MB 390,300 TREC submission: qid, “Q0”, docid, rank, score, runstring
Dev 1 passv2_dev_qrels.tsv 161.2 KB 4,074 TREC qrels format
Dev 2 passv2_dev2_queries.tsv 175.4 KB 4.281 tsv: qid, query
Dev 2 passv2_dev2_top100.txt.gz 5.1 MB 428,100 TREC submission: qid, “Q0”, docid, rank, score, runstring
Dev 2 passv2_dev2_qrels.tsv 177.4 KB 4,456 TREC qrels format
Test (TREC test 2021) 2021_queries.tsv 24.0 KB 477 tsv: qid, query
Test (TREC test 2021) 2021_passage_top100.txt.gz 590.4 KB 47,700 TREC submission: qid, “Q0”, docid, rank, score, runstring

The passage corpus is also in jsonl format. Each passage has:

The passage corpus can be accessed using the passage id, by adapting the python code listed for the document ID case above.

Passage “spans” use byte offsets, but the document text is in UTF-8, so to extract a span the span (x,y) from body text you need to use:

doc_json['body'].encode()[x:y].decode()

Use of external information

You are generally allowed to use external information while developing your runs. When you submit your runs, please fill in a form listing what resources you used. This could include an external corpus such as Wikipedia or a pretrained model (e.g. word embeddings, BERT). This could also include the provided set of document ranking training data, but also optionally other data such as the passage ranking task labels or external labels or pretrained models. This will allow us to analyze the runs and break they down into types.

IMPORTANT NOTE: We are now dealing with multiple versions of MS MARCO ranking data, and all the other MS MARCO tasks as well. This new data release changes what is available and usable. Participants should be careful about using those datasets and must adhere to the following guidelines:

Submission, evaluation and judging

We will be following a similar format as the ones used by most TREC submissions, which is repeated below. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.

1 Q0 pid1    1 2.73 runid1
1 Q0 pid2    2 2.71 runid1
1 Q0 pid3    3 2.61 runid1
1 Q0 pid4    4 2.05 runid1
1 Q0 pid5    5 1.89 runid1

, where:

As the official evaluation set, we provide a set of test queries, where a subset will be judged by NIST assessors. For this purpose, NIST will be using depth pooling and construct separate pools for the passage ranking and document ranking tasks. Passages/documents in these pools will then be labelled by NIST assessors using multi-graded judgments, allowing us to measure NDCG. The same test queries are used for passage retrieval and document retrieval.

Besides our main evaluation using the NIST labels and NDCG, we also have sparse labels for the test queries, which already exist as part of the MS-Marco dataset. More information regarding how these sparse labels were obtained can be found at https://arxiv.org/abs/1611.09268. This allows us to calculate a secondary metric Mean Reciprocal Rank (MRR). For the full ranking setting, we also compute NCG to evaluate the performance of the candidate generation stage.

The main type of TREC submission is automatic, which means there was not manual intervention in running the test queries. This means you should not adjust your runs, rewrite the query, retrain your model, or make any other sorts of manual adjustments after you see the test queries. The ideal case is that you only look at the test queries to check that they ran properly (i.e. no bugs) then you submit your automatic runs. However, if you want to have a human in the loop for your run, or do anything else that uses the test queries to adjust your model or ranking, you can mark your run as manual. Manual runs are interesting, and we may learn a lot, but these are distinct from our main scenario which is a system that responds to unseen queries automatically.

Additional resources

We are sharing the following additional resources which we hope will be useful for the community.

Dataset Filename File size Num Records Format
Segmented document collection msmarco_v2_doc_segmented.tar 25.4 GB 124,131,414 tar
Augmented passage collection msmarco_v2_passage_augmented.tar 20.0 GB 138,364,198 tar

MD5Sum

To check your downloads, compare to our md5sum data:

md5sum filename
f2eead4b192683ae5fbd66f4d3f08b96 docv2_dev2_qrels.tsv
f000319f1893a7acdd60fdcae0703b95 docv2_dev2_queries.tsv
e03b5404e9027569c1aa794b1408d8a5 docv2_dev2_top100.txt.gz
aad92d731892ccb0cf9c4c2e37e0f0f1 docv2_dev_qrels.tsv
b05dc19f1d2b8ad729f189328a685aa1 docv2_dev_queries.tsv
4dd27d511748bede545cd7ae3fc92bf4 docv2_dev_top100.txt.gz
2f788d031c2ca29c4c482167fa5966de docv2_train_qrels.tsv
7821d8bef3971e12780a80a89a3e5cbd docv2_train_queries.tsv
b4d5915172d5f54bd23c31e966c114de docv2_train_top100.txt.gz
eea90100409a254fdb157b8e4e349deb msmarco_v2_doc.tar
05946bac48a8ffee62e160213eab3fda msmarco_v2_passage.tar
8ed8577fa459d34b59cf69b4daa2baeb passv2_dev2_qrels.tsv
565b84dfa7ccd2f4251fa2debea5947a passv2_dev2_queries.tsv
da532bf26169a3a2074fae774471cc9f passv2_dev2_top100.txt.gz
10f9263260d206d8fb8f13864aea123a passv2_dev_qrels.tsv
0fa4c6d64a653142ade9fc61d7484239 passv2_dev_queries.tsv
fee817a3ee273be8623379e5d3108c0b passv2_dev_top100.txt.gz
a2e37e9a9c7ca13d6e38be0512a52017 passv2_train_qrels.tsv
1835f44e6792c51aa98eed722a8dcc11 passv2_train_queries.tsv
7cd731ed984fccb2396f11a284cea800 passv2_train_top100.txt.gz
f18c3a75eb3426efeb6040dca3e885dc msmarco_v2_doc_segmented.tar
69acf3962608b614dbaaeb10282b2ab8 msmarco_v2_passage_augmented.tar
0bc85e3f2a6f798b91e18f0cd4a6bc6b 2021_document_top100.txt.gz
e2be2d307da26d1a3f76eb95507672a3 2021_passage_top100.txt.gz
46d863434dda18300f5af33ee29c4b28 2021_queries.tsv

Previous years of TREC DL

Coordinators

Terms and Conditions

The MS MARCO and ORCAS datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The datasets are provided “as is” without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. By using any of these datasets you are automatically agreeing to abide by these terms and conditions. Upon violation of any of these terms, your rights to use the dataset will end automatically.

Please contact us at ms-marco@microsoft.com if you own any of the documents made available but do not want them in this dataset. We will remove the data accordingly. If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the Creative Commons Attribution 4.0 International Public License, see the LICENSE file, and grant you a license to any code in the repository under the MIT License, see the LICENSE-CODE file.

Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft’s general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.

Privacy information can be found at https://privacy.microsoft.com/en-us/.

Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.