This year we are celebrating the 5th anniversary and the final edition of the TREC Deep Learning Track! 🙏🏽
Overview paper: https://trec.nist.gov/pubs/trec32/papers/Overview_deep.pdf
To participate in TREC please pre-register at the following website: https://ir.nist.gov/trecsubmit.open/application.html
The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).
Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.
Similar to the previous years, one of the main goals of the track in 2023 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?
The Deep Learning Track in 2023 will continue to have the passage ranking and document ranking tasks; and two subtasks in each case: full ranking and reranking. You can submit up to three official runs for each of the subtasks that will be evaluated by NIST and also used during the pooling/judging process. Participating groups are allowed to submit up to five additional runs for evaluation that will not be included in the pooling/judging process. If you do submit more than three runs for a subtask, please select Unpooled as judgment priority for the additional runs during submission. If a participating group submits runs such that the preference ordering for judging is not clear, NIST will unilaterally select which runs to judge. Similar to last year but unlike the previous years, primary focus of the track will be on the passage ranking task, while keeping the document ranking task as a secondary task. The details of the definition and evaluation of the document ranking task can be found later in this page.
Each task uses a large human-generated set of training labels, from the MS MARCO dataset. The two tasks use the same test queries. They also use the same form of training data with usually one positive training document/passage per training query. In the case of passage ranking, there is a direct human label that says the passage can be used to answer the query, whereas for training the document ranking task we infer document-level labels from the passage-level labels.
For both tasks, the participants are encouraged to study the efficacy of transfer learning methods. Our current training labels (from MS MARCO) are generated differently than the test labels (generated by NIST), although some labels from past years (mapped to the new corpus) may also be available. Participants can (and are encouraged to) also use external corpora for large scale language model pretraining, or adapt algorithms built for one task of the track (e.g. passage ranking) to the other task (e.g. document ranking). This allows participants to study a variety of transfer learning strategies. Below the two tasks are described in more detail.
This year our queries include “synthetic queries”, which are not intended for use in the official NIST evaluation. We will evaluate the submitted runs using the synthetic data, to find out how well synthetic eval can match up with official eval. The goal is to develop a synthetic eval that can serve as a leading indicator (“dev set”) for official results. The official evaluation will not use synthetic queries or qrels.
The primary focus of the Deep Learning Track this is year is again on the passage ranking task. We have two subtasks related to this task: Full ranking and top-100 reranking.
In context of full ranking (retrieval) subtask, given a question, you are expected to rank passages from the full collection in terms of their likelihood of containing an answer to the question. It models a scenario where you are building an end-to-end retrieval system for retrieving passages. You can submit up to 100 passages for this end-to-end retrieval task.
In the reranking subtask, we provide you with an initial ranking of 100 passages from a simple IR system, and you are expected to rerank the passages based on their likelihood of containing an answer to the question. This is a very common real-world scenario, since many end-to-end systems are implemented as retrieval followed by top-k reranking. The reranking subtask allows participants to focus on reranking only, without needing to implement an end-to-end system. It also makes those reranking runs more comparable, because they all start from the same set of 100 candidates.
While the passage ranking task is the primary focus of the Deep Learning Track again this year, the track will continue to have the document ranking task. Like last year, the document ranking task is defined and evaluated differently compared to previous years. In the previous years, the expectation in the document ranking task was to rank documents based on their relevance to the question. However, like last year, this year the expectation is to rank documents based on their likelihood of containing a passage relevant to the question.
Similar to the passage ranking task, the document ranking task also has a full ranking and reranking subtasks. In the full ranking (retrieval) subtask, documents can be retrieved from the full document collection provided. You can submit up to 100 documents for this task.
In context of top-100 reranking subtask, we provide you with an initial ranking of 100 documents and you are expected to rerank these documents in terms of their likelihood of containing a passage relevant to the question. In this subtask, we can compare different reranking methods based on the same initial set of 100 candidates, with the same rationale as described for the passage reranking subtask.
This year we will be leveraging the same datasets as last year’s track.
To download large files more quickly and reliably use AzCopy (see instructions).
azcopy copy https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_doc.tar msmarco_v2_doc.tar
We also saw a suggestion for speeding up downloads without azcopy:
wget --header "X-Ms-Version: 2019-12-12" https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_doc.tar
Type | Filename | File size | Num Records | Format |
---|---|---|---|---|
Corpus | msmarco_v2_passage.tar | 20.3 GB | 138,364,198 | tar of 70 gzipped jsonl files |
Train | passv2_train_queries.tsv | 11.1 MB | 277,144 | tsv: qid, query |
Train | passv2_train_top100.txt.gz | 324.9 MB | 27,713,673 | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Train | passv2_train_qrels.tsv | 11.1 MB | 287,889 | TREC qrels format |
Dev 1 | passv2_dev_queries.tsv | 160.7 KB | 3,903 | tsv: qid, query |
Dev 1 | passv2_dev_top100.txt.gz | 4.7 MB | 390,300 | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Dev 1 | passv2_dev_qrels.tsv | 161.2 KB | 4,074 | TREC qrels format |
Dev 2 | passv2_dev2_queries.tsv | 175.4 KB | 4.281 | tsv: qid, query |
Dev 2 | passv2_dev2_top100.txt.gz | 5.1 MB | 428,100 | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Dev 2 | passv2_dev2_qrels.tsv | 177.4 KB | 4,456 | TREC qrels format |
Validation 1 (TREC test 2021) | 2021_queries.tsv | 24.0 KB | 477 | tsv: qid, query |
Validation 1 (TREC test 2021) | 2021_passage_top100.txt.gz | 590.4 KB | 47,700 | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Validation 1 (TREC test 2021) | 2021.qrels.pass.final.txt | 424 KB | 10,828 | qid, “Q0”, docid, rating |
Validation 2 (TREC test 2022) | 2022_queries.tsv | 21.0 KB | 500 | tsv: qid, query |
Validation 2 (TREC test 2022) | 2022_passage_top100.txt.gz | 615.3 KB | 50,000 | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Validation 2 (TREC test 2022) | 2022.qrels.pass.withDupes.txt | 15 MB | 386,416 | qid, “Q0”, docid, rating |
Test (TREC test 2023) | 2023_queries.tsv | 37.2 KB | 700 | tsv: qid, query |
Test (TREC test 2023) | 2023_passage_top100.txt.gz | 868.1 KB | 70,000 | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Test (TREC test 2023) | 2023.qrels.pass.withDupes.txt | 891 KB | 22,327 | qid, “Q0”, docid, rating |
The passage corpus is also in jsonl format. Each passage has:
msmarco_passage_41_45753370
is in the file msmarco_v2_passage/msmarco_passage_41
at position 45753370
.(17789,17900),(17901,18096)
.msmarco_doc_35_1343131017
.The passage corpus can be accessed using the passage id, by adapting the python code listed for the document ID case above.
Passage “spans” use byte offsets, but the document text is in UTF-8, so to extract a span the span (x,y)
from body text you need to use:
doc_json['body'].encode()[x:y].decode()
Type | Filename | File size | Num Records | Format |
---|---|---|---|---|
Corpus | msmarco_v2_doc.tar | 32.3 GB | 11,959,635 | tar of 60 gzipped jsonl files |
Train | docv2_train_queries.tsv | 12.9 MB | 322,196 | tsv: qid, query |
Train | docv2_train_top100.txt.gz | 404.5 MB | 32,218,809 | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Train | docv2_train_qrels.tsv | 11.9 MB | 331,956 | TREC qrels format |
Dev 1 | docv2_dev_queries.tsv | 187.5 KB | 4,552 | tsv: qid, query |
Dev 1 | docv2_dev_top100.txt.gz | 5.6 MB | 455,200 | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Dev 1 | docv2_dev_qrels.tsv | 173.4 KB | 4,702 | TREC qrels format |
Dev 2 | docv2_dev2_queries.tsv | 205.0 KB | 5,000 | tsv: qid, query |
Dev 2 | docv2_dev2_top100.txt.gz | 6.1 MB | 500,000 | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Dev 2 | docv2_dev2_qrels.tsv | 190.9 KB | 5,178 | TREC qrels format |
Validation 1 (TREC test 2019) | msmarco-test2019-queries.tsv.gz | 4.2 KB | 200 | tsv: qid, query |
Validation 1 (TREC test 2019) | (currently not available) |  |  | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Validation 1 (TREC test 2019) | docv2_trec2019_qrels.txt.gz | 105 KB | 13,940 | qid, “Q0”, docid, rating |
Validation 2 (TREC test 2020) | msmarco-test2020-queries.tsv.gz | 8.2 KB | 200 | tsv: qid, query |
Validation 2 (TREC test 2020) | (currently not available) | KB |  | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Validation 2 (TREC test 2020) | docv2_trec2020_qrels.txt.gz | 60.9 KB | 7,942 | qid, “Q0”, docid, rating |
Validation 3 (TREC test 2021) | 2021_queries.tsv | 24.0 KB | 477 | tsv: qid, query |
Validation 3 (TREC test 2021) | 2021_document_top100.txt.gz | 603.7 KB | 47,700 | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Validation 3 (TREC test 2021) | 2021.qrels.docs.final.txt | 468 KB | 13,058 | qid, “Q0”, docid, rating |
Validation 4 (TREC test 2022) | 2022_queries.tsv | 21.0 KB | 500 | tsv: qid, query |
Validation 4 (TREC test 2022) | 2022_document_top100.txt.gz | 627.7 KB | 50,000 | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Validation 4 (TREC test 2022) | 2022.qrels.docs.inferred.txt | 13.1 MB | 369,638 | qid, “Q0”, docid, rating |
Test (TREC test 2023) | 2023_queries.tsv | 37.2 KB | 700 | tsv: qid, query |
Test (TREC test 2023) | 2023_document_top100.txt.gz | 881.0 KB | 70,000 | TREC submission: qid, “Q0”, docid, rank, score, runstring |
Test (TREC test 2023) | 2023.qrels.docs.withDupes.txt | 659 KB | 18,034 | qid, “Q0”, docid, rating |
The document corpus is in jsonl format. Each document has:
msmarco_doc_31_726131
is in the file msmarco_v2_doc/msmarco_doc_31
at position 726131
.If you unzip the corpus, you can quickly access a document using:
import json
def get_document(document_id):
(string1, string2, bundlenum, position) = document_id.split('_')
assert string1 == 'msmarco' and string2 == 'doc'
with open(f'./msmarco_v2_doc/msmarco_doc_{bundlenum}', 'rt', encoding='utf8') as in_fh:
in_fh.seek(int(position))
json_string = in_fh.readline()
document = json.loads(json_string)
assert document['docid'] == document_id
return document
document = get_document('msmarco_doc_31_726131')
print(document.keys())
Producing output:
dict_keys(['url', 'title', 'headings', 'body', 'docid'])
You are generally allowed to use external information while developing your runs. When you submit your runs, please fill in a form listing what resources you used. This could include an external corpus such as Wikipedia or a pretrained model (e.g. word embeddings, BERT). This could also include the provided set of document ranking training data, but also optionally other data such as the passage ranking task labels or external labels or pretrained models. This will allow us to analyze the runs and break they down into types.
IMPORTANT NOTE: We are now dealing with multiple versions of MS MARCO ranking data, and all the other MS MARCO tasks as well. This new data release changes what is available and usable. Participants should be careful about using those datasets and must adhere to the following guidelines:
Please submit your runs at: https://ir.nist.gov/trecsubmit/deep.html
We will be following a similar format as the ones used by most TREC submissions, which is repeated below. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.
1 Q0 pid1 1 2.73 runid1
1 Q0 pid2 2 2.71 runid1
1 Q0 pid3 3 2.61 runid1
1 Q0 pid4 4 2.05 runid1
1 Q0 pid5 5 1.89 runid1
, where:
As the official evaluation set, we provide a set of test queries, a subset of which will be used for the final evaluation. The same test queries are used for passage retrieval and document retrieval. Unlike the previous years, different approaches will be used for constructing test collections for the passage ranking and document ranking tasks.
The approach used for test collection construction for the passage retrieval task will be the same as the previous years: NIST will be using depth pooling and construct pools for the queries in the final test set. Passages in these pools will then be labelled by NIST assessors using multi-graded judgments, allowing us to measure NDCG.
Evaluation for the document ranking task will be done using the labels inferred from the passage ranking task, no additional judgments from NIST will be collected for this task. This is aligned with this year’s definition of the document ranking task, which is focusing on ranking documents based on their likelihood of containing a relevant passage.
The main type of TREC submission is automatic, which means there was not manual intervention in running the test queries. This means you should not adjust your runs, rewrite the query, retrain your model, or make any other sorts of manual adjustments after you see the test queries. The ideal case is that you only look at the test queries to check that they ran properly (i.e. no bugs) then you submit your automatic runs. However, if you want to have a human in the loop for your run, or do anything else that uses the test queries to adjust your model or ranking, you can mark your run as manual. Manual runs are interesting, and we may learn a lot, but these are distinct from our main scenario which is a system that responds to unseen queries automatically.
We are sharing the following additional resources which we hope will be useful for the community.
Dataset | Filename | File size | Num Records | Format |
---|---|---|---|---|
Segmented document collection | msmarco_v2_doc_segmented.tar | 25.4 GB | 124,131,414 | tar |
Augmented passage collection | msmarco_v2_passage_augmented.tar | 20.0 GB | 138,364,198 | tar |
The MS MARCO and ORCAS datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The datasets are provided “as is” without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. By using any of these datasets you are automatically agreeing to abide by these terms and conditions. Upon violation of any of these terms, your rights to use the dataset will end automatically.
Please contact us at ms-marco@microsoft.com if you own any of the documents made available but do not want them in this dataset. We will remove the data accordingly. If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the Creative Commons Attribution 4.0 International Public License, see the LICENSE file, and grant you a license to any code in the repository under the MIT License, see the LICENSE-CODE file.
Microsoft licenses the MS MARCO Mark “as-is” and makes no express or implied representations or warranties regarding non-infringement. You must remove all uses of the Mark immediately upon request from Microsoft.
Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft’s general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.
Privacy information can be found at https://privacy.microsoft.com/en-us/.
Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.