Starting with a paper released at NIPS 2016, MS MARCO is a collection of datasets focused on deep learning in search.

The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Since then we released a 1,000,000 question dataset, a natural language generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.

The NLGEN and QnA Leaderboard will close on 10/23/2020. see DataSet Retirement for details. If you would like to evaluate a model please submit before then

Terms and Conditions

The MS MARCO datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The dataset is provided “as is” without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. Upon violation of any of these terms, your rights to use the dataset will end automatically.

If you encounter any difficulties with accessing the dataset, please do not hesitate to inform us at ms-marco@microsoft.com. We are committed to ensuring that you have seamless access to all necessary resources and will promptly address any issues you may experience. Please contact us if you own any of the documents made available but do not want them in this dataset. We will remove the data accordingly. If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us.

I agree to terms and conditions. Upon accepting links to dataset will become available.

Document Retrieval:RETIRED(08/11/2020-01/01/2023)

Based the questions in the Question Answering Dataset and the documents which answered the questions a document ranking task was formulated. There are 3.2 million documents and the goal is to rank based on their relevance.

Relevance labels are derived from what passages was marked as having the answer in the QnA dataset making this one of the largest relevance datasets ever.

This dataset is the focus of the 2020 and 2019 TREC Deep Learning Track and has been used as a teaching aid for ACM SIGIR/SIGKDD AFIRM Summer School on Machine Learning for Data Mining and Search.

In 2020 we release a set of cleaned and formated clicks for all documents in the collection. This collection of 20 million clicks is called ORCAS.

Tasks

Document Re-Ranking:Given a candidate top 100 document as retrieved by BM25, re-rank documents by relevance.
Document Full Ranking:Given a corpus of 3.2m documents generate a candidate top 100 documents sorted by relevance.

Relevant Links

MSMARCO Document Ranking Github NIST Judgments for TREC 2019 Deep Learning Track Overview of the TREC 2019 deep learning track Paper ORCAS Dataset TREC 2020 Deep Learning TREC 2019 Deep Learning

Link to full leaderboard

Passage Retrieval:RETIRED(10/26/2018-01/01/2023)

Based on the passages and questions in the Question Answering Dataset, a passage ranking task was formulated. There are 8.8 million passages and the goal is to rank based on their relevance.

Relevance labels are derived from what passages was marked as having the answer in the QnA dataset making this one of the largest relevance datasets ever.

This dataset is the focus of the 2020 and 2019 TREC Deep Learning Track and has been used as a teaching aid for ACM SIGIR/SIGKDD AFIRM Summer School on Machine Learning for Data Mining and Search.

In 2020 we release a set of cleaned and formated clicks for all documents in the collection. This collection of 20 million clicks is called ORCAS.

Tasks

Passage Re-Ranking:Given a candidate top 1000 passages as retrieved by BM25, re-rank passage by relevance.
Passage Full Ranking:Given a corpus of 8.8m passages generate a candidate top 1000 passages sorted by relevance.

Relevant Links

NIST Judgments for TREC 2019 Deep Learning Track Overview of the TREC 2019 deep learning track Paper ORCAS Dataset MSMARCO Passage Ranking Github TREC 2020 Deep Learning TREC 2019 Deep Learning

Dataset Download links

Link to full leaderboard

KeyPhrase Extraction:RETIRED(10/18/2019-10/30/2020)

Keyphrase extraction on open domain document is an up and coming area that can be used for many NLP tasks like document ranking, Topic Clusetring, etc. To enable the research community to build performant KeyPhrase Extraction systems we have build OpenKP a human annotated extraction of Keyphrases on a wide variety of documents.

The dataset features 148,124 real world web documents along with a human annotation indicating the 1-3 most relevant keyphrases. More information about the dataset and our initial experiments can be found in the paper Open Domain Web Keyphrase Extraction Beyond Language Modeling which was an oral presentation at EMNLP-IJCNLP 2019. It is part of the MSMARCO dataset family and research projects like this power the core document understanding pipeline that Bing uses.

Tasks

Given a document produce the top 3 most salient keyphrases

Relevant Links

KeyPhrase Extraction Github Repo Paper

Dataset Download links

KeyPhrase Extraction(10/18/2019) ranked by F1 @3 on Eval

Rank	Model	Submission Date	F1 @1,@3,@5
1	ETC-large anonymous	May31 st, 2020	0.393, 0.420, 0.360
2	RoBERTa-JointKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.364, 0.391, 0.338
3	RoBERTa-RankKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.361, 0.390, 0.337
4	SpanBERT-JointKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.359, 0.385, 0.335
5	RoBERTa-TagKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.356, 0.381, 0.332
6	SpanBERT-RankKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.355, 0.380, 0.331
7	BERT-JointKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.349, 0.376, 0.325
8	SpanBERT-TagKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.351, 0.374, 0.325
9	BERT-RankKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.342, 0.374, 0.325
10	RoBERTa-ChunkKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.355, 0.373, 0.324
11	SpanBERT-ChunkKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.348, 0.372, 0.324
12	BERT-TagKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.343, 0.364, 0.318
13	BERT (Base) Sequence Tagging Baseline Si Sun (Tsinghua University), Chenyan Xiong (MSR AI), Zhiyuan Liu (Tsinghua University) [Code]	November 5th, 2019	0.321, 0.361, 0.314
14	BERT-ChunkKPE (base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.340, 0.355, 0.311
15	SpanBERT-SpanKPE (base)Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.329, 0.351, 0.304
16	RoBERTa-SpanKPE (base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.330, 0.350, 0.305
17	LLbeBack Rodrigo Nogueira (Epistemic AI), Jimmy Lin (University of Waterloo)	November 19th, 2019	0.349, 0.341, 0.246
18	BERT-SpanKPE (base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code]	February 6th, 2020	0.317, 0.332, 0.289
19	Baseline finetuned on Bing Queries MSMARCO Team [Xiong, et al. '19]	October 19th, 2019	0.267, 0.292, 0.209
20	Baseline MSMARCO Team [Xiong, et al. '19]	October 19th, 2019	0.244, 0.277, 0.198

Question Answering and Natural Language Generation: RETIRED(12/01/2016-10/30/2020)

The original focus of MSMARCO was to provide a corpus for training and testing systems which given a real domain user query systems would then provide the most likley candidate answer and do so in language which was natural and conversational.

This data comes in three tasks/forms: Original QnA dataset(v1.1), Question Answering(v2.1), Natural Language Generation(v2.1). The original question answering datset featured 100,000 examples and was released in 2016. Leaderboard is now closed but data is availible below.

The current competitive tasks are Question Answering and Natural Language Generation. Question Answering features over 1,000,000 queries and is much like the original QnA dataset but bigger and with higher quality. The Natural Language Generation dataset features 180,000 examples and builds upon the QnA dataset to deliver answers that could be spoken by a smart speaker.

Tasks

QnA(v1.1 now closed):Given a query and 10 candidate passages select the most relvant one and use it to answer the question.
QnA(v2.1):Given a query and 10 candidate passages select the most relvant one and use it to answer the question.
NLGEN(v2.1):Given a query and 10 candidate passages select the most relvant one and use it to answer the question. Provide your answer in a way in which it could be read from a smart speaker and make sense without any additional context.

Relevant Links

Paper Question Answering Github Repo

Dataset Download links

Question Answering Task: RETIRED(03/01/2018-10/30/2020) Leaderboard

Rank	Model	Submission Date	Rouge-L	Bleu-1
1	Multi-doc Enriched BERT Ming Yan of Alibaba Damo NLP	June 20th, 2019	0.540	0.565
2	Human Performance	April 23th, 2018	0.539	0.485
3	BERT Encoded T-Net Y. Zhang, C. Wang, X.L. Chen	August 5th, 2019	0.526	0.539
4	Selector+Combine-Content-Generator QA Model Shengjie Qian of Caiyun xiaoyi AI and BUPT	March 19th, 2019	0.525	0.544
5	LM+Generator Alibaba Damo NLP	November 25th,2019	0.522	0.516
6	Masque Q&A Style NTT Media Intelligence Laboratories [Nishida et al. '19]	January 3rd, 2019	0.522	0.437
7	Deep Cascade QA Ming Yan of Alibaba Damo NLP [Yan et al. '18]	December 12th, 2018	0.520	0.546
8	Unnamed anonymous	December 9th,2019	0.518	0.507
9	PALM Alibaba Damo NLP	December 9th,2019	0.518	0.507
10	VNET Baidu NLP [Wang et al. '18]	November 8th, 2018	0.516	0.543
11	LNET S.L. Liu of NEUKG	April 8th, 2020	0.514	0.553
12	MultiLM QnA Model anonymous	December 2nd, 2019	0.514	0.498
13	LNETS.L. Liu of NEUKG	March 23rd,2020	0.506	0.542
14	BERT Encoded T-NET Y. Zhang, C. Wang, X.L. Chen	July 12th, 2019	0.506	0.525
15	MultiLM QnA Model anonymous	December 5th, 2019	0.499	0.430
16	BERT+ Multi-Pointer-Generator Tongjun Li of the ColorfulClouds Tech and BUPT	June 11th, 2019	0.498	0.525
17	Selector+Combine-Content-Generator NL Model Shengjie Qian of Caiyun xiaoyi AI and BUPT	March 11th, 2019	0.496	0.535
18	REAG Anonymous	March 27th, 2020	0.495	0.500
19	CompLM Alibaba Damo NLP	December 2nd, 2019	0.495	0.516
20	LM+Generator anonymous	November 21st,2019	0.494	0.529
21	PALM Alibaba Damo NLP	December 9th,2019	0.492	0.510
22	anonymous anonymous	December 16th,2019	0.492	0.499
23	LNET S.L. Liu of the NEUKG	Nov 19th, 2019	0.491	0.530
24	BERT+ Multi-Pointer-Generator Tongjun Li of the ColorfulClouds Tech and BUPT	May 21st, 2019	0.491	0.520
25	MUSST-NLG Anonymous	May 15th, 2020	0.490	0.516
26	CompLM Alibaba Damo NLP	December 3rd, 2019	0.490	0.502
27	Masque NLGEN Style NTT Media Intelligence Laboratories [Nishida et al. '19]	January 3rd, 2019	0.489	0.488
28	roberta_T_tlcd_18k Anonymous	May 14th, 2020	0.483	0.516
29	Communicating BERT Xuan Liang of RIDLL from the University of Technology Sydney	October 4th, 2019	0.483	0.506
30	MDC-Generator Ssk-nlp	April 23rd, 2020	0.482	0.516
31	MultiLM NLGen Model anonymous	December 2nd, 2019	0.482	0.514
32	LM+Generator anonymous	November 19th,2019	0.478	0.481
33	MultiLM NLGen Model anonymous	December 5th, 2019	0.475	0.479
34	BERT + Transfer anonymous	October 16th, 2019	0.474	0.499
35	Bert Based Multi-taskZhangY & WangC	June 26th, 2019	0.471	0.512
36	T-RoBERTa-wf-BERTbaseA-120k Anonymous	February 13th, 2020	0.471	0.483
37	BERT-SS-K1-100k Anonymous	January 26th, 2020	0.470	0.493
38	T-RoBERTa-wf-BERTbaseA-80k Anonymous	February 21st, 2020	0.468	0.500
39	Multi-passage QA Model SudaNLP	October 21st, 2020	0.466	0.508
40	BERT-SS-K1-100k Anonymous	February 2nd, 2020	0.464	0.485
41	BERT-RGLM Anonymous	April 22nd, 2020	0.457	0.479
42	REAG Anonymous	May 28th, 2020	0.456	0.449
43	SNET + CES2S Bo Shao of SYSU University	July 24th, 2018	0.450	0.464
44	ranking+nlg anonymous	October 9th, 2019	0.449	0.468
45	ranker-reader RCZoo of UCAS	May 15th, 2019	0.441	0.371
46	Extraction-net zlsh80826	October 20th, 2018	0.437	0.444
47	SNET JY Zhao	August 30th, 2018	0.436	0.463
48	BIDAF+ELMo+SofterMax Wang Changbao	November 16th, 2018	0.436	0.459
49	ranking+nlg anonymous	August 12th, 2019	0.434	0.411
50	DNET QA Geeks	August 1st, 2018	0.432	0.479
51	T-RoBERTa-wf-BERTbaseA-120k Anonymous	February 13th, 2020	0.431	0.424
52	KIGN-QA Chenliang Li	April 22nd, 2019	0.429	0.404
53	MaRCo-da-GAAMA IBM Research AI Multilingual NLP Group	April 7th, 2020	0.426	0.462
54	Reader-Writer Microsoft Business Applications Group AI Research	September 16th, 2018	0.421	0.436
55	Masque2 (single / NLG Style) NTT Media Intelligence Laboratories	October 22nd, 2020	0.419	0.469
56	BERT+Multi-Loss S.L. Liu of NEUKG	November 4th, 2019	0.413	0.422
57	REAG(based on PALM)anonymous	June 1st,2020	0.410	0.430
58	RGLM anonymous	May 5th, 2020	0.406	0.455
59	SNET+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS	June 1st, 2018	0.398	0.423
60	SSK3+BERTBaseAnswerGenerator anonymous	Jan 21st, 2020	0.391	0.413
61	MP-MRC BERT H.Y. Zhang	Aug 27th, 2020	0.389	0.410
62	MP-MRC BERT-base H.Y. Zhang	Sep 4th, 2020	0.388	0.411
63	MUSST anonymous	March 31st, 2020	0.376	0.405
64	Anonymous anonymous	October 12th, 2020	0.359	0.409
65	fj-net(single) yzm nlp group	August 3rd, 2020	0.343	0.409
66	MNet-Base(Single) NLGEN fuii of iDW	July 8th, 2020	0.337	0.405
67	fj-reader(single) yzm nlp group	July 28th, 2020	0.336	0.404
68	Generation with latent retrieval per answer anonymous	May 11th, 2020	0.335	0.290
69	MDCG-Base ssk-nlp	June 8th, 2020	0.334	0.398
70	MUSST-NLG Anonymous	June 2nd, 2020	0.334	0.388
71	MDCC-Base ssk-nlp	June 10th, 2020	0.333	0.400
72	Generation with latent retrieval Baseline 2 anonymous	May 11th, 2020	0.331	0.307
73	MDCC ssk-nlp	June 10th, 2020	0.328	0.391
74	Generation with latent retrieval Baseline 1 anonymous	May 11th, 2020	0.305	0.275
75	MultiTask+DataAug+Unlikelihood UvA	June 3rd, 2020	0.300	0.332
76	MUSST-QA Anonymous	June 1st, 2020	0.298	0.354
77	lightNLP+BiDAF Enliple AI	February 1st, 2019	0.298	0.156
78	Pretrained seq2seq model BDEG	September 10th, 2020	0.290	0.331
79	roberta_T_tlx_90k Anonymous	July 29th, 2020	0.286	0.327
80	BIDAF+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS	May 29th, 2018	0.276	0.288
81	BiDaF Baseline(Implemented By MSMARCO Team) Allen Institute for AI & University of Washington [Seo et al. '16]	April 23th, 2018	0.240	0.106
82	TrioNLP +　BiDAF Trio.AI of the CCNU	September 23rd, 2018	0.205	0.232
83	BiDAF + LSTM Meefly	January 15th,2019	0.153	0.120

Natural Language Generation Task:RETIRED(03/01/2018-10/30/2020)

Rank	Model	Submission Date	Rouge-L	Bleu-1
1	Human Performance	April 23th, 2018	0.632	0.530
2	PALM Alibaba Damo NLP	December 16th,2019	0.498	0.499
3	REAG Anonymous	March 27th, 2020	0.498	0.497
4	Masque NLGEN Style NTT Media Intelligence Laboratories [Nishida et al. '19]	January 3rd, 2019	0.496	0.501
5	CompLM Alibaba Damo NLP	December 3rd, 2019	0.496	0.489
6	PALM Alibaba Damo NLP	December 9th,2019	0.496	0.484
7	BERT+ Multi-Pointer-Generator Tongjun Li of the ColorfulClouds Tech and BUPT	June 11th,2019	0.495	0.476
8	CompLM Alibaba Damo NLP	November 19th,2019	0.495	0.470
9	CompLM Alibaba Damo NLP	December 2nd, 2019	0.493	0.475
10	BERT+ Multi-Pointer-Generator Tongjun Li of the ColorfulClouds Tech and BUPT	May 21st,2019	0.491	0.474
11	CompLM Alibaba Damo NLP	November 19th,2019	0.488	0.485
12	roberta_T_tlcd_18k Anonymous	May 14th, 2020	0.487	0.468
13	BERT+ Multi-Pointer-Generator Tongjun Li of the ColorfulClouds Tech and BUPT	March 26th,2019	0.487	0.465
14	Selector+Combine-Content-Generator NLGEN Model Shengjie Qian of Caiyun xiaoyi AI and BUPT	March 11th, 2019	0.487	0.449
15	VNET Baidu NLP [Wang et al. '18]	November 8th, 2018	0.484	0.468
16	BERT+ Multi-Pointer-Generator (Single) Tongjun Li of the ColorfulClouds Tech and BUPT	March 19th,2019	0.484	0.459
17	Communicating BERT Xuan Liang of RIDLL from the University of Technology Sydney	October 4th, 2019	0.483	0.472
18	MultiLM NLGen Model anonymous	December 2nd, 2019	0.483	0.461
19	ranking+nlg anonymous	October 9th, 2019	0.481	0.468
20	MUSST-NLG Anonymous	May 15th, 2020	0.480	0.458
21	MultiLM NLGen Model anonymous	December 5th, 2019	0.478	0.481
22	BERT-RGLM Anonymous	April 22nd, 2020	0.470	0.452
23	BERT-SS-K1-100k anonymous	January 26th, 2020	0.470	0.437
24	MDC-Generator Ssk-nlp	April 23rd, 2020	0.466	0.446
25	BERT-SS-K1-100k anonymous	February 2nd, 2020	0.465	0.427
26	T-RoBERTa-wf-BERTbaseA-120k Anonymous	February 17th, 2020	0.464	0.420
27	T-RoBERTa-wf-BERTbaseA-80k Anonymous	February 21st, 2020	0.463	0.438
28	ranking+nlg anonymous	October 9th, 2019	0.462	0.451
29	PM-MUG-1 anonymous	May 20th, 2020	0.453	0.441
30	PM-MUG-2 anonymous	May 20th, 2020	0.452	0.449
31	SNET + CES2S Bo Shao of SYSU University	July 24th, 2018	0.450	0.406
32	MaRCo-da-GAAMA IBM Research AI Multilingual NLP Group	April 7th, 2020	0.448	0.402
33	REAG(based on PALM)anonymous	June 1st,2020	0.447	0.444
34	Masque2 (single / NLG Style) NTT Media Intelligence Laboratories	October 22nd, 2020	0.445	0.423
35	KIGN-QA Chenliang Li	April 22nd, 2019	0.441	0.462
36	Reader-Writer Microsoft Business Applications Group AI Research	September 16th, 2018	0.439	0.426
37	ranking+nlg anonymous	August 12th, 2019	0.439	0.411
38	RGLM anonymous	May 5th, 2020	0.435	0.413
39	T-RoBERTa-wf-BERTbaseA-120k Anonymous	February 13th, 2020	0.427	0.364
40	ConZNet Samsung Research [Indurthi et al. '18]	July 16th, 2018	0.421	0.386
41	Anonymous Anonymous	November 21st, 2019	0.412	0.410
42	Bayes QA Bin Bi of Alibaba NLP	June 14st, 2018	0.411	0.435
43	Generation with latent retrieval per answer anonymous	May 11th, 2020	0.408	0.442
44	Generation with latent retrieval Baseline 2 anonymous	May 11th, 2020	0.401	0.415
45	SNET+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS	June 1st, 2018	0.401	0.375
46	MUSST anonymous	March 31, 2020	0.392	0.359
47	SSK3+BERTBaseAnswerGenerator Anonymous	Jan 21st, 2020	0.384	0.356
48	Generation with latent retrieval Baseline 1 anonymous	May 11th, 2020	0.382	0.416
49	BPG-NET Zhijie Sang of the Center for Intelligence Science and Technology Research(CIST) of the Beijing University of Posts and Telecommunications (BUPT)	August 1st, 2018	0.382	0.347
50	GUM anonymous from anonymous	September 4th, 2019	0.375	0.438
51	MDCC-Base ssk-nlp	June 10th, 2020	0.358	0.362
52	MDCG-Base ssk-nlp	June 8th, 2020	0.358	0.359
53	fj-net(single) yzm nlp group	August 3rd, 2020	0.353	0.363
54	Deep Cascade QA Ming Yan of Alibaba Damo NLP	October 25th, 2018	0.351	0.374
55	MNet-Base(Single) NLGEN fuii of iDW	July 8th, 2020	0.350	0.354
56	fj-reader(single) yzm nlp group	July 28th, 2020	0.350	0.350
57	MDCC ssk-nlp	June 10th, 2020	0.349	0.350
58	MUSST-NLG Anonymous	June 2nd, 2020	0.340	0.358
59	AE + ReRanking + Bert Based Multi-task ZhangY & WangC	July 12th, 2019	0.331	0.376
60	BERT Encoded T-Net Y. Zhang, C. Wang, X.L. Chen	August 5th, 2019	0.329	0.373
61	MultiTask+DataAug+Unlikelihood UvA	June 3rd, 2020	0.327	0.347
62	Multi-doc Enriched BERT Ming Yan of Alibaba Damo NLP	June 20th, 2019	0.325	0.377
63	BIDAF+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS	May 29th, 2018	0.322	0.283
64	BERT Encoded T-Net Y. Zhang, C. Wang, X.L. Chen	July 12th, 2019	0.320	0.361
65	Unnamed Anonymous	December 9th,2019	0.318	0.384
66	roberta_T_tlx_90k Anonymous	July 29th, 2020	0.303	0.298
67	Pretrained seq2seq model BDEG	September 10th, 2020	0.302	0.294
68	LM+Generator anonymous	November 25th,2019	0.299	0.372
69	LNET S.L. Liu of NEUKG	April 8th, 2020	0.294	0.352
70	LNETS.L. Liu of NEUKG	March 23rd,2020	0.293	0.347
71	Masque Q&A Style NTT Media Intelligence Laboratories [Nishida et al. '19]	January 3rd, 2019	0.285	0.399
72	Bert Based Multi-taskZhangY & WangC	June 26th, 2019	0.284	0.349
73	Selector+Combine-Content-Generator QA Model Shengjie Qian of Caiyun xiaoyi AI and BUPT	March 11th, 2019	0.281	0.337
74	DNET QA Geeks	August 1st, 2018	0.275	0.332
75	ranker-reader RCZoo of UCAS	May 15th, 2019	0.271	0.382
76	BIDAF+ELMo+SofterMax Wang Changbao	November 16th, 2018	0.268	0.346
77	BERT+Multi-Loss S.L. Liu of NEUKG	November 4th, 2019	0.266	0.422
78	LNET S.L. Liu of the NEUKG	Nov 19th, 2019	0.266	0.339
79	MultiLM QnA Model anonymous	December 2nd, 2019	0.266	0.340
80	MultiLM NLGen Model anonymous	December 5th, 2019	0.257	0.360
81	REAG Anonymous	May 28th, 2020	0.247	0.328
82	Multi-passage QA Model SudaNLP	October 21st, 2020	0.247	0.323
83	Extraction-net zlsh80826	August 14th, 2018	0.247	0.321
84	SNET JY Zhao	May 29th, 2018	0.247	0.308
85	MP-MRC BERT H.Y. Zhang	Aug 27th, 2020	0.211	0.258
86	MP-MRC BERT-base H.Y. Zhang	Sep 4th, 2020	0.211	0.258
87	lightNLP+BiDAF Enliple AI	February 1st, 2019	0.210	0.108
88	Anonymous anonymous	October 12th, 2020	0.195	0.280
89	MUSST-QA Anonymous	June 1st, 2020	0.187	0.285
90	BiDaF Baseline(Implemented By MSMARCO Team) Allen Institute for AI & University of Washington [Seo et al. '16]	April 23th, 2018	0.169	0.093
91	TrioNLP +　BiDAF Trio.AI of the CCNU	September 23rd, 2018	0.142	0.160
92	BiDAF + LSTM Meefly	January 15th,2019	0.119	0.173

MS MARCO V1:RETIRED(12/01/2016-03/31/2018)

Rank	Model	Submission Date	Rouge-L	Bleu-1
1	MARS YUANFUDAO research NLP	March 26th, 2018	0.497	0.480
2	Human Performance	December 2016	0.470	0.460
3	V-Net Baidu NLP [Wang et al '18]	February 15th, 2018	0.462	0.445
4	S-Net Microsoft AI and Research [Tan et al. '17]	June 2017	0.452	0.438
5	R-Net Microsoft AI and Research [Wei et al. '16]	May 2017	0.429	0.422
6	HieAttnNet Akaitsuki	March 26th, 2018	0.423	0.448
7	BiAttentionFlow+ ShanghaiTech University GeekPie_HPC team	March 11th, 2018	0.415	0.381
8	ReasoNet Microsoft AI and Research [Shen et al. '16]	April 28th, 2017	0.388	0.399
9	Prediction Singapore Management University [Wang et al. '16]	March 2017	0.373	0.407
10	FastQA_Ext DFKI German Research Center for AI [Weissenborn et al. '17]	March 2017	0.337	0.339
11	FastQA DFKI German Research Center for AI [Weissenborn et al. '17]	March 2017	0.321	0.340
12	Flypaper Model ZhengZhou University	March 14th, 2018	0.317	0.342
13	DCNMarcoNet Flying Riddlers @ Carnegie Mellon University	March 31st, 2018	0.313	0.238
14	BiDaF Baseline for V2 (Implemented By MSMARCO Team) Seo et al. '16]	April 23th, 2018	0.268	0.129
15	ReasoNet Baseline rained on SQuAd, Microsoft AI & Research [Shen et al. '16]	December 2016	0.192	0.148

Usefulness Data(Released 02/02/2020)

Data associated with the WebConf 2020's paper Leading Conversational Search by Suggesting Useful Questions

Relevant Links

Usefulness Github Repo

Dataset Download links

Conversational Search(Released 04/23/2019)

Truly Conversational Search is the next logic step in the journey to generate intelligent and useful AI. To understand what this may mean, researchers have voiced a continuous desire to study how people currently converse with search engines. As a result we have released a large corpus of anonymized user search sessions.

We hope the community can use this corpus to explore what conversations with search engines look like.

Relevant Links

Conversational Search Github Repo

Dataset Download links

Optimal Crawling(Released 04/23/2019)

The dataset used for Optimal Freshness Crawl Under Politeness Constraints and Staying up to Date with Online Content Changes Using Reinforcement Learning for Scheduling which are both focused on providing an optimal crawling schedule for a search engine based ont he changing nature of the internet.

There is currently no public task associated with this dataset

Relevant Links

Paper Paper 2 Optimal Crawl Github Repo

Dataset Download links

Message to NLP community and MSMARCO Community

TLDR: We are closing the MSMARCO QnA and NLGEN Leaderboard. Last Submissions 10/23.

Dear NLP community and Question Answering enthusiasts,
When we released MSMARCO v2 back in March of 2018 we did not expect how much love this dataset would receive from the community. Needless to say we have been humbled by not only the number of submissions to the leaderboard but also all the remarkable research that incorporated this dataset as part of their benchmarking efforts. While we originally envisioned that this dataset will be useful to the NLP and QnA communities, we were again humbled by how the dataset was adopted and evolved by the IR community for document and passage retrieval tasks. However, as you may have guessed, maintaining a public resource like the MS MARCO leaderboard takes significant time and effort and we are grateful to our small but dedicated team of volunteers that maintain this website. As we look forward to the future, we believe that given the small size of this team and the limited resources, it is time to refocus our energy and time on the scenarios where MS MARCO can provide the most value to the research community moving forward. Towards that goal, we have made the hard (but what we believe is the right) decision to retire the Question answering and Language Generation leaderboard. Both tasks have not made large leaps in quality in the last year and we want to refocus our efforts on the document and passage retrieval tasks where the engagement with the research communities are actively growing in the present. As a result, the last day for any submissions to the MSMARCO Question Answering, Natural Language Generation, and KeyPhrase Extraction leaderboards is October 23, 2020. Submissions to both the document and passage retrieval leaderboards will continue as usual. We will continue to host all the datasets (including those specifically for the tasks being retired), as we believe they can still serve as valuable resources for future research. We want to again thank all the participants for their submissions and support for MS MARCO and we hope to see the community around the IR tasks continue to grow more in the future. We are always listening for feedback, so please continue to send us your suggestions and requests.

Sincerely
The MS MARCO Team

Changes To Dataset

10.23.2020Task Retirement

1. Retire QnA V2 Task 2. Retire NLGEN V2 Task 3. Retire OpenKP Task

08.11.2020New Task

1. Released Document Ranking task and 3 baselines.

07.30.2020New Data

1. Released ORCAS Click data

02.11.2020:New Data

1. Released Usefulness Data

10.22.2019:New Datasets

1. Released OpenKP Keyphrase Extraction dataset! 2. Released Optimal Crawling Dataset!

05.06.2019:Fixed Encoding issues with Ranking Dataset

1. Updated various encoding issues in ranking dataset.

04.23.2019:We have released a conversational search dataset

1. Brand new conversational search Dataset

10.26.2018:We have released a new ranking dataset based on the v2.1 dataset

1. Brand new Ranking Dataset
2. Basic Baseline and evaluation function

04.23.2018:We have released an updated to the dataset. V2.1 Includes the following:
1. Over 1 million queries
2. ~182k Well Formed Answers
3. Query type is now included for every query.
4. Bias in Evaluation set fixed(a small portion of answers for the V2.0 Evaluation set were able to be found in the v1.1 set and the v2.0 well formed sets, these have been removed from eval and added to train).
5. Utilities and Readme now availible.

03.01.2018:We have released an updated to the dataset. V2.0 Includes the following:
1. ~900,000 unique queries
2. ~160k Well Formed Answers

01.30.2017:We have released an update to the dataset! V1.1 contains the follwing:
1. Improvments to dataset and evaluation scripts

12.01.2016:We have released our dataset! V1.0 contains the follwing:
1. 100,000 unique query answer pairs

MS MARCO Submission Instructions

Once you have built a model that meets your expectations on evaluation with the dev set, you can submit your test results to get official evaluation on the test set. To ensure the integrity of the official test results, we do not release the correct answers for test set to the public.

To submit your model for official evaluation on the test set for the document ranking task, follow the instructions here.

To submit your model for official evaluation on the test set for other tasks, follow the below steps:

Generate your candidate output for the dev set.
Run the official evaluation methodologies found in the task specific git repo and verify your systems are running as expected
Generate your candidate output for the eval/test set and submit the following information by emailing us

Your email should include

Candidate evaluation file
Candidate dev file
Individual/Team Name: Name of the individual or the team to appear in the leaderboard [Required]
Individual/Team Institution: Name of the institution of the individual or the team to appear in the leaderboard [Optional]
Model code: Training code for the model [Recommended]
Model information: Name of the model/technique to appear in the leaderboard [Required]
Paper Information: Name, Citation, URL of the paper if model is from a published work to appear in the leaderboard [Optional]

To avoid "P-hacking" we discourage too many submissions from the same group in a short period of time.

About MS MARCO

Email us

Microsoft Machine Reading Comprehension (MS MARCO) is a collection of large scale datasets for deep learning related to Search. In MS MARCO, all questions are sampled from real anonymized user queries. The context passages, from which answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated if they could summarize the answer.

Current Team

Daniel Campos
Program Manager

Bhaskar Mitra
Principal Applied Scientist

Chenyan Xiong
Researcher

Nick Craswell
Principal Applied Science Manager

Emine Yilmaz
Associate Professor University College London

Corby Rosset
Applied Scientist