MSMARCO Logo Microsoft Logo

Starting with a paper released at NIPS 2016, MS MARCO is a collection of datasets focused on deep learning in search.

The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Since then we released a 1,000,000 question dataset, a natural langauge generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.

The NLGEN and QnA Leaderboard will close on 10/23/2020. see DataSet Retirement for details. If you would like to evaluate a model please submit before then

Terms and Conditions

The MS MARCO datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The dataset is provided “as is” without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. Upon violation of any of these terms, your rights to use the dataset will end automatically.



Please contact us at ms-marco@microsoft.com if you own any of the documents made available but do not want them in this dataset. We will remove the data accordingly. If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us.

Document Retrieval(08/11/2020-Present)

Based the questions in the Question Answering Dataset and the documents which answered the questions a document ranking task was formulated. There are 3.2 million documents and the goal is to rank based on their relevance.

Relevance labels are derived from what passages was marked as having the answer in the QnA dataset making this one of the largest relevance datasets ever.

This dataset is the focus of the 2020 and 2019 TREC Deep Learning Track and has been used as a teaching aid for ACM SIGIR/SIGKDD AFIRM Summer School on Machine Learning for Data Mining and Search.

In 2020 we release a set of cleaned and formated clicks for all documents in the collection. This collection of 20 million clicks is called ORCAS.

Tasks

  1. Document Re-Ranking:Given a candidate top 100 document as retrieved by BM25, re-rank documents by relevance.
  2. Document Full Ranking:Given a corpus of 3.2m documents generate a candidate top 100 documents sorted by relevance.

Relevant Links

MSMARCO Document Ranking Github NIST Judgments for TREC 2019 Deep Learning Track Overview of the TREC 2019 deep learning track Paper ORCAS Dataset TREC 2020 Deep Learning TREC 2019 Deep Learning

Dataset Download links

Document Ranking Leaderboard(August 11th, 2020-Present) ranked by MRR@100 on Eval

Rank Model Ranking Style Submission Date MRR@100 On Eval
BERT-m1 base + classic IR + doc2query Leonid Boytsov - Bosch Center for AI - [code] Full Ranking October 20th, 2020 0.396
BERT-m1 base + classic IR + doc2query Leonid Boytsov - Bosch Center for AI - [code] Full Ranking October 12th, 2020 0.390
ColBERT MaxP end-to-end Omar Khattab, Christopher Potts, and Matei Zaharia - Stanford University Full Ranking October 6th, 2020 0.384
HDCT top100 + BERT-base FirstP (single) Luyu Gao and Zhuyun Dai - LTI, CMU Full Ranking September 9th, 2020 0.382
BERT-m1 base + classic IR + doc2query Leonid Boytsov - Bosch Center for AI - [code] Full Ranking October 1st, 2020 0.380
ANCE + BM25 BERT Base FirstP OpenMatch - THU-MSR - [code] Full Ranking October 16th, 2020 0.380
Expando-Mono-Duo-T5 FirstP+MaxP Ronak Pradeep, Ruizhou Xu, Rodrigo Nogueira, and Jimmy Lin - University of Waterloo Full Ranking September 19th, 2020 0.378
Expando-Mono-Duo-T5 Ronak Pradeep, Ruizhou Xu, Rodrigo Nogueira, and Jimmy Lin - University of Waterloo Full Ranking September 7th, 2020 0.370
BERT based re-ranking on top of classic IR model Leonid Boytsov - Bosch Center for AI - [code] Full Ranking August 27th, 2020 0.368
ColBERT-long end-to-end v0 Omar Khattab, Christopher Potts, and Matei Zaharia - Stanford University Full Ranking September 24th, 2020 0.366
Expando-Mono-T5 Ronak Pradeep, Ruizhou Xu, Rodrigo Nogueira, and Jimmy Lin - University of Waterloo Full Ranking September 7th, 2020 0.362
ANCE BERT Base FirstP OpenMatch - THU-MSR - [code] Full Ranking October 9th, 2020 0.362
ANCE BERT Base FirstP Fusion OpenMatch - THU-MSR - [code] Full Ranking September 23rd, 2020 0.357
BERT-base Rerank Document First Passage Luyu Gao and Zhuyun Dai - LTI, CMU ReRanking August 31st, 2020 0.338
ANCE FirstP OpenMatch - THU-MSR - [code] - [Xiong et Al. '20] Full Ranking September 23rd, 2020 0.334
BERT Base FirstP + Document Expansion OpenMatch - THU-MSR - [code] ReRanking September 8th, 2020 0.329
BERT Base FirstP early stop OpenMatch - THU-MSR - [code] ReRanking September 18th, 2020 0.329
BERT Base FirstP OpenMatch - THU-MSR - [code] ReRanking September 8th, 2020 0.328
BERT Base FirstP early stop + url OpenMatch - THU-MSR - [code] ReRanking September 18th, 2020 0.326
BERT Base FirstP Baseline OpenMatch - THU-MSR - [code] ReRanking August 28th, 2020 0.317
ME-hybrid Google Research Full Ranking September 24th, 2020 0.310
[Official Baseline] Conformer-Kernel with QTI (NDRM1) Bhaskar Mitra (Microsoft), Sebastian Hofstätter (TU Wien), Hamed Zamani (Microsoft), and Nick Craswell (Microsoft) - [code] - [Mitra et Al. '20] ReRanking August 11th, 2020 0.307
Longformer reranking Ivan Sekulic (Università della Svizzera italiana), Amir Soleimani (University of Amsterdam), Mohammad Aliannejadi (University of Amsterdam), and Fabio Crestani (Università della Svizzera italiana) - [code] ReRanking August 31st, 2020 0.305
[Official Baseline] Conformer-Kernel with QTI (NDRM3) Bhaskar Mitra (Microsoft), Sebastian Hofstätter (TU Wien), Hamed Zamani (Microsoft), and Nick Craswell (Microsoft) - [code] - [Mitra et Al. '20] ReRanking August 11th, 2020 0.293
uogTrDPHCBmp: PyTerrier DPH reranked by ColBERT MaxPassage Craig Macdonald - University of Glasgow Terrier Team Full Ranking September 3rd, 2020 0.291
Anserini's BM25 + T5QLM Max passage Shengyao Zhuang, Hang Li, and Guido Zuccon - IELab, The University of Queensland Full Ranking September 10th, 2020 0.290
DE-hybrid Google Research Full Ranking September 24th, 2020 0.287
Anserini's BM25 baseline, doc2query segmented index Ronak Pradeep, Ruizhou Xu, Rodrigo Nogueira, and Jimmy Lin - University of Waterloo - [code] Full Ranking August 30th, 2020 0.284
Non-neural fusion of BM25 (multi-field) and IBM Model 1 scores Leonid Boytsov - Bosch Center for AI - [code] Full Ranking August 20th, 2020 0.256
Anserini's BM25 baseline, doc2query index Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin - University of Waterloo - [code] Full Ranking August 16th, 2020 0.251
PyTerrier framework + DPH Divergence from Randomness model Craig Macdonald - University of Glasgow Terrier Team - [code] - [Macdonald and Tonellotto '20] Full Ranking August 31st, 2020 0.230
PyTerrier framework + DPH + Bo1 QE Craig Macdonald - University of Glasgow Terrier Team - [code] - [Macdonald and Tonellotto '20] Full Ranking September 1st, 2020 0.219
Anserini's BM25 Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin - University of Waterloo - [code] Full Ranking August 16th, 2020 0.201
[Official Baseline] IndriQueryLikelihood MSMARCO Team - [code] Full Ranking August 11th, 2020 0.192
Anserini's BM25 + RM3 baseline, doc2query index Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin - University of Waterloo - [code] Full Ranking August 16th, 2020 0.161
Anserini's BM25 + RM3 baseline Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin - University of Waterloo - [code] Full Ranking August 16th, 2020 0.139

Passage Retrieval(10/26/2018-Present)

Based on the passages and questions in the Question Answering Dataset, a passage ranking task was formulated. There are 8.8 million passages and the goal is to rank based on their relevance.

Relevance labels are derived from what passages was marked as having the answer in the QnA dataset making this one of the largest relevance datasets ever.

This dataset is the focus of the 2020 and 2019 TREC Deep Learning Track and has been used as a teaching aid for ACM SIGIR/SIGKDD AFIRM Summer School on Machine Learning for Data Mining and Search.

In 2020 we release a set of cleaned and formated clicks for all documents in the collection. This collection of 20 million clicks is called ORCAS.

Tasks

  1. Passage Re-Ranking:Given a candidate top 1000 passages as retrieved by BM25, re-rank passage by relevance.
  2. Passage Full Ranking:Given a corpus of 8.8m passages generate a candidate top 1000 passages sorted by relevance.

Relevant Links

NIST Judgments for TREC 2019 Deep Learning Track Overview of the TREC 2019 deep learning track Paper ORCAS Dataset MSMARCO Passage Ranking Github TREC 2020 Deep Learning TREC 2019 Deep Learning

Dataset Download links

Passage Ranking Leaderboard(10/26/2018-Present) ranked by MRR on Eval

Rank Model Ranking Style Submission Date MRR@10 On Eval MRR@10 On Dev
RocketQA + ERNIE Baidu NLP - [Qu et al.] Full Ranking September 18th, 2020 0.426 0.439
UED-Large Anonymous Full Ranking August 12th, 2020 0.424 0.436
DR-BERT X.W. S of Meituan-Dianping NLP-KG Group Full Ranking May 20th, 2020 0.419 0.420
expando-mono-duo-T5 Ronak Pradeep, Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin - University of Waterloo Full Ranking May 19th, 2020 0.408 0.420
DeepCT + TF-Ranking Ensemble of BERT, ROBERTA and ELECTRA (1) Shuguang Han, (2) Zhuyun Dai, (1) Xuanhui Wang, (1) Michael Bendersky and (1) Marc Najork - 1) Google Research, (2) Carnegie Mellon - Paper and Code Full Ranking June 2nd, 2020 0.407 0.421
UED Anonymous Full Ranking May 5th, 2020 0.405 0.414
UED-Large Anonymous Full Ranking August 11th, 2020 0.405
TABLE Model X.W. S of Meituan-Dianping NLP-KG Group Full Ranking May 11th, 2020 0.401 0.412
TABLE Model X.W. S of Meituan-Dianping NLP-KG Group Full Ranking January 21th, 2020 0.400 0.401
TABLE Model X.W. S of Meituan-Dianping NLP-KG Group Full Ranking May 8th, 2020 0.400 0.401
DeepCT Retrieval + TF-Ranking BERT Ensemble 1) Shuguang Han, (2) Zhuyun Dai, (1) Xuanhui Wang, (1) Michael Bendersky and (1) Marc Najork - (1) Google Research, (2) Carnegie Mellon University - Paper [Han, et al. '20]Code Full Ranking April 10th, 2020 0.395 0.405
DeepCT + Bart Binsheng Liu - RMIT University Full Ranking May 6th, 2020 0.394 0.408
Enriched BERT base + AOA index + CAS Ming Yan of Alibaba Damo NLP Full Ranking August 20th, 2019 0.393 0.408
TF-Ranking Ensemble of BERT, ROBERTA and ELECTRA (1) Shuguang Han, (2) Zhuyun Dai, (1) Xuanhui Wang, (1) Michael Bendersky and (1) Marc Najork - 1) Google Research, (2) Carnegie Mellon - Paper and Code ReRanking June 2nd, 2020 0.391 0.405
BM25 + Bert-C sookienlane Full Ranking February 21st,2019 0.388 0.394
W-Index retrieval + BERT-F re-rank Zhuyun Dai of Carnegie Mellon University Full Ranking September 12th,2019 0.388 0.394
BM25 + monoT5-3B Ronak Pradeep, Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin of University of Waterloo Full Ranking October 2nd,2020 0.388 0.398
Enriched BERT base + AOA index V1 Ming Yan of Alibaba Damo NLP Full Ranking May 13th, 2019 0.383 0.397
BERTter pretraining (1)Rodrigo Nogueira, (2)Wei Yang, (3)Jimmy Lin, (4)Kyunghyun Cho - New York University(1,4), University of Waterloo(2,3), Facebook AI Research(4) Full Ranking May 21st, 2019 0.383 0.395
R-Index and R-BERT X.W. S Full Ranking Jan 14th,2020 0.382 0.429
Enriched BERT base + AOA index V2 Ming Yan of Alibaba Damo NLP Full Ranking May 13th, 2019 0.380 0.389
BM25 + monoBERT + duoBERT + TCP (1)Rodrigo Nogueira, (2)Wei Yang, (3)Jimmy Lin, (4)Kyunghyun Cho - New York University(1,4), University of Waterloo(2,3), Facebook AI Research(4) [Nogueira, et al. '19] and Code Full Ranking June 26th, 2019 0.379 0.390
BM25 + Electra Large OpenMatch - THU-MSR - [Code] ReRanking August 13th, 2020 0.376 0.388
BERT^2 (1)Rodrigo Nogueira, (2)Wei Yang, (3)Jimmy Lin, (4)Kyunghyun Cho - New York University(1,4), University of Waterloo(2,3), Facebook AI Research(4) Full Ranking May 13th, 2019 0.375 0.386
TF-Ranking + BERT(Ensemble of pointwise, pairwise and listwise losses)TF-Ranking team (Shuguang Han, Xuanhui Wang, Michael Bendersky and Marc Najork) of Google Research - Paper [Han, et al. '20] and [Code] ReRanking March 30th, 2020 0.375 0.388
BM25 + Roberta Large OpenMatch - THU-MSR - [Code] ReRanking August 13th, 2020 0.375 0.386
Enriched BERT base + AOA index Ming Yan of Alibaba Damo NLP Full Ranking May 6th, 2019 0.373 0.387
BM25 + monoBERT + duoBERT (1)Rodrigo Nogueira, (2)Wei Yang, (3)Jimmy Lin, (4)Kyunghyun Cho - New York University(1,4), University of Waterloo(2,3), Facebook AI Research(4) [Nogueira, et al. '19] Full Ranking June 26th, 2019 0.370 0.382
ReinforcedQGen+BERTRank Rajarshee Mitra of Microsoft STCI Full Ranking August 5th, 2019 0.369 -
BERTter Indexing (1)Rodrigo Nogueira, (2)Wei Yang, (3)Jimmy Lin, (4)Kyunghyun Cho - New York University(1,4), University of Waterloo(2,3), Facebook AI Research(4) [Nogueira et al. '19] and [Code] Full Ranking April 8th, 2019 0.368 0.375
Enriched BERT base + AOA index Ming Yan of Alibaba Damo NLP ReRanking May 6th, 2019 0.368 0.373
ELECTRA-Large Jheng-Hong Yang(1), Sheng-Chieh Lin(1), Rodrigo Nogueira(2), Jimmy Lin(2) - Academia Sinica(1), University of Waterloo(2) ReRanking March 23rd, 2020 0.367 0.376
TF-Ranking + BERT(Softmax Loss, List size 6, 200k steps)TF-Ranking team (Shuguang Han, Xuanhui Wang, Michael Bendersky and Marc Najork) of Google Research - Paper [Han, et al. '20] and [Code] Re Ranking March 16th, 2020 0.366 0.378
BM25 + monoBERT (1)Rodrigo Nogueira, (2)Wei Yang, (3)Jimmy Lin, (4)Kyunghyun Cho - New York University(1,4), University of Waterloo(2,3), Facebook AI Research(4) [Nogueira, et al. '19] Full Ranking June 26th, 2019 0.365 0.372
BERT base + attention ranking anonymous ReRanking August 26th, 2019 0.364 0.377
BERT + Small Training Rodrigo Nogueira(1) and Kyunghyun Cho(2) - New York University(1,2), Facebook AI Research(2) [Nogueira, et al. '19] and [Code] ReRanking January 7th, 2019 0.359 0.365
SAN + BERT base Yu Wang, Xiaodong Liu, Jianfeng Gao - Deep Learning Group, Microsoft Research AI [Xiaodong, et al. '18] ReRanking January 22th, 2019 0.359 0.370
BERT + Projected Matching Yifan Qiao(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Lui(4) - Tsinghua University(1,3,4), Microsoft Research(2) [ Qiao et al. '19] ReRanking February 7th,2019 0.356 -
BERT base + L2R Ming Yan of Alibaba Damo NLP ReRanking March 16th,2019 0.356 0.364
LBERT-base anonymous ReRanking March 1st, 2019 0.349 -
BERT-Base Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, Shaoping Ma of the Information Retrieval Group, Tsinghua University ReRanking April 8th, 2020 0.349 0.358
BERT base + attention ranking anonymous ReRanking March 1st, 2019 0.347 0.317
BERT + Small Training Xue-He Wang, Chia-Hung Yuan, Bing-Han Chiang, Dong-Ze Wu, Lu-Dan Ruan, Shan-Hung Wu of National Tsing Hua University ReRanking June 20th, 2019 0.347 0.361
BERT-base +ranking loss + horovod Milk&Cereal ReRanking May 6th, 2019 0.346 0.352
BERT-base fine-tune ICT-NLU ReRanking May 23rd, 2019 0.346 0.349
BERT, Roberta, Electra, Anserini, DeepCT retrieval models (ensembled) Leonid Pugachev, DeepPavlov- Moscow Institute of Physics and Technology ReRanking July 20th, 2020 0.346 0.394
BM25 + BERT Base OpenMatch - THU-MSR - [Code] ReRanking August 7th, 2020 0.345 0.349
BERT base + attention ranking anonymous ReRanking March 11th, 2019 0.344 -
BM25 + Electra Base OpenMatch - THU-MSR - [Code] ReRanking August 7th, 2020 0.344 0.352
BERT base + attention ranking anonymous ReRanking March 4th, 2019 0.343 -
Bert-base + hinge ranking loss Milk&Cereral ReRanking April 24th, 2019 0.342 0.345
BERT + L2R ICT-NLU ReRanking June 11th, 2019 0.342 0.348
BERT-LLR Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, Shaoping Ma of the Information Retrieval Group, Tsinghua University ReRanking April 6th, 2020 0.342 0.352
BERT-RI Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, Shaoping Ma of the Information Retrieval Group, Tsinghua University ReRanking April 7th, 2020 0.340 0.352
BERT+ENA Di Zhao, Hui Fang, UD Infolab ReRanking August 11th, 2019 0.339 -
BERT Base + Highway + Cross Entropy Loss + Axioms Di Zhao, Hui Fang, UD Infolab ReRanking August 9th, 2019 0.336 0.340
BERT Base + Highway+Cross Entropy Loss + Axioms Di Zhao, Hui Fang, UD Infolab ReRanking August 11th, 2019 0.336 -
BERT Base OpenMatch of THU-MSR - Code ReRanking July 28th, 2020 0.336 -
ME-Hybrid Google Research Full Ranking August 18th, 2020 0.336 0.343
BERT base + attention ranking anonymous ReRanking March 2nd, 2019 0.335 -
BERT Base Finetuned 400k steps Chaitanya Sai Alaparthi of IIIT-Hyderabad ReRanking February 13th, 2020 0.335 -
BERT + CNN Chia-Hung Yuan, Bing-Han Chiang, Xue-He Wang, Dong-Ze Wu, Lu-Dan Ruan, Shan-Hung Wu of National Tsing Hua University ReRanking June 15th, 2019 0.333 0.346
BERT + Multilayer Interaction Yifan Qiao(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Lui(4) - Tsinghua University(1,3,4), Microsoft Research(2) [ Qiao et al. '19] ReRanking February 19th,2019 0.329 0.311
BERT base + ranking Yifan Qiao(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Lui(4) - Tsinghua University(1,3,4), Microsoft Research(2) [ Qiao et al. '19] ReRanking February 8th, 2019 0.326 0.316
BERT Base + Highway + Ranking Loss Di Zhao, Hui Fang, UD Infolab ReRanking August 9th, 2019 0.323 -
ME-BERT Google Research Full Ranking August 10th, 2020 0.323 0.334
RDBERT-Embedding Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, Shaoping Ma - Information Retrieval Group, Tsinghua University ReRanking May 13th, 2020 0.313 0.320
FastText + Conv-KNRM (Ensemble) Sebastian Hofstätter (1), Navid Rekabsaz (2), Carsten Eickhoff (3), and Allan Hanbury (1) - TU Wien(1), Idiap Research Institute(2), Brown University(3) [ Hofstätter et al. '19] and [Code] ReRanking May 8th, 2019 0.309 0.318
biLSTM + Co-attention on n-grams + query-based scorer Chaitanya Sai Alaparthi-IIIT-Hyderabad ReRanking June 16th,2020 0.309 0.319
DE-Hybrid Google Research Full Ranking September 18th, 2020 0.306 0.309
BiLSTM + Co-attention on n-grams Chaitanya Sai Alaparthi of IIIT-Hyderabad ReRanking May 14th, 2020 0.299 0.310
n-gram co-attention Yon ReRanking July 23rd, 2020 0.299 0.303
DE-BERT Google Research Full Ranking July 31st, 2020 0.295 0.302
RepBERT Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, Shaoping Ma - IR Group, Tsinghua University - [Zhan et al '20] and Code Full Ranking June 23rd, 2020 0.294 0.304
BiLSTM + Co-Attention + self attention based document scorer Chaitanya Sai Alaparthi - IIIT-Hyderabad - [Alaparthi et al '19] ReRanking April 29th, 2020 0.291 0.298
docTTTTTquery + T5QLM IELab - The University of Queensland Full Ranking September 3rd, 2020 0.289 0.300
BiLSTM + CoAttention Chaitanya Sai Alaparthi - IIIT-Hyderabad [Alaparthi et al '19] ReRanking April 13th, 2020 0.286 0.288
IRNet (Deep CNN/IR Hybrid Network) Dave DeBarr, Navendu Jain, Robert Sim, Justin Wang, Nirupama Chandrasekaran – Microsoft ReRanking January 2nd, 2019 0.281 0.278
FastText + Conv-KNRM (Single) Sebastian Hofstätter (1), Navid Rekabsaz (2), Carsten Eickhoff (3), and Allan Hanbury (1) - TU Wien(1), Idiap Research Institute(2), Brown University(3) [ Hofstätter et al. '19] and [Code] ReRanking May 8th, 2019 0.277 0.290
docTTTTTquery Rodrigo Nogueira (Epistemic AI), Jimmy Lin (University of Waterloo) [Paper] and [Code] Full Ranking November 27th, 2019 0.272 0.277
Neural Kernel Match IR (Conv-KNRM) (Ensembled)(1)Yifan Qiao, (2)Chenyan Xiong, (3)Zhenghao Liu, (4)Zhiyuan Liu-Tsinghua University(1, 3, 4); Microsoft Research AI(2) [Dai et al. '18] ReRanking Novmeber 28th, 2018 0.271 0.290
Axiom-Regularized Conv-KNRM Corby Rosset, Bhaskar Mitra, Chenyan Xiong, Nick Craswell, Xia Song, Saurabh Tiwary - Microsoft AI & Research[Rosset et al. '19] ReRanking February 19, 2019 0.263 0.262
α-SVS NTT Media Intelligence Laboratories Full Ranking June 29th, 2020 0.262 0.259
Encoder-Decoder model with attention + multi loss Youngjin Jang ReRanking June 3rd, 2020 0.261 0.273
R3D anonymous ReRanking March 4th, 2020 0.260 0.243
BERT, Roberta, Electra, Anserini, DeepCT retrieval models (ensembled) Leonid Pugachev, DeepPavlov- Moscow Institute of Physics and Technology ReRanking June 23rd, 2020 0.259 0.263
[Official Baseline] Duet V2 (Ensembled) Bhaskar Mitra, Fernando Diaz, Nick Craswell - Microsoft AI & Research [Mitra et al. '19] and [Code] ReRanking February 19, 2019 0.253 0.252
Duet with query term independence assumption (Single) Bhaskar Mitra (1, 2), Corby Rosset (1), David Hawking (1), Nick Craswell (1), Fernando Diaz (1), and Emine Yilmaz (2) of (1) Microsoft & (2) UCL Paper ReRanking March 14th, 2019 0.252 0.254
Neural Kernel Match IR (Conv-KNRM) (Single)(1)Yifan Qiao, (2)Chenyan Xiong, (3)Zhenghao Liu, (4)Zhiyuan Liu-Tsinghua University(1, 3, 4); Microsoft Research AI(2) [Dai et al. '18] ReRanking February 19, 2019 0.247 0.247
BM25 (Anserini) + ALBERT Bi-encoder for First-stage Ranking Marco Wrzalik of the LAVIS Group at RheinMain University of Applied Sciences Full Ranking April 24th, 2020 0.247 0.249
[Official Baseline] Duet V2 (Single) Bhaskar Mitra, Fernando Diaz, Nick Craswell - Microsoft AI & Research [Mitra et al. '19s] and [Code] ReRanking February 20, 2019 0.245 0.243
DW Index + BM25 anonymous Full Ranking April 29th, 2019 0.239 0.243
BERT Base + Highway + Cross Entropy Loss + Axioms Di Zhao, Hui Fang, UD Infolab ReRanking August 5th, 2019 0.223 0.340
BERT Base + Highway + Ranking Loss Di Zhao, Hui Fang, UD Infolab ReRanking August 5th, 2019 0.222 0.340
BM25 (Anserini) + doc2query (1)Rodrigo Nogueira, (2)Wei Yang, (3)Jimmy Lin, (4)Kyunghyun Cho - New York University(1,4), University of Waterloo(2,3), Facebook AI Research(4) [Nogueira et al. '19] and [Code] Full Ranking April 10th, 2019 0.218 0.215
Neural Kernel Match IR (Conv-KNRM) (Ensembled)(1)Yifan Qiao, (2)Chenyan Xiong, (3)Zhenghao Liu, (4)Zhiyuan Liu-Tsinghua University(1, 3, 4); Microsoft Research AI(2) [Dai et al. '18] ReRanking Novmeber 26th, 2018 0.199 0.199
Neural Kernel Match IR (KNRM) ((1)Yifan Qiao, (2)Chenyan Xiong, (3)Zhenghao Liu, (4)Zhiyuan Liu-Tsinghua University(1, 3, 4); Microsoft Research AI(2) [ Xiong et al. '17] ReRanking December 10th, 2018 0.198 0.218
Feature-based LeToR: simple-feature based RankSVM(1)Yifan Qiao, (2)Chenyan Xiong, (3)Zhenghao Liu, (4)Zhiyuan Liu-Tsinghua University(1, 3, 4); Microsoft Research AI(2) ReRanking December 10th, 2018 0.191 0.195
BM25 (Lucene8, tuned) (1)Rodrigo Nogueira, (2)Wei Yang, (3)Jimmy Lin, (4)Kyunghyun Cho - New York University(1,4), University of Waterloo(2,3), Facebook AI Research(4) [Nogueira, et al. '19] Full Ranking June 26th, 2019 0.190 0.187
BM25 (Anserini) (1)Rodrigo Nogueira, (2)Wei Yang, (3)Jimmy Lin, (4)Kyunghyun Cho - New York University(1,4), University of Waterloo(2,3), Facebook AI Research(4) [Nogueira et al. '19] and [Code] Full Ranking April 10th, 2019 0.186 0.184
Unnamed Hongyin Zhu ReRanking June 26th, 2019 0.174 -
[Official Baseline]BM25 Stephen E. Robertson; Steve Walker; Susan Jones; Micheline Hancock-Beaulieu & Mike Gatford (Implemented by MSMARCO Team) [ Robertson et al. '94] Full Ranking Novmeber 1st, 2018 0.165 0.167
FastMatch Anonymous ReRanking August 17th, 2020 0.154 0.329
BERT Represenatation Yifan Qiao(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Lui(4) - Tsinghua University(1,3,4), Microsoft Research(2) [Qiao et al. '19] ReRanking February 19th,2019 0.015 0.043

KeyPhrase Extraction:RETIRED(10/18/2019-10/30/2020)

Keyphrase extraction on open domain document is an up and coming area that can be used for many NLP tasks like document ranking, Topic Clusetring, etc. To enable the research community to build performant KeyPhrase Extraction systems we have build OpenKP a human annotated extraction of Keyphrases on a wide variety of documents.

The dataset features 148,124 real world web documents along with a human annotation indicating the 1-3 most relevant keyphrases. More information about the dataset and our initial experiments can be found in the paper Open Domain Web Keyphrase Extraction Beyond Language Modeling which was an oral presentation at EMNLP-IJCNLP 2019. It is part of the MSMARCO dataset family and research projects like this power the core document understanding pipeline that Bing uses.

Tasks

  1. Given a document produce the top 3 most salient keyphrases

Relevant Links

KeyPhrase Extraction Github Repo Paper

Dataset Download links

KeyPhrase Extraction(10/18/2019) ranked by F1 @3 on Eval

Rank Model Submission Date F1 @1,@3,@5
ETC-large anonymous May31 st, 2020 0.393, 0.420, 0.360
RoBERTa-JointKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.364, 0.391, 0.338
RoBERTa-RankKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.361, 0.390, 0.337
SpanBERT-JointKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.359, 0.385, 0.335
RoBERTa-TagKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.356, 0.381, 0.332
SpanBERT-RankKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.355, 0.380, 0.331
BERT-JointKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.349, 0.376, 0.325
SpanBERT-TagKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.351, 0.374, 0.325
BERT-RankKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.342, 0.374, 0.325
RoBERTa-ChunkKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.355, 0.373, 0.324
SpanBERT-ChunkKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.348, 0.372, 0.324
BERT-TagKPE (Base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.343, 0.364, 0.318
BERT (Base) Sequence Tagging Baseline Si Sun (Tsinghua University), Chenyan Xiong (MSR AI), Zhiyuan Liu (Tsinghua University) [Code] November 5th, 2019 0.321, 0.361, 0.314
BERT-ChunkKPE (base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.340, 0.355, 0.311
SpanBERT-SpanKPE (base)Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.329, 0.351, 0.304
RoBERTa-SpanKPE (base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.330, 0.350, 0.305
LLbeBack Rodrigo Nogueira (Epistemic AI), Jimmy Lin (University of Waterloo) November 19th, 2019 0.349, 0.341, 0.246
BERT-SpanKPE (base) Si Sun(1), Chenyan Xiong(2), Zhenghao Liu(3), Zhiyuan Liu(4), Jie Bao(5) - Tsinghua University(1,3,4,5), MSR AI(2)- [Sun et al '20] and [Code] February 6th, 2020 0.317, 0.332, 0.289
Baseline finetuned on Bing Queries MSMARCO Team [Xiong, et al. '19] October 19th, 2019 0.267, 0.292, 0.209
Baseline MSMARCO Team [Xiong, et al. '19] October 19th, 2019 0.244, 0.277, 0.198

Question Answering and Natural Langauge Generation: RETIRED(12/01/2016-10/30/2020)

The original focus of MSMARCO was to provide a corpus for training and testing systems which given a real domain user query systems would then provide the most likley candidate answer and do so in language which was natural and conversational.

This data comes in three tasks/forms: Original QnA dataset(v1.1), Question Answering(v2.1), Natural Language Generation(v2.1). The original question answering datset featured 100,000 examples and was released in 2016. Leaderboard is now closed but data is availible below.

The current competitive tasks are Question Answering and Natural Language Generation. Question Answering features over 1,000,000 queries and is much like the original QnA dataset but bigger and with higher quality. The Natural Language Generation dataset features 180,000 examples and builds upon the QnA dataset to deliver answers that could be spoken by a smart speaker.

Tasks

  1. QnA(v1.1 now closed):Given a query and 10 candidate passages select the most relvant one and use it to answer the question.
  2. QnA(v2.1):Given a query and 10 candidate passages select the most relvant one and use it to answer the question.
  3. NLGEN(v2.1):Given a query and 10 candidate passages select the most relvant one and use it to answer the question. Provide your answer in a way in which it could be read from a smart speaker and make sense without any additional context.

Relevant Links

Paper Question Answering Github Repo

Dataset Download links

Question Answering Task: RETIRED(03/01/2018-10/30/2020) Leaderboard

Rank Model Submission Date Rouge-L Bleu-1
Multi-doc Enriched BERT Ming Yan of Alibaba Damo NLP June 20th, 2019 0.540 0.565
Human Performance April 23th, 2018 0.539 0.485
BERT Encoded T-Net Y. Zhang, C. Wang, X.L. Chen August 5th, 2019 0.526 0.539
Selector+Combine-Content-Generator QA Model Shengjie Qian of Caiyun xiaoyi AI and BUPT March 19th, 2019 0.525 0.544
LM+Generator Alibaba Damo NLP November 25th,2019 0.522 0.516
Masque Q&A Style NTT Media Intelligence Laboratories [Nishida et al. '19] January 3rd, 2019 0.522 0.437
Deep Cascade QA Ming Yan of Alibaba Damo NLP [Yan et al. '18] December 12th, 2018 0.520 0.546
Unnamed anonymous December 9th,2019 0.518 0.507
PALM Alibaba Damo NLP December 9th,2019 0.518 0.507
VNET Baidu NLP [Wang et al. '18] November 8th, 2018 0.516 0.543
LNET S.L. Liu of NEUKG April 8th, 2020 0.514 0.553
MultiLM QnA Model anonymous December 2nd, 2019 0.514 0.498
LNETS.L. Liu of NEUKG March 23rd,2020 0.506 0.542
BERT Encoded T-NET Y. Zhang, C. Wang, X.L. Chen July 12th, 2019 0.506 0.525
MultiLM QnA Model anonymous December 5th, 2019 0.499 0.430
BERT+ Multi-Pointer-Generator Tongjun Li of the ColorfulClouds Tech and BUPT June 11th, 2019 0.498 0.525
Selector+Combine-Content-Generator NL Model Shengjie Qian of Caiyun xiaoyi AI and BUPT March 11th, 2019 0.496 0.535
REAG Anonymous March 27th, 2020 0.495 0.500
CompLM Alibaba Damo NLP December 2nd, 2019 0.495 0.516
LM+Generator anonymous November 21st,2019 0.494 0.529
PALM Alibaba Damo NLP December 9th,2019 0.492 0.510
anonymous anonymous December 16th,2019 0.492 0.499
LNET S.L. Liu of the NEUKG Nov 19th, 2019 0.491 0.530
BERT+ Multi-Pointer-Generator Tongjun Li of the ColorfulClouds Tech and BUPT May 21st, 2019 0.491 0.520
MUSST-NLG Anonymous May 15th, 2020 0.490 0.516
CompLM Alibaba Damo NLP December 3rd, 2019 0.490 0.502
Masque NLGEN Style NTT Media Intelligence Laboratories [Nishida et al. '19] January 3rd, 2019 0.489 0.488
roberta_T_tlcd_18k Anonymous May 14th, 2020 0.483 0.516
Communicating BERT Xuan Liang of RIDLL from the University of Technology Sydney October 4th, 2019 0.483 0.506
MDC-Generator Ssk-nlp April 23rd, 2020 0.482 0.516
MultiLM NLGen Model anonymous December 2nd, 2019 0.482 0.514
LM+Generator anonymous November 19th,2019 0.478 0.481
MultiLM NLGen Model anonymous December 5th, 2019 0.475 0.479
BERT + Transfer anonymous October 16th, 2019 0.474 0.499
Bert Based Multi-taskZhangY & WangC June 26th, 2019 0.471 0.512
T-RoBERTa-wf-BERTbaseA-120k Anonymous February 13th, 2020 0.471 0.483
BERT-SS-K1-100k Anonymous January 26th, 2020 0.470 0.493
T-RoBERTa-wf-BERTbaseA-80k Anonymous February 21st, 2020 0.468 0.500
Multi-passage QA Model SudaNLP October 21st, 2020 0.466 0.508
BERT-SS-K1-100k Anonymous February 2nd, 2020 0.464 0.485
BERT-RGLM Anonymous April 22nd, 2020 0.457 0.479
REAG Anonymous May 28th, 2020 0.456 0.449
SNET + CES2S Bo Shao of SYSU University July 24th, 2018 0.450 0.464
ranking+nlg anonymous October 9th, 2019 0.449 0.468
ranker-reader RCZoo of UCAS May 15th, 2019 0.441 0.371
Extraction-net zlsh80826 October 20th, 2018 0.437 0.444
SNET JY Zhao August 30th, 2018 0.436 0.463
BIDAF+ELMo+SofterMax Wang Changbao November 16th, 2018 0.436 0.459
ranking+nlg anonymous August 12th, 2019 0.434 0.411
DNET QA Geeks August 1st, 2018 0.432 0.479
T-RoBERTa-wf-BERTbaseA-120k Anonymous February 13th, 2020 0.431 0.424
KIGN-QA Chenliang Li April 22nd, 2019 0.429 0.404
MaRCo-da-GAAMA IBM Research AI Multilingual NLP Group April 7th, 2020 0.426 0.462
Reader-Writer Microsoft Business Applications Group AI Research September 16th, 2018 0.421 0.436
Masque2 (single / NLG Style) NTT Media Intelligence Laboratories October 22nd, 2020 0.419 0.469
BERT+Multi-Loss S.L. Liu of NEUKG November 4th, 2019 0.413 0.422
REAG(based on PALM)anonymous June 1st,2020 0.410 0.430
RGLM anonymous May 5th, 2020 0.406 0.455
SNET+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS June 1st, 2018 0.398 0.423
SSK3+BERTBaseAnswerGenerator anonymous Jan 21st, 2020 0.391 0.413
MP-MRC BERT H.Y. Zhang Aug 27th, 2020 0.389 0.410
MP-MRC BERT-base H.Y. Zhang Sep 4th, 2020 0.388 0.411
MUSST anonymous March 31st, 2020 0.376 0.405
Anonymous anonymous October 12th, 2020 0.359 0.409
fj-net(single) yzm nlp group August 3rd, 2020 0.343 0.409
MNet-Base(Single) NLGEN fuii of iDW July 8th, 2020 0.337 0.405
fj-reader(single) yzm nlp group July 28th, 2020 0.336 0.404
Generation with latent retrieval per answer anonymous May 11th, 2020 0.335 0.290
MDCG-Base ssk-nlp June 8th, 2020 0.334 0.398
MUSST-NLG Anonymous June 2nd, 2020 0.334 0.388
MDCC-Base ssk-nlp June 10th, 2020 0.333 0.400
Generation with latent retrieval Baseline 2 anonymous May 11th, 2020 0.331 0.307
MDCC ssk-nlp June 10th, 2020 0.328 0.391
Generation with latent retrieval Baseline 1 anonymous May 11th, 2020 0.305 0.275
MultiTask+DataAug+Unlikelihood UvA June 3rd, 2020 0.300 0.332
MUSST-QA Anonymous June 1st, 2020 0.298 0.354
lightNLP+BiDAF Enliple AI February 1st, 2019 0.298 0.156
Pretrained seq2seq model BDEG September 10th, 2020 0.290 0.331
roberta_T_tlx_90k Anonymous July 29th, 2020 0.286 0.327
BIDAF+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS May 29th, 2018 0.276 0.288
BiDaF Baseline(Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 0.240 0.106
TrioNLP + BiDAF Trio.AI of the CCNU September 23rd, 2018 0.205 0.232
BiDAF + LSTM Meefly January 15th,2019 0.153 0.120

Natural Language Generation Task:RETIRED(03/01/2018-10/30/2020)

Rank Model Submission Date Rouge-L Bleu-1
Human Performance April 23th, 2018 0.632 0.530
PALM Alibaba Damo NLP December 16th,2019 0.498 0.499
REAG Anonymous March 27th, 2020 0.498 0.497
Masque NLGEN Style NTT Media Intelligence Laboratories [Nishida et al. '19] January 3rd, 2019 0.496 0.501
CompLM Alibaba Damo NLP December 3rd, 2019 0.496 0.489
PALM Alibaba Damo NLP December 9th,2019 0.496 0.484
BERT+ Multi-Pointer-Generator Tongjun Li of the ColorfulClouds Tech and BUPT June 11th,2019 0.495 0.476
CompLM Alibaba Damo NLP November 19th,2019 0.495 0.470
CompLM Alibaba Damo NLP December 2nd, 2019 0.493 0.475
BERT+ Multi-Pointer-Generator Tongjun Li of the ColorfulClouds Tech and BUPT May 21st,2019 0.491 0.474
CompLM Alibaba Damo NLP November 19th,2019 0.488 0.485
roberta_T_tlcd_18k Anonymous May 14th, 2020 0.487 0.468
BERT+ Multi-Pointer-Generator Tongjun Li of the ColorfulClouds Tech and BUPT March 26th,2019 0.487 0.465
Selector+Combine-Content-Generator NLGEN Model Shengjie Qian of Caiyun xiaoyi AI and BUPT March 11th, 2019 0.487 0.449
VNET Baidu NLP [Wang et al. '18] November 8th, 2018 0.484 0.468
BERT+ Multi-Pointer-Generator (Single) Tongjun Li of the ColorfulClouds Tech and BUPT March 19th,2019 0.484 0.459
Communicating BERT Xuan Liang of RIDLL from the University of Technology Sydney October 4th, 2019 0.483 0.472
MultiLM NLGen Model anonymous December 2nd, 2019 0.483 0.461
ranking+nlg anonymous October 9th, 2019 0.481 0.468
MUSST-NLG Anonymous May 15th, 2020 0.480 0.458
MultiLM NLGen Model anonymous December 5th, 2019 0.478 0.481
BERT-RGLM Anonymous April 22nd, 2020 0.470 0.452
BERT-SS-K1-100k anonymous January 26th, 2020 0.470 0.437
MDC-Generator Ssk-nlp April 23rd, 2020 0.466 0.446
BERT-SS-K1-100k anonymous February 2nd, 2020 0.465 0.427
T-RoBERTa-wf-BERTbaseA-120k Anonymous February 17th, 2020 0.464 0.420
T-RoBERTa-wf-BERTbaseA-80k Anonymous February 21st, 2020 0.463 0.438
ranking+nlg anonymous October 9th, 2019 0.462 0.451
PM-MUG-1 anonymous May 20th, 2020 0.453 0.441
PM-MUG-2 anonymous May 20th, 2020 0.452 0.449
SNET + CES2S Bo Shao of SYSU University July 24th, 2018 0.450 0.406
MaRCo-da-GAAMA IBM Research AI Multilingual NLP Group April 7th, 2020 0.448 0.402
REAG(based on PALM)anonymous June 1st,2020 0.447 0.444
Masque2 (single / NLG Style) NTT Media Intelligence Laboratories October 22nd, 2020 0.445 0.423
KIGN-QA Chenliang Li April 22nd, 2019 0.441 0.462
Reader-Writer Microsoft Business Applications Group AI Research September 16th, 2018 0.439 0.426
ranking+nlg anonymous August 12th, 2019 0.439 0.411
RGLM anonymous May 5th, 2020 0.435 0.413
T-RoBERTa-wf-BERTbaseA-120k Anonymous February 13th, 2020 0.427 0.364
ConZNet Samsung Research [Indurthi et al. '18] July 16th, 2018 0.421 0.386
Anonymous Anonymous November 21st, 2019 0.412 0.410
Bayes QA Bin Bi of Alibaba NLP June 14st, 2018 0.411 0.435
Generation with latent retrieval per answer anonymous May 11th, 2020 0.408 0.442
Generation with latent retrieval Baseline 2 anonymous May 11th, 2020 0.401 0.415
SNET+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS June 1st, 2018 0.401 0.375
MUSST anonymous March 31, 2020 0.392 0.359
SSK3+BERTBaseAnswerGenerator Anonymous Jan 21st, 2020 0.384 0.356
Generation with latent retrieval Baseline 1 anonymous May 11th, 2020 0.382 0.416
BPG-NET Zhijie Sang of the Center for Intelligence Science and Technology Research(CIST) of the Beijing University of Posts and Telecommunications (BUPT) August 1st, 2018 0.382 0.347
GUM anonymous from anonymous September 4th, 2019 0.375 0.438
MDCC-Base ssk-nlp June 10th, 2020 0.358 0.362
MDCG-Base ssk-nlp June 8th, 2020 0.358 0.359
fj-net(single) yzm nlp group August 3rd, 2020 0.353 0.363
Deep Cascade QA Ming Yan of Alibaba Damo NLP October 25th, 2018 0.351 0.374
MNet-Base(Single) NLGEN fuii of iDW July 8th, 2020 0.350 0.354
fj-reader(single) yzm nlp group July 28th, 2020 0.350 0.350
MDCC ssk-nlp June 10th, 2020 0.349 0.350
MUSST-NLG Anonymous June 2nd, 2020 0.340 0.358
AE + ReRanking + Bert Based Multi-task ZhangY & WangC July 12th, 2019 0.331 0.376
BERT Encoded T-Net Y. Zhang, C. Wang, X.L. Chen August 5th, 2019 0.329 0.373
MultiTask+DataAug+Unlikelihood UvA June 3rd, 2020 0.327 0.347
Multi-doc Enriched BERT Ming Yan of Alibaba Damo NLP June 20th, 2019 0.325 0.377
BIDAF+seq2seq Yihan Ni of the CAS Key Lab of Web Data Science and Technology, ICT, CAS May 29th, 2018 0.322 0.283
BERT Encoded T-Net Y. Zhang, C. Wang, X.L. Chen July 12th, 2019 0.320 0.361
Unnamed Anonymous December 9th,2019 0.318 0.384
roberta_T_tlx_90k Anonymous July 29th, 2020 0.303 0.298
Pretrained seq2seq model BDEG September 10th, 2020 0.302 0.294
LM+Generator anonymous November 25th,2019 0.299 0.372
LNET S.L. Liu of NEUKG April 8th, 2020 0.294 0.352
LNETS.L. Liu of NEUKG March 23rd,2020 0.293 0.347
Masque Q&A Style NTT Media Intelligence Laboratories [Nishida et al. '19] January 3rd, 2019 0.285 0.399
Bert Based Multi-taskZhangY & WangC June 26th, 2019 0.284 0.349
Selector+Combine-Content-Generator QA Model Shengjie Qian of Caiyun xiaoyi AI and BUPT March 11th, 2019 0.281 0.337
DNET QA Geeks August 1st, 2018 0.275 0.332
ranker-reader RCZoo of UCAS May 15th, 2019 0.271 0.382
BIDAF+ELMo+SofterMax Wang Changbao November 16th, 2018 0.268 0.346
BERT+Multi-Loss S.L. Liu of NEUKG November 4th, 2019 0.266 0.422
LNET S.L. Liu of the NEUKG Nov 19th, 2019 0.266 0.339
MultiLM QnA Model anonymous December 2nd, 2019 0.266 0.340
MultiLM NLGen Model anonymous December 5th, 2019 0.257 0.360
REAG Anonymous May 28th, 2020 0.247 0.328
Multi-passage QA Model SudaNLP October 21st, 2020 0.247 0.323
Extraction-net zlsh80826 August 14th, 2018 0.247 0.321
SNET JY Zhao May 29th, 2018 0.247 0.308
MP-MRC BERT H.Y. Zhang Aug 27th, 2020 0.211 0.258
MP-MRC BERT-base H.Y. Zhang Sep 4th, 2020 0.211 0.258
lightNLP+BiDAF Enliple AI February 1st, 2019 0.210 0.108
Anonymous anonymous October 12th, 2020 0.195 0.280
MUSST-QA Anonymous June 1st, 2020 0.187 0.285
BiDaF Baseline(Implemented By MSMARCO Team)
Allen Institute for AI & University of Washington [Seo et al. '16]
April 23th, 2018 0.169 0.093
TrioNLP + BiDAF Trio.AI of the CCNU September 23rd, 2018 0.142 0.160
BiDAF + LSTM Meefly January 15th,2019 0.119 0.173

MS MARCO V1:RETIRED(12/01/2016-03/31/2018)

Rank Model Submission Date Rouge-L Bleu-1
MARS YUANFUDAO research NLP March 26th, 2018 0.497 0.480
Human Performance
December 2016 0.470 0.460
V-Net Baidu NLP [Wang et al '18] February 15th, 2018 0.462 0.445
S-Net Microsoft AI and Research [Tan et al. '17] June 2017 0.452 0.438
R-Net Microsoft AI and Research [Wei et al. '16] May 2017 0.429 0.422
HieAttnNet Akaitsuki March 26th, 2018 0.423 0.448
BiAttentionFlow+ ShanghaiTech University GeekPie_HPC team March 11th, 2018 0.415 0.381
ReasoNet Microsoft AI and Research [Shen et al. '16] April 28th, 2017 0.388 0.399
Prediction Singapore Management University [Wang et al. '16] March 2017 0.373 0.407
FastQA_Ext DFKI German Research Center for AI [Weissenborn et al. '17] March 2017 0.337 0.339
FastQA DFKI German Research Center for AI [Weissenborn et al. '17] March 2017 0.321 0.340
Flypaper Model ZhengZhou University March 14th, 2018 0.317 0.342
DCNMarcoNet Flying Riddlers @ Carnegie Mellon University March 31st, 2018 0.313 0.238
BiDaF Baseline for V2 (Implemented By MSMARCO Team) Seo et al. '16] April 23th, 2018 0.268 0.129
ReasoNet Baseline rained on SQuAd, Microsoft AI & Research [Shen et al. '16] December 2016 0.192 0.148

Usefulness Data(Released 02/02/2020)

Data associated with the WebConf 2020's paper Leading Conversational Search by Suggesting Useful Questions

Relevant Links

Usefulness Github Repo

Dataset Download links

Conversational Search(Released 04/23/2019)

Truly Conversational Search is the next logic step in the journey to generate intelligent and useful AI. To understand what this may mean, researchers have voiced a continuous desire to study how people currently converse with search engines. As a result we have released a large corpus of anonymized user search sessions.

We hope the community can use this corpus to explore what conversations with search engines look like.

Relevant Links

Conversational Search Github Repo

Dataset Download links

Optimal Crawling(Released 04/23/2019)

The dataset used for Optimal Freshness Crawl Under Politeness Constraints and Staying up to Date with Online Content Changes Using Reinforcement Learning for Scheduling which are both focused on providing an optimal crawling schedule for a search engine based ont he changing nature of the internet.

There is currently no public task associated with this dataset

Relevant Links

Paper Paper 2 Optimal Crawl Github Repo

Dataset Download links

Message to NLP community and MSMARCO Community

TLDR: We are closing the MSMARCO QnA and NLGEN Leaderboard. Last Submissions 10/23.

Dear NLP community and Question Answering enthusiasts,
When we released MSMARCO v2 back in March of 2018 we did not expect how much love this dataset would receive from the community. Needless to say we have been humbled by not only the number of submissions to the leaderboard but also all the remarkable research that incorporated this dataset as part of their benchmarking efforts. While we originally envisioned that this dataset will be useful to the NLP and QnA communities, we were again humbled by how the dataset was adopted and evolved by the IR community for document and passage retrieval tasks. However, as you may have guessed, maintaining a public resource like the MS MARCO leaderboard takes significant time and effort and we are grateful to our small but dedicated team of volunteers that maintain this website. As we look forward to the future, we believe that given the small size of this team and the limited resources, it is time to refocus our energy and time on the scenarios where MS MARCO can provide the most value to the research community moving forward. Towards that goal, we have made the hard (but what we believe is the right) decision to retire the Question answering and Language Generation leaderboard. Both tasks have not made large leaps in quality in the last year and we want to refocus our efforts on the document and passage retrieval tasks where the engagement with the research communities are actively growing in the present. As a result, the last day for any submissions to the MSMARCO Question Answering, Natural Language Generation, and KeyPhrase Extraction leaderboards is October 23, 2020. Submissions to both the document and passage retrieval leaderboards will continue as usual. We will continue to host all the datasets (including those specifically for the tasks being retired), as we believe they can still serve as valuable resources for future research. We want to again thank all the participants for their submissions and support for MS MARCO and we hope to see the community around the IR tasks continue to grow more in the future. We are always listening for feedback, so please continue to send us your suggestions and requests.

Sincerely
The MS MARCO Team

Changes To Dataset

10.23.2020Task Retirement

1. Retire QnA V2 Task 2. Retire NLGEN V2 Task 3. Retire OpenKP Task

08.11.2020New Task

1. Released Document Ranking task and 3 baselines.

07.30.2020New Data

1. Released ORCAS Click data

02.11.2020:New Data

1. Released Usefulness Data

10.22.2019:New Datasets

1. Released OpenKP Keyphrase Extraction dataset! 2. Released Optimal Crawling Dataset!

05.06.2019:Fixed Encoding issues with Ranking Dataset

1. Updated various encoding issues in ranking dataset.

04.23.2019:We have released a conversational search dataset

1. Brand new conversational search Dataset

10.26.2018:We have released a new ranking dataset based on the v2.1 dataset

1. Brand new Ranking Dataset
2. Basic Baseline and evaluation function

04.23.2018:We have released an updated to the dataset. V2.1 Includes the following:
1. Over 1 million queries
2. ~182k Well Formed Answers
3. Query type is now included for every query.
4. Bias in Evaluation set fixed(a small portion of answers for the V2.0 Evaluation set were able to be found in the v1.1 set and the v2.0 well formed sets, these have been removed from eval and added to train).
5. Utilities and Readme now availible.

03.01.2018:We have released an updated to the dataset. V2.0 Includes the following:
1. ~900,000 unique queries
2. ~160k Well Formed Answers

01.30.2017:We have released an update to the dataset! V1.1 contains the follwing:
1. Improvments to dataset and evaluation scripts

12.01.2016:We have released our dataset! V1.0 contains the follwing:
1. 100,000 unique query answer pairs

MS MARCO Submission Instructions

Once you have built a model that meets your expectations on evaluation with the dev set, you can submit your test results to get official evaluation on the test set. To ensure the integrity of the official test results, we do not release the correct answers for test set to the public. To submit your model for official evaluation on the test set, follow the below steps:

  1. Generate your candidate output for the dev set.
  2. Run the official evaluation methodologies found in the task specific git repo and verify your systems are running as expected
  3. Generate your candidate output for the eval/test set and submit the following information by emailing us

Your email should include

  1. Candidate evaluation file
  2. Candidate dev file
  3. Individual/Team Name: Name of the individual or the team to appear in the leaderboard [Required]
  4. Individual/Team Institution: Name of the institution of the individual or the team to appear in the leaderboard [Optional]
  5. Model code: Training code for the model [Recommended]
  6. Model information: Name of the model/technique to appear in the leaderboard [Required]
  7. Paper Information: Name, Citation, URL of the paper if model is from a published work to appear in the leaderboard [Optional]

To avoid "P-hacking" we discourage too many submissions from the same group in a short period of time.

About MS MARCO

Email us

Microsoft Machine Reading Comprehension (MS MARCO) is a collection of large scale datasets for deep learning related to Search. In MS MARCO, all questions are sampled from real anonymized user queries. The context passages, from which answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated if they could summarize the answer.

Current Team