A Dataset for Open-Domain Varying-hop Question Answering

What is BeerQA?

BeerQA is an open-domain question answering dataset that features questions requiring information from one or more Wikipedia documents to answer, which presents a more realistic challenge for open-domain question answering. BeerQA is constructed based on the Stanford Question Answering Dataset (SQuAD) and the HotpotQA dataset, by a team of NLP researchers at JD AI Research, Samsung Research, and Stanford University.

For more details about BeerQA, please refer to our EMNLP 2021 paper:

Getting started

BeerQA is distributed under a CC BY-SA 4.0 License. The training and development sets can be downloaded below.

A more comprehensive summary about data download, preprocessing, baseline model training, and evaluation is included in our GitHub repository, and linked below.

Once you have built your model, you can use the evaluation script we provide below to evaluate model performance by running python <path_to_gold> <path_to_prediction>

To submit your models and evaluate them on the official test sets, please read our submission guide hosted on Codalab.

We also release the processed Wikipedia used in the process of creating BeerQA (also under a CC BY-SA 4.0 License), serving both as the corpus for the fullwiki setting in our evaluation, and hopefully as a standalone resource for future researches involving processed text on Wikipedia. Below please find the link to the documentation for this corpus.


If you use BeerQA in your research, please cite our paper with the following BibTeX entry

  title={Answering Open-Domain Questions of Varying Reasoning Steps from Text},
  author = {Qi, Peng and Lee, Haejun and Sido, Oghenetegiri "TG" and Manning, Christopher D.},
  booktitle = {Empirical Methods for Natural Language Processing ({EMNLP})},
  year = {2021}
The BeerQA benchmark consists of question answering data from three sources: SQuAD, HotpotQA, as well as newly annotated questions that require at least three supporting documents from Wikipedia to answer to test zero-shot generalization. We evaluate systems on answer exact match (EM) and F1 metrics on each of these subsets, and use the macro average as the main evaluation metric.
Model Code Ext.
SQuAD Open HotpotQA 3+ Hop Challenge Macro Avg
Nov 1, 2021
Baseline Model: IRRR (single model)
JD AI Research, Samsung Research, & Stanford University
(Qi, Lee, Sido, and Manning, EMNLP 2021)
61.06 67.87 58.12 69.29 34.15 40.72 51.11 59.30