A Dataset for Open-Domain Varying-hop Question Answering

Preprocessed Wikipedia for BeerQA

To build BeerQA, we downloaded the English Wikipedia dump on August 1st, 2020 from Wikimedia, and preprocessed it with (our own fork of) WikiExtractor for extracting plain text with hyperlinks, followed by Stanford CoreNLP (version 3.8.0) for tokenization and sentence segmentation.

The processed Wikipedia can be downloaded here (BZip2 format, 24,164,294,600 bytes, MD51 d7b50ec812c164681b4b2a7f5dfde898, distributed under a CC BY-SA 4.0 License). When decompressed2, this results in a directory named enwiki-20200801-pages-articles-tokenized with numerous subdirectories, each containing a few .bz2 files (it is strongly suggested that you read these files programmatically, e.g. through the bz2 package in Python, instead of decompressing them, which will take up a lot of disk space). Each of these .bz2 file contains a few processed Wikipedia pages in JSON format. Each JSON object is on its own line, and has the following format

    "id": 12,
    "url": "",
    "title": "Anarchism",
    "text": ["Anarchism", "\n\nAnarchism is a political philosophy and movement that rejects all involuntary ...],
    "offsets": [[[0, 9]], [[11, 20], [21, 23], [24, 25], [26, 35], [36, 46], [47, 50], [51, 59], [60, 64], [65, 72], [73, 76], ...],
    "text_with_links": ["Anarchism", "\n\nAnarchism is a <a href=\"political%20philosophy\">political philosophy</a> and <a href=\"Political%20movement\">movement</a> that rejects ...],
    "offsets_with_links": [[[0, 9]], [[11, 20], [21, 23], [24, 25], [26, 59], [59, 68], [69, 79], [79, 83], [84, 87], [88, 119], [119, 127], [127, 131], [132, 136], [137, 144], ...]

Here, id is the unique numerical ID of the article, url is a URL of the actual Wikipedia article, and title is the title of that article, which also serves as its textual identifier (case-insensitive). text is a list of strings representing the plaintext from the original Wikipedia article, where each string corresponds to a sentence in that article. offsets has a parallel structure to text, with a list of lists of integers that represent the character offsets of each token in the input, which are 0-based indices calculated with respect to the beginning of the document. We also provide a version of the text and offsets fields with hyperlinks, under the names text_with_links and offsets_with_links, respectively. Note that hyperlinks are preserved as HTML tags directly in these sentences, and the hyperlink points to the target page by its textual identifier (with the href property), which is escaped with standard URL encoding for special characters (e.g. a space will become %20, as can be seen in the example). All sentences in a Wikipedia document can be joined without separators to recover the original document (Python example: "".join(sentences)).

Related Links

  • The original Wikipedia dump used to map the questions and contexts in BeerQA (BZip2 format, 18,590,867,942 bytes, MD5 0ba34122315b0d2ee318cfa1d8c77a7a, distributed under the original CC BY-SA 3.0 Unported License from Wikipedia);
  • Our own fork of Attardi's WikiExtractor, where we fixed the issue in expanding convert templates to properly render text spans such as 123 km (instead of neglecting them altogether).

  1. To compute the MD5 checksum on a Unix system, run md5sum <bz2_file_you_downloaded>

  2. On a Unix system, this can be done by tar -xjvf <bz2_file_you_downloaded>