v2.5.0
Datasets ⭐
EXAMPLE CODE
QuAC Dataset
Load QuAC in Python fast with one line of code. 14K information-seeking QA conversations. Stream QuAC Dataset while training models in PyTorch & TensorFlow.

QuAC Dataset

What is QuAC Dataset?

QuAC (Question Answering in Context) is a question-answering in context dataset containing 14K information-seeking QA conversations (100K questions in total). Data consists of a dialogue between two people, where a student asks questions and a teacher gives answers by providing short snippets of the text. QuAC presents challenges not found in current machine understanding data sets: its questions are often open-ended, unanswerable, or only meaningful in the context of dialogue.

Download QuAC Dataset in Python

Instead of downloading the QuAC dataset in Python, you can effortlessly load it in Python via our open-source package Hub with just one line of code.

Load QuAC Dataset Training Subset in Python

1
import hub
2
ds = hub.load("hub://activeloop/quac-train")
Copied!

Load QuAC Dataset Validation Subset in Python

1
import hub
2
ds = hub.load("hub://activeloop/quac-val")
Copied!

QuAC Dataset Structure

QuAC Data Fields

For the training set
  • id : tensor containing id of the dialogue.
  • context: tensor containing text in the Wikipedia.
  • followup_label: tensor which contains list of follow-up actions.
  • yesorno_answer: tensor containing yes or no in the dialogue. y represents yes, n represents no, x represents neither of them.
  • question: tensor containing questions in the dialogue.
  • answer_text: tensor that contains an answer to the questions.
  • answer_start: tensor that contains starting offsets.
  • original_ans_text: tensor that contains original answers given by the teacher in the dialogue
  • original_ans_start: tensor that contains starting offsets of the original answer.
For the validation set
  • id : tensor containing the id of the dialogue.
  • context: tensor containing text in the wikipedia.
  • followup_label: tensor which contains a list of follow-up actions.
  • yesorno_answer: tensor containing yes or no in the dialogue. y represents yes, n represents no, x represents neither of them.
  • question: tensor containing questions in the dialogue.
  • answer_text: tensor that contains answer to the questions.
  • answer_start: tensor that contains starting offsets.
  • original_ans_text: tensor that contains original answers given by the teacher in the dialogue.
  • original_ans_start: tensor that contains starting offsets of the original answer.

QuAC Data Splits

  • The QuAC dataset training set comprises 83,568 questions, 11,567 dialogs and 6843 unique sections.
  • The QuAC dataset validation set comprises 7,354 questions, 1,000 dialogs and 1,000 unique sections.

How to use QuAC Dataset with PyTorch and TensorFlow in Python

Train a model on QuAC dataset with PyTorch in Python

Let's use Hub's built-in PyTorch one-line dataloader to connect the data to the compute:
1
dataloader = ds.pytorch(num_workers=0, batch_size=4, shuffle=False)
Copied!

Train a model on QuAC dataset with TensorFlow in Python

1
dataloader = ds.tensorflow()
Copied!

QuAC Dataset Creation

Data Collection and Normalization Information
Amazon Mechanical Turk was used for collecting the data. The task was limited to workers in English-speaking countries with over 1000 HITs and at an acceptance rate of at least 95%. The workers were rewarded based on how many turns they had in the dialog with their partner, which encouraged them to have long conversations with their partner and to discard dialogs with less than three QA pairs. A qualification task was created, allowing workers to report their partners for various problems to ensure quality.

Additional Information about QuAC Dataset

QuAC Dataset Description

QuAC Dataset Curators

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, Luke Zettlemoyer

QuAC Dataset Licensing Information

CC BY-SA 4.0 Licence

QuAC Dataset Citation Information

1
@misc{choi2018quac,
2
title={QuAC : Question Answering in Context},
3
author={Eunsol Choi and He He and Mohit Iyyer and Mark Yatskar and Wen-tau Yih and Yejin Choi and Percy Liang and Luke Zettlemoyer},
4
year={2018},
5
eprint={1808.07036},
6
archivePrefix={arXiv},
7
primaryClass={cs.CL}
8
}
Copied!