QuAC Dataset

Estimated reading: 4 minutes 1330 views

QuAC Dataset

What is QuAC Dataset?

QuAC (Question Answering in Context) is a question-answering in-context dataset containing 14K information-seeking QA conversations (100K questions in total). Data consists of a dialogue between two people, where a student asks questions and a teacher gives answers by providing short snippets of the text. QuAC presents challenges not found in current machine-understanding data sets: its questions are often open-ended, unanswerable, or only meaningful in the context of dialogue.

Download QuAC Dataset in Python

Instead of downloading the QuAC dataset in Python, you can effortlessly load it in Python via our Deep Lake open-source with just one line of code.

Load QuAC Dataset Training Subset in Python

				
					import deeplake
ds = deeplake.load("hub://activeloop/quac-train")

Load QuAC Dataset Validation Subset in Python

				
					import deeplake
ds = deeplake.load("hub://activeloop/quac-val")

QuAC Dataset Structure

QuAC Data Fields

For the training set

id: tensor containing the id of the dialogue.
context: tensor containing text in Wikipedia.
followup_label: tensor which contains the list of follow-up actions.
yesorno_answer: tensor containing yes or no in the dialogue. y represents yes, n represents no, and x represents neither of them.
question: tensor containing questions in the dialogue.
answer_text: tensor that contains an answer to the questions.
answer_start: tensor that contains starting offsets.
original_ans_text: tensor that contains original answers given by the teacher in the dialogue
original_ans_start: tensor that contains starting offsets of the original answer.

For the validation set

id: tensor containing the id of the dialogue.
context: tensor containing text in Wikipedia.
followup_label: tensor which contains a list of follow-up actions.
yesorno_answer: tensor containing yes or no in the dialogue. y represents yes, n represents no, and x represents neither of them.
question: tensor containing questions in the dialogue.
answer_text: tensor that contains answers to the questions.
answer_start: tensor that contains starting offsets.
original_ans_text: tensor that contains original answers given by the teacher in the dialogue.
original_ans_start: tensor that contains starting offsets of the original answer.

QuAC Data Splits

The QuAC dataset training set comprises 83,568 questions, 11,567 dialogs, and 6843 unique sections.
The QuAC dataset validation set comprises 7,354 questions, 1,000 dialogs, and 1,000 unique sections.

How to use QuAC Dataset with PyTorch and TensorFlow in Python

Train a model on QuAC dataset with PyTorch in Python

Let’s use Deep Lake built-in PyTorch one-line dataloader to connect the data to the compute:

				
					dataloader = ds.pytorch(num_workers=0, batch_size=4, shuffle=False)

Train a model on QuAC dataset with TensorFlow in Python

				
					dataloader = ds.tensorflow()

QuAC Dataset Creation

Data Collection and Normalization Information

Amazon Mechanical Turk was used for collecting the data. The task was limited to workers in English-speaking countries with over 1000 HITs and at an acceptance rate of at least 95%. The workers were rewarded based on how many turns they had in the dialog with their partner, which encouraged them to have long conversations with their partner and to discard dialogs with less than three QA pairs. A qualification task was created, allowing workers to report their partners for various problems to ensure quality.

Additional Information about QuAC Dataset

QuAC Dataset Description

Homepage: https://quac.ai/
Repository: N/A
Paper: Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, Luke Zettlemoyer: QuAC : Question Answering in Context
Point of Contact: http://yann.lecun.com/, [email protected], [email protected], [email protected], [email protected]

QuAC Dataset Curators

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, Luke Zettlemoyer

QuAC Dataset Licensing Information

CC BY-SA 4.0 Licence

QuAC Dataset Citation Information

				
					@misc{choi2018quac,
      title={QuAC : Question Answering in Context}, 
      author={Eunsol Choi and He He and Mohit Iyyer and Mark Yatskar and Wen-tau Yih and Yejin Choi and Percy Liang and Luke Zettlemoyer},
      year={2018},
      eprint={1808.07036},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}