Google Released Two New Natural Language Dialog Datasets

Google AI announced that it is releasing two new natural language dialog datasets: Coached Conversational Preference Elicitation (CCPE) and Taskmaster-1.

According to Google researchers, the idea behind the development of these datasets was the lack of quality training data for digital assistants. In fact, they argue that today’s digital assistants have not reached a human-level understanding and that is mostly due to the lack of understanding of how humans express themselves.

To overcome this problem, researchers leveraged a platform based on the Wizard-of-Oz experiment idea and that pairs people to engage in spoken conversations, thinking that they are interacting with a computer (or assistant in this case).

The first dataset, CCPE or Coached Conversational Preference Elicitation, contains dialogue data on people’s movie preferences. It consists of a total of 502 dialogs with 12000 annotated utterances between a user (person) and an assistant. The dataset is collected in such a way that the assistant poses questions which are specifically designed to minimize the bias in the terminology of the user. This means, the descriptions obtained from the user need to be rich and express their preferences in a concise manner.

The second dataset – Taskmaster-1, contains conversational data in a spoken and written form based on six defined tasks: ordering pizza, creating auto repair appointments, setting up riders for hire, ordering movie tickets, ordering coffee drinks and making reservations in restaurants.

Researchers also employed a simple labeling technique, to annotate the dialogue data and obtain some amount of ground truth data. The dataset contains 13 215 dialogs with 7708 self-dialog written and 5507 two-person spoken samples.

The two datasets are open-sourced and are available under the Creative Commons license. The Taskmaster-1 dataset can be downloaded from here, while CCPE is available here.