SemEval-2021 Task 12: Learning with Disagreements

  • Alexandra Nnemamaka Uma (Creator)
  • Tommaso Fornaciari (Creator)
  • Anca Dumitrache (Creator)
  • Tristan Miller (Creator)
  • Jon Chamberlain (Creator)
  • Barbara Plank (Creator)
  • Edwin D. Simpson (Creator)
  • Massimo Poesio (Creator)



This repository contains the Post-Evaluation data for SemEval-2021 Task 12: Learning with Disagreement, a shared task on learning to classify with datasets containing disagreements. 

The aim of this shared task is to provide a unified testing framework for learning from disagreements using the best-known datasets containing information about disagreements for interpreting language and classifying images:

    1. LabelMe-IC: Image Classification using a subset of LabelMe images (Russell et al., 2008), is a widely used, community-created image classification dataset where images are assigned to one of 8 categories: highway, inside city, tall building, street, forest, coast, mountain, open country. Rodrigues and Pereira (2017) collected crowd labels for these images using Amazon Mechanical Turk (AMT).

    2. CIFAR10-IC: Image Classification using a subset of CIFAR-10 dataset, The entire dataset consists of colour images in 10 categories (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck). Crowdsourced labels for this dataset were collected by Peterson et al (2019).

    3. PDIS: Information Status Classification using Phrase Detectives Information. Information Status Classification (IS) in Phrase Detectives (Poesio et al., 2019) dataset involves identifying the information status of a noun phrase: whether that noun phrase refers to new information or to old information.

    4. Gimpel-POS: Part-of-Speech tagging using the Gimpel dataset (Gimpel et al., 2011) for Twitter posts. Plank et al.(2014b) mapped the Gimpel tags to the universal tag set (Petrov et al., 2011), using these tags as gold, and collected crowdsourced labels.

    5. Humour: ranking one-line texts using pairwise funniness judgements (Simpson et al., 2019). Crowdworkers have annotated pairs of puns to indicate which is funniest. A gold standard ranking was produced using a large number of redundant annotations. The goal is to infer the gold standard ranking from a reduced number of crowdsourced judgements.

The files contained in this data collection are as follows: - Base models used provided for the shared task. - The training and development data used during the Practice Phase of the competition. - The test data, used during the Evaluation Phase of the competition

Details of format of each dataset for each task can be found on Codalab.
Date made available23 Jul 2021

Cite this