Anca Dumitrache

CrowdTruth at HCOMP 2018

The CrowdTruth team is preparing to attend the sixth AAAI Conference on Human Computation and Crowdsourcing (HCOMP), taking place in Zurich, Switzerland, July 5-8. We are happy to announce we will be presenting two papers in the main track:

  • Capturing Ambiguity in Crowdsourcing Frame Disambiguation (Anca Dumitrache, Lora Aroyo, Chris Welty):
  • FrameNet is a computational linguistics resource composed of semantic frames, high-level concepts that represent the meanings of words. We present an approach to gather frame disambiguation annotations in sentences using a crowdsourcing approach with multiple workers per sentence to capture inter-annotator disagreement. We perform an experiment over a set of 433 sentences annotated with frames from the FrameNet corpus, and show that the aggregated crowd annotations achieve an F1 score greater than 0.67 as compared to expert linguists. We highlight cases where the crowd annotation was correct even though the expert is in disagreement, arguing for the need to have multiple annotators per sentence. Most importantly, we examine cases in which crowd workers could not agree, and demonstrate that these cases exhibit ambiguity, either in the sentence, frame, or the task itself, and argue that collapsing such cases to a single, discrete truth value (i.e. correct or incorrect) is inappropriate, creating arbitrary targets for machine learning.

  • A Study of Narrative Creation by Means of Crowds and Niches (Oana Inel, Sabrina Sauer, Lora Aroyo):
  • Online video constitutes the largest, continuously growing portion of the Web content. Web users drive this growth by massively sharing their personal stories on social media platforms as compilations of their daily visual memories, or with animated GIFs and memes based on existing video material. Therefore, it is crucial to gain understanding of the semantics of video stories, i.e., what do they capture and how. The remix of visual content is also a powerful way of understanding the implicit aspects of storytelling, as well as the essential parts of audio-visual (AV) material. In this paper we take a digital hermeneutics approach to understand what are the visual attributes and semantics that drive the creation of narratives. We present insights from a nichesourcing study in which humanities scholars remix keyframes and video fragments into micro-narratives i.e., (sequences of) GIFs. To support the narrative creation for humanities scholars a specific video annotation is needed, e.g., (1) annotations that consider literal and abstract connotations of video material, and (2) annotations that are coarse-grained, i.e., focusing on keyframes and video fragments as opposed to full length videos. The main findings of the study are used to facilitate the creation of narratives in the digital humanities exploratory search tool DIVE+.

We will also appear in the Collective Intelligence co-located event, where we will be discussing our paper False Positive and Cross-relation Signals in Distant Supervision Data (Anca Dumitrache, Lora Aroyo, Chris Welty), previously published at AKBC 2017:

Distant supervision (DS) is a well-established method for relation extraction from text, based on the assumption that when a knowledge-base contains a relation between a term pair, then sentences that contain that pair are likely to express the relation. In this paper, we use the results of a crowdsourcing relation extraction task to identify two problems with DS data quality: the widely varying degree of false positives across different relations, and the observed causal connection between relations that are not considered by the DS method. The crowdsourcing data aggregation is performed using ambiguity-aware CrowdTruth metrics, that are used to capture and interpret inter-annotator disagreement. We also present preliminary results of using the crowd to enhance DS training data for a relation classification model, without requiring the crowd to annotate the entire set.

If you are attending HCOMP 2018, we hope you will stop by our presentations!

Figure Eight & CrowdTruth Dataset on Medical Relation Extraction

As part of an initiative to highlight highly curated datasets that have been collected using crowdsourcing, Figure Eight (the AI and crowdsourcing platform formerly known as Crowdflower) has teamed up with CrowdTruth to highlight our work on medical relation extraction from sentences. Both the dataset and the task templates have been made available, and it is now possible to re-use the task template directly in any Figure Eight account. For more information, read the post on the Figure Eight website, as well as our papers:

Relation Extraction at Collective Intelligence 2017

We are happy to announce that our project exploring relation extraction from natural language has 2 extended abstracts accepted at the Collective Intelligence conference this summer! Here are the papers:

  • Crowdsourcing Ambiguity-Aware Ground Truth: we apply the CrowdTruth methodology to collect data over a set of diverse tasks: medical relation extraction, Twitter event identification, news event extraction and sound interpretation. We prove that capturing disagreement is essential for acquiring a high quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with majority vote, a method which enforces consensus among annotators. By applying our analysis over a set of diverse tasks we show that, even though ambiguity manifests differently depending on the task, our theory of inter-annotator disagreement as a property of ambiguity is generalizable.
  • Disagreement in Crowdsourcing and Active Learning for Better Distant Supervision Quality: we present ongoing work on combining active learning with the CrowdTruth methodology for further improving the quality of DS training data. We report the results of a crowdsourcing experiment ran on 2,500 sentences from the open domain. We show that modeling disagreement can be used to identify interesting types of errors caused by ambiguity in the TAC-KBP knowledge base, and we discuss how an active learning approach can incorporate these observations to utilize the crowd more efficiently.

CrowdTruth @ Watson Experience MeetUp

CrowdTruth had an appearance at the Watson Experience MeetUp last week. Together with Zoltán Szlávik, my colleague from IBM, we talked about the pervasive myths that still influence how we collect annotation from humans. While time and money constraints definitely influence data quality, the common core of all of these issues is rather the very definition of quality, as well as what the value of ambiguous data is. The slides of the talk were based on this paper.

Thank you to Jibes for organizing, Rabobank Utrecht for hosting us, and especially to Loes Brouwers and Tessa van der Eems for setting all of this up!

2 Papers Accepted at ISWC 2015 Workshops

We are happy to announce that two CrowdTruth papers have been accepted at the workshops of the 14th International Semantic Web Conference (ISWC 2015). Both of them present some exciting results from our work with medical relation extraction.

The first one, Achieving Expert-Level Annotation Quality with CrowdTruth: the Case of Medical Relation Extraction, will appear in the Biomedical Data Mining, Modeling, and Semantic Integration (BDM2I) workshop. Download it here, or read the abstract below:

The lack of annotated datasets for training and benchmarking is one of the main challenges of Clinical Natural Language Processing. In addition, current methods for collecting annotation attempt to minimize disagreement between annotators, and therefore fail to model the ambiguity inherent in language. We propose the CrowdTruth method for collecting medical ground truth through crowdsourcing, based on the observation that disagreement between annotators can be used to capture ambiguity in text. In this work, we report on using this method to build a ground truth for medical relation extraction, and how it performed in training a classification model. Our results show that, with appropriate processing, the crowd performs just as well as medical experts in terms of the quality and efficacy of annotations. Furthermore, we show that the general practice of employing a small number of annotators for collecting ground truth is faulty, and that more annotators per sentence are needed to get the highest quality annotations.

The second one, CrowdTruth Measures for Language Ambiguity: the Case of Medical Relation Extraction, will appear in the Linked Data for Information Extraction (LD4IE) workshop. Download it here, or read the abstract below:

A widespread use of linked data for information extraction is distant supervision, in which relation tuples from a data source are found in sentences in a text corpus, and those sentences are treated as training data for relation extraction systems. Distant supervision is a cheap way to acquire training data, but that data can be quite noisy, which limits the performance of a system trained with it. Human annotators can be used to clean the data, but in some domains, such as medical NLP, it is widely believed that only medical experts can do this reliably. We have been investigating the use of crowdsourcing as an affordable alternative to using experts to clean noisy data, and have found that with the proper analysis, crowds can rival and even out-perform the precision and recall of experts, at a much lower cost. We have further found that the crowd, by virtue of its diversity, can help us find evidence of ambiguous sentences that are difficult to classify, and we have hypothesized that such sentences are likely just as difficult for machines to classify. In this paper we outline CrowdTruth, a previously presented method for scoring ambiguous sentences that suggests that existing modes of truth are inadequate, and we present for the first time a set of weighted metrics for evaluating the performance of experts, the crowd, and a trained classifier in light of ambiguity. We show that our theory of truth and our metrics are a more powerful way to evaluate NLP performance over traditional unweighted metrics like precision and recall, because they allow us to account for the rather obvious fact that some sentences express the target relations more clearly than others.

CrowdTruth Measures for Language Ambiguity

A widespread use of linked data for information extraction is distant supervision, in which relation tuples from a data source are found in sentences in a text corpus, and those sentences are treated as training data for relation extraction systems. Distant supervision is a cheap way to acquire training data, but that data can be quite noisy, which limits the performance of a system trained with it. Human annotators can be used to clean the data, but in some domains, such as medical NLP, it is widely believed that only medical experts can do this reliably. We have been investigating the use of crowdsourcing as an affordable alternative to using experts to clean noisy data, and have found that with the proper analysis, crowds can rival and even out-perform the precision and recall of experts, at a much lower cost. We have further found that the crowd, by virtue of its diversity, can help us find evidence of ambiguous sentences that are difficult to classify, and we have hypothesized that such sentences are likely just as difficult for machines to classify. In this paper we outline CrowdTruth, a previously presented method for scoring ambiguous sentences that suggests that existing modes of truth are inadequate, and we present for the first time a set of weighted metrics for evaluating the performance of experts, the crowd, and a trained classifier in light of ambiguity. We show that our theory of truth and our metrics are a more powerful way to evaluate NLP performance over traditional unweighted metrics like precision and recall, because they allow us to account for the rather obvious fact that some sentences express the target relations more clearly than others.

Read More...

Achieving Expert-Level Annotation Quality with CrowdTruth

The lack of annotated datasets for training and benchmarking is one of the main challenges of Clinical Natural Language Processing. In addition, current methods for collecting annotation attempt to minimize disagreement between annotators, and therefore fail to model the ambiguity inherent in language. We propose the CrowdTruth method for collecting medical ground truth through crowdsourcing, based on the observation that disagreement between annotators can be used to capture ambiguity in text. In this work, we report on using this method to build a ground truth for medical relation extraction, and how it performed in training a classification model. Our results show that, with appropriate processing, the crowd performs just as well as medical experts in terms of the quality and efficacy of annotations. Furthermore, we show that the general practice of employing a small number of annotators for collecting ground truth is faulty, and that more annotators per sentence are needed to get the highest quality annotations.

Read More...

Crowdsourcing Disagreement for Collecting Semantic Annotation

This paper proposes an approach to gathering semantic annotation, which rejects the notion that human interpretation can have a single ground truth, and is instead based on the observation that disagreement between annotators can signal ambiguity in the input text, as well as how the annotation task has been designed. The purpose of this research is to investigate whether disagreement-aware crowdsourcing is a scalable approach to gather semantic annotation across various tasks and domains. We propose a methodology for answering this question that involves, for each task and domain: defining the crowdsourcing setup, experimental data collection, and evaluating both the setup and the results. We present initial results for the task of medical relation extraction, and propose an evaluation plan for crowdsourcing semantic annotation for several tasks and domains.

Read More...