Welcome to the CrowdTruth blog!

The CrowdTruth Framework implements an approach to machine-human computing for collecting annotation data on text, images and videos. The approach is focussed specifically on collecting gold standard data for training and evaluation of cognitive computing systems. The original framework was inspired by the IBM Watson project for providing improved (multi-perspective) gold standard (medical) text annotation data for the training and evaluation of various IBM Watson components, such as Medical Relation Extraction, Medical Factor Extraction and Question-Answer passage alignment.

The CrowdTruth framework supports the composition of CrowdTruth gathering workflows, where a sequence of micro-annotation tasks can be configured and sent out to a number of crowdsourcing platforms (e.g. Figure Eight and Amazon Mechanical Turk) and applications (e.g. Expert annotation game Dr. Detective). The CrowdTruth framework has a special focus on micro-tasks for knowledge extraction in medical text (e.g. medical documents, from various sources such as Wikipedia articles or patient case reports). The main steps involved in the CrowdTruth workflow are: (1) exploring & processing of input data, (2) collecting of annotation data, and (3) applying disagreement analytics on the results. These steps are realised in an automatic end-to-end workflow, that can support a continuous collection of high quality gold standard data with feedback loop to all steps of the process. Have a look at our presentations and papers for more details on the research.

Studying Topical Relevance with Evidence-based Crowdsourcing

Our paper “Studying Topical Relevance with Evidence-based Crowdsourcing” (Oana Inel, Giannis Haralabopoulos, Dan Li, Christophe Van Gysel, Zoltán Szlávik, Elena Simperl, Evangelos Kanoulas and Lora Aroyo) has been accepted as a full paper at the International Conference on Information and Knowledge Management (CIKM), 2018. The paper will be presented on the 25th of October at CIKM 2018, in Turin, Italy.

Abstract:
Information Retrieval systems rely on large test collections to measure their effectiveness in retrieving relevant documents. While the demand is high, the task of creating such test collections is laborious due to the large amounts of data that need to be annotated, and due to the intrinsic subjectivity of the task itself. In this paper we study the topical relevance from a user perspective by addressing the problems of subjectivity and ambiguity. We compare our approach and results with the established TREC annotation guidelines and results. The comparison is based on a series of crowdsourcing pilots experimenting with variables, such as relevance scale, document granularity, annotation template and the number of workers. Our results show correlation between relevance assessment accuracy and smaller document granularity, i.e., aggregation of relevance on paragraph level results in a better relevance accuracy, compared to assessment done at the level of the full document. As expected, our results also show that collecting binary relevance judgments results in a higher accuracy compared to the ternary scale used in the TREC annotation guidelines. Finally, the crowdsourced annotation tasks provided a more accurate document relevance ranking than a single assessor relevance label. This work resulted is a reliable test collection around the TREC Common Core track.

HCOMP18 – Work-in-Progress Follow-up

In July, we presented a work-in-progress paper at the sixth AAAI Conference on Human Computation and Crowdsourcing (HCOMP). In this paper we took a digital hermeneutics approach to understand what are the visual attributes and semantics that drive the creation of narratives. We present insights from a nichesourcing study in which humanities scholars remix keyframes and video fragments into micro-narratives i.e., (sequences of) GIFs. To support the narrative creation for humanities scholars a specific video annotation is needed, e.g., (1) annotations that consider literal and abstract connotations of video material, and (2) annotations that are coarse-grained, i.e., focusing on keyframes and video fragments as opposed to full length videos. The main findings of the study are used to facilitate the creation of narratives in the digital humanities exploratory search tool DIVE+. In previous DIVE+ crowdsourcing experiments, we used the CrowdTruth metrics and methodology to gain understanding of events!

Our presentation started with a one-minute pitch (check the slide above) and continued with a posters and demo session.

You can also check out our poster:

CrowdTruth at HCOMP 2018

The CrowdTruth team is preparing to attend the sixth AAAI Conference on Human Computation and Crowdsourcing (HCOMP), taking place in Zurich, Switzerland, July 5-8. We are happy to announce we will be presenting two papers in the main track:

  • Capturing Ambiguity in Crowdsourcing Frame Disambiguation (Anca Dumitrache, Lora Aroyo, Chris Welty):
  • FrameNet is a computational linguistics resource composed of semantic frames, high-level concepts that represent the meanings of words. We present an approach to gather frame disambiguation annotations in sentences using a crowdsourcing approach with multiple workers per sentence to capture inter-annotator disagreement. We perform an experiment over a set of 433 sentences annotated with frames from the FrameNet corpus, and show that the aggregated crowd annotations achieve an F1 score greater than 0.67 as compared to expert linguists. We highlight cases where the crowd annotation was correct even though the expert is in disagreement, arguing for the need to have multiple annotators per sentence. Most importantly, we examine cases in which crowd workers could not agree, and demonstrate that these cases exhibit ambiguity, either in the sentence, frame, or the task itself, and argue that collapsing such cases to a single, discrete truth value (i.e. correct or incorrect) is inappropriate, creating arbitrary targets for machine learning.

  • A Study of Narrative Creation by Means of Crowds and Niches (Oana Inel, Sabrina Sauer, Lora Aroyo):
  • Online video constitutes the largest, continuously growing portion of the Web content. Web users drive this growth by massively sharing their personal stories on social media platforms as compilations of their daily visual memories, or with animated GIFs and memes based on existing video material. Therefore, it is crucial to gain understanding of the semantics of video stories, i.e., what do they capture and how. The remix of visual content is also a powerful way of understanding the implicit aspects of storytelling, as well as the essential parts of audio-visual (AV) material. In this paper we take a digital hermeneutics approach to understand what are the visual attributes and semantics that drive the creation of narratives. We present insights from a nichesourcing study in which humanities scholars remix keyframes and video fragments into micro-narratives i.e., (sequences of) GIFs. To support the narrative creation for humanities scholars a specific video annotation is needed, e.g., (1) annotations that consider literal and abstract connotations of video material, and (2) annotations that are coarse-grained, i.e., focusing on keyframes and video fragments as opposed to full length videos. The main findings of the study are used to facilitate the creation of narratives in the digital humanities exploratory search tool DIVE+.

We will also appear in the Collective Intelligence co-located event, where we will be discussing our paper False Positive and Cross-relation Signals in Distant Supervision Data (Anca Dumitrache, Lora Aroyo, Chris Welty), previously published at AKBC 2017:

Distant supervision (DS) is a well-established method for relation extraction from text, based on the assumption that when a knowledge-base contains a relation between a term pair, then sentences that contain that pair are likely to express the relation. In this paper, we use the results of a crowdsourcing relation extraction task to identify two problems with DS data quality: the widely varying degree of false positives across different relations, and the observed causal connection between relations that are not considered by the DS method. The crowdsourcing data aggregation is performed using ambiguity-aware CrowdTruth metrics, that are used to capture and interpret inter-annotator disagreement. We also present preliminary results of using the crowd to enhance DS training data for a relation classification model, without requiring the crowd to annotate the entire set.

If you are attending HCOMP 2018, we hope you will stop by our presentations!

Figure Eight & CrowdTruth Dataset on Medical Relation Extraction

As part of an initiative to highlight highly curated datasets that have been collected using crowdsourcing, Figure Eight (the AI and crowdsourcing platform formerly known as Crowdflower) has teamed up with CrowdTruth to highlight our work on medical relation extraction from sentences. Both the dataset and the task templates have been made available, and it is now possible to re-use the task template directly in any Figure Eight account. For more information, read the post on the Figure Eight website, as well as our papers:

Watson Innovation Course – Invited Lecture by Ken Barker, IBM Watson US

This week, the Watson Innovation course, a collaboration between the Vrije Universiteit, University of Amsterdam and IBM Netherlands, Centre for Advanced Studies (CAS) starts. The course offers a unique opportunity to learn about IBM Watson, cognitive computing and the meaning of such artificial intelligence systems in a real world and big data context. Students from Computer Science and Economics faculties join their complimentary efforts and creativity in cross-disciplinary teams to explore the business and innovation potential of such technologies.


This year, on 13th of November, Ken Barker from IBM Watson US will give an invited lecture. Here is an abstract of his invited lecture entitled “Question Answering Post-Watson”:

There is a long, rich history of Natural Language Processing and Question Answering research at IBM. This research achieved a significant milestone when the autonomous Question Answering system called “Watson” competed head-to-head with human trivia experts on the American television show, “Jeopardy!” Since that event, both Watson and QA/NLP research have barreled forward at IBM, though not always in the same direction.

In this talk, I will give a brief, biased history of Question Answering research and Watson at IBM, before and after the Jeopardy! challenge. But most of the talk will be a more technical presentation of our path of QA research “post-Watson”. The discussion will be in three parts: 1) Continuing research on traditional Question Answering technology beyond Jeopardy! 2) Work on transferring QA technology to Medicine and Healthcare; and 3) Recent research into exploratory, collaborative Question Answering against scientific literature.


Ken Barker Bio:

Ken Barker heads the Natural Language Analytics Department in the Learning Health Systems Organization at IBM Research AI. His current research examines the weaknesses of existing information gathering tools and applies Natural Language Processing to collaborative, exploratory question answering against scientific literature. Before joining IBM in 2011, he was a Research Faculty Member at the University of Texas at Austin, serving as Investigator on DARPA’s Rapid Knowledge Formation and Machine Reading Projects, as well as on Vulcan’s Digital Aristotle Project to build intelligent scientific textbooks. He was also an Assistant Professor of Computer Science at the University of Ottawa. His research there focused on Natural Language Semantics and Semi-Automatic Interpretation of Text.