Welcome to the CrowdTruth blog!

The CrowdTruth Framework implements an approach to machine-human computing for collecting annotation data on text, images and videos. The approach is focussed specifically on collecting gold standard data for training and evaluation of cognitive computing systems. The original framework was inspired by the IBM Watson project for providing improved (multi-perspective) gold standard (medical) text annotation data for the training and evaluation of various IBM Watson components, such as Medical Relation Extraction, Medical Factor Extraction and Question-Answer passage alignment.

The CrowdTruth framework supports the composition of CrowdTruth gathering workflows, where a sequence of micro-annotation tasks can be configured and sent out to a number of crowdsourcing platforms (e.g. CrowdFlower and Amazon Mechanical Turk) and applications (e.g. Expert annotation game Dr. Detective). The CrowdTruth framework has a special focus on micro-tasks for knowledge extraction in medical text (e.g. medical documents, from various sources such as Wikipedia articles or patient case reports). The main steps involved in the CrowdTruth workflow are: (1) exploring & processing of input data, (2) collecting of annotation data, and (3) applying disagreement analytics on the results. These steps are realised in an automatic end-to-end workflow, that can support a continuous collection of high quality gold standard data with feedback loop to all steps of the process. Have a look at our presentations and papers for more details on the research.

DIVE+ Presentation at Cross Media Café

On 7th of March the DIVE+ project will be presented at Cross Media Café: Uit het Lab. DIVE+ is result of a true inter-disciplinary collaboration between computer scientists, humanities scholars, cultural heritage professionals and interaction designers. In this project, we use the CrowdTruth methodology and framework in order to crowdsource events for the news broadcasts from The Netherlands Institute for Sound and Vision (NISV) that are published under open licenses in the OpenImages platform. As part of the digital humanities effort, DIVE+ is also integrated in the CLARIAH (Common Lab Research Infrastructure for the Arts and Humanities) research infrastructure, next to other media studies research tools, that aims at supporting the media studies researchers and scholars by providing access to digital data and tools. In order to develop this project we work together with eScience Center, which is also funding the DIVE+ project.

Paper Accepted for the ESWC 2017 Research Track

Our paper “Harnessing Diversity in Crowds and Machines for Better NER Performance” (Oana Inel and Lora Aroyo) has been accepted for the ESWC 2017 Research Track. The paper is to be published together with the proceedings of the conference.

Over the last years, information extraction tools have gained a great popularity and brought significant improvement in performance in extracting meaning from structured or unstructured data. For example, named entity recognition (NER) tools identify types such as people, organizations or places in text. However, despite their high F1 performance, NER tools are still prone to brittleness due to their highly specialized and constrained input and training data. Thus, each tool is able to extract only a subset of the named entities (NE) mentioned in a given text. In order to improve NE Coverage, we propose a hybrid approach, where we first aggregate the output of various NER tools and then validate and extend it through crowdsourcing. The results from our experiments show that this approach performs significantly better than the individual state-of-the-art tools (including existing tools that integrate individual outputs already). Furthermore, we show that the crowd is quite effective in (1) identifying mistakes, inconsistencies and ambiguities in currently used ground truth, as well as in (2) a promising approach to gather ground truth annotations for NER that capture a multitude of opinions.

Watson Innovation Course wins ICT project of the year in education


Yesterday at the Computable Awards the Vrije Universiteit, University of Amsterdam and IBM won the prize for “ICT project of the year in education” with the Watson Innovation Course. Furthermore, the project was highest rated across all nominees of all prize categories. The course is ongoing at the moment for the second time, with an improved setup and new state of the art tools for the students.

The course is run by Lora Aroyo, Anca Dumitrache, Benjamin Timmermans and Oana Inel from the VU, and Robert-Jan Sips and Zoltan Szlavik from IBM. In the course the students were challenged by Amsterdam Marketing to solve the issue of the increasing overcrowdedness of tourists in the city center of Amsterdam. The city is culturally rich with many places to visit, yet most visitors cluster around a limited set of popular locations. The students came up with ideas to motivate visitors to spread in the city and provide them with relevant information for their visit.


Interested in working with IBM and ABN AMRO on an exciting innovation project?

Are you interested in working with IBM and ABN AMRO on an exciting innovation project?

Although the mortgage application process and the regulations surrounding that are clearly mapped, institutionalised and supported by automated systems, the process of orientation in the housing market is not. When looking for an appropriate place to live / open a business, clients of the bank are confronted with questions and considerations like the location, image of and facilities in the neighbourhood and the safety of this neighbourhood, energy labels, average price levels, future development plans, etc., surrounding one of the largest decisions of their life: the purchase of a house. At current, the bank is unclear about the steps customers take in the orientation process and with which extra services / answers / information their bank could support them.

Within this 3 month project, IBM, VU and UvA will join forces and will use the IBM Open Innovation approach to come up with a new data-driven service concept for one of the leading banks in the Netherlands. The team will consist of IBM Consultants (design / business strategy), an IBM programmer, 2 IBM researchers and researchers from VU Amsterdam and UvA.

To strengthen our team we are looking for 3 students for a 3-month-internship at IBM, potentially followed by a MSc thesis project deepening/continuing upon their work, in the following disciplines:

(1) Business/Service Innovation: Students with an entrepreneurial / service innovation background, looking to gain experience in the development of a real-life business case. The student should ideally have experience with focus groups and qualitative interviews, to help gain initial insights into the house orientation process.

(2) Crowdsourcing: Using crowdsourcing and the social web, to get a clear(er) picture of the demands, questions, uncertainties surrounding the purchase of a house. This project will complement the work done by student (1) with quantitative results and a larger scope.

(3) Open Data / Information Retrieval: Finding (open) datasets and retrieving datasources which would be able to provide insights in the questions identified by the work done by student (1) and (2).

If you are interested in an internship with IBM / ABNAMRO within the context of this project, please contact Lora Aroyo via lora.aroyo@vu.nl with a short motivation why you would like to work on it, CV and your availability in the coming 3-4 months.

Digging into Military Memoirs

On 8th and 9th of September the workshop “Digging into Military Memoirs” took place at the Royal Netherlands Institute of Southeast Asian and Caribbean Studies, in Leiden. The workshop, organized by Stef Scagliola, was a great opportunity to get a close contact with researchers, historians in various fields such as interviews, oral history, cross-media analysis among others. During the workshop the participants experimented with digital technologies on the basis of a corpus of 700 documents published about the veterans in Indonesia.

The aim of the workshop was to explain to a group of around 20 historians the possibilities of Digital Humanities tools and methods. The workshop was divided in four sessions (Data Visualization, Open Linked Data, Text Mining and Crowdsourcing) and each part was composed of a short presentation and hands-on assignments to be performed individually or in groups. The main expectation for each of the sessions was to inform the researchers about the most appropriate tools/applications to use at each stage of their research in order to generate faster and more efficient insights for their work.

The crowdsourcing session was developed and presented together with Liliana Melgar. We divided the session in two parts. The first part was to be followed as an example, Liliana provided brief explanations about the current state-of-the-art in crowdsourcing approaches in Digital Humanities and other fields. In the second part, the historians were able to experiment with different examples of crowdsourcing task and further develop a project idea (based on their own interests) where crowdsourcing would make a good candidate.