On 7th of March the DIVE+ project will be presented at Cross Media Café: Uit het Lab. DIVE+ is result of a true inter-disciplinary collaboration between computer scientists, humanities scholars, cultural heritage professionals and interaction designers. In this project, we use the CrowdTruth methodology and framework in order to crowdsource events for the news broadcasts from The Netherlands Institute for Sound and Vision (NISV) that are published under open licenses in the OpenImages platform. As part of the digital humanities effort, DIVE+ is also integrated in the CLARIAH (Common Lab Research Infrastructure for the Arts and Humanities) research infrastructure, next to other media studies research tools, that aims at supporting the media studies researchers and scholars by providing access to digital data and tools. In order to develop this project we work together with eScience Center, which is also funding the DIVE+ project.
Our paper “Harnessing Diversity in Crowds and Machines for Better NER Performance” (Oana Inel and Lora Aroyo) has been accepted for the ESWC 2017 Research Track. The paper is to be published together with the proceedings of the conference.
Over the last years, information extraction tools have gained a great popularity and brought significant improvement in performance in extracting meaning from structured or unstructured data. For example, named entity recognition (NER) tools identify types such as people, organizations or places in text. However, despite their high F1 performance, NER tools are still prone to brittleness due to their highly specialized and constrained input and training data. Thus, each tool is able to extract only a subset of the named entities (NE) mentioned in a given text. In order to improve NE Coverage, we propose a hybrid approach, where we first aggregate the output of various NER tools and then validate and extend it through crowdsourcing. The results from our experiments show that this approach performs significantly better than the individual state-of-the-art tools (including existing tools that integrate individual outputs already). Furthermore, we show that the crowd is quite effective in (1) identifying mistakes, inconsistencies and ambiguities in currently used ground truth, as well as in (2) a promising approach to gather ground truth annotations for NER that capture a multitude of opinions.
On 8th and 9th of September the workshop “Digging into Military Memoirs” took place at the Royal Netherlands Institute of Southeast Asian and Caribbean Studies, in Leiden. The workshop, organized by Stef Scagliola, was a great opportunity to get a close contact with researchers, historians in various fields such as interviews, oral history, cross-media analysis among others. During the workshop the participants experimented with digital technologies on the basis of a corpus of 700 documents published about the veterans in Indonesia.
The aim of the workshop was to explain to a group of around 20 historians the possibilities of Digital Humanities tools and methods. The workshop was divided in four sessions (Data Visualization, Open Linked Data, Text Mining and Crowdsourcing) and each part was composed of a short presentation and hands-on assignments to be performed individually or in groups. The main expectation for each of the sessions was to inform the researchers about the most appropriate tools/applications to use at each stage of their research in order to generate faster and more efficient insights for their work.
The crowdsourcing session was developed and presented together with Liliana Melgar. We divided the session in two parts. The first part was to be followed as an example, Liliana provided brief explanations about the current state-of-the-art in crowdsourcing approaches in Digital Humanities and other fields. In the second part, the historians were able to experiment with different examples of crowdsourcing task and further develop a project idea (based on their own interests) where crowdsourcing would make a good candidate.
During the ESWC 2016 PhD Symposium, I presented my doctoral consortium paper, which is entitled “Machine-Crowd Annotation Workflow for Event Understanding across Collections and Domains”.
People need context to process the massive information online. Context is often expressed by a specific event taking place. The multitude of data streams used to mention events provide an inconceivable amount of information redundancy and perspectives. This poses challenges to both humans, i.e., to reduce the information overload and consume the meaningful information and machines, i.e., to generate a concise overview of the events. For machines to generate such overviews, they need to be taught to understand events. The goal of this research project is to investigate whether combining machines output with crowd perspectives boosts the event understanding of state-of-the-art natural language processing tools and improve their event detection. To answer this question, we propose an end-to-end research methodology for: machine processing, defining experimental data and setup, gathering event semantics and results evaluation. We present preliminary results that indicate crowdsourcing as a reliable approach for (1) linking events and their related entities in cultural heritage collections and (2) identifying salient event features (i.e., relevant mentions and sentiments) for online data. We provide an evaluation plan for the overall research methodology of crowdsourcing event semantics across modalities and domains.
On Monday, 9 March, 2015, Oana presented the workflow and the experiments for event extraction conducted through the CrowdTruth platform. More exactly, the talk presented the things that we have learnt from the crowd and the benefits that crowdsourcing can bring in solving this problem.