Oana Inel

A Concentric-based Approach to Represent Topics in Tweets and News

[This post is based on the BSc. Thesis of Enya Nieland and the BSc. Thesis of Quinten van Langen (Information Science Track)]

The Web is a rich source of information that presents events, facts and their evolution across time. People mainly follow events through news articles or through social media, such as Twitter. The main goal of the two bachelor projects was to see whether topics in news articles or tweets can be represented in a concentric model where the main concepts describing the topic are placed in a “core”, and the concepts less relevant are placed in a “crust”. In order to answer to this question, Enya and Quinten addressed the research conducted by José Luis Redondo García et al. in the paper “The Concentric Nature of News Semantic Snapshots”.

Enya focused on the tweets dataset and her results show that the approach presented in the aforementioned paper does not work well for tweets. The model had a precision score of only 0.56. After a data inspection, Enya concluded that the high amount of redundant information found in tweets, make them difficult to summarise and identify the most relevant concepts. Thus, after applying stemming and lemmatisation techniques, data cleaning and similarity scores together with various relevance thresholds, she improved the precision to 0.97.

Quinten focused on topics published in news articles. When applying the method described in the reference article, Quinten concluded that relevant entities from news articles can be indeed identified. However, his focus was also to identify the most relevant events that are mentioned when talking about a topic. As an addition, he calculated a term frequency inverse document frequency (TF-IDF) score and an event-relation (temporal relations and event-related concepts) score for each topic. These combined scores determines the new relevance score of the entities mentioned in a news article. The improvements made improved the ranking of the events, but did not improve the ranking of the other concepts, such as places or actors.

Following, you can check the final presentations that the students gave to present their work:

A Concentric-based Approach to Represent News Topics in Tweets
Enya Nieland, June 21st 2017

The Relevance of Events in News Articles
Quentin van Langen, June 21st 2017

ESWC 2017 – Trip Report

Between 28th of May and 1st of June 2016 the 14th Extended Semantic Web Conference took place in Portorož, Slovenia. As part of the CrowdTruth team and project, Oana Inel presented her paper written together with Lora Aroyo in the first day of the conference. More about the paper that was presented can be found in a previous post. In the last day of the conference, Lora was the keynote speaker.

The Semantic Web group at the Vrije Universiteit Amsterdam had other great presentations. During the Scientometrics Workshop Al Idrissou talked about the SMS platform that links and enriches data for studying science. During the poster and demo session people were invited to check SPARQL2Git: Transparent SPARQL and Linked Data API Curation via Git by Albert Meroño-Peñuela and Rinke Hoekstra. Furthermore, the Semantic Web group had a candidate paper for the 7-year impact award “OWL reasoning with WebPIE: calculating the closure of 100 billion triples”, by Jacopo Urbani, Spyros Kotoulas, Jason Maassen, Frank van Harmelen and Henri Bal.

Keynotes

I’ll start by writing a couple of words about the keynotes, which covered this year a high range of areas, domains and subjects. In the first keynote presentation at ESWC 2017, on Tuesday, Kevin Crosby, from RavenPack, stressed the importance of data as a factor in decision making for financial markets. In his talk entitled “Bringing semantic intelligence to financial markets”, he focused on the current issues related to data analytics in decision making: the lack of skills and expertise, the quality and completeness of data and the timeliness of data. However, the most important issue is the fact that although we live in the age of data, only around 29% of the decisions in the financial market are made based on data.

The second keynote speaker was John Sheridan, the digital director of The National Archives in UK. While giving a nice overview of the British history, he talked about how semantic technologies are used to preserve the history at The National Archives in UK, in a talk entitled “Semantic Web technologies for Digital Archives”. Nowadays, semantic technologies are used at large in order to make the cultural heritage collections publicly available online. However, people still struggle to search and browse through archives without having the context of the data. As a take home message, we need to work towards the second generation digital archives that should measure risks, provide trust evidence, redefine context, embrace uncertainty, enable use and access.

In the last day of the conference Lora Aroyo gave her keynote presentation, “Disrupting the Semantic Comfort Zone”. Lora started her keynote by looking back into the history of Semantic Web and AI and how her own journey embraced the changes along the way. Something was clear: the humans were always in the centre and they still continue to be. The second part of the presentation focused on introducing the underlying idea of the CrowdTruth project. As a final note, I’ll leave here the following question from Lora: “Will the next AI winter be the winter of human intelligence or not?”

NLP & ML Tracks

Federico Bianchi presented during the ML track an approach that uses active learning to rank semantic associations. The problem is well-known, we have an information overload in contextual KB exploration and even for small amounts of texts there is a lot of data to be considered. In order to determine which semantic associations are most interesting to users, Actively Learning to Rank Semantic Associations for Personalized Contextual Exploration of Knowledge Graphs defines a ranking function based on a serendipity heuristic, i.e., relevance and unexpectedness.

The paper “All that Glitters Is Not Gold – Rule-Based Curation of Reference Datasets for Named Entity Recognition and Entity Linking” by Kunal Jha, Michael Röder and Axel-Cyrille Ngonga Ngomo draws the attention over the current gold standards and makes similar claims as the ones we presented in our paper: the gold standards for not share a common set of rules for annotating named entities, they are not thoroughly checked and they are not refined and updated to newer versions. Thus, the need for the EAGLET benchmark curation tool for named entities!

Using semantic annotations for providing a better access to scientific publications is a subject that nowadays caught the attention of many researchers. Sepideh Mesbah, PhD student at Delft University of Technology presented “Semantic Annotation of Data Processing Pipelines in Scientific Publications”, a paper that proposes an approach and workflow for extracting semantically rich metadata from scientific publications, by classifying the content of scientific publications and extracting the named entities (objectives, datasets, methods, software, results).

Jose G. Moreno presented the paper “Combining Word and Entity Embeddings for Entity Linking” which introduces a natural idea for entity linking by using a combination of entity and word embeddings. The claims of the authors are the following: you shall know a word by the company it keeps and you shall know an entity by the company it keeps in a KB, word context by alignment, word/entity context by concatenation.

Social Media Track

The Social Media track started with a presentation by Hassan Saif – “A Semantic Graph-based Approach for Radicalisation Detection on Social Media”. The approach presented in the paper uses semantic graph representation in order to discover patterns among pro and anti ISIS users on social media. Overall, pro-ISIS users tend to discuss about religion, historical events and ethnicity, while anti-ISIS users focus more on politics, geographical locations and intervention against ISIS. The second presentation – “Crowdsourced Affinity: A Matter of Fact or Experience” by Chun Lu – took us in a different domain – a travel destination recommendation scenario that is based on a user-entity affinity, i.e., the likelihood of a user to be attracted by an entity (book film, artist) or to perform an ection (click, purchase, like, share). The main finding of the paper was that in general, a knowledge graph helps to assess more accurately the affinity, while a folksonomy helps to increase its diversity and novelty. The Social Media Track had two papers nominated for best student research paper – the aforementioned paper and the paper “Linked Data Notifications” presented by Sarven Capadisli, Amy Guy, Christoph Lange, Sören Auer, Andrei Sambra and Tim Berners-Lee. The latter was also the winner!

In-Use and Industrial Track

Social media was highly relevant for the In-Use track as well. The Swiss Armed Forces is developing a Social Media Analysis system aiming to detect events such as natural disasters and terrorists activity by performing semantic tweet analysis. If you want to know more, you can the paper “ArmaTweet: Detecting Events by Semantic Tweet Analysis”. This track has as well nominations for best in-use paper. The winning paper in this category was “smartAPI: Towards a More Intelligent Network of Web APIs”, presented by Amrapali Zaveri.

Open Knowledge Extraction Challenge

During the Open Knowledge Extraction challenge, Raphaël Troncy presented the participating system ADEL – an adaptable entity extraction and linking framework, also the challenge winning entry. The ADEL framework can be adapted to a variety of different generic or specific entity types that need to be extracted, as well as to different knowledge bases to be disambiguated to, such as DBpedia and MusicBrainz). Overall, this self-configurable system tries to solve a difficult problem with current NER tools, i.e., the fact that they are only tailored for specific data, scenarios and applications.

Workshops

On Monday, during the second day of workshops I attended two workshops, 3rd international workshop on Semantic Web for Scientific Heritage, SW4SH 2017 and Semantic Deep Learning, SemDeep-17, now at the first edition. During the SW4SH 2017 workshop, Francesco Beretta had a detailed keynote, entitled “Collaboratively Producing Interoperable Ontologies and Semantically Annotated Corpora” in which he presented a couple of projects for digital humanities (symogih.org, the corpus analysis environment TXM, among others) and how linked (open) data, ontologies, automated tools for natural language processing and semantics are finding their place in the daily projects of humanities scholars. However, all these tools, approaches and technologies are not 100% embraced, as humanities scholars are seldom content with precision values of 90% and they feel the urge of manually tweak the data, until it looks perfect.

During SemDeep-17, Sergio Oramas presented the paper “ELMDist: A vector space model with words and MusicBrainz entities”. This article makes it clear that it’s still unclear how NLP and semantic technologies can contribute in Music Information Retrieval areas such as music and artist recommendation and similarity. The approach presented uses NLP processing in order to disambiguate the entities from the musical texts and then runs the word2vec algorithm over this sense level space. Overall, their results show promising results, meaning that textual descriptions can be used in order to improve the Music Information Retrieval area. The last paper of the workshop, “On Semantics and Deep Learning for Event Detection in Crisis Situations”, was presented by Hassan Saif. As the title suggests, the paper tries to solve the problem of event detection in crisis situations from social media, using Dual-CNN, a semantically-enhanceddeep learning model. Altought the model has successful results in identifying the existence of events and their types, its performance drops significantly when identifying event-related information such as the number of people affected, total damages.

Harnessing Diversity in Crowds and Machines for Better NER Performance

Today, I presented in the Research Track of ESWC 2017 my work entitled “Harnessing Diversity in Crowds and Machines for Better NER Performance”. Following, you can check the abstract of the paper and the slides that I used during the presentation.


Abstract:

Over the last years, information extraction tools have gained a great popularity and brought significant performance improvement in extracting meaning from structured or unstructured data. For example, named entity recognition (NER) tools identify types such as people, organizations or places in text. However, despite their high F1 performance, NER tools are still prone to brittleness due to their highly specialized and constrained input and training data. Thus, each tool is able to extract only a subset of the named entities (NE) mentioned in a given text. In order to improve \emph{NE Coverage}, we propose a hybrid approach, where we first aggregate the output of various NER tools and then validate and extend it through crowdsourcing. The results from our experiments show that this approach performs significantly better than the individual state-of-the-art tools (including existing tools that integrate individual outputs already). Furthermore, we show that the crowd is quite effective in (1) identifying mistakes, inconsistencies and ambiguities in currently used ground truth, as well as in (2) a promising approach to gather ground truth annotations for NER that capture a multitude of opinions.

IBM Ph.D. Fellowship 2017-2018

Oana Inel received for the second time the IBM Ph.D. Fellowship. Her research topic focuses on data enrichment with events and event-related entities, by combining the computer power with the crowd potential to identify their relevant dimension, granularity and perspective. She performs her research and experiments in the context of the CrowdTruth project, a project in collaboration with IBM Benelux Centre for Advanced Studies.

DIVE+ Presentation at Cross Media Café

On 7th of March the DIVE+ project will be presented at Cross Media Café: Uit het Lab. DIVE+ is result of a true inter-disciplinary collaboration between computer scientists, humanities scholars, cultural heritage professionals and interaction designers. In this project, we use the CrowdTruth methodology and framework in order to crowdsource events for the news broadcasts from The Netherlands Institute for Sound and Vision (NISV) that are published under open licenses in the OpenImages platform. As part of the digital humanities effort, DIVE+ is also integrated in the CLARIAH (Common Lab Research Infrastructure for the Arts and Humanities) research infrastructure, next to other media studies research tools, that aims at supporting the media studies researchers and scholars by providing access to digital data and tools. In order to develop this project we work together with eScience Center, which is also funding the DIVE+ project.

Paper Accepted for the ESWC 2017 Research Track

Our paper “Harnessing Diversity in Crowds and Machines for Better NER Performance” (Oana Inel and Lora Aroyo) has been accepted for the ESWC 2017 Research Track. The paper is to be published together with the proceedings of the conference.

Abstract
Over the last years, information extraction tools have gained a great popularity and brought significant improvement in performance in extracting meaning from structured or unstructured data. For example, named entity recognition (NER) tools identify types such as people, organizations or places in text. However, despite their high F1 performance, NER tools are still prone to brittleness due to their highly specialized and constrained input and training data. Thus, each tool is able to extract only a subset of the named entities (NE) mentioned in a given text. In order to improve NE Coverage, we propose a hybrid approach, where we first aggregate the output of various NER tools and then validate and extend it through crowdsourcing. The results from our experiments show that this approach performs significantly better than the individual state-of-the-art tools (including existing tools that integrate individual outputs already). Furthermore, we show that the crowd is quite effective in (1) identifying mistakes, inconsistencies and ambiguities in currently used ground truth, as well as in (2) a promising approach to gather ground truth annotations for NER that capture a multitude of opinions.

Digging into Military Memoirs

On 8th and 9th of September the workshop “Digging into Military Memoirs” took place at the Royal Netherlands Institute of Southeast Asian and Caribbean Studies, in Leiden. The workshop, organized by Stef Scagliola, was a great opportunity to get a close contact with researchers, historians in various fields such as interviews, oral history, cross-media analysis among others. During the workshop the participants experimented with digital technologies on the basis of a corpus of 700 documents published about the veterans in Indonesia.

The aim of the workshop was to explain to a group of around 20 historians the possibilities of Digital Humanities tools and methods. The workshop was divided in four sessions (Data Visualization, Open Linked Data, Text Mining and Crowdsourcing) and each part was composed of a short presentation and hands-on assignments to be performed individually or in groups. The main expectation for each of the sessions was to inform the researchers about the most appropriate tools/applications to use at each stage of their research in order to generate faster and more efficient insights for their work.

The crowdsourcing session was developed and presented together with Liliana Melgar. We divided the session in two parts. The first part was to be followed as an example, Liliana provided brief explanations about the current state-of-the-art in crowdsourcing approaches in Digital Humanities and other fields. In the second part, the historians were able to experiment with different examples of crowdsourcing task and further develop a project idea (based on their own interests) where crowdsourcing would make a good candidate.

Machine-Crowd Annotation Workflow for Event Understanding across Collections and Domains

During the ESWC 2016 PhD Symposium, I presented my doctoral consortium paper, which is entitled “Machine-Crowd Annotation Workflow for Event Understanding across Collections and Domains”.

People need context to process the massive information online. Context is often expressed by a specific event taking place. The multitude of data streams used to mention events provide an inconceivable amount of information redundancy and perspectives. This poses challenges to both humans, i.e., to reduce the information overload and consume the meaningful information and machines, i.e., to generate a concise overview of the events. For machines to generate such overviews, they need to be taught to understand events. The goal of this research project is to investigate whether combining machines output with crowd perspectives boosts the event understanding of state-of-the-art natural language processing tools and improve their event detection. To answer this question, we propose an end-to-end research methodology for: machine processing, defining experimental data and setup, gathering event semantics and results evaluation. We present preliminary results that indicate crowdsourcing as a reliable approach for (1) linking events and their related entities in cultural heritage collections and (2) identifying salient event features (i.e., relevant mentions and sentiments) for online data. We provide an evaluation plan for the overall research methodology of crowdsourcing event semantics across modalities and domains.