• User Centric Data Science: Users, Systems, Data

  • Data Science: A new journal at IOS Press

  • CrowdTruth: gold standard data
    for cognitive computing systems

Our Group


Users, Systems, Data

User Centric Data Science is about how users interact with systems and consume and produce data.


Our Team

We are an international and diverse team of researchers.


VU Amsterdam

We are part of the Department of Computer Science at VU University Amsterdam.

Current Projects


Crowds and Machines for Modeling and Discovering Controversy



Innovative access to heritage objects from heterogeneous online collections


Crowd Truth

The framework for crowdsourcing ground truth data



Extracting relations between people and events




ABC-Kb Network Institute project kickoff

Our Academy Assistant project ABC-Kb has started on "A Knowledge base supporting the Assessment of language impairment in Bilingual Children".


DIVE+ in Europeana Insight

This months’ edition of Europeana Insight features articles from this year’s LODLAM Challenge finalists, which include the winner: DIVE+. The online article “DIVE+: EXPLORING INTEGRATED LINKED MEDIA” discusses the DIVE+ User studies, data enrichment, exploratory interface and impact on the cultural heritage domain.


Data Science: a new journal

Data Science is an interdisciplinary journal that addresses the development that data has become a crucial factor for a large number and variety of scientific fields. This journal covers aspects around scientific data over the whole range from data creation, mining, discovery, curation, modeling, processing, and management to analysis, prediction, visualization, user interaction, communication, sharing, and re-use.


Latest Publications

These are our most recent publications. Click on the button below for a complete overview!
Publication and archival of scientific results is still commonly considered the responsability of classical publishing companies. Classical forms of publishing, however, which center around printed narrative articles, no longer seem well-suited in the digital age. In particular, there exist currently no efficient, reliable, and agreed-upon methods for publishing scientific datasets, which have become increasingly important for science. In this article, we propose to design scientific data publishing as a web-based bottom-up process, without top-down control of central authorities such as publishing companies. Based on a novel combination of existing concepts and technologies, we present a server network to decentrally store and archive data in the form of nanopublications, an RDF-based format to represent scientific data. We show how this approach allows researchers to publish, retrieve, verify, and recombine datasets of nanopublications in a reliable and trustworthy manner, and we argue that this architecture could be used as a low-level data publication layer to serve the Semantic Web in general. Our evaluation of the current network shows that this system is efficient and reliable.
Cognitive computing systems require human labeled data for evaluation, and often for training. The standard practice used in gathering this data minimizes disagreement between annotators, and we have found this results in data that fails to account for the ambiguity inherent in language. We have proposed the CrowdTruth method for collecting ground truth through crowdsourcing, that reconsiders the role of people in machine learning based on the observation that disagreement between annotators provides a useful signal for phenomena such as ambiguity in the text. We report on using this method to build an annotated data set for medical relation extraction for the cause and treat relations, and how this data performed in a supervised training experiment. We demonstrate that by modeling ambiguity, labeled data gathered from crowd workers can (1) reach the level of quality of domain experts for this task while reducing the cost, and (2) provide better training data at scale than distant supervision. We further propose and validate new weighted measures for precision, recall, and F-measure, that account for ambiguity in both human and machine performance on this task.