• User Centric Data Science: Users, Systems, Data

  • Data Science: A new journal at IOS Press

  • CrowdTruth: gold standard data
    for cognitive computing systems


Machine-to-machine communication in rural conditions: Realizing KasadakaNet

In his Master Project, Fahad Ali researched using wifi sneakernets for machine-to-machine communication to enable information sharing between geographically distributed devices. He developed a Raspberry Pi-based device called the Wifi-donkey that can be mounted on a vehicle and facilitates information exchange with nearby devices, using the built-in wifi card of the rPi 3.The solution is based on Piratebox offline file-sharing and communications system built with free software and uses off-the-shelf Linux software components and configuration settings to allow it to discover and connect to nearby Kasadaka devices based using Wifi technologies.


DIVE+ collection enrichment paper wins best paper award at MTSR2017

DIVE+ collection enrichment paper wins best paper award at the 11th Metadata and Semantics Research Conference in Tallinn, Estonia.


Final Release of the Big Data Europe Integrator Platform (BDI)

On Thursday 16 November, the BDE Consortium launched the third version of the Big Data Europe Integrator Platform. The technical team has set out what the Big Data Europe Integrator Platform can do, how it does it, and how you can use it to derive more value from your data.



A better understanding of the understudied areas of architectural history.



Crowds and Machines for Modeling and Discovering Controversy



Innovative access to heritage objects from heterogeneous online collections


Crowd Truth

The framework for crowdsourcing ground truth data


Users, Systems, Data

User Centric Data Science is about how users interact with systems and consume and produce data.


Our Team

We are an international and diverse team of researchers.


VU Amsterdam

We are part of the Department of Computer Science at VU University Amsterdam.


Spotlight Publications

These are some selected publications. Click on the button below for a complete overview!
More Papers
Publication and archival of scientific results is still commonly considered the responsability of classical publishing companies. Classical forms of publishing, however, which center around printed narrative articles, no longer seem well-suited in the digital age. In particular, there exist currently no efficient, reliable, and agreed-upon methods for publishing scientific datasets, which have become increasingly important for science. In this article, we propose to design scientific data publishing as a web-based bottom-up process, without top-down control of central authorities such as publishing companies. Based on a novel combination of existing concepts and technologies, we present a server network to decentrally store and archive data in the form of nanopublications, an RDF-based format to represent scientific data. We show how this approach allows researchers to publish, retrieve, verify, and recombine datasets of nanopublications in a reliable and trustworthy manner, and we argue that this architecture could be used as a low-level data publication layer to serve the Semantic Web in general. Our evaluation of the current network shows that this system is efficient and reliable.
Cognitive computing systems require human labeled data for evaluation, and often for training. The standard practice used in gathering this data minimizes disagreement between annotators, and we have found this results in data that fails to account for the ambiguity inherent in language. We have proposed the CrowdTruth method for collecting ground truth through crowdsourcing, that reconsiders the role of people in machine learning based on the observation that disagreement between annotators provides a useful signal for phenomena such as ambiguity in the text. We report on using this method to build an annotated data set for medical relation extraction for the cause and treat relations, and how this data performed in a supervised training experiment. We demonstrate that by modeling ambiguity, labeled data gathered from crowd workers can (1) reach the level of quality of domain experts for this task while reducing the cost, and (2) provide better training data at scale than distant supervision. We further propose and validate new weighted measures for precision, recall, and F-measure, that account for ambiguity in both human and machine performance on this task.