• User Centric Data Science: Users, Systems, Data



  • Data Science: A new journal at IOS Press

  • CrowdTruth: gold standard data
    for cognitive computing systems

Our Group

 

Users, Systems, Data

User Centric Data Science is about how users interact with systems and consume and produce data.

 

Our Team

We are an international and diverse team of researchers.

 

VU Amsterdam

We are part of the Department of Computer Science at VU University Amsterdam.

Projects

ControCurator

Crowds and Machines for Modeling and Discovering Controversy

More

DIVE

Innovative access to heritage objects from heterogeneous online collections

More

Crowd Truth

The framework for crowdsourcing ground truth data

More

BiographyNet

Extracting relations between people and events

More

News

 

Final Release of the Big Data Europe Integrator Platform (BDI)

On Thursday 16 November, the BDE Consortium launched the third version of the Big Data Europe Integrator Platform. The technical team has set out what the Big Data Europe Integrator Platform can do, how it does it, and how you can use it to derive more value from your data.

 

Dancing and Semantics

This post describes the MSc theses of Ana-Liza Tjon-a-Pauw and Josien Jansen on the topic of Dancing and Semantics. They investigated 1) How can we model and represent dance in a sensible manner so that computers can make sense of choreographs and 2) How can we communicate those choreographies to the dancers?

 

ABC-Kb Network Institute project kickoff

Our Academy Assistant project ABC-Kb has started on "A Knowledge base supporting the Assessment of language impairment in Bilingual Children".

People

Latest Publications

These are our most recent publications. Click on the button below for a complete overview!
Publication and archival of scientific results is still commonly considered the responsability of classical publishing companies. Classical forms of publishing, however, which center around printed narrative articles, no longer seem well-suited in the digital age. In particular, there exist currently no efficient, reliable, and agreed-upon methods for publishing scientific datasets, which have become increasingly important for science. In this article, we propose to design scientific data publishing as a web-based bottom-up process, without top-down control of central authorities such as publishing companies. Based on a novel combination of existing concepts and technologies, we present a server network to decentrally store and archive data in the form of nanopublications, an RDF-based format to represent scientific data. We show how this approach allows researchers to publish, retrieve, verify, and recombine datasets of nanopublications in a reliable and trustworthy manner, and we argue that this architecture could be used as a low-level data publication layer to serve the Semantic Web in general. Our evaluation of the current network shows that this system is efficient and reliable.
Cognitive computing systems require human labeled data for evaluation, and often for training. The standard practice used in gathering this data minimizes disagreement between annotators, and we have found this results in data that fails to account for the ambiguity inherent in language. We have proposed the CrowdTruth method for collecting ground truth through crowdsourcing, that reconsiders the role of people in machine learning based on the observation that disagreement between annotators provides a useful signal for phenomena such as ambiguity in the text. We report on using this method to build an annotated data set for medical relation extraction for the cause and treat relations, and how this data performed in a supervised training experiment. We demonstrate that by modeling ambiguity, labeled data gathered from crowd workers can (1) reach the level of quality of domain experts for this task while reducing the cost, and (2) provide better training data at scale than distant supervision. We further propose and validate new weighted measures for precision, recall, and F-measure, that account for ambiguity in both human and machine performance on this task.