• User Centric Data Science: Users, Systems, Data



  • Data Science: A new journal at IOS Press

  • CrowdTruth: gold standard data
    for cognitive computing systems

 

UCDS lab continues to work on User-Centric Data Science with many new faces.

UCDS lab continues to work on User-Centric Data Science with many new faces. With Jacco van Ossenbruggen installed as our brand new group leader, we have been expanding the group significantly with new assistant professors, postdoc researchers and PhD students. We are now participating in multiple ICAI labs, including the Civic AI Lab and the Cultural AI Lab. We also continue to collaborate in (inter) national research projects such as Interconnect, Clariah, Hybrid Intelligence and many others. For more information, visit the People and Project pages or contact us.

 

Can a Transformer Assist in Scientific Writing?

The Semantic Web community has produced a large body of literature that is becoming increasingly difficult to manage, browse, and use. Recent work on attention-based, sequence-to-sequence Transformer neural architecture has produced language models that generate surprisingly convincing synthetic conditional text samples. In this demonstration, we re-train the GPT-2 architecture using the complete corpus of proceedings of the International Semantic Web Conference since 2002 until 2019. We use user-provided sentences to conditionally sample paper snippets, therefore illustrating cases where this model can help at addressing challenges in scientific paper writing, such as navigating extensive literature, explaining the Semantic Web core concepts, providing definitions, and even inspiring new research ideas. Links: full paper, demo poster.

 

Machine-to-machine communication in rural conditions: Realizing KasadakaNet

In his Master Project, Fahad Ali researched using wifi sneakernets for machine-to-machine communication to enable information sharing between geographically distributed devices. He developed a Raspberry Pi-based device called the Wifi-donkey that can be mounted on a vehicle and facilitates information exchange with nearby devices, using the built-in wifi card of the rPi 3.The solution is based on Piratebox offline file-sharing and communications system built with free software and uses off-the-shelf Linux software components and configuration settings to allow it to discover and connect to nearby Kasadaka devices based using Wifi technologies.

Projects

Interconnect

Interoperable solutions connecting smart homes, buildings and grids

More

Odissei

the national research infrastructure for the social sciences in the Netherlands.

More

CLARIAH

Tools - Data- Standards - Workflows - Learn for academic research in Humanities and Social Sciences

More

Civic AI Lab

Advancing Society through Inclusive AI Technology

More
 

Users, Systems, Data

User Centric Data Science is about how users interact with systems and consume and produce data.

 

Our Team

We are an international and diverse team of researchers.

 

VU Amsterdam

We are part of the Department of Computer Science at VU University Amsterdam.

People

Spotlight Publications

These are some selected publications. Click on the button below for a complete overview!
More Papers
Publication and archival of scientific results is still commonly considered the responsability of classical publishing companies. Classical forms of publishing, however, which center around printed narrative articles, no longer seem well-suited in the digital age. In particular, there exist currently no efficient, reliable, and agreed-upon methods for publishing scientific datasets, which have become increasingly important for science. In this article, we propose to design scientific data publishing as a web-based bottom-up process, without top-down control of central authorities such as publishing companies. Based on a novel combination of existing concepts and technologies, we present a server network to decentrally store and archive data in the form of nanopublications, an RDF-based format to represent scientific data. We show how this approach allows researchers to publish, retrieve, verify, and recombine datasets of nanopublications in a reliable and trustworthy manner, and we argue that this architecture could be used as a low-level data publication layer to serve the Semantic Web in general. Our evaluation of the current network shows that this system is efficient and reliable.
Cognitive computing systems require human labeled data for evaluation, and often for training. The standard practice used in gathering this data minimizes disagreement between annotators, and we have found this results in data that fails to account for the ambiguity inherent in language. We have proposed the CrowdTruth method for collecting ground truth through crowdsourcing, that reconsiders the role of people in machine learning based on the observation that disagreement between annotators provides a useful signal for phenomena such as ambiguity in the text. We report on using this method to build an annotated data set for medical relation extraction for the cause and treat relations, and how this data performed in a supervised training experiment. We demonstrate that by modeling ambiguity, labeled data gathered from crowd workers can (1) reach the level of quality of domain experts for this task while reducing the cost, and (2) provide better training data at scale than distant supervision. We further propose and validate new weighted measures for precision, recall, and F-measure, that account for ambiguity in both human and machine performance on this task.