Category: Blogs

  • Blogpost on FAIR implementation profiles

    Blogpost on FAIR implementation profiles

    Together with Angelica Maineri, UCDS’ own Shuai Wang wrote an article on the ODISSEI website on FAIR Implementation profiles in the Social Sciences. A FAIR Implementation Profile (FIP) is a collection of resources, services, and technologies adopted by a community to implement the FAIR principles. In their article “FAIR yes, but how? FAIR Implementation Profiles in the Social Sciences” Shuai and Angelica describe what FIPs are, why they should be used, and how a FIP can be initiated.

  • What Latour can teach us about AI and its moral implications

    What Latour can teach us about AI and its moral implications

    Last week, the renowned French philosopher, sociologist, and anthropologist Bruno Latour passed away at age 75 (1947-2022). Latour is considered to be one of the most influential thinkers of modern-day science. His Actor-Network theory (ANT) and mediation theory are known to provide an alternative perspective to the famous subject-object dichotomy, a dominant paradigm in science originating from Kant.

    In view of the current critical ethical issues with AI systems pervading our societies, reviewing Latour’s ANT provides invaluable insights into the human network that can create or mitigate the threats of AI.

    Read more about it in the blog Mirthe Dankloff (PhD Candidate) wrote for the Civic AI Lab

  • Welcome to the New Era of Scientific Publishing

    Welcome to the New Era of Scientific Publishing

    — by Tobias Kuhn, 4 April 2022, original blog post

    I believe we have made the first steps venturing into a new era of scientific publishing. Let me explain.

    Science is nowadays communicated in a digital manner through the internet. We essentially have a kind of “scientific knowledge cloud”, where researchers with the help of publishers upload their latest findings in the form of scientific articles, and where everybody who is interested can access and retrieve these findings. (This is in fact only true for articles that are published as Open Access, but that is not the point here.)

    In a sense, this scientific knowledge cloud has been a big success, but it also has serious limitations. Imagine somebody wanting to build an application that feeds on all these findings, for example to help researchers learn about interesting new developments or to summarize scientific consensus for laypeople. Concretely, a developer might face a task like this one:

    Retrieve all genes that have been found to play a role in a part of the respiratory system in Covid-19 patients. Only include results from randomized controlled trials published in the last seven days.

    To a naive developer without experience in how scientific knowledge is communicated, this might sound quite easy. One would just have to find the right API, translate the task description into a query, and possibly do some post-processing and filtering. But everybody who knows a bit how science is communicated immediately realizes that this will take a much bigger effort.

    Why Text Mining is not the Solution

    The problem is that scientific results are represented and published in plain text, and not in a structured format that software could understand. So, in order to access the scientific findings and their contexts, one has to apply text mining first. Unfortunately, despite all the impressive progress with Deep Learning and related techniques in the past few years, text mining is not good enough, and probably will never be.

    To illustrate the point, we can look at the results of the seventh BioCreative workshop held in November 2021, where world-leading research teams competed in extracting entities and relations from scientific texts. Just to detect the type of a relation between a given drug and a given gene out of 13 given relation types, the best system achieved an F-score of 0.7973.

    But that is just the relation type. To get the full relation out, we first have to know the entity on the left-hand side (subject) and right-hand side (object) of the relation. We can look at a different task of the BioCreative workshop to get a feeling of how well extracting these subjects and objects work. The task focussed on extracting chemicals, and this is done it a two-stage process. First the entities are recognized in the text, with an F-score of the best system of 0.8672, and then the recognized chemicals are linked to the corresponding formal identifiers, with the best F-score being 0.8136:

    A very rough back-of-the-envelope calculation can give us an estimate of the quality of mining such entire relations:

    An overall F-score of 0.40, as resulting from this calculation, roughly means that around 60% of retrieved relations are wrong and 60% of existing relations are not retrieved. This is clearly not even close to good enough for most types of possible applications. And mind you, these numbers come from the best performing systems when world-leading research groups compete and probably put months of effort into optimizing their systems for the specific problem. Moreover, we are talking here only about the simplest possible kind of relations of the form subject-relation-object.

    This seems to point to a deeper problem. Text mining is just a work-around, and the real problem is elsewhere. As Barend Mons rhetorically asked “Why bury it first and then mine it again?”. Instead of seeing text mining as the ultimate solution, we should just stop burying our knowledge in the first place.

    In the work I will explain below, we wanted to find out how we can practically publish findings without burying them.

    Representing Scientific Findings in Logic

    As a first step to experiment with such a new way of publishing, we needed to find a general way of how to represent high-level scientific findings in some sort of formal logic. Even though such findings are arguably the most important outcome of science, there was no prior work on practically mapping (most of) these findings to formal logic across domains. To better understand the logical structure of such high-level scientific findings (e.g. that a gene tends to have a certain effect on the course of a given disease), we selected a random sample of 75 research articles from Semantic Scholar.

    Studying the high-level findings from these random articles, we managed to elicit a wide-spread logical pattern, which we then turned in the super-pattern model. It consists of five slots to be filled, and translates directly to a logical formula. Our paper [1] explains the details, and I give here just one example. The finding of the article entitled “P‐glycoprotein expression in rat brain endothelial cells: evidence for regulation by transient oxidative stress” can be expressed with the super-pattern as follows:

    • Context class: rat brain endothelial cell
    • Subject class: transient oxidative stress
    • Qualifier: generally
    • Relation: affects
    • Object class: Pgp expression

    The three class slots take any class identifier (here shown by class name for simplicity), whereas the qualifier and the relation come from closed lists of a few possible values. Informally, the example above means that whenever there is an instance of “transient oxidative stress” in the context of an instance of “rat brain endothelial cell” then generally (being defined as in at least 90% of the cases) it has a relation of type “affects” to an instance of “Pgp expression” that is in the same context. Formally, it corresponds to this logical formula (in a bit of a non-standard notation using conditional probability):

    Once we discovered this pattern we tried to formalize the 75 high-level findings, and this was the result:

    68 of the 75 findings, shown in blue, could be represented with the super-pattern. And the remaining 7 findings, shown in red, are in fact easier to represent, as they can be captured with a simple subject-relation-object structure. (So these latter simpler structures are the ones that existing text mining approaches struggle with, and it is safe to assume that they would perform even worse on the more complicated statements here shown in blue.)

    So, it seems that we have found a logical pattern that allows us to represent most high-level scientific findings from different disciplines.

    Stepping into the new Era of Scientific Publishing

    Next, we wanted to make a serious practical step into the new era where findings are machine-interpretable from the start. We designed this as a kind of field study with a special issue of machine-interpretable papers at an existing journal. We wanted these papers to look like regular papers to those who want to look at them in that way, but they should also come with representations in formal logic for anyone or anything that knows how to deal with that. For that special issue, we chose the journal Data Science, of which I am an editor-in-chief.

    We also had to make a practical concession though: While the whole setup could be used to publish novel findings, we restricted ourselves to findings from existing publications. For that, we introduced the concept of a formalization paper whose novel contribution is the formalization of an existing finding. So, authors of a formalization paper take credit for the formalization of the finding, but not for the finding itself.

    To represent these findings and thereby the formalization papers, we used the nanopublication format and the Nanobench tool. Researchers who contributed to this special issue filled in a form to express and submit their formalization that looked like this:

    Apart from the actual formalization in the assertion part of the nanopublication (blue), this also involved specifying the provenance of that formalization (red) and further metadata (yellow). The provenance in this case states that the assertion was derived by a formalization activity taking an existing paper as input. This interface provides auto-completing drop-down menus and maps the result to the logic language RDF in the back. I won’t go into the details of defining new classes and reviewing (both done with nanopublications and Nanobench too), but you can have a look at the preprint of our paper [2] on that.

    We ended up with 15 formalization papers in our special issue, as summarized by this table:

    AuthorsCONTEXTSUBJECTQUALIFIERRELATIONOBJECT
    1Joslinearly human adipogenesisregulatory element within the first intron of FTOgenerallyaffectsexpression of genes IRX3 and IRX5
    2Nicholshuman motor neuronTAR DNA binding proteincan generallycontributes totranscription of stmn2
    3Mietchendejellied fertilizable stage VI Xenopus laevis oocytestrong static magnetic fieldgenerallyaffectscell cortex
    4Ehrhart, Evelo(no context class)genes associated with CAKUTsometimesis same astargets of vitamin A
    5Patrinospatient undergoing PCIpharmacogenomics guided clopidogrel therapygenerallyenablescost-effective treatment
    6Martoranahumansmoothened signaling pathwaymostlyaffectsastrocyte development
    7Mietchen, Penev, Dimitrovabiodiversity datalicense with non-commercial clausegenerallyinhibitsdata reuse
    8Dimitrovarelease of OpenBiodiv knowledge graphtriple in OpenBiodiv knowledge graphgenerallyis same assemantic triple extracted from biodiversity literature
    9BrauerUNC13ATAR DNA binding proteingenerallyinhibitsinclusion of cryptic exon
    10Dumontierdata setadherence to the FAIR guiding principlescan generallyenablesautomated discovery
    11Queralt RosinachhumanNGLY1 deficiencyalwaysis caused bydysfunction of ERAD pathway
    12Usbecksocial grouprelative neocortex sizeneveraffectssocial group size
    13Bainerecm bound cancer cellglycocayx bulkgenerallyincreasesintegrin clustering
    14Grouès, Vega Moreno, SatagopamhumanSTX1B mutationfrequentlyco-occurs withepilepsy
    15de Boerdigital humanities researchusage of Linked Data Scopescan generallycontributes totransparency

    Each row of this table corresponds to a formalization, and can be written out as a logic formula. So, these papers look here quite different from what we are used to. But if you go to the official page for the special issue on the publisher’s website, they look like regular papers. Besides the regular button to download the paper as a PDF, there is also a link that points to the nanopublication representation, thereby connecting the two ways of looking at this paper:

    So, for the first time, software can reliably interpret the main high-level findings of scientific publications. This special issue is a just a small first step, but it could prove to be the first step into a new era of scientific publishing. The practical immediate consequences of this might seem limited, but I think the longer-term potential of making scientific knowledge accessible to the interpretation by machines is hard to overstate.

    References

    1. Cristina-Iulia Bucur, Tobias Kuhn, Davide Ceolin, Jacco van Ossenbruggen. Expressing high-level scientific claims with formal semantics. In Proceedings of K-CAP ’21. ACM, 2021.
    2. Cristina-Iulia Bucur, Tobias Kuhn, Davide Ceolin, Jacco van Ossenbruggen. Nanopublication-Based Semantic Publishing and Reviewing: A Field Study with Formalization Papers. arXiv 2203.01608, 2022.

    License

    This text is available under the CC BY 4.0 license. The images were created with Excalidraw, using several of its libraries of visual elements, and are available under the MIT license.

  • Making scholarly communication both human and machine-interpretable

    Making scholarly communication both human and machine-interpretable

    This blog post is written by Cristina-Iulia Bucur, and cross-posted at the IOS Press Labs website

    book pages

    We currently disseminate, share, and evaluate scientific findings following paradigms of publishing where the only difference from the methods of more than 300 years ago is the medium in which we publish – we have moved from printed articles to digital format, but still use almost the same natural language narrative with long coarse-grained text with complicated structures. These are optimized for human readers and not for automated means of organization and access. Additionally, peer reviewing is the main method of quality assessment, but these peer reviews are rarely published and have their own complicated structure, with no accessible links to the respective articles. With the increasing number of articles being published, it is difficult for researchers to stay up to date in their specific fields, unless we find a way to involve machines as well in this whole process. So, how can we make use of the current technologies to change these old paradigms of publishing and make the process more transparent and machine-interpretable?

    In order to address these problems and to better align scientific publishing with the principles of the Web and linked data, my research proposes an approach to use nanopublications – in the form of a fine-grained unifying model – to represent in a semantic way the elements of publications, their assessments, as well as the involved processes, actors, and provenance in general. This research is a result of the collaboration between Vrije Universiteit Amsterdam, IOS Press, and the Netherlands Institute for Sound and Vision for the Linkflows project. The purpose of this project is to make scientific contributions on the Web, e.g. articles, reviews, blog posts, multimedia objects, datasets, individual data entries, annotations, discussions, etc., better valorized and efficiently assessed in a way that allows for their automated interlinking, quality evaluation, and inclusion in scientific workflows. My involvement in the Linkflows project is in collaboration with Tobias Kuhn, Davide Ceolin, Jacco van Ossenbruggen, Stephanie Delbecque, Maarten Fröhlich, and Johan Oomen.

    Semantic publishing

    One concept that first comes to mind and a proposed solution to make scientific publishing machine-interpretable is “semantic publishing.” This is not a new concept, as its roots are tightly coupled to the notion of the Web, with Tim Berners-Lee mentioning that the semantic web “will likely profoundly change the very nature of how scientific knowledge is produced and shared, in ways that we can now barely imagine” [1]. Despite the fact that semantic publishing is not directly linked to the Web, its progress was highly influenced by the rise of the “semantic web.” As such, in the beginning, it referred to mainly publishing information on the Web in the form of documents that additionally contain structured annotations, so extra information that is parsable by machines in the form of semantic markup (with markup languages like RDFa, for example). This allowed published information on the Web to be machine-interpretable, to the limited extent to which the markup languages allowed. A next step was to use semantic web languages like RDF and OWL to publish information in the form of data objects, together with a specific detailed representation called ontology that is able to represent the domain of the data in a formal (thus machine-interpretable) way. The information published in such structured ways provides not only a “semantic” context through the metadata that describes the information, but also a way for machines to understand the structure and even the meaning of the published information [2,3].

    In this way, semantic publishing would allow for the “automated discovery, enables its linking to semantically related articles, provides access to data within the article in actionable form, or facilitates integration of data between papers” [3]. However, despite all the advancements in the semantic web technologies in the past years, semantic publishing is not “genuine” [4] in the sense that the current scientific publishing paradigm has not changed much as we are still using long articles written in natural language that do not contain formal semantics from the start that machines can process and interpret in an automated manner. So, with scientific publishing often stuck to formats optimized for print such as PDF, we are not using the advances that are available to us with technologies around the semantic web and linked data.

    colorful illuminated hexagons

    Transforming scholarly articles into small, interlinked snippets of data

    In our approach, we argue for a new system of scientific publishing that contains smaller, fine-grained, formal – machine-interpretable – representations of scientific knowledge that are linked together in a big knowledge network in a web-like fashion in a way that these publications do not need to be necessarily linked to a journal or a traditional publication and can be publication entities by themselves [5,6]. Moreover, semantic technologies make possible the decomposition of traditional science articles into constituent machine-readable parts that are linked not only with one another, but also to other related fine-grained parts of knowledge on the Web following the linked data principles. So, we designed a model for a more granular and semantic publishing paradigm, which can be used for scientific articles as well as reviews.

    The scientific publishing process at a granular level
    Figure 1: The scientific publishing process at a granular level:

    The scientific publishing process at a granular level

    An example based on this is shown in Figure 1. Here, we see how the scientific publishing process can look at a more finely-grained level, with a network-like representation and with the recording of formal semantics for the different elements. Instead of treating big bulks of text as such, we propose to represent them as small snippets – such as paragraphs, or a title in this example – that have formal semantics attached and can be treated as independent publication units. They can link to other such units and therefore form a larger entity – such as a full paper or review – by forming a complex network of links. With that approach, we can ensure that provenance of each snippet of information can be accurately tracked together with its creation time and author, and therefore allow for more flexible and more efficient publishing than the current paradigm.

    Reimagining the peer-review process

    A process like peer reviewing can then be broken down into small snippets and thereby take the specialization of reviewers and the detailed context of their review comments into account, and these review comments can formally and precisely link to exactly the parts of the paper they address. These small snippets of text can be represented as nodes in a network and can be linked with one another with semantically-annotated connections, thus forming distributed and semantically annotated networks of contributions.

    The individual review comments are semantically modeled with respect to the specific part of the paper that they target – whether they are about syntax or content, whether they raise a positive or negative point, and whether they are a suggestion or compulsory, and what their impact on the quality of the paper is, according to our previous proposed model [7]. Each article, paragraph, and review comment thereby forms a single node in a network and is each identified by a dereferenceable URI (uniform resource identifier). All this is very different to how scientific communication happens nowadays, with the long non-machine interpretable natural-language coarse-grained texts, thus large non-semantic documents.

    Practical applications for journal editors

    This formal networked approach allows us, among other things, to provide general, user-friendly, and powerful aggregations, for example for journal editors, that can assist them in their meta-reviewing tasks. Shown in Figure 2 are two screenshots from a prototype system [8]. We can see in a more quantitative way the set of review comments and their types represented in different colors, where the checkboxes in the legend can be used to filter the review comments of the given category. To see the content of the review comments that are in a certain dimension, it is sufficient to just click the figure and then select any bar in the chart to discover the related information.

    Linkflows and nanopublications prototype demo
    Figure 2. Linkflows and nanopublications prototype demo:

    Linkflows and nanopublications prototype demo


    The screenshot shown in Figure 3 aggregates all the finer-grained dimensions of the review comments at the level of sections in an article. Again, in the prototype each cell in the table can be clicked, thus selecting one specific dimension of the review comments will show the comments from each reviewer underneath the table in the interface.

    Finer-grained dimensions of review comments
    Figure 3. Finer-grained dimensions of review comments:

    Finer-grained dimensions of review comments


    We emphasize the fact that such a detailed view regarding the reviews and their content is one of the many possibilities when it comes to the graphical representation of articles and their reviews. Having all the information stored in a granular, formally-represented network allows for incredible flexibility with regard to the interpretation, visualization, and level of detail that can be shown, all in an automated manner. In a traditional publishing setting, such level of detail is not yet a possibility due to the unstructured nature of the current procedures.

    What’s next: Real-case application in the publishing process


    In the future, we would like to extend this fine-grained formal approach to the content of scientific articles, mainly to see if the content of the main scientific claims expressed in scientific articles could be expressed in a formal way to make scientific knowledge accessible not only to human readers, but also to automated systems. Moreover, it would be interesting to actually apply in a real-case setting, thus, in a real publishing environment, this new way of fine-grained semantic publishing for the entire scientific publishing process, from the way the publications are written, to the way they are submitted, to the peer-review process, and even the publication itself.

    Background information


    To find out more about our approach, you can check out the links below, including a live demo and a presentation:

    References

    1. “Publishing on the semantic web” by Tim Berners-Lee and James Hendler, Nature, Vol. 410, pp. 1023–1024 (2001); link: nature.com/articles/35074206 (last accessed: 25 November 2021).
    2. “Semantic publishing: The coming revolution in scientific journal publishing” by David Shotton, Learned Publishing, Vol. 22, Iss. 2, pp. 85–94. (2009); link: onlinelibrary.wiley.com/doi/abs/10.1087/2009202 (last accessed: 25 November 2021).
    3. “Ceci n’est pas un hamburger: Modelling and representing the scholarly article” by S. Pettifer, P. McDermott, J. Marsh, D. Thorne, A. Villeger, and T.K. Attwood, Learned Publishing, Vol. 24, Iss. 3, pp. 207–220 (2011); link: onlinelibrary.wiley.com/doi/abs/10.1087/20110309 (last accessed: 25 November 2021).
    4. “Genuine semantic publishing” by Tobias Kuhn and Michel Dumontier, Data Science, Vol. 1, Iss. 1/2, pp. 139–154 (2017); link: content.iospress.com/articles/data-science/ds010 (last accessed: 25 November 2021).
    5. “A Unified Nanopublication Model for Effective and User-Friendly Access to the Elements of Scientific Publishing” by Cristina-Iulia Bucur, Tobias Kuhn, and Davide Ceolin, in: Knowledge Engineering and Knowledge Management, C. Maria Keet and Michel Dumontier (Eds), EKAW 2020 Proceedings, Lecture Notes in Computer Science, Vol. 12387 (2020; Springer); link: link.springer.com/chapter/10.1007%2F978-3-030-61244-3_7 (last accessed: 25 November 2021).
    6. “Linkflows: Enabling a web of linked semantic publishing work-flows” by Cristina-Iulia Bucur, in: The Semantic Web: ESWC 2018 Satellite Events, A.Gangemi, et al. (Eds), ESWC 2018, Lecture Notes in Computer Science, Vol. 11155 (2018; Springer); link: link.springer.com/chapter/10.1007%2F978-3-319-98192-5_45 (last accessed: 30 November 2021).
    7. “Peer Reviewing Revisited: Assessing Research with Interlinked Semantic Comments” by Cristina-Iulia Bucur, Tobias Kuhn, and Davide Ceolin, in: K-CAP ’19: Proc. 10th Int. Conf. Knowledge Capture, M. Kejriwal, P. A. Szekely, and R. Troncy (Eds.), pp. 179–187 (2019; ACM); link: dl.acm.org/doi/proceedings/10.1145/3360901 (last accessed: 25 November 2021).
    8. “Linkflows and nanopublications prototype demo” by Cristina-Iulia Bucur, et. al. (2020); link: linkflows.nanopubs.lod.labs.vu.nl (last accessed: 25 November 2021).
  • How to account for automated decision systems in the public domain?

    How to account for automated decision systems in the public domain?

    Automated decision-making affects governmental decision-making processes in terms of accountability, explainability, and democratic power. For instance, deciding on acceptable error rates reflects important value judgments that can have far-reaching ethical impacts for citizens.

    Error analysis is an important determinant for the design and deployment choices in algorithms. Public authorities, therefore, need to balance the risks and benefits to protect their citizens by making error analysis transparent and understandable.

    Read more about it in the blog Mirthe Dankloff (PhD Candidate) wrote for the Civic AI Lab (02-09-2021): https://www.civic-ai.nl/post/how-to-account-for-automated-decision-systems-in-the-public-domain