This project is funded by the National Science Foundation SES-1834597.

Overview

Changes in the meaning of information as it passes through cyberspace can mislead those who access the information. This project will develop a new dataset and algorithms to identify and categorize medical information that remains true to the original meaning or undergoes distortion. Instead of imposing an external true/false label on this information, this project looks into a series of changes within the news coverage itself that gradually lead to a deviation from the original medical claims. Identifying important differences between original medical articles and news stories is a challenging, high risk-high reward venture. Broader impacts of this work include benefits to the research community by making novel contributions to understanding temporal changes in natural language information, as well as social benefits in the form of improved informational tools like question-answering. For the medical domain in particular, understanding temporal distortions and deviations from actual medical findings can reduce occurrences of harmful health choices, for instance, by embedding the research outcomes in news, social media, or search engines.

This project will develop a large dataset of medical scientific publications, and record their characteristics as they change over time across news by designing and developing discrete time-series representations of entities and their attributes and relations. This task will provide the basis for designing and implementing machine learning tasks that exploit stylometric features in natural language in conjunction with temporal distributions to identify and categorize such changes. This research will go beyond current approaches limited to true/false classification of individual articles, and hence be able to identify and analyze information change in narratives, including semantic changes and nuances, or selective emphasis of related information. The research entails an unsupervised and a semi-supervised machine learning approach with bootstrapping, and exploring a binary labeling task to distinguish distorted pieces of information from those that are faithful to the scientific finding, and a multi-label categorization to learn the type of semantic change occurring through time.

As a first step in this direction, we have focused on identifying what information is worth verifying, and have developed a hybrid method comprising heuristics and supervised learning to identify information that is worth checking [Zuo,Karakas, and Banerjee; 2018] . Our approach achieved the best state-of-the-art detection, as measured by several metrics. An expansion on this work was invited to the CLEF 2019 conference [Zuo,Karakas, and Banerjee; 2019].

Team

This project is being led by Dr. Ritwik Banerjee at the Department of Computer Science, Stony Brook University. The team comprises

Publications

Chaoyuan Zuo, Ayla Karakas, and Ritwik Banerjee.
A Hybrid Recognition System for Check-worthy Claims Using Heuristics and Supervised Learning
In Working Notes of CLEF 2018 – Conference and Labs of the Evaluation Forum, CLEF 2018 – Vol. 2125. CEUR-WS, 2018.
cite    description    slides
Chaoyuan Zuo, Ayla Karakas, and Ritwik Banerjee.
To Check or not to Check: Syntax, Semantics, and Context in the Language of Check-worthy Claims
 Invited Paper to "Best of the Labs" Track 
In Crestani et al. (Eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 10th International Conference of the CLEF Association, CLEF 2019 – LNCS Vol. 11696. Springer, 2019.
cite    description