Language Understanding for Social Good

My research interest lies in creating intelligent systems based on human language. Currently, I focus in two areas that bring together machine learning and natural language processing (NLP): biomedical knowledge discovery for better healthcare, and language use + society. The way we use language varies a lot, depending on the content (the "what"), the domain, the audience, and of course, the speaker/writer (the "how" and "who"). It also varies based on the intent (the "why"). Understanding these phenomena can help us gain significant understanding of how and why we communicate the way we do, and what we can learn from it. Intelligent systems built using a deeper understanding of human language can be employed for immense social good. The language used in medical research, for instance, is highly specialized as it is meant to be understood by other researchers in that field; but a system capable of understanding it, and extracting useful information from it, can help healthcare practitioners and patients. Quotidian language use, on the other hand, is dictated by various aspects of individual and collective human behavior. An intelligent system can be refined to provide better help to individuals and/or social groups by interpreting language, and inferring such behavioral traits from it. Our research touches these, and many other, aspects of information use and its implications on society. We love these multifaceted problems and enthusiastically work toward developing elegant streamlined solutions with practical and immensely beneficial social applications.

Research Overview

Biomedical Knowledge Discovery

The biomedical domain is teeming with natural language information, but this language is highly specialized. Our research problems here consist of building systems that can (a) understand scientific literature, clinical notes, and other healthcare-related information (which is usually an ad-hoc composition of structured and unstructured data), and (b) make smart inferences from the combined understanding of all this information. Here is a brief overview of what we have achieved in this direction:

Relation Inference from Biomedical Research Literature

Due to the sheer volume of new research taking place, it is nearly impossible for practicing physicians to keep up with the latest findings. This calls for knowledge discovery from research literature in an automated manner. Even though the problem is extremely difficult because of the complex and highly-specialized language, an AI-driven solution could lead to automatically keeping all medical databases updated with state-of-the-art knowledge. Such databases can be combined with clinical notes to improve diagnostics and patient care. Arguably, the most important knowledge in clinical practice is the understanding of the relation between a drug and a disease or symptom. This can be broadly understood in terms of whether a drug is beneficial or harmful for a patient. As part of my doctoral research, I designed a global inference system modeled as an integer linear programming optimization problem based on drug-drug similarities. It was shown that even if the system has no prior knowledge of newly studied drug classes, it can identify these drugs as being potentially beneficial. For example, for type-2 diabetes patients, our system identified the class "sodium glucose co-transporter 2" as a potentially beneficial drug, in spite of no drug from this category being known to the system a priori. The key novelty in this work was that the similarity between medical entities was computed based on the pharmacological actions of these entities [Banerjee 2015].

Precision Healthcare Informatics

A problem in healthcare is that in spite of the recent focus on precision medicine, much of the relevant data is not patient-specific, and thus, corroborating relevant information and discarding the rest remains the manual endeavor of clinicians. This is a rather complex problem, with several aspects to it — laboratory tests, prescription drugs, diet, etc. We developed AI-driven systems that can distill patient-specific information from large amounts of natural language data as well as structured databases. This has led to automatic recommendation of the most relevant laboratory tests for a patient, depending on the precise circumstances [Banerjee et al. 2014], and personalized identification of adverse drug reactions and attribution of patient's symptoms to their drug regimen [Banerjee et al. 2015]. Our proposal to advance this line of research was selected by the Targeted Research Opportunity (TRO) program of the Stony Brook School of Medicine. This internal research grant is an attempt to bring precision healthcare to radiology, by employing statistical NLP to analyze complex phenomena narrated in radiology reports.

Language Use & Society

Language use is inherent in any form of social communication. And with that comes many important challenges, espcially with the advent of social media. A healthy society, one might suppose, indulges in healthy communication. Language use in society usually depends on the user's social context and intent. Understanding these two aspects can shed light on important large-scale social phenomena that affect us all.

Trustworthy Information & Communication

Language is a medium of conveying information, but unfortunately, it is also often used to deceive. We have explored the detection of such language in Online reviews, and discovered that the stylometric aspects of language play an important role in exposing the deceptive intent of writers [Feng, Banerjee, and Choi; 2012a]. We have also carried out experiments on the process of creating such language by investigating the differences in how people type when writing truthful as opposed to deceptive texts, and revealed interesting parallels between typing patterns and speech patterns when people lie [Banerjee et al. 2012]. On a related note, we also investigated stylometric aspects of language to identify the traits of individual writers, and therefore, were able to develop an algorithm to identify the authors even in highly formal writing [Feng, Banerjee, and Choi; 2012b]. A similar effort was also directed toward identifying authorship changes in multi-author documents [Zuo, Zhao, and Banerjee; 2019], but using neural networks instead of explicit stylometric features.

The initial phase of research on deceptive information was conducted on Online reviews about hotels and restaurants. A far more threatening social problem, however, is healthcare-related misinformation. This is a major area of our current research, and we are excited to carry out this project with support from the National Science Foundation (NSF). As a first step in this direction, we worked on identifying what information is worth checking for credibility [Zuo, Karakas, and Banerjee; 2018, Zuo, Karakas, and Banerjee; 2019]. Please feel free to take a look at our page on tracking semantic change in medical information.

A rather different problem pertaining to securing trustworthy communicaton is to maintain the ability to communicate — which is susceptible to adverserial attacks, especially in today's Internet-driven society. The technical strategies required to thwart such attacks is a not our area of research. However, we observed that a lot of information about such incidents is usually available as human conversations. Our collaboration with researchers in computer networks was able to identify and analyze the nature of such breakdown in cyber communication [Banerjee et al. 2015].