|
Chaoyuan Zuo, Narayan Acharya, and Ritwik Banerjee.
In The 2020 Conference on Empirical Methods in Natural Language Processing (to appear),
2020.
cite
description
@inproceedings{zuo2020querying,
author={Zuo, Chaoyuan and Acharya, Narayan and Banerjee, Ritwik},
title={{Querying Across Genres to Retrieve Research that Supports Medical Claims made in News}},
booktitle = {Proceesings of the 2020 Conference on Empirical Methods in Natural Language Processing},
publisher = {Association for Computational Linguistics},
year = {2020}
}
|
|
Chaoyuan Zuo, Yu Zhao, and Ritwik Banerjee.
Style Change Detection with Feed-forward Neural Networks
In Working Notes of CLEF 2019 – Conference and Labs of the Evaluation Forum, CLEF 2018 – Vol. 2380.
CEUR-WS, 2019.
cite
description
@inproceedings{zuo2019style,
author={Zuo, Chaoyuan and Zhao, Yu and Banerjee, Ritwik},
title={{Style Change Detection with Feed-forward Neural Networks}},
booktitle = {CLEF 2019 Working Notes. Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum},
series = {{CEUR} Workshop Proceedings},
publisher = {CEUR-WS.org},
editor = {Cappellato, Linda and Ferro, Nicola and Losada, David E. and Müller, Henning},
address = {Lugano, Switzerland},
month = {September},
year = {2019}
}
The majority of previous authorship attribution studies mainly
focus on a dataset of documents (or parts of documents) with labeled authorship.
This scenario, however, is not applicable to documents written by more than one
author. Detecting the authorship switches within multi-author documents has been
shown to be a challenging task in previous PAN tasks. A simplified version of
the style change task is thus organized by PAN 2019, which aims at identifying
the number of authors in a given document. To this end, we present a system
consisting of two modules, one for distinguishing the single-author documents
from the multiauthor documents and the other for determining the exact number of
authors in the multi-author documents.
|
|
Chaoyuan Zuo, Ayla Karakas, and Ritwik Banerjee.
To Check or not to Check: Syntax, Semantics, and Context in the Language of
Check-worthy Claims
Invited Paper to "Best of the Labs" Track
In Crestani et al. (Eds.) Experimental IR Meets Multilinguality, Multimodality,
and Interaction: Proceedings of the 10th International Conference of the CLEF
Association, CLEF 2019 – LNCS Vol. 11696. Springer, 2019.
cite
description
@inproceedings{zuo2019check,
author = {Zuo, Chaoyuan and Karakas, Ayla and Banerjee, Ritwik},
title = {{To Check or not to Check: Syntax, Semantics, and Context in the Language of Check-worthy Claims}},
editor = {Crestani, Fabio and Braschler, Martin and Savoy, Jacques and Rauber, Andreas and Müller, Henning
and Losada, David E. and Bürki, Gundula H. and Cappellato, Linda and Ferro, Nicola},
booktitle = {Experimental IR Meets Multilinguality, Multimodality, and Interaction – Proceedings of the
10th International Conference of the CLEF Association},
publisher = {Springer International Publishing},
year = {2019},
series = {Lecture Notes in Computer Science},
volume = {11696}
}
As the spread of information has received a compelling boost due
to pervasive use of social media, so has the spread of misinformation. The sheer
volume of data has rendered the traditional methods of expert-driven manual
fact-checking largely infeasible. As a result, computational linguistics and
data-driven algorithms have been explored in recent years. Despite this
progress, identifying and prioritizing what needs to be checked has
received little attention. Given that expert-driven manual intervention is
likely to remain an important component of fact-checking, especially in specific
domains (e.g., politics, environmental science), this identification and
prioritization is critical. A successful algorithmic ranking of "check-worthy"
claims can help an expert-in-the-loop fact-checking system, thereby reducing the
expert's workload while still tackling the most salient bits of misinformation.
In this work, we explore how linguistic syntax, semantics, and the contextual
meaning of words play a role in determining the check-worthiness of claims. Our
preliminary experiments used explicit stylometric features and simple word
embeddings on the English language dataset in the Check-worthiness task of the
CLEF-2018 Fact-Checking Lab, where our primary solution outperformed the other
systems in terms of the mean average precision, R-precision, reciprocal
rank, and precision at k for multiple values of k. Here, we
present an extension of this approach with more sophisticated word embeddings
and report further improvements in this task.
|
|
Chaoyuan Zuo, Ayla Karakas, and Ritwik Banerjee.
A Hybrid Recognition System for Check-worthy Claims Using Heuristics and Supervised Learning
In Working Notes of CLEF 2018 – Conference and Labs of the Evaluation Forum, CLEF 2018 – Vol. 2125.
CEUR-WS, 2018.
cite
description
slides
@inproceedings{zuo2018hybrid,
author={Zuo, Chaoyuan and Karakas, Ayla and Banerjee, Ritwik},
title={{A Hybrid Recognition System for Check-worthy Claims Using Heuristics and Supervised Learning}},
booktitle = {CLEF 2018 Working Notes. Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum},
series = {{CEUR} Workshop Proceedings},
publisher = {CEUR-WS.org},
editor = {Cappellato, Linda and Ferro, Nicola and Nie, Jian-Yun and Soulier, Laure},
address = {Avignon, France},
month = {September},
year = {2018}
}
In recent years, the speed at which information disseminates has
received an
alarming boost from the pervasive usage of social media. To the detriment of
political and social
stability, this has also made it easier to quickly spread false claims. Due to
the sheer volume of
information, manual fact-checking seems infeasible, and as a result,
computational approaches have
been recently explored for automated fact-checking. In spite of the recent
advancements in this
direction, the critical step of recognizing and prioritizing statements worth
fact-checking has
received little attention. In this paper, we propose a hybrid approach that
combines simple
heuristics with supervised machine learning to identify claims made in political
debates and
speeches, and provide a mechanism to rank them in terms of their "check-worthiness".
The
viability of our method is demonstrated by evaluations on the English language
dataset as part of
the Check-worthiness task of the CLEF-2018 Fact Checking Lab.
|
|
Ian R. Whiteside, I. V. Ramakrishnan, Ritwik Banerjee, Vasudev Balasubramanian, Basava R.
Kanaparthi, and Matthew A. Barish.
Development and Evaluation of Natural Language Processing Software to Produce a
Summary of Inpatient Radiographic Findings Identified for Follow-Up.
In Annual Meeting of the Radiological Society of North America,
(abstract). RSNA, 2016.
|
|
Ritwik Banerjee, I. V. Rakamrishnan, Mark C. Henry, and Matthew Perciavalle.
Patient-centered Identification, Attribution, and Ranking of Adverse Drug Events
Best Technical Paper Award
In 2015 IEEE International Conference on Healthcare Informatics, pp. 18 -
27. IEEE,
2015.
cite
description
@inproceedings{banerjee2015_ichi,
author={Banerjee, Ritwik and Ramakrishnan, I. V. and Henry, Mark and Perciavalle, Matthew},
booktitle={{Healthcare Informatics (ICHI), 2015 International Conference on}},
title={{Patient Centered Identification, Attribution, and Ranking of Adverse Drug Events}},
year={2015},
pages={18--27},
month={Oct},
organization={IEEE}
}
Adverse drug events (ADEs) trigger a high number
of hospital emergency room (ER) visits. Information about ADEs
is often available in online drug databases in the form of narrative
texts, and serves as the physician’s primary reference point
for ADE attribution and diagnosis. Manually reviewing these
narratives, however, is an error prone and time consuming
process, especially due to the prevalence of polypharmacy. So
ER healthcare providers, especially given the heavy volume of
traffic in ERs, often either skip this step or at best do it rather
perfunctorily. This causes ADEs to be missed or misdiagnosed,
often leading to extensive and unnecessary testing and treatment,
including hospitalization. In this paper, we present a system
that automates the detection of ADEs and provides a list of
suspect drugs, ranked by their likelihood of causing the patient’s
complaints and symptoms. The input data, i.e., medications and
complaints, are obtained from triage notes that often contain
descriptive language. Our application utilizes heterogeneous
information sources (including drug databases) to refine and
transform these descriptions as well as the online database
narratives using a natural language processing (NLP) pipeline.
We then employ ranking measures to establish correspondence
between the complaints and the medications. Our preliminary
evaluation based on actual ER cases demonstrates that this
system achieves high precision and recall.
|
|
Ritwik Banerjee, Abbas Razaghpanah, Luis Chiang, Akassh Mishra, Vyaas Sekar, Yejin Choi, and
Phillipa
Gill.
Internet Outages, the Eyewitness Accounts: Analysis of the Outages Mailing List
In Proceedings of the 16th International Conference on Passive and Active
Network Measurement, PAM 2015 – Vol. 8995, pp. 206 – 217. Springer, 2015.
cite
description
@inproceedings{banerjee2015pam,
author={Banerjee, Ritwik and Razaghpanah, Abbas and Chiang, Luis and Mishra, Akassh and Sekar, Vyas and Choi, Yejin and Gill, Phillipa},
title={{Internet Outages, the Eyewitness Accounts: Analysis of the Outages Mailing List}},
booktitle={{Passive and Active Measurement: 16th International Conference}},
year={2015},
volume={8995},
pages={206--217},
year={2015},
organization={Springer}
}
Understanding network reliability and outages is critical to the
"health"
of the Internet infrastructure. Unfortunately, our ability to analyze Internet
out-
ages has been hampered by the lack of access to public information from key
players. In this paper, we leverage a somewhat unconventional dataset to analyze
Internet reliability—the outages mailing list. The mailing list is an avenue for
network operators to share information and insights about widespread outages.
Using this unique dataset, we perform a first-of-its-kind longitudinal analysis
of
Internet outages from 2006 to 2013 using text mining and natural language pro-
cessing techniques. We observe several interesting aspects of Internet outages:
a
large number of application and mobility issues that impact users, a rise in
con-
tent, mobile issues, and discussion of large-scale DDoS attacks in recent
years.
|
|
Ritwik Banerjee
Knowledge Extraction from Diverse Biomedical Corpora with Applications in Healthcare: Bridging the Translational Gap
PhD Dissertation. Stony Brook University, 2015.
cite
description
@phdthesis{banerjee2015_dissertation,
title={{Knowledge Extraction from Diverse Biomedical Corpora with Applications in Healthcare: Bridging the Translational Gap}},
author={Banerjee, Ritwik},
year={2015},
school={Stony Brook University},
address={Stony Brook, NY},
month={December},
}
A wealth of knowledge in the biomedical domain is available in unstructured or
semi-structured data
repositories as natural language narratives. Much of this knowledge can provide
immediate and
tangible benefits in patient welfare and the healthcare industry. Extracting
relevant knowledge from
these natural language sources and providing them as structured information
suitable for immediate
real-time consumption in clinical settings is, however, a manual process
restricted to human domain
experts. As a result, it is expensive and time-consuming. A very real
consequence of this is that
the journey made by medical "knowledge nuggets" from research
publications to patient-care
settings like hospitals often take several years. Even so, the knowledge still
gets presented to
clinicians in natural language – unsuitable for machine consumption, and
an impediment to the
pace of work often demanded of clinicians (e.g. in emergency rooms).
Automatic extraction of this knowledge is a challenging task.
Biomedical research
literature is replete with language constructs that are highly specific to not
just the domain, but
internal sub-domains. The linguistic semantics used in discussions of, say,
diabetes, are very
different from the semantics used to discuss diseases like malaria that are
caused by external
agents. Moreover, being research literature, authors typically write for readers
with a fair amount
of encyclopaedic domain knowledge. Consequently, important information
can often only be
gleaned by identifying causal relations that are implicit. Standard
information extraction
methods that depend on identifying causality in text usually require explicit
discourse connectives
like "because", "since", etc. Additionally, they manage to extract only those
relations that are
expressed within the span of a single sentence.
This dissertation presents a novel methodology to learn relations
from biomedical
natural language that is able to infer relations where (a) the relation is implicit,
and (b)
the related entities do not co-occur within the span of a single sentence. We
show that our
technique outperforms a sentence-level supervised classification approach.
Further, as a
human-in-the-loop (HITL) model, it is capable of augmenting biomedical knowledge
bases quickly and
accurately. Finally, we contribute two novel applications that demonstrate the
use of such
relational knowledge in providing real-time clinical decision support.
|
|
Ritwik Banerjee, Song Feng, Jun S. Kang, and Yejin Choi.
Keystroke Patterns as Prosody in Digital Writings: A Case Study with Deceptive Reviews and Essays
In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing,
pp. 1469 – 1473. ACL,
2014.
cite
description
data
@inproceedings{banerjee2014_emnlp,
author={Banerjee, Ritwik and Feng, Song and Kang, Jun Seok and Choi, Yejin},
title={{Keystroke Patterns as Prosody in Digital Writings: A Case Study with Deceptive Reviews and Essays}},
booktitle={{Proceedings of the 2014 Conference on Empirical Methods on Natural Language Processing}},
year={2014},
pages={1469--1473},
organization={Association for Computational Linguistics}
}
In this paper, we explore the use of keyboard strokes as a means
to access the
real-time writing process of online authors, analogously to prosody in speech
analysis, in the
context of deception detection. We show that differences in keystroke patterns
like editing
maneuvers and duration of pauses can help distinguish between truthful and
deceptive writing.
Empirical results show that incorporating keystroke-based features lead to
improved performance in
deception detection in two different domains: online reviews and essays.
The dataset contains truthful and deceptive writings from two
domains: business
reviews, and essays on two topics of social interest —
'gun control' and 'gay
marriage'. The data is available for download as compressed tar.bz2
files:
— Restaurant
Reviews
— Gun Control
— Gay Marriage
The uncompressed dataset consists of files with tab-separated values. The key
log data is found in
the last column, titled ReviewMeta. This field has a list of
KeyUp,
KeyDown and MouseUp
event logs. Note that the first event timestamp is not always zero. The event
logs have the
following formats:
[timestamp] KeyUp/KeyDown
[javascript keycode]
[timestamp] MouseUp
[begin-index]
[end-index]
|
|
Ritwik Banerjee, Yejin Choi, Gaurav Piyush, Ameya Naik, and I. V. Ramakrishnan.
Automated Suggestion of Laboratory Tests for Identifying Likelihood of Adverse Drug Events
In 2014 IEEE International Conference on Healthcare Informatics, pp. 170
–
175. IEEE,
2014.
cite
description
@inproceedings{banerjee2014_ichi,
author={Banerjee, Ritwik and Choi, Yejin and Piyush, Gaurav and Naik, Ameya and Rakamrishnan, I. V.},
title={{Automated Suggestion of Laboratory Tests for Identifying Likelihood of Adverse Drug Events}},
booktitle={{2014 International Conference on Healthcare Informatics}},
year={2014},
month={September},
pages={170--175},
organization={IEEE}
}
Adverse drug events (ADE) caused by use, misuse or sudden
discontinuation
of medications trigger hospital emergency room visits. Information about a wide
range of drugs and
associated ADEs is provided in online drug databases in the form of narrative
texts. Even though
some ADEs can be detected by observable symptoms, several others can only be
confirmed by laboratory
tests. In this paper, we present a system that provides automated suggestion of
tests to identify
the likelihood of ADEs. Given a patient’s medications and an optional list of
signs and symptoms,
our system automatically produces the laboratory tests needed to confirm
possible ADEs associated
with these drugs. The basis of our application is to map clinical symptoms to
medical problems and
laboratory tests. Towards that, we use template-based extraction and shallow
parsing techniques from
natural language processing to extract information from narrative texts. We
employ relevance ranking
measures to establish correspondence between the tests and ADEs. Our evaluation
based on a sample
set
of 40 drugs shows that this system achieves relatively high sensitivity.
|
|
Song Feng, Ritwik Banerjee, and Yejin Choi
Characterizing Stylistic Elements in Syntactic Structure
In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning,
pp. 1522 – 1533. ACL, 2012.
cite
description
@inproceedings{feng2012_emnlp,
title={{Characterizing Stylistic Elements in Syntactic Structure}},
author={Feng, Song and Banerjee, Ritwik and Choi, Yejin},
booktitle={{Proceedings of the 2012 Joint Conference on Empirical Methods on Natural Language Processing and
Computational Natural Language Learning}},
pages={1522--1533},
year={2012},
organization={Association for Computational Linguistics}
}
Much of the writing styles recognized in rhetorical and
composition theories involve
deep syntactic elements. However, most previous research for
computational stylometric
analysis has relied on shallow lexico-syntactic patterns. Some very recent work
has shown that PCFG
models can detect distributional difference in syntactic styles, but without
offering much insights
into exactly what constitute salient stylistic elements in sentence structure
characterizing each
authorship. In this paper, we present a comprehensive exploration of syntactic
elements in writing
styles, with particular emphasis on interpretable characterization of
stylistic elements.
We present analytic insights with respect to the authorship attribution task in
two different
domains.
|
|
Song Feng, Ritwik Banerjee, and Yejin Choi
Syntactic Stylometry for Deception Detection
In Proceedings of the 50th Annual Meeting of the Association for Computation
Linguistics: Short Papers – Vol. 2, pp. 171 – 175. ACL, 2012.
cite
description
@inproceedings{feng2012_acl,
title={{Syntactic Stylometry for Deception Detection}},
author={Feng, Song and Banerjee, Ritwik and Choi, Yejin},
booktitle={Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2},
pages={171--175},
year={2012},
organization={Association for Computational Linguistics}
}
Most previous studies in computerized deception detection have
relied only
on shallow lexico-syntactic patterns. This paper investigates syntactic
stylometry
for deception detection, adding a somewhat unconventional angle to prior
literature.
Over four different datasets spanning from the product review to the essay
domain, we
demonstrate that features driven from Context Free Grammar (CFG) parse trees
consistently
improve the detection performance over several baselines that are based only on
shallow
lexico-syntactic features. Our results improve the best published result on the
hotel review
data [Ott et
al., 2011]
reaching 91.2% accuracy – a 14% error reduction.
|