Jun Seok Kang
I am a PhD candidate at the Department of Computer Science at Stony Brook University.

The main focus of my research is on social media analytics. To analyze subtle and nuanced sentiment of words that are prevalent in social media, I work on constructing a connotation lexicon by formulating a graph model of words using large scale data. I have also worked on deception detection by analyzing the characteristics of keystroke patterns. The key insight is that the style of writing, together with the keystroke patterns, can help detecting the intent of authors, even when it is deceptive. I have also worked on predicting restaurant hygiene inspection results from online reviews in a hope to be helpful to local government in dispatching their inspectors. This project is an example of turning NLP techniques for social good.
I am currently working on neural language generation focusing on controlling style and content of the generated text.

My Adviser is Yejin Choi and Niranjan Balasubramanian

- Email:
- CV
Post-Modifier Generation

Post-modifier is a short phrase that comes after an entity in a sentence to describe the entity in detail. It can be found easily in many news articles. For example, in the below sentence, the MIT professor and antiwar activist is the post-modifier of Noam Chomsky.

Noam Chomsky, the MIT professor and antiwar activist, said Dr. Melman helped mobilize what once was weak and scattered resistance to war and other military operations.

We formulate post-modifier generation task as a data-to-text generation problem, where the data is the context (a sentence without a post-modifier) and the set of known facts about the target entity. The text to be generated is a post-modifier that is relevant to the rest of the information conveyed in the text. The example on the right shows the input and output of the task.

[Paper@NAACL 2019(arXiv version)] [Project Page]

Learning Connotation of Words

A considerable amount of studies of sentiment analysis have been focused on learning explicit sentiment of words and documents. To capture subtle shades of sentiment, this project aims to learn subtle, nuanced connotation of words, even that of seemingly objective words such as "intelligence", "human", and "cheesecake".

We construct a graph of words encoding diverse linguistic insights (such as semantic prosody, distributional similarity and semantic parallelism of coordination) to construct a connotation lexicon.
[Paper@ACL 2013] [Project Page]

We enhance our graph of words by adding senses to increase its coverage and to encode lexcical relations as additional information. From this extended graph, we construct a connotation lexicon, ConnotationWordNet, using loopy belief propagation as a lexicon induction algorithm.
[Paper@ACL 2014] [Project Page]

Predicting Hygiene Status of Restaurants using Online Reviews

Many counties & cities such as NYC or LA require restaurants to post their inspection grades which helps people to decide where to eat.

However, the health departments often have limited resources to dispatch inspectors. (Among the Seattle restaurants listed on Yelp.com (2006~2013), more than 50% of them didn't have an inspection record!)

In this project, we predict hygiene status of restaurants using online reviews in a hope to help local governments to dispatch their inspectors more efficiently.
[Paper@EMNLP 2013] [Project Page]

* This project is thanks to collaboration with Mike Luca at Harvard Business School.

Detecting Deception Using Keystroke Patterns

Many of the studies on deception detection focus on the insights you can learn from existing texts that are already written.

In this project, however, we turn our attention to keystroke patterns of truthful and deceptive writers and explore ways to use them as a means of detecting deceptive intent of the writers.
[Paper@EMNLP 2014] [Project Page]

Learning Semantic Composition Patterns of Sentences; Revising Sentences

In this on-going project, we first learn semantic composition patterns of sentences from the documents of the same domain using mixed-Integer programming, and revise the sentences automatically using the learned sematic patterns and their statistics.