Youngseo Son

I earned my PhD of Computer Science with a main focus on Natural Language Processing (NLP) and Machine Learning (ML) from Stony Brook University in DEC 2020.
I am very grateful for having H Andrew Schwartz as my advisor.
My key research focus is in the field of Natural Language Processing (NLP) for social media analysis, language modeling, information extraction and data analysis. I collaborate with psychologists and computational linguists for Human-centered language modeling to obtain higher accuracies of various NLP tasks from traditional tasks (e.g., sentiment analysis) to novel tasks such as discourse style analysis for psychological assessment and well-being measurement. I especially focus on discourse relation parsing to extract key information for targeted tasks such as opinions or reasons for sentiment of reviews and a political stance, and finding the correlations of discourse styles with human variables such as personality.



Research Scienctist

  • Designed comment quality metrics and developed NLP models to recommend relevant comments and encourage high quality conversations (e.g., relevancy, sentiment and tones) (link).
  • Fine-tuned and deployed AI integrity models to mitigate hate speech on the platform (e.g., sexual, harassment and bullying) (link).
  • Conducted data science experiments and statistical analysis to develop ML models using state-of-the-art model architectures and optimize them for comment ranking across different products (e.g., Facebook public posts, Reels, live videos) (link).
FEB 2021 - JAN 2023

ML PhD Software Engineer Intern

  • I am very grateful for the experience working with Sarang Metkar as my intern manager in Search Integrity team.
  • Developed graph-based ML and NLP model to improve search results
  • Conducted data analysis and data mining to improve search models
Summer 2020

Data Scientist PhD Intern

Data Sciences and Analytics Group in Pacific Northwest National Laboratory
  • I am very grateful for the experience working with Svitlana Volkova as my PI and Maria Glenski as my mentor.
  • DARPA SocialSim Project
  • Developed SocialSim modules to analyze information / graph evolution and cross-platform misinformation / disinformation spreads on social media (e.g., Twitter, Reddit, Github)
  • Collaborated with Prasha Shrestha for detecting coordinated efforts and analyzing trends and spread mechanisms of cryptocurrencies over social media
Summer 2019

Research Scientist Intern

World Well-Being Project (WWBP), the University of Pennsylvania
  • Developed a discourse relation parser for social media to capture counterfactual thinking from tweets.
  • The NLP pipeline of the joint model of the rule-based model (regex with Tweet Brown Clusters) and the statistical model (Linear SVM with discourse unit extraction)
  • Led the project with the help of Prof. Lyle Ungar and Anneke Buffone of the WWBP team.
Summer 2016

Teaching Assistant

Stony Brook University
  • Grduate Courses: Big Data Analytics (Fall 2016, Fall 2017)
  • Undergrduate Courses: Senior Software Engineering, Computer Science III, Advanced Game Programming, and Computer Music
Spring 2014 - Fall 2017

Software Engineer Intern

Dassault Systemes
  • Development of the Product Lifecycle Management (PLM) Web Application (ENOVIA)
  • Worked on updating PLM chart display and data visualization functions with PLM Development senior software engineers and pre sales team members
December 2012 - February 2013

Information Technology Specialist (25B)

the ROK Army & the US Army
  • Worked with 2nd Infantry Division 8th Army of the United States
  • Managed networks, computers, peripheral devices, and the online portal of the battalion
August 2010 - May 2012


Stony Brook University Computer Scinece PhD Awarded

DEC 2020

Invited Talks at INFORMS 2019

Two Session Talks for NLP Applications for Decision Support and Social Media Mining
OCT 20-23, 2019

Selected Publication

Discourse Relation Embeddings: Representing the Relations between Discourse Segments in Social Media

Youngseo Son, Vasudha Varadarajan, H. Andrew Schwartz

Large-scale weakly-supervised multitask learning for parsing discourse relation embeddings from a new domain (social media) without labels. Bidirectional Hierarchical LSTMs with word-level attentions.



Predicting Adolescent Depression and Anxiety from Multi-Wave Longitudinal Data using Machine Learning

Mariah T. Hawes , H.Andrew Schwartz, Youngseo Son, Daniel N. Klein

This study leveraged machine learning to evaluate the contribution of information from multiple developmental stages to prospective prediction of depression and anxiety in mid-adolescence. We used canonical correlation analysis (CCA). The feature set included several important risk factors spanning psychopathology, temperament/personality, family environment, life stress, interpersonal relationships, neurocognitive, hormonal, and neural functioning, and parental psychopathology and personality.

Psychological Medicine

Suicide Risk Assessment with Multi-level Dual-Context Language and BERT

Matthew Matero, Akash Idnani, Youngseo Son, Salvatore Giorgi, Huy Vu, Mohammadzaman Zamani, Parth Limbachiya, Sharath Chandra Guntuku, H. Andrew Schwartz

Ranked No 1. for predicting reddit users’ suicide risk level using their SuicideWatch and Non-SuicideWatch posts (Task B). Developed user-factor-adapted RNN models with post-level attention using BERT and psychology language model representations of reddit posts

NAACL 2019 CLPsych

The Language of Well-Being: Tracking Fluctuations in Emotion Experience through Everyday Speech

Jessie Sun, H.Andrew Schwartz, Youngseo Son, Margaret Kern, Simine Vazire

LDA Topic modeling to capture momentary emotions from language (validated by the replication in the second year). Exploration over Linguistic Inquiry and Word Count (LIWC) categories and open-vocabulary models for the correlation analysis between language and momentary emotion.

Journal of Personality and Social Psychology

Causal Explanation Anlaysis on Social Media

Youngseo Son, Nipun Bayas, H. Andrew Schwartz

The NLP pipeline of the joint model of the causality classifier (Linear SVM) and the causal explanation identifier (Bidirectional LSTM). The application of the pipeline to downstream tasks (Facebook Demographic Analysis and Yelp Review Sentiment Cause Detection)

[Code] [Pretrained Models]

EMNLP 2018

Human Centered NLP with User-Factor Adaptation

Veronica E. Lynn, Youngseo Son, Vivek Kulkarni, Niranjan Balasubramanian, H. Andrew Schwartz

Feature Adapatation of NLP models using human variables (age, gender, and personality) for downstream tasks (POS Tagging, PP-Attachment, Sentiment, Sarcasm, Stance)


EMNLP 2017

Recognizing Counterfactual Thinking in Social Media Texts

Youngseo Son, Anneke Buffone, Anthony Janocko, Allegra Larche, Joseph Raso, Kevin Zembroski, H Andrew Schwartz, Lyle Ungar

The NLP pipeline of the joint model of the rule-based model (regular expression capable of capturing social-media-specific variations of discourse connectives with Tweet Brown Clusters) and the statistical model (Linear SVM)


ACL 2017


Stony Brook University

Doctor of Philosophy
Computer Science

GPA: 3.89/4.00

August 2015 - Present

Stony Brook University

Bachelor of Science
Summa Cum Laude
Departmental Honors
Computer Science

GPA: 3.93/4.00

August 2013 - May 2015

Ajou University

Bachelor of Engineering
Computer Engineering

GPA: 4.36/4.50

March 2009 - August 2013

Recent Projects

9/11 World Trade Center Project

Project in collaboration with Stony Brook WTC Wellness Program and Stony Brook Medicine
  • Analyzing interviews of people who were at the scene of 9/11 WTC Attack.
  • Correlating linguistic features of the subjects with their mental/physical health.
  • Using LDA topic clustering, discourse relation parsing, sentiment/emotion lexicons.
August 2017 - Present

The Language of Well-Being Project

Project in collaboration with the University of California, Davis and the University of Melbourne
  • Correlating linguistic features of people’s everyday language with the changes of their emotions.
  • Conducting LDA topic clustering over the transcripts of the participants’ daily speech for the emotion analysis.
  • Using N-gram, Linguistic Inquiry and Word Count (LIWC), sentiment/emotion lexicons.
July 2017 - Present

Awards & Scholarship