Wei Xu     

[phonetic pronunciation: way shoo ]

Assistant Professor
College of Computing
Georgia Institute of Technology

I am a faculty member of the School of Interactive Computing, Machine Learning Center, and NSF AI CARING Institute at Georgia Tech. My research lies at the intersections of machine learning, natural language processing, and social media. I focus on designing algorithms for learning semantics from large data for natural language understanding, and natural language generation in particular with stylistic variations. I recently received the NSF CRII Award, NSF CAREER Award, Criteo Faculty Research Award, CrowdFlower AI for Everyone Award, Best Paper Award at COLING'18, as well as research funds from DARPA and IARPA. I was a postdoctoral researcher at the University of Pennsylvania. I received my PhD in Computer Science from New York University, MS and BS from Tsinghua University.

  I'm recruiting 1-2 new PhD students every year (apply to PhD program and list me as a potential advisor). I also advise undergraduate and MS students (who have sufficient time and motivation) for research theses.
What's New
Research Highlights

Natural Language Generation / Stylistics

Many text-to-text generation problems can be thought of as sentential paraphrasing or monolingual machine translation. It faces an exponential search space larger than bilingual translation, but a much smaller optimal solution space due to specific task requirements. I am interested in a variety of generation problems, including style transfer [COLING'12] and stylistics in general (e.g., historic ↔ modern, non-standard ↔ standard [BUCC'13], feminine ↔ masculine [AAAI'16]). Our latest work focuses on controllable text generation [NAACL'21]. My work uncovered multiple serious problems in previous research (from 2010 to 2014) on text simplification [TACL'15] , designed a new tunable metric SARI [TACL'16] which is effective for evaluation and as a learning objective for training (now added to TensorFlow by the Google AI group), optimized syntax-based machine translation models [TACL'16], created pairwise neural ranking models to for lexical simplification [EMNLP'18], and studied document-level simplification [AAAI'20]. Our newest Transformer-based model initialized with BERT is the current state-of-the-art for automatic text simplification [ACL'20a].

Natural Language Understanding / Semantics

My approach to natural language understanding is learning and modeling paraphrases on a much larger scale and with a much broader range than previous work, essentially by developing more robust machine learning models and leveraging social media data. These paraphrase can enable natural language systems to handle errors (e.g., “everytime” ↔ “every time”), lexical variations (e.g., “oscar nom’d doc” ↔ “Oscar-nominated documentary”), rare words (e.g “NetsBulls series” ↔ “Nets and Bulls games”), and language shifts (e.g. “is bananas” ↔ “is great”). We designed a series of unsupervised and supervised learning approaches for paraphrase identification in social media data (also applicable to question/answer pairs for QA systems), ranging from neural network models [COLING'18][NAACL'18a] to multi-instance learning [TACL'14][EMNLP'16], and crowdsourcing large-scale datasets [SemEval'15][EMNLP'17].

Noisy User-generated Data / Social Media

For AI to truly understand human language and help people (e.g., instructing a robot), we ought to study the language people actually use in their daily life (e.g., posting on social media), besides the formally written texts that are well supported by existing NLP software. I thus focus on specially designed learning algorithms and the data for training these algorithms to develop tools to process and analyze noisy user-generated data. I have worked a lot with Twitter data [EMNLP'19][EMNLP'17][EMNLP'16] [TACL'14], given its importance and large scale coverage. Social media also contains very diverse languages for studying stylistics and semantics, carrying information that is important for both people’s everyday lives and national security. In the past three years, with my students, I have expanded my scope to cover a wider range of user-generated data, including biology lab protocols [NAACL'18b], GitHub, and StackOverflow [ACL'20b].

Current Offering:
Previous Offerings:


Group outings in 2021 with social distance.

Current Students:
    Mounica Maddela (PhD student; text generation, ranking model)
    Chao Jiang (PhD student; semantics, structured model)
    Yao Dou (PhD student; text generation, evaluation)
    Tarek Naous (PhD student; multilingual NLP, social media)
    Duong Minh Le (PhD student; dialog, controllable text generation -- co-advisor: Alan Ritter)
    Yang Chen (PhD student; information extraction, transfer learning -- co-advisor: Alan Ritter)
    Junmo Kang (PhD student; language model efficiency -- co-advisor: Alan Ritter)
    Jonathan Zheng (Undergrad, autumn 2020 -- ; robustness, social media EMNLP'22)
    David Heineman (Undergrad, winter 2020 -- ; generation evaluation)
    Michael Ryan (Undergrad, winter 2020 -- ; text simplification)
    Vishnu Suresh (Undergrad, autumn 2021 -- ; medical NLP)
    Marcus Ma (Undergrad, spring 2022 -- ; stylistics)
    Alexandra Soong (Undergrad, spring 2022 -- )
    Vishnesh Jayanthi (Undergrad, summer 2022 -- )
    Rachel Choi (Undergrad, summer 2022 -- )
    Ian Ligon (Undergrad, summer 2022 -- )
    Anton Lavrouk (Undergrad, autumn 2022 -- )
    Vinayak Athavale (Undergrad, autumn 2022 -- )
    Govind Ramesh (Undergrad, winter 2022 -- )
    Grace Kim (Undergrad, spring 2023 -- )
    Yimeng Jiang (Undergrad, spring 2023 -- )

PhD Thesis Committee:
    Sarah Wiegreffe (PhD @GaTech, 2022; interpretability/explainable AI - advisor: Mark Riedl)
    Yuval Pinter (PhD @GaTech, 2021; interpretability/semantics/morphology - advisor: Jacob Eisenstein)
    Sanqiang Zhao (PhD @UPitt, 2021; text simplification - advisor: Daqing He)
    Shi Zong (PhD @OSU, 2020; computational social science - advisor: Alan Ritter)
    Kai Cao (PhD @NYU, 2017; information extraction - advisor: Ralph Grishman)
    Maria Pershina (PhD @NYU, 2014; information extraction ACL'14 - advisor: Ralph Grishman)

Former Student Advisees:
    Wuwei Lan (PhD @OSU, 2021; semantics ACL'21 EMNLP'20'17 COLING'18 NAACL'18 → researcher at Amazon)
    Jeniya Tabassum (PhD @OSU, 2021; social media ACL'20 EMNLP'16 - co-advisor: Alan Ritter → lecturer at OSU)
    Chaitanya Kulkarni (PhD @OSU; biology protocols NAACL'18b - advisor: Raghu Machiraju)
    Yang Zhong (MS @OSU, 2021; stylistics EMNLP-F'21 AAAI'20 → phd at UPitt)
    Mingkun Gao (MS @UPenn; crowdsourcing/MT NAACL'15 - advisor: Chris Callison-Burch → phd at UIUC)
    Siyu Qiu (MS @UPenn; semantics EMNLP'17 → Hulu)
    Piyush Ghai (MS @OSU; semantics → Amazon)
    Daniel Joongwon Kim (Undergrad @UPenn; generation EMNLP'21 - advisor: Chris Callison-Burch → phd at UW)
    Jim Chen (Undergrad @UPenn; crowdsourcing HCOMP'14 TACL'16 - advisor: Chris Callison-Burch → phd at UW)
    Ray Lei (Undergrad @UPenn; crowdsourcing HCOMP'14 → Microsoft)
    Wenchao Du (Undergrad @UWaterloo; dialog AAAI'17 SAP - advisor: Pascal Poupart → MS at CMU LTI)
    Sydney Lee (Undergrad @OSU; linguistic annotation WNUT'20 → Capital One)
    Sam Stevens (Undergrad @OSU; scientific writing → phd at OSU)
    Ema Goh (Undergrad; text simplification)
    Dylan Small (Undergrad; generation)
    Mohamed Ghanem (Undergrad; multilingual NLP)
    Brian Seeds (Undergrad; user interface)
    Daniel Szoke (Undergrad; offensive language)
    Sarah Flanagan (Undergrad; linguistic annotation)
    Andrew Duffy (Undergrad; ; linguistic annotation)
    Kenneth Koepcke (Undergrad; linguistic annotation)
    Renliang Sun (Graduate intern, Peking Univ., summer/fall 2022; training BART/RoBERTa variations)
    Srushti Nandu (Undergrad intern, summer 2021 - spring 2022; social media)
    Panya Bhinder (High school intern, summer 2020)
    Solomon Wood (High school intern, spring 2020)
    Muji Lai (High school intern, spring 2022)


I am a NAACL executive board member, a senior area chair for EMNLP 2022 (generation), NAACL 2022 (machine learning for NLP), 2021 (generation), and ACL 2020 (generation), and an area chair for ACL 2023 (semantics), EMNLP 2021 (computational social science), EMNLP 2020 (generation), AAAI 2020 (NLP), ACL 2019 (semantics), NAACL 2019 (generation), EMNLP 2018 (social media), COLING 2018 (semantics), EMNLP 2016 (generation), a workshop chair for ACL 2017, and the publicity chair for EMNLP 2019, NAACL 2018 and 2016. I also created a new undergraduate course on Social Media and Text Analytics.


When I have spare time, I enjoy visiting art museums, hiking, biking, and snowboarding.

I wrote a biography of my phd advisor Ralph Grishman along with some early history of Information Extraction research in 2017.

I also made a list of the best dressed NLP researchers in 2016/17 , 2015 and 2014.