My research lies at the intersections of machine learning, natural language processing, and social media. I focus on designing algorithms for learning semantics from large data for natural language understanding, and natural language generation in particular with stylistic variations. I recently received the NSF CRII Award, Criteo Faculty Research Award, CrowdFlower AI for Everyone Award, Best Paper Award at COLING'18, as well as research funds from DARPA. I was a postdoctoral researcher at the University of Pennsylvania. I received my PhD in Computer Science from New York University, MS and BS from Tsinghua University.
I am a senior area chair for NAACL 2021 and ACL 2020 (generation), and an area chair for EMNLP 2020 (generation), AAAI 2020 (NLP), ACL 2019 (semantics), NAACL 2019 (generation), EMNLP 2018 (social media), COLING 2018 (semantics), EMNLP 2016 (generation), a workshop chair for ACL 2017, and the publicity chair for EMNLP 2019, NAACL 2018 and 2016. I also created a new course on Social Media and Text Analytics.
I'm recruiting two new PhD students this year. Possible areas of research include language generation, robust NLP, information extraction, interactive machine learning, etc.
Sep 11 - talk at Emory University, CS Department Seminar, "Understanding & Generating Human Language"
Oct 15 - talk at USC/ISI NL Seminar (video), "Natural Language Understanding for Noisy Text"
Oct 27 - talk at University of Pittsburgh, NLP Seminar, "Automatic Text Simplification"
Oct 29 - talk at University of Sheffield, NLP Seminar, "Natural Language Understanding for Noisy Text"
Oct 30 - talk at Google, "Natural Language Generation towards Social Good"
Nov 4 - talk at University of Delaware, ECE Department Seminar, "Understanding & Generating Human Language"
Nov 13 - talk at CMU LTI Colloquium (video), "Importance of Data and Linguistics in Neural Language Generation"
Nov 19 - organizing EMNLP Workshop on Noisy User-generated Text
Currently, serving as a senior area chair for NAACL 2021 (Language Generation track).
Nov 2020, my PhD student Jeniya Tabassum successfully defended her phd thesis (co-advisor Alan Ritter).
Oct 2020, paper accepted at EMNLP 2020 on GigaBERT for zero-shot transfer learning from English to Arabic.
April 2020, three long papers accepted to ACL 2020! We are releasing (1) new high-quality dataset and Transformer-based model for text simplification; (2) fine-grained named entity and code recognition for StackOverflow; (3) a unified span-based neural network framework and benchmark leaderboard for 10+ NLP tasks.
Many text-to-text generation problems can be thought of as sentential paraphrasing or monolingual machine translation. It faces an exponential search space larger than bilingual translation, but a much smaller optimal solution space due to specific task requirements. I am interested in a variety of generation problems, including text simplification, style transfer, paraphrase generation, and error correction. My work uncovered multiple serious problems in previous research (from 2010 to 2014) on text simplification [TACL'15] , designed a new tunable metric SARI [TACL'16] which is effective for evaluation and as a learning objective for training (now added by the Google AI group to TensorFlow), optimized syntax-based machine translation models [TACL'16], created pairwise neural ranking models to for lexical simplification [EMNLP'18], and studied document-level simplification [AAAI'20]. Our newest Transformer-based model initialized with BERT is the current state-of-the-art for automatic text simplification [ACL'20a]. I am interested in text generation for style transfer [COLING'12] and stylistics in general (e.g. historic ↔ modern, non-standard ↔ standard [BUCC'13], feminine ↔ masculine [AAAI'16]).
Natural Language Understanding / Semantics
My approach to natural language understanding is learning and modeling paraphrases on a much larger scale and with a much broader range than previous work, essentially by developing more robust machine learning models and leveraging social media data. These paraphrase can enable natural language systems to handle errors (e.g., “everytime” ↔ “every time”), lexical variations (e.g., “oscar nom’d doc” ↔ “Oscar-nominated documentary”), rare words (e.g “NetsBulls series” ↔ “Nets and Bulls games”), and language shifts (e.g. “is bananas” ↔ “is great”). We designed a series of unsupervised and supervised learning approaches for paraphrase identification in social media data (also applicable to question/answer pairs for QA systems), ranging from neural network models [COLING'18][NAACL'18a] to multi-instance learning [TACL'14][EMNLP'16], and crowdsourcing large-scale datasets [SemEval'15][EMNLP'17].
Noisy User-generated Data / Social Media
For AI to truly understand human language and help people (e.g., instructing a robot), we ought to study the language people actually use in their daily life (e.g., posting on social media), besides the formally written texts that are well supported by existing NLP software. I thus focus on specially designed learning algorithms and the data for training these algorithms to develop tools to process and analyze noisy user-generated data. I have worked a lot with Twitter data [EMNLP'19][EMNLP'17][EMNLP'16][TACL'14], given its importance and large scale coverage. Social media also contains very diverse languages for studying stylistics and semantics, carrying information that is important for both people’s everyday lives and national security. In the past three years, with my students, I have expanded my scope to cover a wider range of user-generated data, including biology lab protocols [NAACL'18b], GitHub, and StackOverflow [ACL'20b].
Importance of Data and Linguistics in Neural Language Generation [video recording]
Nov 2020, Carnegie Mellon University (LTI Colloquium)
Natural Language Understanding for Noisy Text [video recording]
Oct 2020,
University of Sheffield (NLP Seminar)
Oct 2020,
USC Information Sciences Institute (NL Seminar)
Natural Language Generation towards Social Good
Oct 2020, Google
Automatic Text Simplification [slides]
Oct 2020, University of Pittsburgh, Pittsburgh, PA (NLP Seminar)
Understanding and Generating Human Language
Nov 2020, University of Delaware, Newark, DE (ECE Department Seminar)
Sep 2020, Emory University, Atlanta, GA (CS Department Seminar)
Feb 2020, University of Maryland, College Park, MD
Jan 2020, University of Massachusetts, Amherst, MA
Dec 2019, Georgia Institute of Technology, Atlanta, GA
Learning for Unlimited Human Language
Dec 2018, Peking University, Beijing, China
Learning Large-scale Paraphrases for Natural Language Understanding and Generation
Jun 2018, Midwest Machine Learning Symposium, Chicago, IL
May 2018, Facebook, Menlo Park, CA
May 2018, Twitter, San Francisco, CA
May 2018, Stanford Research Institute, Menlo Park, CA
Nov 2017, IBM Thomas J. Watson Research Center, New York
How AI Understand Language?
Mar 2018, Women in Analytics Conference (Main-stage Panel)
Can Paraphrase be a Ultimate
Solution for NLU and NLG?
July 2017, Google Research, New York, NY
Paraphrase ≈ Monolingual Translation
Aug 2016, Amazon, Berlin, Germany
Multiple-instance Learning from Unlimited Text
Dec 2016, Microsoft Research Asia, Beijing, China
Sep 2016, University of Delaware, Newark, DE
May 2016, University of Edinburgh, Edinburgh, United Kingdom
Apr 2016, Ohio State University, Columbus, OH
Apr 2016, University of North Carolina, Chapel Hill, NC
Mar 2016, Arizona State University, Tempe, AZ
Mar 2016, Vanderbilt University, Nashville, TN
Mar 2016, Imperial College London, London, United Kingdom
Mar 2016, University of Waterloo, Waterloo, ON, Canada (CS Seminar)
Feb 2016, Indiana University, Bloomington, IN (Computer Science Colloquium Series)
Feb 2016, Washington University, St Louis, MI (Computer Science & Engineering Colloquia Series)
Feb 2016, Simon Fraser University, Vancouver, BC, Canada
Feb 2016, University of Alberta, Edmonton, AB , Canada (Special Lecture)
Feb 2016, Yale University, New Haven, CT (CS Talk)
Oct 2015, University of Maryland, College Park, MD (CLIP Colloquium)
Oct 2015, Ohio State University, Columbus, OH (Clippers Seminar)
Large-scale Paraphrase Acquisition from Twitter
May 2015, DARPA DEFT PI Meeting, Boulder, CO
Learning and Generating Paraphrases from Twitter and Beyond
Apr 2015, Carnegie Mellon University, Pittsburgh, PA
Apr 2015, Columbia University, New York, NY (NLP Talk)
Feb 2015, Johns Hopkins University, Baltimore, MD (CLIP Colloquium)
Paraphrases in Twitter [slides]
Feb 2015, Twitter, San Francisco, CA
Modeling Lexically Divergent Paraphrases in Twitter (and
Shakespeare!) [poster] Mar 2015, The City University of New York, New York, NY (NLP Seminar)
Feb 2015, IBM Research - Almaden, San Jose, CA
Feb 2015, UC Berkeley, Berkeley, CA
Feb 2015, UT Austin, Austin, TX (Forum for Artificial Intelligence)
Dec 2014, Yahoo! Research, New York, NY
Nov 2014, Carnegie Mellon
University, Pittsburgh, PA (CL+NLP Lunch Seminar)
Aug 2014, Microsoft Research,
Redmond, WA (Visiting Speaker Series)
Incremental Information Extraction
Apr 2012, Stanford Research Institute, Palo Alto, CA
May 2011, IARPA's
KDD PI Meeting, San Diego, CA
Information Extraction Research
Jan 2011, University of Washington,
Seattle, WA
Event-based Summarization
Nov 2009, Thomson Reuters, Eagan,
MN
Mar 2007, France Telecom, Beijing,
China
Miscellaneous
When I have spare time, I enjoy visiting art museums, hiking, biking, and snowboarding.