My research lies at the intersections of machine learning,
natural language processing, and social media. I am particularly
interested in designing learning algorithms for gleaning semantic and
structured knowledge from massive social media and web data. My work enables deeper analysis of text meaning and better natural language generation. I received my PhD in Computer Science from New York University, MS and BS from Tsinghua University. Prior to joining OSU, I was a postdoc at the University of Pennsylvania. I am a workshop chair for ACL 2017, an area chair for EMNLP 2016 and the publicity chair for NAACL 2016 and 2018. I also wrote the Twitter API tutorial.
I am looking to recruit one or two new PhD students each year. My group also regularly have a few research positions for top undergraduate and masters students. Here is a note to prospective students.
Jul 16 - 18, Seattle for Microsoft Research Faculty Summit
Jul 28 - 29, NYC for Google Research
Jul 30 - Aug 4, Vancouver, Canada for ACL 2017
Sep 7-11, Copenhagen, Denmark for EMNLP, organizing the 3rd Workshop on Noisy User-generated Text
September 2017, serving as the publicity chair for NAACL 2018. Consider to submit a paper!
July 2017, invited to give a talk at Google Research NLU workshop
Social Media and Text Analytics - CSE 5539 (Fall 2017, Fall 2016)
Social media provides a massive amount of valuable information and shows us how language is actually used by lots of people. This course covers several important machine learning algorithms and the core natural language processing techniques for obtaining and processing Twitter data.
Speech and Language Processing - CSE 5525 (Spring 2017)
Fundamentals of natural language processing, automatic speech recognition and speech synthesis; lab projects concentrating on building systems to process written and/or spoken language.
My past advisees all have published one or more papers with me: Jim Chen (Undergraduate UPenn → PhD University of Washington) Bin
Fu (Undergraduate Tsinghua → PhD CMU → Google NYC) Mingkun Gao (Masters Upenn → PhD UIUC) Ray Lei (Undergraduate/Masters UPenn → Microsoft Redmond) Maria Pershina (PhD NYU → Goldman Sachs NYC) Siyu Qiu (Masters UPenn → Hulu LA)
Natural Language Understanding / Semantics
We design machine learning algorithms to extract semantic or structured knowledge from large volumes of data. We have a series of work on learning web-scale paraphrases from Twitter that can enable natural language systems to handle errors (e.g.
“everytime” ↔ “every time”), lexical variations (e.g. “oscar nom’d
doc” ↔ “Oscar-nominated documentary”), rare words (e.g “NetsBulls
series” ↔ “Nets and Bulls games”), and language shifts (e.g. “is
bananas” ↔ “is great”) [BUCC2013][SemEval2015]. It is difficult to capture such lexically divergent paraphrases by the conventional similarity-based approaches. We design multi-instance learning models[TACL2014], which jointly infers latent word-sentence relations [EMNLP 2016] and relaxes the reliance on human annotation, and neural network models for sentence pair modeling [EMNLP2017].
Natural Language Generation / Stylistics
Many text-to-text generation problems can be thought of as sentential paraphrasing or monolingual machine translation. It faces an exponential search space larger than bilingual translation, but a much smaller optimal solution space due to specific task requirements. I advocate for a statistical text-to-text framework, building on top of statistical machine translation (SMT) technology. My recent work uncovered multiple serious problems in text simplification [TACL2015] research between 2010 and 2014, and set a new state-of-the-art by designing novel objective functions for optimizing syntax-based SMT and overgenerating with large-scale paraphrases [TACL2016]. I am also very interested in paraphrases of different language styles (e.g. historic ↔ modern [COLING2012], erroneous ↔ well-edited [BUCC2013], feminine ↔ masculine [AAAI2016]).
Can Paraphrase be a Ultimate
Solution for NLU and NLG?
July 2017, Google Research, New York, NY
Paraphrase ≈ Monolingual Translation
Aug 2016, Amazon.com, Berlin, Germany
Multiple-instance Learning from Unlimited Text
Dec 2016, Microsoft Research Asia, Beijing, China
Sep 2016, University of Delaware, Newark, DE
May 2016, University of Edinburgh, Edinburgh, United Kingdom
Apr 2016, Ohio State University, Columbus, OH
Apr 2016, University of North Carolina, Chapel Hill, NC
Mar 2016, Arizona State University, Tempe, AZ
Mar 2016, Vanderbilt University, Nashville, TN
Mar 2016, Imperial College London, London, United Kingdom
Mar 2016, University of Waterloo, Waterloo, ON, Canada (CS Seminar)
Feb 2016, Indiana University, Bloomington, IN (Computer Science Colloquium Series)
Feb 2016, Washington University, St Louis, MI (Computer Science & Engineering Colloquia Series)
Feb 2016, Simon Fraser University, Vancouver, BC, Canada
Feb 2016, University of Alberta, Edmonton, AB , Canada (Special Lecture)
Feb 2016, Yale University, New Haven, CT (CS Talk)
Oct 2015, University of Maryland, College Park, MD (CLIP Colloquium)
Oct 2015, Ohio State University, Columbus, OH (Clippers Seminar)
Large-scale Paraphrase Acquisition from Twitter
May 2015, DARPA DEFT PI Meeting, Boulder, CO
Learning and Generating Paraphrases from Twitter and Beyond [poster]
Apr 2015, Carnegie Mellon University, Pittsburgh, PA
Apr 2015, Columbia University, New York, NY (NLP Talk)
Feb 2015, Johns Hopkins University, Baltimore, MD (CLIP Colloquium)
Paraphrases in Twitter [slides]
Feb 2015, Twitter.com, San Francisco, CA
Modeling Lexically Divergent Paraphrases in Twitter (and
Shakespeare!) [poster] Mar 2015, The City University of New York, New York, NY (NLP Seminar)
Feb 2015, IBM Research - Almaden, San Jose, CA
Feb 2015, UC Berkeley, Berkeley, CA
Feb 2015, UT Austin, Austin, TX (Forum for Artificial Intelligence)
Dec 2014, Yahoo! Research, New York, NY
Nov 2014, Carnegie Mellon
University, Pittsburgh, PA (CL+NLP Lunch Seminar)
Aug 2014, Microsoft Research,
Redmond, WA (Visiting Speaker Series)
Incremental Information Extraction
Apr 2012, Stanford Research Institute, Palo Alto, CA
May 2011, IARPA's
KDD PI Meeting, San Diego, CA
Information Extraction Research
Jan 2011, University of Washington,
Nov 2009, Thomson Reuters, Eagan,
Mar 2007, France Telecom, Beijing,
When I have spare time, I enjoy traveling, swimming and snowboarding.