Wei Xu     

[phonetic pronunciation: way shoo ]

Assistant Professor
Department of Computer Science and Engineering
The Ohio State University
   495 Dreese Lab (2015 Neil Ave, Columbus, OH 43210)

My research lies at the intersections of machine learning, natural language processing, and social media. I am particularly interested in designing learning algorithms for gleaning semantic and structured knowledge from massive social media and web data. My work enables deeper analysis of text meaning and better natural language generation. I received my PhD in Computer Science from New York University, MS and BS from Tsinghua University. Prior to joining OSU, I was a postdoc at the University of Pennsylvania. I am a workshop chair for ACL 2017, an area chair for EMNLP 2016 and the publicity chair for NAACL 2016. I also wrote the Twitter API tutorial.

I am looking to recruit one or two new PhD students each year. My group also regularly have a few research positions for top undergraduate and masters students. Here is a note to prospective students.
What's New
  Jul 16 - 18, Seattle for Microsoft Research Faculty Summit
  Jul 28 - 29, NYC for Google Research
  Jul 30 - Aug 4, Vancouver, Canada for ACL 2017
  Sep 7-11, Copenhagen, Denmark for EMNLP, organizing the 3rd Workshop on Noisy User-generated Text
Talk on Twitter Paraphrase @ NAACL 2015

Talk on Text Simplification @ EMNLP 2015
Social Media and Text Analytics - CSE 5539 (Fall 2017, Fall 2016)
Social media provides a massive amount of valuable information and shows us how language is actually used by lots of people. This course covers several important machine learning algorithms and the core natural language processing techniques for obtaining and processing Twitter data.

Speech and Language Processing - CSE 5525 (Spring 2017)
Fundamentals of natural language processing, automatic speech recognition and speech synthesis; lab projects concentrating on building systems to process written and/or spoken language.

Artificial Intelligence II: Advanced Techniques - CSE 5522 (Spring 2018 - coming soon)

    Wuwei Lan (PhD student)
    Pravar Mahajan (RA - Masters student)

My past advisees all have published one or more papers with me:
    Jim Chen (Undergraduate UPenn → PhD University of Washington)
    Bin Fu (Undergraduate Tsinghua → PhD CMU → Google NYC)
    Mingkun Gao (Masters Upenn → PhD UIUC)
    Ray Lei (Undergraduate/Masters UPenn → Microsoft)
    Maria Pershina (PhD NYU)
    Siyu Qiu (Masters UPenn → Hulu.com)

Research Highlights

Natural Language Understanding and Semantics

I build probabilistic graphical models to extract semantic or structured knowledge from large volumes of data. I designed the first succesful models to extract paraphrases from Twitter that can scale up to billions of sentences. These web-scale paraphrases enable natural language systems to handle errors (e.g. “everytime” ↔ “every time”), lexical variations (e.g. “oscar nom’d doc” ↔ “Oscar-nominated documentary”), rare words (e.g “NetsBulls series” ↔ “Nets and Bulls games”), and language shifts (e.g. “is bananas” ↔ “is great”) [BUCC2013] [SemEval2015][EMNLP2017]. But it is difficult to capture such lexically divergent paraphrases by the conventional similarity-based approaches. I invented the multi-instance learning paraphrase (MultiP) model [TACL2014], which jointly infers latent word-sentence relations and relaxes the reliance on human annotation. It is a conditional random field model with latent variables [EMNLP 2016][ACL2014][ACL2013], and the current state-of-the-art, outperforming deep leaning and latent space methods.

Natural Language Generation

Many text-to-text generation problems can be thought of as sentential paraphrasing or monolingual machine translation. It faces an exponential search space larger than bilingual translation, but a much smaller optimal solution space due to specific task requirements. I advocate for a statistical text-to-text framework, building on top of statistical machine translation (SMT) technology. My recent work uncovered multiple serious problems in text simplification [TACL2015] research between 2010 and 2014, and set a new state-of-the-art by designing novel objective functions for optimizing syntax-based SMT and overgenerating with large-scale paraphrases [TACL2016]. I am also very interested in paraphrases of different language styles (e.g. historic ↔ modern [COLING2012], erroneous ↔ well-edited [BUCC2013], feminine ↔ masculine [AAAI2016]).

Professional Service
Workshop Chair:   ACL (2017)
Area Chair:   EMNLP (2016)
Publicity Chair:   NAACL (2016)
     - ACL 2015, COLING 2016, EMNLP 2017 Workshop on Noisy User-generated Text (W-NUT)
     - SemEval 2015 shared-task: Paraphrases and Semantic Similarity in Twitter (PIT)
     - 2016 Mid-Atlantic Student Colloquium on Speech, Language and Learning
Program Committee:
     ACL (2017, 2015, 2014, 2013), NAACL (2015), EMNLP (2017, 2016, 2015, 2014), COLING (2016, 2014)
     WWW (2016, 2015), AAAI (2016, 2015, 2012), KDD (2015)
Journal Reviewer:
     Transactions of the Association for Computational Linguistics (TACL)
     Journal of Artificial Intelligence Research (JAIR)

Invited Talks
I am a big believer of collaborations and have been happy to work and co-author with:
    Chris Callison-Burch (UPenn)
    Colin Cherry (National Research Council Canada)
    Bill Dolan (Microsoft Research)
    Yangfeng Ji (Gatech → U of Washington)
    Raphael Hoffmann (U of Washington → AI2 Incubator)
    Wenjie Li (Hong Kong Polytechnic University)
    Adam Meyers (NYU)
    Courtney Napoles (JHU)
    Daniel Preoţiuc-Pietro (UPenn → Bloomberg)
    Alan Ritter (U of Washington → Ohio State U)
    Joel Tetreault (ETS → Yahoo! Research → Grammarly)
    Lyle Ungar (UPenn)
    Luke Zettlemoyer (U of Washington)
    Le Zhao (CMU → Google)
    and many others ...


When I have spare time, I enjoy traveling, swimming and snowboarding.

I also made a list of the best dressed NLP researchers (2015) and (2014).