Wei Xu     

[phonetic pronunciation: way shoo ]

Assistant Professor
Department of Computer Science and Engineering
The Ohio State University
   495 Dreese Lab (2015 Neil Ave, Columbus, OH 43210)

My research lies at the intersections of machine learning, natural language processing, and social media. I am particularly interested in designing learning algorithms for gleaning semantic and structured knowledge from massive social media and web data. My work enables deeper analysis of text meaning and better natural language generation. I received my PhD in Computer Science from New York University, MS and BS from Tsinghua University. Prior to joining OSU, I was a postdoctoral researcher at the University of Pennsylvania. I am a workshop chair for ACL 2017, an area chair for EMNLP 2016 and the publicity chair for NAACL 2016 and 2018. I also wrote the Twitter API tutorial and designed a new undergraduate/graduate course on Social Media and Text Analytics.

I am looking to recruit one or two new PhD students each year. My group also regularly have a few research positions for undergraduate and masters students. Here is a note to prospective students.
What's New
  March 15, Columbus - speaking at Women in Analytics Conference
  June 1-5, New Orleans - NAACL 2018
  June 6, Chicago - invited talk at Midwest Machine Learning Symposium
Talk on Twitter Paraphrase @ NAACL 2015

Talk on Text Simplification @ EMNLP 2015
CSE 5522 Artificial Intelligence II: Advanced Techniques (Spring 2018 - current)
CSE 5539 Social Media and Text Analytics (Fall 2017, Fall 2016)
CSE 5525 Speech and Language Processing (Spring 2017)

Current Students:
    Wuwei Lan (PhD student, 2016 -- ; semantics/deep learning EMNLP'17 NAACL'18a)
    Mounica Maddela (PhD student, 2017 -- ; stylistics)
    Shuaichen Chang (PhD student, 2017 -- )

Past Students:
    Jim Chen (Undergradaute @UPenn; crowdsourcing HCOMP'14 TACL'16 - now PhD University of Washington)
    Wenchao Du (Undergradaute @UWaterloo; dialog AAAI'17 SAP - now Master CMU LTI)
    Mingkun Gao (Masters student @UPenn; machine translation NAACL'15 - now PhD UIUC)
    Piyush Ghai (Masters student @OSU; semantics)
    Chaitanya Kulkarni (PhD student @OSU; robotic instructions NAACL'18b)
    Ray Lei (Undergradaute @UPenn; crowdsourcing HCOMP'14 - now Microsoft Redmond)
    Pravar Mahajan (Masters student @OSU; social media)
    Jeniya Tabassum (PhD student @OSU; information extraction EMNLP'16)
    Maria Pershina (PhD student @NYU; information extraction ACL'14 - now Goldman Sachs NYC)
    Siyu Qiu (Masters student @UPenn; semantics EMNLP'17 - now Hulu LA)

Research Highlights

Natural Language Understanding / Semantics

We design machine learning algorithms to extract semantic or structured knowledge from large volumes of data. We have a series of work on learning web-scale paraphrases from Twitter that can enable natural language systems to handle errors (e.g. “everytime” ↔ “every time”), lexical variations (e.g. “oscar nom’d doc” ↔ “Oscar-nominated documentary”), rare words (e.g “NetsBulls series” ↔ “Nets and Bulls games”), and language shifts (e.g. “is bananas” ↔ “is great”) [BUCC'13] [SemEval'15]. It is difficult to capture such lexically divergent paraphrases by the conventional similarity-based approaches. We design large-scale data [EMNLP'17], neural network models for sentence pair modeling [NAACL'18a] and multi-instance learning models [TACL'14] [EMNLP'16], which jointly infers latent word-sentence relations.

Natural Language Generation / Stylistics

Many text-to-text generation problems can be thought of as sentential paraphrasing or monolingual machine translation. It faces an exponential search space larger than bilingual translation, but a much smaller optimal solution space due to specific task requirements. I advocate for a text-to-text generation framework, building on top of machine translation technologies. My recent work uncovered multiple serious problems in text simplification [TACL'15] research between 2010 and 2014, and set a new state-of-the-art by designing novel objective functions for optimizing syntax-based SMT and overgenerating with large-scale paraphrases [TACL'16]. I am also very interested in paraphrases of different language styles (e.g. historic ↔ modern [COLING'12], non-standard ↔ standard [BUCC'13], feminine ↔ masculine [AAAI'16]).

Professional Service
Workshop Chair:   ACL (2017)
Area Chair:   COLING (2018), EMNLP (2016)
Publicity Chair:   NAACL (2018, 2016)
     - Workshop on Noisy User-generated Text (W-NUT) at ACL 2015, COLING 2016, EMNLP 2017 & 2018
     - SemEval 2015 shared-task: Paraphrases and Semantic Similarity in Twitter
     - 2016 Mid-Atlantic Student Colloquium on Speech, Language and Learning
Program Committee:
     ACL (2018, 2017, 2015, 2014, 2013), NAACL (2018, 2015), EMNLP (2017, 2016, 2015, 2014), COLING (2016, 2014)
     WWW (2016, 2015), AAAI (2016, 2015, 2012), KDD (2015)
Journal Reviewer:
     Transactions of the Association for Computational Linguistics (TACL)
     Journal of Artificial Intelligence Research (JAIR)

Invited Talks

When I have spare time, I enjoy traveling, swimming and snowboarding.

I also made a list of the best dressed NLP researchers (2016/17) , (2015) and (2014).