A long-standing goal in Natural Language Processing (NLP) is the development of robust language technology applicable across the world’s languages. Until this goal is met, there will be limited global access to important applications such as Machine Translation or Information Retrieval. The main research challenge in multilingual NLP is to mitigate the serious bottleneck concerning the lack of annotated data for the majority of the world’s languages. Although we can approach this challenge via transfer of knowledge from resource-rich to resource-lean languages or via creation of models by joint learning from several languages, we are still far from accurate and efficient models applicable to any language of the world.
One of the main problems is the huge diversity of human languages. While languages can share universal features at a deep level, at the surface level their structures and categories vary significantly. This compromises the performance of language-agnostic NLP algorithms when applied on a large scale: their design, training, and hyperparameter tuning suffer from language-specific biases. For instance, the perplexities of word-level language models suffer particularly on languages with rich morphology because this information is disregarded. Moreover, the variation in syntactic trees, unless reduced, hinders the performance of structured encoders when applied cross-lingually on Natural Language Inference.
One highly promising solution to cross-lingual variation lies in linguistic typology. Linguistic typology provides a systematic, empirical comparison of the world’s languages with respect to a variety of linguistic properties. Work on NLP has shown that typological information documented in publicly available databases can provide a rich source of guidance for choice of data, features and algorithm design in multilingual NLP. Just one well-known example is this work that integrates typological information into multilingual models in the form of “selective sharing”, an approach that ties language-specific model parameters according to the typological features of each language.
Although many such approaches have been proposed, typological information has not been fully exploited in NLP yet. Most experiments have been limited to a small number of typological features (mostly related to word order) and tasks (mostly dependency parsing), while many others could be explored. One potential reason for this could be the limited awareness or understanding of typology among the NLP community. However, existing typological databases are also lacking in coverage and their interpretable, absolute, discrete features cannot be integrated straightforwardly into the state-of-art NLP algorithms which are opaque, contextual, and probabilistic. A possible solution to this is the automatic induction of typological information from data. For example, expressions of tense can be extracted from a multi-parallel corpus starting from a pivot and finding equivalents in other languages based on distributional methods. Although this research thread has been flourishing recently, automatically induced typological information has not yet been integrated in NLP algorithms or multilingual NLP on a large scale, which is a promising line for future research.
Our TyP-NLP workshop will be the first dedicated venue for typology-related research and its integration in multilingual NLP. Long due, the workshop is specifically aimed at raising awareness of linguistic typology and its potential in supporting and widening the global reach multilingual NLP. It will foster research and discussion on open problems, not only within the active community working on cross- and multilingual NLP but also inviting input from leading researchers in linguistic typology. The workshop will provide focussed discussion on a range of topics, including (but not limited to) the following:
The workshop will feature several invited speakers from the fields of (multilingual) NLP and linguistic typology (see the list below), focusing on the themes mentioned above. We will also host a panel to bring in different perspectives on the problems shared by the two disciplines. Finally, we will issue a call for abstract submissions and the accepted abstracts will be presented at the workshop, providing new insights and ideas. We plan to make the short abstracts non-archival, in order not to discourage researchers from preferring main conference proceedings, and at the same time to ensure that interesting, exciting, and thought-provoking research is presented at the workshop. In particular, we will solicit 2-page or 4-page abstracts of already published work or work in progress.
In general, we believe that this inter-disciplinary workshop will be a great opportunity to encourage research on a timely area which has not received such dedicated attention before but which is of interest to the large and diverse community of researchers working on multilingual NLP. We expect this workshop to ultimately lead into key methodology for improving the global reach of language technology.
Emily M. Bender’s primary research interests are in multilingual grammar engineering, the study of variation, both within and across languages, and the relationship between linguistics and computational linguistics. She is the LSA’s delegate to the ACL. Her 2013 book Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax aims to present linguistic concepts in an manner accessible to NLP practitioners.
Jason Eisner works on machine learning, combinatorial algorithms, probabilistic models of linguistic structure, and declarative specification of knowledge and algorithms. His work addresses the question, “How can we appropriately formalize linguistic structure and discover it automatically?”
Balthasar Bickel aims at understanding the diversity of human language with rigorously tested causal models, i.e. at answering the question what’s where why in language. What structures are there, and how exactly do they vary? Engaged in both linguistic fieldwork and statistical modeling, he focuses on explaining universally consistent biases in the diachrony of grammar properties, biases that are independent of local historical events.
Sabine Stoll questions how children can cope with the incredible variation exhibited in the approximately 6000–7000 languages spoken around the world. Her main focus is the interplay of innate biological factors (such as the capacity for pattern recognition and imitation) with idiosyncratic and culturally determined factors (such as for instance type and quantity of input). Her approach is radically empirical, based first and foremost on the quantitative analysis of large corpora that record how children learn diverse languages.
Isabelle Augenstein is a tenure-track assistant professor at the University of Copenhagen, Department of Computer Science since July 2017, affiliated with the Copenhagen NLP group and the Machine Learning Section, and work in the general areas of Statistical Natural Language Processing and Machine Learning. Her main research interests are weakly supervised and low-resource learning with applications including information extraction, machine reading and fact checking.