NLP Beyond English
5000+ languages are spoken around the world but NLP research has mostly focused on English. This post outlines why you should work on languages other than English.
Natural language processing (NLP) research predominantly focuses on developing methods that work well for English despite the many positive benefits of working on other languages. These benefits range from an outsized societal impact to modelling a wealth of linguistic features to avoiding overfitting as well as interesting challenges for machine learning (ML).
The civil outlook
Technology cannot be accessible if it is only available for English speakers with a standard accent.
What language you speak determines your access to information, education, and even human connections. Even though we think of the Internet as open to everyone, there is a digital language divide between dominant languages (mostly from the Western world) and others. Only a few hundred languages are represented on the web and speakers of minority languages are severely limited in the information available to them.
A continuing lack of technological inclusion will not only exacerbate the language divide but it may also drive speakers of unsupported languages and dialects to high-resource languages with better technological support, further endangering such language varieties. To ensure that non-English language speakers are not left behind and at the same time to offset the existing imbalance, to lower language and literacy barriers, we need to apply our models to non-English languages.
Even though we claim to be interested in developing general language understanding methods, our methods are generally only applied to a single language, English.
English and the small set of other high-resource languages are in many ways not representative of the world’s other languages. Many resource-rich languages belong to the Indo-European language family, are spoken mostly in the Western world, and are morphologically poor, i.e. information is mostly expressed syntactically, e.g. via a fixed word order and using multiple separate words rather than through variation at the word level.
Working on languages beyond English may also help us gain new knowledge about the relationships between the languages of the world. Conversely, it can help us reveal what linguistic features our models are able to capture. Specifically, you could use your knowledge of a particular language to probe aspects that differ from English such as the use of diacritics, extensive compounding, inflection, derivation, reduplication, agglutination, fusion, etc.
Machine Learning perspective
We encode assumptions into the architectures of our models that are based on the data we intend to apply them. Even though we intend our models to be general, many of their inductive biases are specific to English and languages similar to it.
The lack of any explicitly encoded information in a model does not mean that it is truly language agnostic. A classic example are n-gram language models, which perform significantly worse for languages with elaborate morphology and relatively free word order. Recent models have repeatedly matched human-level performance on increasingly difficult benchmarks—that is, in English using labelled datasets with thousands and unlabeled data with millions
of examples. In the process, as a community we have overfit to the characteristics and conditions of English-language data. In particular, by focusing on high-resource languages, we have prioritized methods that work well only when large amounts of labelled and unlabeled data are available.
In contrast, most current methods break down when applied to the data-scarce conditions that are common for most of the world’s languages. Even recent advances in pre-training language models that dramatically reduce the sample complexity for downstream tasks require massive amounts of clean, unlabeled data, which is not available for most of the world’s languages. Doing well with few data is thus an ideal setting to test the limitations of current models—and evaluation on low-resource languages constitutes arguably its most impactful real-world application.
The cultural and normative perspective
The data our models are trained on reveals not only the characteristics of the specific language but also sheds light on cultural norms and common sense knowledge.
However, such common sense knowledge may be different for different cultures. For instance, the notion of ‘free’ and ‘non-free’ varies cross-culturally where ‘free’ goods are ones that anyone can use without seeking permission, such as salt in a restaurant.
Consequently, an agent that was only exposed to English data originating mainly in the Western world may be able to have a reasonable conversation with speakers from Western countries, but conversing with someone from a different culture may lead to pragmatic failures.
Beyond cultural norms and common sense knowledge, the data we train a model on also reflects the values of the underlying society. As an NLP researcher or practitioner, we have to ask ourselves whether we want our NLP system to exclusively share the values of a specific country or language community.
While this decision might be less important for current systems that mostly deal with simple tasks such as text classification, it will become more important as systems become more intelligent and need to deal with complex decision-making tasks.
The cognitive perspective
Human children can acquire any natural language and their language understanding ability is remarkably consistent across all kinds of languages. In order to achieve human-level language understanding, our models should be able to show the same level of consistency across languages from different language families and typologies.
Our models should ultimately be able to learn abstractions that are not specific to the structure of any language but that can generalize to languages with different properties.