Language detection in microblogs

[About] [Example scenarios] [Services]

About

The set of languages a user understands is an attribute of the user profile which we derive from the user's posts on Twitter.

Different languages provide us with different amounts of information about the user. For instance, we cannot learn a lot about a user who tweets exclusively in English as it is a spatially widely distributed language. On the opposite end of the spectrum, we can infer a lot more information from a learner who regularly tweets in Irish Gaelic as this language is spoken in a very small part of the world.

The most common approach for language detection is based on character and character combination frequencies. For instance, in English very common two-letter combinations are "th", "he", "an" and "in". In German on the other hand, very common are the combinations "er", "en", "ch" and "de" [2]. For each language that we want to detect, a frequency based profile of letter combinations is generated from large text corpora such as Wikipedia, which is available in 285 different languages [3]. Once all language profiles are created, we can assign to a given piece of text (such as a tweet) the language that is most similar to the text's two-letter frequency profile.

References:

Language detection library.
Decrypted secrets: methods and maxims of cryptology by Friedrich Ludwig Bauer, Springer Verlag, 2010, page 307.
List of all available Wikipedia
Overview of ISO 639-1 lanugage codes

Example scenarios

Within the ImREAL project, we rely on languages as one (of potentially several) indicators of a learner's present and/or past locations as well as of the learner's intercultural awareness.

One potential application of our language-based profiling can be found in the Buddy use case: here, university students in Erlangen, Germany receive intercultural awareness training, before they are paired with foreign (exchange) students in order to help them settle into university life in Germany. Depending on the languages a learner tweets in, we can infer something about the learner's familiarity with particular cultures. This knowledge is used by the simulator to adapt the training modules (e.g. by excluding training material about a culture the user is predicted to be familiar with).

Services

We created two web services:

TwitterLanguageDetection: The service collects the user's past 200 tweets and determines the language for each one of them. The output lists all observed languages including their frequency. Language detection can be done with high accuracy with existing language identification libraries. For this reason, we wrap an existing open-source Java library [1] to provide as a Twitter-based language detection Web service.
- Barack Obama on Twitter
- Volkskrant on Twitter (a Dutch newspaper)
- Le Monde on Twitter (a French newspaper)
CulturalCompetencyLng: Based on the number of languages detected in a user's tweets, a cultural competency score of either 1.0, 2.0 or 3.0 is returned. Only languages that occur in at least 5 tweets are considered. 1
detected language (score 1), 2-3 detected languages (score 2), else score 3.
- Shakira on Twitter

Language detection in microblogs

What languages does a learner understand?

About

Example scenarios

Services