This is yet another Java Locale post caused by me spending too much time wrestling with i18n and l10n in Java. It follows on from an earlier post about the large number of new languages supported in Java9.
This article is based on a small test scribble I wrote to list all the available countries and languages, displaying their names in each of the available languages.
For example, in UK English we call the language our nearest continental neighbours speak “French”, but for some strange reason they choose to call it “Français”. When it comes to the name of their country, we both choose to call it “France” (although obviously they pronounce it wrongly ;-) ). However, their large neighbour has a different opinion and calls the country “Frankreich” (well at least that’s what Java 8 thinks).
This kind of thing happens across the world due to all sorts of reasons to do with, essentially, people being complicated. In earlier versions of Java, the developers took a fairly common approach to all this complexity and either ignored it, or made a half-hearted attempt to address it, making sure to get the bits near them fairly correct and not worrying too much about all the stuff on the other side of the planet.
But with Java 9 they decided to start making a serious attempt to handle i18n and all that language related awkwardness.
The code used as the source of this article is here but basically it’s:
get a unicode capable output stream get a sorted list of all the available locales get a sorted list of all the available countries for every locale print the locale.getDisplayName() in English and its own language for every country print the country.getDisplayCountry() in the current locale print the count of locales and countries
The raw results are here:
I compiled that with OpenJDK’s Java 8 version, then ran it in OpenJDK’s JVMs for
Java 8, 9 and 10. I took the outputs from each and
diffed them to see what has
changed between Java versions.
Note that these results are from the OpenJDK implementations of the JVM, running on a Debian PC. Different JVMs for different OSs on different hardware may have different locales information. Do your own tests if you need to know for sure!
Those result files are fairly large, and to be honest pretty boring too, but there are a few interesting things to be aware of if you’re dealing with localised country names in Java.
1. My word, Java 9 added a lot of Locales!
- Java 8 Locale count: 160
- Java 9 Locale count: 736
This is just repeating what the earlier article said. It explains why the results files are so much larger for Java 9 and 10 than for Java 8.
2. Java 9 did a lot of translation work
In Java 8, the vast majority of language-country combinations end up showing the country name in English; maybe because no one asked, or no one knew what translation to use, or they just didn’t get round to it.
In Java 9, they’ve tried to get translations for the majority of combinations, using the character set of the appropriate language. This is a good thing, but be aware you’ll be needing a larger number of fonts if your code needs to localise country names into multiple languages.
Just to pick an early example from those results:
Java 8 Language: Arabic (العربية) Country Code: AD, Country Name:Andorra Java 9 Language: Arabic (العربية) Country Code: AD, Country Name:أندورا
In English we have a bad habit of anglicising names. We haven’t really cared if you call your country “Россия”, we called it “Russia”. Similarly, in Java 8, if you want that country’s name in Arabic, you get “Russia”. However, in Java 9 you get “روسيا”.
3. Surprising untranslated country names
In Java 9 there are a few languages that stand out as still largely using English naming, for example Assamese mostly has the same country names as English, except a few countries, although I can’t tell the reasoning for which country names are localised and which are anglicised (e.g. Banlgadesh is next door to Assam, but its name is in English, but Germany is in Assamese despite being half a planet away).
We can’t assume that countries near each other that speak differing languages will have translations of each other’s names in each language.
4. Doing half a job in Java 8
In addition to the load of languages that pretty much had no translations for any country names in Java 8, there are a few languages where there was a surprising sort of half attempt made.
For example, in the Thai language section of the Java 8 result set (starts about line 36650) we can see a fair number of the countries are translated - but maybe 10-15% are still anglicised. That’s going to cause annoyance if you’re trying to use the country names in Thai; it’s pot-luck whether you’ll get a good one or not.
In the Java 9 result set we can see that every single one is translated now - even the new and slightly odd outlying country cases.
5. Mix and matching character sets
This is a general gotcha across any kind of localised wording of names from other languages. The character set for one language may not include all the characters needed to display the names used for a country which speaks a different language.
That means if you’re using language A to display the name of a country where they speak language B, you may need characters from the alphabet of both A and B.
Just to pick an odd example, from the Java 8 result set:
Language: Ewe (Eʋegbe) Country Code: AX, Country Name:Åland ƒudomekpo nutome
That fancy ‘A’ is from the Swedish name for the islands and that fancy ‘f’ is from the Ghanaian language.
This isn’t just a gotcha for naming countries, but more generally for “borrowed” words. For example English occasionally borrows extra characters from French for various “continental” words and Japanese borrows quite a lot of characters from Chinese.
By the way, if you’re seeing blocks or question marks for any of the fancy characters on this page - update your browser, it may be feeling inadequate in the font department ;-)
6. No guessing! The rules aren’t simple
There’s a temptation when you start to look at translated words which you know in other languages to start extrapolating from the words you know and the translations you’re seeing. For example, look at the Asu language in the Java 9 result set (starts about line 8000).
Country Code: AT, Country Name:Authtria Country Code: AU, Country Name:Authtralia Country Code: BW, Country Name:Botthwana Country Code: ES, Country Name:Hithpania
Clearly, they don’t have the ‘S’ sound/symbol, so they’re using a ‘th’ instead. We can think of it like they have a lisp. Can’t we?
Country Code: BQ, Country Name:Bonaire, Sint Eustatius and Saba Country Code: CX, Country Name:Christmas Island Country Code: GS, Country Name:South Georgia And The South Sandwich Islands Country Code: GW, Country Name:Ginebisau
Yeah, those are quite a collection of ‘s’ symbols and sounds. And in case you’re thinking those are just untranslated country names, GW in English is written “Guinea-Bissau”, so “Ginebisau” is the Asu translation.
Just as another example of guessing just not working, take a look at the Yoruba language section in the Java 9 result set (starts about line 178250). Almost every country name starts with “Orílẹ́ède” (e.g. Britain is “Orílẹ́ède Omobabirin”).
Apart from the ones that don’t:
Country Code: AQ, Country Name:Antarctica Country Code: IM, Country Name:Isle Of Man Country Code: UM, Country Name:United States Minor Outlying Islands Country Code: US, Country Name:Orílẹ́ède Orilẹede Amerika
Yes, “United States” gets a translation on its own, but not as part of another country name.
So guessing at the rules just isn’t going to work, either look up the words you need to know with some vaguely authoritative source (like ISO, or a native speaker).
7. Java 9 to Java 10 - ho hum
In contrast to the masses of work and changes done between Java 8 and Java 9, there were only 2 changes between Java 9 and Java 10 that my test scribble picked up.
These were the 2 new language variants added:
Language: Serbian (Kosovo) Language: Chinese (Macau SAR China)
With their own set of translations for the 249 countries Java knows about.