Let's look at languages
Posted on 2024-05-28We've finally implemented a language detector into GeoSeer! This allows us to detect the language of the metadata itself. Needless to say, the first we did was take a look at the stats!
Caveats
As ever there are some caveats:
- This was done automatically using a model of 97 languages.
- Most metadata is very short snippets of text.
- Many datasets and services don't have metadata (Looking at you data custodians!)
- Some mix languages in their metadata; usually native plus English. I.e.:
Prefecture/محافظة, Province/المحافظة
(gets picked us Arabic) - We filtered out standardised strings (typically English). I.e.:
"This is an OGC Compliant WFS"
Services
Lets start by looking at services. The following table shows the number of services with metadata records in a given language.
Language Number of Services Percent of all Services
No Language 122,683 25.34%
German 313,684 64.79%
English 20,398 4.21%
French 11,447 2.36%
Spanish 5,113 1.06%
Polish 2,972 0.61%
Dutch 2,259 0.47%
Italian 1,549 0.32%
Portuguese 1,095 0.23%
Finnish 609 0.13%
Czech 494 0.1%
Catalan 369 -
Swedish 251 -
Slovak 247 -
Croatian 183 -
Estonian 155 -
Danish 138 -
Galician 76 -
Icelandic 62 -
Norwegian 50 -
Latvian 48 -
Slovenian 35 -
Hungarian 35 -
Chinese 32 -
Thai 27 -
Norwegian Nynorsk 24 -
Greek 21 -
Luxembourgish 19 -
Basque 11 -
Japanese 10 -
Romanian 7 -
Latin 7 -
Norwegian Bokmål 4 -
Welsh 4 -
Bulgarian 3 -
Occitan 2 -
Lithuanian 2 -
Macedonian 1 -
Irish 1 -
Faroese 1 -
Subtotal 484,128 -
Language | Number of Services | Percent of all Services |
---|---|---|
No Language | 122,683 | 25.34% |
German | 313,684 | 64.79% |
English | 20,398 | 4.21% |
French | 11,447 | 2.36% |
Spanish | 5,113 | 1.06% |
Polish | 2,972 | 0.61% |
Dutch | 2,259 | 0.47% |
Italian | 1,549 | 0.32% |
Portuguese | 1,095 | 0.23% |
Finnish | 609 | 0.13% |
Czech | 494 | 0.1% |
Catalan | 369 | - |
Swedish | 251 | - |
Slovak | 247 | - |
Croatian | 183 | - |
Estonian | 155 | - |
Danish | 138 | - |
Galician | 76 | - |
Icelandic | 62 | - |
Norwegian | 50 | - |
Latvian | 48 | - |
Slovenian | 35 | - |
Hungarian | 35 | - |
Chinese | 32 | - |
Thai | 27 | - |
Norwegian Nynorsk | 24 | - |
Greek | 21 | - |
Luxembourgish | 19 | - |
Basque | 11 | - |
Japanese | 10 | - |
Romanian | 7 | - |
Latin | 7 | - |
Norwegian Bokmål | 4 | - |
Welsh | 4 | - |
Bulgarian | 3 | - |
Occitan | 2 | - |
Lithuanian | 2 | - |
Macedonian | 1 | - |
Irish | 1 | - |
Faroese | 1 | - |
Subtotal | 484,128 | - |
There are a truly astonishing number of services in German. In fact, it's reasonable to say there are basically just two types of service: German ones, and those that don't have enough metadata to detect a language. Between them those two conditions account for slightly over 90% of all services.
We also see our first mistakes here. The Luxembourgish and Faroese language services are mostly hosted on German domains, with a couple Czech ones, and a Belgian one thrown in. So chances are they're false positives.
And bad news: the ancient Romans probably don't have any spatial web services either: The Latin services seem to be a mixture of explicitly Hungarian ("Hungarian Biogeographical regions view service"), as well as Slovenian, Slovakian, and Czech domains. Linguistically interesting, but probably not time-travelling data.
These false positives similarly hold true for the dataset data, but by no means are all of the low numbers above wrong. We really do have Turkish, Chinese (all via Taiwan), Thai and Japanese services, among others. Though these are dwarfed by the Indo-European languages.
Datasets
Next up, the number of datasets with metadata in a given language. These are non-distinct datasets, so if the same dataset is available via WMS and WFS, it may be counted twice (metadata can be different).
Language Number of Datasets Percent of all Datasets
No Language 1,797,983 44.75%
German 1,396,847 34.76%
English 290,636 7.23%
Portuguese 110,387 2.75%
French 105,949 2.64%
Spanish 73,467 1.83%
Dutch 68,135 1.7%
Italian 47,914 1.19%
Czech 24,781 0.62%
Polish 19,053 0.47%
Finnish 17,859 0.44%
Catalan 10,609 0.26%
Swedish 10,390 0.26%
Japanese 9,549 0.24%
Danish 8,362 0.21%
Estonian 4,317 0.11%
Greek 4,047 0.1%
Slovak 2,586 -
Icelandic 1,791 -
Croatian 1,785 -
Chinese 1,766 -
Hungarian 1,570 -
Russian 1,047 -
Slovenian 964 -
Latvian 753 -
Norwegian 666 -
Basque 654 -
Thai 635 -
Romanian 629 -
Korean 625 -
Bulgarian 312 -
Galician 256 -
Luxembourgish 187 -
Afrikaans 174 -
Lithuanian 141 -
Malagasy 134 -
Latin 114 -
Welsh 111 -
Occitan 108 -
Walloon 106 -
Norwegian Nynorsk 101 -
Aragonese 89 -
Maltese 62 -
Irish 57 -
Swahili 51 -
Albanian 48 -
Esperanto 46 -
Javanese 39 -
Filipino 33 -
Kinyarwanda 31 -
Norwegian Bokmål 27 -
Bosnian 27 -
Macedonian 26 -
Xhosa 23 -
Breton 21 -
Hebrew 19 -
Indonesian 18 -
Kurdish 17 -
Volapük 16 -
Quechua 16 -
Vietnamese 9 -
Faroese 9 -
Haitian Creole 6 -
Ukrainian 4 -
Turkish 4 -
Azerbaijani 4 -
Malay 3 -
Kyrgyz 2 -
Georgian 2 -
Arabic 2 -
Subtotal 4,018,211 -
Language | Number of Datasets | Percent of all Datasets |
---|---|---|
No Language | 1,797,983 | 44.75% |
German | 1,396,847 | 34.76% |
English | 290,636 | 7.23% |
Portuguese | 110,387 | 2.75% |
French | 105,949 | 2.64% |
Spanish | 73,467 | 1.83% |
Dutch | 68,135 | 1.7% |
Italian | 47,914 | 1.19% |
Czech | 24,781 | 0.62% |
Polish | 19,053 | 0.47% |
Finnish | 17,859 | 0.44% |
Catalan | 10,609 | 0.26% |
Swedish | 10,390 | 0.26% |
Japanese | 9,549 | 0.24% |
Danish | 8,362 | 0.21% |
Estonian | 4,317 | 0.11% |
Greek | 4,047 | 0.1% |
Slovak | 2,586 | - |
Icelandic | 1,791 | - |
Croatian | 1,785 | - |
Chinese | 1,766 | - |
Hungarian | 1,570 | - |
Russian | 1,047 | - |
Slovenian | 964 | - |
Latvian | 753 | - |
Norwegian | 666 | - |
Basque | 654 | - |
Thai | 635 | - |
Romanian | 629 | - |
Korean | 625 | - |
Bulgarian | 312 | - |
Galician | 256 | - |
Luxembourgish | 187 | - |
Afrikaans | 174 | - |
Lithuanian | 141 | - |
Malagasy | 134 | - |
Latin | 114 | - |
Welsh | 111 | - |
Occitan | 108 | - |
Walloon | 106 | - |
Norwegian Nynorsk | 101 | - |
Aragonese | 89 | - |
Maltese | 62 | - |
Irish | 57 | - |
Swahili | 51 | - |
Albanian | 48 | - |
Esperanto | 46 | - |
Javanese | 39 | - |
Filipino | 33 | - |
Kinyarwanda | 31 | - |
Norwegian Bokmål | 27 | - |
Bosnian | 27 | - |
Macedonian | 26 | - |
Xhosa | 23 | - |
Breton | 21 | - |
Hebrew | 19 | - |
Indonesian | 18 | - |
Kurdish | 17 | - |
Volapük | 16 | - |
Quechua | 16 | - |
Vietnamese | 9 | - |
Faroese | 9 | - |
Haitian Creole | 6 | - |
Ukrainian | 4 | - |
Turkish | 4 | - |
Azerbaijani | 4 | - |
Malay | 3 | - |
Kyrgyz | 2 | - |
Georgian | 2 | - |
Arabic | 2 | - |
Subtotal | 4,018,211 | - |
While German layer metadata is clearly the most numerous, the difference has come down from vast, for 64.7% of services, to merely huge for 34.76% of datasets.
It's also disappointing to see that 44% of datasets don't have enough metadata to make a language determination. Given the tool we're using is happy to take a guess at 10 words or so, that says something about the quality of metadata we're looking at.
Different Strategies
Lets finish with one table that highlights the differing strategies that can be seen between individual countries. The below shows the number of datasets per service that exist for each language.
Language | # Datasets per Service | # Services | # Datasets |
---|---|---|---|
No Language Detected | 14.66 | 122,683 | 1,797,983 |
Japanese | 954.9 | 10 | 9,549 |
Greek | 192.71 | 21 | 4,047 |
Portuguese | 100.81 | 1,095 | 110,387 |
Danish | 60.59 | 138 | 8,362 |
Czech | 50.16 | 494 | 24,781 |
Swedish | 41.39 | 251 | 10,390 |
Italian | 30.93 | 1,549 | 47,914 |
Dutch | 30.16 | 2,259 | 68,135 |
Finnish | 29.33 | 609 | 17,859 |
Catalan | 28.75 | 369 | 10,609 |
Estonian | 27.85 | 155 | 4,317 |
Slovenian | 27.54 | 35 | 964 |
Thai | 23.52 | 27 | 635 |
Spanish | 14.37 | 5,113 | 73,467 |
English | 14.25 | 20,398 | 290,636 |
French | 9.26 | 11,447 | 105,949 |
Polish | 6.41 | 2,972 | 19,053 |
German | 4.45 | 313,684 | 1,396,847 |
Here we see why there are so many German services: It's evident they're using a service-heavy architecture in their OGC deployments. As a prime example of this, the domain with the most services is German (geodienste.komm.one
) hosting no less than 184,358 services. Second place with a "mere" 29,071 services is also German.
At the other end of the spectrum, we have Japan, with an average of 954 datasets per service. It's clear that Japan's geospatial strategy is to be highly centralised. Greece, Denmark, and the Czech republic are all similarly focused on being centralised.
The above uses languages as a proxy for country, which holds true for the above languages (though the German ones could also be Swiss or Austrian, but in this case we don't believe they are).
Portuguese is another matter. We can't make the same claim for Portugal because most of the Portuguese datasets are actually coming from .br
domains, at a ratio of about 7 Brazilian datasets for 1 dataset from Portugal. This also makes it hard to draw country-level conclusions about other widely spread languages (English, French, Spanish in particular).
Conclusion
In the above, we've used language as a proxy for country. In the past, we've done similar investigations using Country level domains, and seem similar conclusions (not posted to this blog). So it's nice to see things verified using a completely separate mechanism. In summary:
- People and organisations remain terrible at creating metadata.
- Germany has a very service-focused architecture, probably a consequence of governmental policy.
- Japan and Greece are very centralised in their geospatial services.
- Language detection is an interesting if difficult problem, especially on small samples of text.