The GeoSeer Blog

All pages with the tag: Countries

Let's look at languages

Posted on 2024-05-28

We've finally implemented a language detector into GeoSeer! This allows us to detect the language of the metadata itself. Needless to say, the first we did was take a look at the stats!

Caveats

As ever there are some caveats:

  • This was done automatically using a model of 97 languages.
  • Most metadata is very short snippets of text.
  • Many datasets and services don't have metadata (Looking at you data custodians!)
  • Some mix languages in their metadata; usually native plus English. I.e.: Prefecture/محافظة, Province/المحافظة (gets picked us Arabic)
  • We filtered out standardised strings (typically English). I.e.: "This is an OGC Compliant WFS"
Our own experimenting showed the model is really good even on small text strings, which is really important here. However, when you're analysing millions of records with the above limitations, even a very low false-positive rate can become a lot of errors. More on them in a bit.

Services

Lets start by looking at services. The following table shows the number of services with metadata records in a given language.

LanguageNumber of ServicesPercent of all Services
No Language122,68325.34%
German313,68464.79%
English20,3984.21%
French11,4472.36%
Spanish5,1131.06%
Polish2,9720.61%
Dutch2,2590.47%
Italian1,5490.32%
Portuguese1,0950.23%
Finnish6090.13%
Czech4940.1%
Catalan369-
Swedish251-
Slovak247-
Croatian183-
Estonian155-
Danish138-
Galician76-
Icelandic62-
Norwegian50-
Latvian48-
Slovenian35-
Hungarian35-
Chinese32-
Thai27-
Norwegian Nynorsk24-
Greek21-
Luxembourgish19-
Basque11-
Japanese10-
Romanian7-
Latin7-
Norwegian Bokmål4-
Welsh4-
Bulgarian3-
Occitan2-
Lithuanian2-
Macedonian1-
Irish1-
Faroese1-
Subtotal484,128-

There are a truly astonishing number of services in German. In fact, it's reasonable to say there are basically just two types of service: German ones, and those that don't have enough metadata to detect a language. Between them those two conditions account for slightly over 90% of all services.

We also see our first mistakes here. The Luxembourgish and Faroese language services are mostly hosted on German domains, with a couple Czech ones, and a Belgian one thrown in. So chances are they're false positives.

And bad news: the ancient Romans probably don't have any spatial web services either: The Latin services seem to be a mixture of explicitly Hungarian ("Hungarian Biogeographical regions view service"), as well as Slovenian, Slovakian, and Czech domains. Linguistically interesting, but probably not time-travelling data.

These false positives similarly hold true for the dataset data, but by no means are all of the low numbers above wrong. We really do have Turkish, Chinese (all via Taiwan), Thai and Japanese services, among others. Though these are dwarfed by the Indo-European languages.

Datasets

Next up, the number of datasets with metadata in a given language. These are non-distinct datasets, so if the same dataset is available via WMS and WFS, it may be counted twice (metadata can be different).

LanguageNumber of DatasetsPercent of all Datasets
No Language1,797,98344.75%
German1,396,84734.76%
English290,6367.23%
Portuguese110,3872.75%
French105,9492.64%
Spanish73,4671.83%
Dutch68,1351.7%
Italian47,9141.19%
Czech24,7810.62%
Polish19,0530.47%
Finnish17,8590.44%
Catalan10,6090.26%
Swedish10,3900.26%
Japanese9,5490.24%
Danish8,3620.21%
Estonian4,3170.11%
Greek4,0470.1%
Slovak2,586-
Icelandic1,791-
Croatian1,785-
Chinese1,766-
Hungarian1,570-
Russian1,047-
Slovenian964-
Latvian753-
Norwegian666-
Basque654-
Thai635-
Romanian629-
Korean625-
Bulgarian312-
Galician256-
Luxembourgish187-
Afrikaans174-
Lithuanian141-
Malagasy134-
Latin114-
Welsh111-
Occitan108-
Walloon106-
Norwegian Nynorsk101-
Aragonese89-
Maltese62-
Irish57-
Swahili51-
Albanian48-
Esperanto46-
Javanese39-
Filipino33-
Kinyarwanda31-
Norwegian Bokmål27-
Bosnian27-
Macedonian26-
Xhosa23-
Breton21-
Hebrew19-
Indonesian18-
Kurdish17-
Volapük16-
Quechua16-
Vietnamese9-
Faroese9-
Haitian Creole6-
Ukrainian4-
Turkish4-
Azerbaijani4-
Malay3-
Kyrgyz2-
Georgian2-
Arabic2-
Subtotal4,018,211-

While German layer metadata is clearly the most numerous, the difference has come down from vast, for 64.7% of services, to merely huge for 34.76% of datasets.

It's also disappointing to see that 44% of datasets don't have enough metadata to make a language determination. Given the tool we're using is happy to take a guess at 10 words or so, that says something about the quality of metadata we're looking at.

Different Strategies

Lets finish with one table that highlights the differing strategies that can be seen between individual countries. The below shows the number of datasets per service that exist for each language.

Language# Datasets per Service# Services# Datasets
No Language Detected14.66122,6831,797,983
Japanese954.9109,549
Greek192.71214,047
Portuguese100.811,095110,387
Danish60.591388,362
Czech50.1649424,781
Swedish41.3925110,390
Italian30.931,54947,914
Dutch30.162,25968,135
Finnish29.3360917,859
Catalan28.7536910,609
Estonian27.851554,317
Slovenian27.5435964
Thai23.5227635
Spanish14.375,11373,467
English14.2520,398290,636
French9.2611,447105,949
Polish6.412,97219,053
German4.45313,6841,396,847

Here we see why there are so many German services: It's evident they're using a service-heavy architecture in their OGC deployments. As a prime example of this, the domain with the most services is German (geodienste.komm.one) hosting no less than 184,358 services. Second place with a "mere" 29,071 services is also German.

At the other end of the spectrum, we have Japan, with an average of 954 datasets per service. It's clear that Japan's geospatial strategy is to be highly centralised. Greece, Denmark, and the Czech republic are all similarly focused on being centralised.

The above uses languages as a proxy for country, which holds true for the above languages (though the German ones could also be Swiss or Austrian, but in this case we don't believe they are).
Portuguese is another matter. We can't make the same claim for Portugal because most of the Portuguese datasets are actually coming from .br domains, at a ratio of about 7 Brazilian datasets for 1 dataset from Portugal. This also makes it hard to draw country-level conclusions about other widely spread languages (English, French, Spanish in particular).

Conclusion

In the above, we've used language as a proxy for country. In the past, we've done similar investigations using Country level domains, and seem similar conclusions (not posted to this blog). So it's nice to see things verified using a completely separate mechanism. In summary:

  • People and organisations remain terrible at creating metadata.
  • Germany has a very service-focused architecture, probably a consequence of governmental policy.
  • Japan and Greece are very centralised in their geospatial services.
  • Language detection is an interesting if difficult problem, especially on small samples of text.

Plotting Dataset Extents

Posted on 2019-10-31

Back at the start of September we released some historical statistics and, almost as an afterthought, mentioned the new extents plots. In this post we explore those dataset extents plots in more detail.

Extent Plots; What Are They?

Put simply, every dataset that GeoSeer indexes follows various standards which say that the datasets should declare a rectangular bounding box which represents its spatial extent, i.e., where does the dataset cover? So if it's a dataset covering Germany, it should declare a box covering Germany, but because Germany isn't a rectangle, the box will overlap with surrounding countries to various degrees, including all of tiny Luxembourg.

What we've done is taken all of those extents boxes and stacked them all up, overlapping them to create what we call extent plots which show how many datasets cover a given area. As you can imagine there are a lot of caveats to this process (as well as to the dataset extents themselves, like the Luxembourg example above), these are covered in detail on the datasets extent plots page.

What do they show?

One important caveat to remember is that the plots only show dataset extents that are entirely within the plot extent. The exception to this is the Global plot where we're also excluding the 191,052 global datasets (about 10% of all datasets) as they add nothing to the plot. It's interesting to note that ~10% of all datasets are global though.

So the plots show where in the world there are lots of spatial datasets available via WMS (Web Map Service), WFS (Web Feature Service), WCS (Web Coverage Service), and/or WMTS (Web Map Tile Service). There are two colour schemes, we're mostly using the spectral one here because it brings out more detail even if it's not very aesthetically pleasing.

Global extent plot Starting with the global plot (above) it's obvious that the EU's INSPIRE directive has had a considerable effect, particularly in central Europe. The USA and Brazil also have considerable coverage.

Looking closer at West Europe (below), it becomes apparent that the rectangular nature of the extents are causing lots of overlaps in the region of the Netherlands/Belgium/Germany tripoint (apparently called the Vaalserberg), hence the very high values there. That said, there's still a lot of datasets covering the region. West Europe

Zooming even closer to one specific country such as the UK (below), it's possible to see a lot of nuance that's swamped out at the smaller scales. It's now clear that Wales has excellent national coverage compared to England (which is mostly just the South East, not even London), and certainly Scotland (just Glasgow and Edinburgh). UK Extent plot



Bad data

Do you notice anything odd about this plot of Africa (apart from it being green)? Africa plot with bad data

Yep, there's a very large number of datasets covering a tiny area off the west coast of Africa. Why? As you've probably guessed it's because 6,527 datasets are wrongly declaring their extent as being entirely around 0, 0 (lat/lon). This floods out Africa which unfortunately doesn't have many datasets in the first place. So we filter these bad datasets out of the plot extents to get the below. Now we can see that east Africa has a respectable number of datasets covering it, as do the Canary Islands. Africa plot with good data



Technical; How We Make Them

We use Python to create these plots by reading in all of the WGS84 (coordinate system) bounding boxes from the GeoSeer index, stacking them together with NumPy as a two dimensional array, and then plotting them out via MatPlotLib. NumPy does the magic of summing the extents together remarkably quickly so we can rebuild them every month with the updates.

Closing Remarks

There are other interesting insights that can be gleaned from these, take a look at the Datasets Extent Plots page for more. This is a good case study of the sort of cool stuff you can do with Python, and GeoSeer Datasets, for example if you have a research itch you need to scratch.

Because we like to share, the plots are available for use under the CC-BY 4.0 license, which means you can do anything you want with them but please link back to GeoSeer.

Finally, if there's any particular area you think would make an interesting plot, let us know and we'll take a look.

Blog content licensed as: CC-BY-SA 4.0