The GeoSeer Blog

All pages with the tag: Analysis

Let's look at languages

Posted on 2024-05-28

We've finally implemented a language detector into GeoSeer! This allows us to detect the language of the metadata itself. Needless to say, the first we did was take a look at the stats!

Caveats

As ever there are some caveats:

  • This was done automatically using a model of 97 languages.
  • Most metadata is very short snippets of text.
  • Many datasets and services don't have metadata (Looking at you data custodians!)
  • Some mix languages in their metadata; usually native plus English. I.e.: Prefecture/محافظة, Province/المحافظة (gets picked us Arabic)
  • We filtered out standardised strings (typically English). I.e.: "This is an OGC Compliant WFS"
Our own experimenting showed the model is really good even on small text strings, which is really important here. However, when you're analysing millions of records with the above limitations, even a very low false-positive rate can become a lot of errors. More on them in a bit.

Services

Lets start by looking at services. The following table shows the number of services with metadata records in a given language.

LanguageNumber of ServicesPercent of all Services
No Language122,68325.34%
German313,68464.79%
English20,3984.21%
French11,4472.36%
Spanish5,1131.06%
Polish2,9720.61%
Dutch2,2590.47%
Italian1,5490.32%
Portuguese1,0950.23%
Finnish6090.13%
Czech4940.1%
Catalan369-
Swedish251-
Slovak247-
Croatian183-
Estonian155-
Danish138-
Galician76-
Icelandic62-
Norwegian50-
Latvian48-
Slovenian35-
Hungarian35-
Chinese32-
Thai27-
Norwegian Nynorsk24-
Greek21-
Luxembourgish19-
Basque11-
Japanese10-
Romanian7-
Latin7-
Norwegian Bokmål4-
Welsh4-
Bulgarian3-
Occitan2-
Lithuanian2-
Macedonian1-
Irish1-
Faroese1-
Subtotal484,128-

There are a truly astonishing number of services in German. In fact, it's reasonable to say there are basically just two types of service: German ones, and those that don't have enough metadata to detect a language. Between them those two conditions account for slightly over 90% of all services.

We also see our first mistakes here. The Luxembourgish and Faroese language services are mostly hosted on German domains, with a couple Czech ones, and a Belgian one thrown in. So chances are they're false positives.

And bad news: the ancient Romans probably don't have any spatial web services either: The Latin services seem to be a mixture of explicitly Hungarian ("Hungarian Biogeographical regions view service"), as well as Slovenian, Slovakian, and Czech domains. Linguistically interesting, but probably not time-travelling data.

These false positives similarly hold true for the dataset data, but by no means are all of the low numbers above wrong. We really do have Turkish, Chinese (all via Taiwan), Thai and Japanese services, among others. Though these are dwarfed by the Indo-European languages.

Datasets

Next up, the number of datasets with metadata in a given language. These are non-distinct datasets, so if the same dataset is available via WMS and WFS, it may be counted twice (metadata can be different).

LanguageNumber of DatasetsPercent of all Datasets
No Language1,797,98344.75%
German1,396,84734.76%
English290,6367.23%
Portuguese110,3872.75%
French105,9492.64%
Spanish73,4671.83%
Dutch68,1351.7%
Italian47,9141.19%
Czech24,7810.62%
Polish19,0530.47%
Finnish17,8590.44%
Catalan10,6090.26%
Swedish10,3900.26%
Japanese9,5490.24%
Danish8,3620.21%
Estonian4,3170.11%
Greek4,0470.1%
Slovak2,586-
Icelandic1,791-
Croatian1,785-
Chinese1,766-
Hungarian1,570-
Russian1,047-
Slovenian964-
Latvian753-
Norwegian666-
Basque654-
Thai635-
Romanian629-
Korean625-
Bulgarian312-
Galician256-
Luxembourgish187-
Afrikaans174-
Lithuanian141-
Malagasy134-
Latin114-
Welsh111-
Occitan108-
Walloon106-
Norwegian Nynorsk101-
Aragonese89-
Maltese62-
Irish57-
Swahili51-
Albanian48-
Esperanto46-
Javanese39-
Filipino33-
Kinyarwanda31-
Norwegian Bokmål27-
Bosnian27-
Macedonian26-
Xhosa23-
Breton21-
Hebrew19-
Indonesian18-
Kurdish17-
Volapük16-
Quechua16-
Vietnamese9-
Faroese9-
Haitian Creole6-
Ukrainian4-
Turkish4-
Azerbaijani4-
Malay3-
Kyrgyz2-
Georgian2-
Arabic2-
Subtotal4,018,211-

While German layer metadata is clearly the most numerous, the difference has come down from vast, for 64.7% of services, to merely huge for 34.76% of datasets.

It's also disappointing to see that 44% of datasets don't have enough metadata to make a language determination. Given the tool we're using is happy to take a guess at 10 words or so, that says something about the quality of metadata we're looking at.

Different Strategies

Lets finish with one table that highlights the differing strategies that can be seen between individual countries. The below shows the number of datasets per service that exist for each language.

Language# Datasets per Service# Services# Datasets
No Language Detected14.66122,6831,797,983
Japanese954.9109,549
Greek192.71214,047
Portuguese100.811,095110,387
Danish60.591388,362
Czech50.1649424,781
Swedish41.3925110,390
Italian30.931,54947,914
Dutch30.162,25968,135
Finnish29.3360917,859
Catalan28.7536910,609
Estonian27.851554,317
Slovenian27.5435964
Thai23.5227635
Spanish14.375,11373,467
English14.2520,398290,636
French9.2611,447105,949
Polish6.412,97219,053
German4.45313,6841,396,847

Here we see why there are so many German services: It's evident they're using a service-heavy architecture in their OGC deployments. As a prime example of this, the domain with the most services is German (geodienste.komm.one) hosting no less than 184,358 services. Second place with a "mere" 29,071 services is also German.

At the other end of the spectrum, we have Japan, with an average of 954 datasets per service. It's clear that Japan's geospatial strategy is to be highly centralised. Greece, Denmark, and the Czech republic are all similarly focused on being centralised.

The above uses languages as a proxy for country, which holds true for the above languages (though the German ones could also be Swiss or Austrian, but in this case we don't believe they are).
Portuguese is another matter. We can't make the same claim for Portugal because most of the Portuguese datasets are actually coming from .br domains, at a ratio of about 7 Brazilian datasets for 1 dataset from Portugal. This also makes it hard to draw country-level conclusions about other widely spread languages (English, French, Spanish in particular).

Conclusion

In the above, we've used language as a proxy for country. In the past, we've done similar investigations using Country level domains, and seem similar conclusions (not posted to this blog). So it's nice to see things verified using a completely separate mechanism. In summary:

  • People and organisations remain terrible at creating metadata.
  • Germany has a very service-focused architecture, probably a consequence of governmental policy.
  • Japan and Greece are very centralised in their geospatial services.
  • Language detection is an interesting if difficult problem, especially on small samples of text.

What's the most deployed geospatial server software?

Posted on 2020-06-04

One of the things we've been meaning to do for a long time is investigate which geospatial server software is most prevalent for serving up all these services GeoSeer has in its index. After all, what's the point of having the world's largest index of geospatial web services at your fingertips (shameless plug!) if you're not going to use it to answer interesting questions?

The answer is...

... not 42, that's a different question. The one word answer is: ArcGIS. But as ever with these things, there's a much more nuanced story to tell. For example, the software that hosts the most datasets is easily GeoServer. The question we're answering is: What's the most deployed software out there for serving up publicly accessible geospatial data via WMS, WFS, WCS, and WMTS services? While that may read like a lot of caveats, this isn't a tabloid newspaper! Here are the results in one big table.

Note: Deployment = At least one instance of this software, grouped by domain (i.e. geoserver.example.com, and geoserver2.example.com are two deployments); Service = A single service, the thing you get when you copy/paste a WMS/WFS/WCS/WMTS URL into your GIS.


Software# Deployments# Services# Datasets
ArcGIS Server2,75572,054517,169
Cardogis137142,869
Cubeserv7361,141
deegree453,04315,062
Erdas1652797
Ewmapa1717189
Extensis22123
Geognosis611256
Geomedia2653311,078
GeoServer96422,673963,603
GeoWebCache499842,128
MapBender430,99731,060
MapCache14237,495
MapGuide45258
MapProxy1565610
MapServer54457,606389,709
QGIS Server6061311,924
Tekla2222461
THREDDS4326,97651,345
UNSURE174391,395
UNKNOWN50712,470178,995
Type of geospatial server software and its count of deployments, as well as the number of hosted services and datasets provided by it.
UNSURE means it could be one of several things. UNKNOWN means no idea at all. Linked software is Open Source.
The proprietary world

The first thing that jumps out is that ArcGIS has a huge number of deployments at 2,733, that's 53.7% of them. In reality, there are actually a lot more ArcGIS servers out there (at least ~4,900 in our index), but here we're only counting the ones that are serving WMS/WFS/WMTS/WCS. The rest are likely only serving via ESRI standards.

The next obvious thing in regards to proprietary is that outside of ArcGIS, the rest of them aren't even "also rans", totalling just 2.12% of the deployments and are behind only 0.75% of the datasets served. Barely a rounding error! It's likely there are a few more different pieces of proprietary software in the UNKNOWN grouping, but probably not enough to make a real difference.

The power of Open Source

Open Source has a much healthier ecosystem, with MapServer and GeoServer having very large deployment counts, and niche servers like THREDDS (oceanic data community), and GeoWebCache (caching server) also serving up alot of data.

If you group the proprietary/open source servers together, things become even more interesting:


Software Type# Deployments# Services# Datasets
Proprietary2,864 (55.83%)73,441 (32.15%)534,083 (23.97%)
OpenSource1,742 (33.96%)142,099 (62.2%)1,513,194 (67.93%)
UNSURE/
UNKNOWN
507 (9.88%)12,470 (5.46%)178,995 (8.04%)
The total number of Deployments, Services, and Datasets grouped by whether the software is Open Source or proprietary.

Graph showing percentages of deployments/services/datasets that are served by proprietary/open source/unknown software Graph version of the above table.

Looking at the above it rapidly becomes clear that while there may be a lot of ArcGIS deployments, they're not sharing much data as compared to the Open Source installs. It seems reasonable to conclude that ESRI are very good at selling their software to cities/counties/local provinces, who then use it to comply with "Open Data" edicts, but when it comes time to roll out an SDI, Open Source is where it's at. In fact, Open Source solutions are behind at least two thirds of the world's OGC served datasets!

Deployment patterns

One final data table. This one breaking down some of the software a bit further, this time including the average number of services per deployment, and datasets per deployment.


Software Type# Deployments# ServicesAvg
Services/
Deployment
# DatasetsAvg
Datasets/
Deployment
Popular Data Servers
ArcGIS2,755 (53.7%)72,054 (31.54%)26.15517,169 (23.22%)187.72
GeoServer964 (18.79%)22,673 (9.92%)23.52963,603 (43.26%)999.59
MapServer544 (10.6%)57,606 (25.22%)105.89389,709 (17.49%)716.38
THREDDS43 (0.84%)26,976 (11.81%)627.3551,345 (2.3%)1194.07
Totals (All)5,130228,44944.532,227,667434.24
Subset of data. Includes the average (mean) number of services and datasets per deployment.

This further reinforces the point that ArcGIS deployments don't have many datasets on them as compared to the Open Source variants. It also shows how different servers structure themselves; MapServer has a lot of services per deployment, and THREDDS has a huge number. THREDDS then carries this over to a very high number of datasets per deployment as well, explaining why with such a low number of deployments it still serves more datasets than all of the non-ArcGIS proprietary systems combined.

How it was done

That's the end of the stats, but for those interested in how it was done, read on. (It's like a bonus blog post!)
Fingerprinting

The short version is that most servers return unique components in their responses (which are XML documents) that allow us to fingerprint them. For example: A unique XML namespace for example; a comment that explicitly says what it is: <!-- MapServer version ... --> (hmm, what could that be?); a certain combination of supported formats; and even the path component of the URL to the service: https://psl.noaa.gov/thredds/wms/Datasets/NARR/Monthlies/monolevel/wspd.10m.mon.mean.nc.

We can also rely on lazy administrators who have left defaults in place. For example default service titles ("MapGuide WMS Server") and abstracts, or a ridiculously long, 5000+ item list of default projections that the server supports that 1 in 6 GeoServer administrators hasn't culled.

False Negatives over False Positives

Using these various fingerprints we can then assign a server-score to the response depending on which factors it meets. We leant towards false negatives, meaning if we weren't sure it was unique, we wouldn't use it as a fingerprint. This is evidenced by the low number of "UNSURE" results, the majority of which are some flavour of MapBender impersonating deegree.

Limitations

It's important with this sort of thing to point out the limitations of the methodology, and the caveats it comes with, like we do with the extents plots.

Fingerprinting does have its limits, for example GeoWebCache is integrated into GeoServer, so stand-alone GeoWebCaches may be under-counted. Similarly the proxy servers (MapProxy, GeoWebCache, and MapCache), by definition are only caches for actual renderers sitting behind them. That rendering software could be anything. As such the numbers for the caches should certainly be treated with a grain of salt; it may be underestimated because they're often invisible. This also means when they're not invisible we have no way of knowing what's behind them.

Confidence Levels

Some pieces of software we're very confident we've managed to identify all of the deployments in our index because they have clear fingerprints that server administrators are extremely unlikely to change (custom namespaces, obnoxious hard-coded software-licensing details, etc). The below table shows how confident we are that we found all of that software within our index. High confidence means we're pretty sure we found it all, low confidence means there could be more deployments in the "UNKNOWN" and/or "UNSURE" groups.



SoftwareConfidence
ArcGIS ServerHigh
CardogisHigh
CubeservHigh
deegreeLow
ErdasHigh
EwmapaLow
ExtensisLow
GeognosisHigh
GeomediaHigh
GeoServerMedium
GeoWebCacheLow
MapBenderLow
MapCacheLow
MapGuideLow
MapProxyLow
MapServerHigh
QGIS ServerHigh
TeklaMedium
THREDDSMedium
Table showing how confident we are that we found all of deployments of a specific type of software within our index.
General Notes and Caveats
  • There can be many software installations behind one "deployment".
  • Some domains have multiple different pieces of software behind them; this is why the number of deployments is higher than the number of "hosts" on the stats page.
  • Results based on a snapshot of global geospatial services for mid May 2020
  • Excludes servers that only have "meaningless" data/services, and demo/test servers. Only includes servers that actually serve data.
  • The GeoSeer index, while the largest we know of, doesn't cover all public services. But there's certainly enough that this should be an accurate representation.
  • This only covers public facing, freely accessible services (i.e. the sort GeoSeer Indexes). There will be many more deployments of all of this software that only points at internal corporate networks.
As ever, let us know if you have any thoughts/feedback/comments etc.

Blog content licensed as: CC-BY-SA 4.0