The GeoSeer Blog

All pages with the tag: Statistics

Let's look at languages

Posted on 2024-05-28

We've finally implemented a language detector into GeoSeer! This allows us to detect the language of the metadata itself. Needless to say, the first we did was take a look at the stats!

Caveats

As ever there are some caveats:

  • This was done automatically using a model of 97 languages.
  • Most metadata is very short snippets of text.
  • Many datasets and services don't have metadata (Looking at you data custodians!)
  • Some mix languages in their metadata; usually native plus English. I.e.: Prefecture/محافظة, Province/المحافظة (gets picked us Arabic)
  • We filtered out standardised strings (typically English). I.e.: "This is an OGC Compliant WFS"
Our own experimenting showed the model is really good even on small text strings, which is really important here. However, when you're analysing millions of records with the above limitations, even a very low false-positive rate can become a lot of errors. More on them in a bit.

Services

Lets start by looking at services. The following table shows the number of services with metadata records in a given language.

LanguageNumber of ServicesPercent of all Services
No Language122,68325.34%
German313,68464.79%
English20,3984.21%
French11,4472.36%
Spanish5,1131.06%
Polish2,9720.61%
Dutch2,2590.47%
Italian1,5490.32%
Portuguese1,0950.23%
Finnish6090.13%
Czech4940.1%
Catalan369-
Swedish251-
Slovak247-
Croatian183-
Estonian155-
Danish138-
Galician76-
Icelandic62-
Norwegian50-
Latvian48-
Slovenian35-
Hungarian35-
Chinese32-
Thai27-
Norwegian Nynorsk24-
Greek21-
Luxembourgish19-
Basque11-
Japanese10-
Romanian7-
Latin7-
Norwegian Bokmål4-
Welsh4-
Bulgarian3-
Occitan2-
Lithuanian2-
Macedonian1-
Irish1-
Faroese1-
Subtotal484,128-

There are a truly astonishing number of services in German. In fact, it's reasonable to say there are basically just two types of service: German ones, and those that don't have enough metadata to detect a language. Between them those two conditions account for slightly over 90% of all services.

We also see our first mistakes here. The Luxembourgish and Faroese language services are mostly hosted on German domains, with a couple Czech ones, and a Belgian one thrown in. So chances are they're false positives.

And bad news: the ancient Romans probably don't have any spatial web services either: The Latin services seem to be a mixture of explicitly Hungarian ("Hungarian Biogeographical regions view service"), as well as Slovenian, Slovakian, and Czech domains. Linguistically interesting, but probably not time-travelling data.

These false positives similarly hold true for the dataset data, but by no means are all of the low numbers above wrong. We really do have Turkish, Chinese (all via Taiwan), Thai and Japanese services, among others. Though these are dwarfed by the Indo-European languages.

Datasets

Next up, the number of datasets with metadata in a given language. These are non-distinct datasets, so if the same dataset is available via WMS and WFS, it may be counted twice (metadata can be different).

LanguageNumber of DatasetsPercent of all Datasets
No Language1,797,98344.75%
German1,396,84734.76%
English290,6367.23%
Portuguese110,3872.75%
French105,9492.64%
Spanish73,4671.83%
Dutch68,1351.7%
Italian47,9141.19%
Czech24,7810.62%
Polish19,0530.47%
Finnish17,8590.44%
Catalan10,6090.26%
Swedish10,3900.26%
Japanese9,5490.24%
Danish8,3620.21%
Estonian4,3170.11%
Greek4,0470.1%
Slovak2,586-
Icelandic1,791-
Croatian1,785-
Chinese1,766-
Hungarian1,570-
Russian1,047-
Slovenian964-
Latvian753-
Norwegian666-
Basque654-
Thai635-
Romanian629-
Korean625-
Bulgarian312-
Galician256-
Luxembourgish187-
Afrikaans174-
Lithuanian141-
Malagasy134-
Latin114-
Welsh111-
Occitan108-
Walloon106-
Norwegian Nynorsk101-
Aragonese89-
Maltese62-
Irish57-
Swahili51-
Albanian48-
Esperanto46-
Javanese39-
Filipino33-
Kinyarwanda31-
Norwegian Bokmål27-
Bosnian27-
Macedonian26-
Xhosa23-
Breton21-
Hebrew19-
Indonesian18-
Kurdish17-
Volapük16-
Quechua16-
Vietnamese9-
Faroese9-
Haitian Creole6-
Ukrainian4-
Turkish4-
Azerbaijani4-
Malay3-
Kyrgyz2-
Georgian2-
Arabic2-
Subtotal4,018,211-

While German layer metadata is clearly the most numerous, the difference has come down from vast, for 64.7% of services, to merely huge for 34.76% of datasets.

It's also disappointing to see that 44% of datasets don't have enough metadata to make a language determination. Given the tool we're using is happy to take a guess at 10 words or so, that says something about the quality of metadata we're looking at.

Different Strategies

Lets finish with one table that highlights the differing strategies that can be seen between individual countries. The below shows the number of datasets per service that exist for each language.

Language# Datasets per Service# Services# Datasets
No Language Detected14.66122,6831,797,983
Japanese954.9109,549
Greek192.71214,047
Portuguese100.811,095110,387
Danish60.591388,362
Czech50.1649424,781
Swedish41.3925110,390
Italian30.931,54947,914
Dutch30.162,25968,135
Finnish29.3360917,859
Catalan28.7536910,609
Estonian27.851554,317
Slovenian27.5435964
Thai23.5227635
Spanish14.375,11373,467
English14.2520,398290,636
French9.2611,447105,949
Polish6.412,97219,053
German4.45313,6841,396,847

Here we see why there are so many German services: It's evident they're using a service-heavy architecture in their OGC deployments. As a prime example of this, the domain with the most services is German (geodienste.komm.one) hosting no less than 184,358 services. Second place with a "mere" 29,071 services is also German.

At the other end of the spectrum, we have Japan, with an average of 954 datasets per service. It's clear that Japan's geospatial strategy is to be highly centralised. Greece, Denmark, and the Czech republic are all similarly focused on being centralised.

The above uses languages as a proxy for country, which holds true for the above languages (though the German ones could also be Swiss or Austrian, but in this case we don't believe they are).
Portuguese is another matter. We can't make the same claim for Portugal because most of the Portuguese datasets are actually coming from .br domains, at a ratio of about 7 Brazilian datasets for 1 dataset from Portugal. This also makes it hard to draw country-level conclusions about other widely spread languages (English, French, Spanish in particular).

Conclusion

In the above, we've used language as a proxy for country. In the past, we've done similar investigations using Country level domains, and seem similar conclusions (not posted to this blog). So it's nice to see things verified using a completely separate mechanism. In summary:

  • People and organisations remain terrible at creating metadata.
  • Germany has a very service-focused architecture, probably a consequence of governmental policy.
  • Japan and Greece are very centralised in their geospatial services.
  • Language detection is an interesting if difficult problem, especially on small samples of text.

What's the most deployed geospatial server software?

Posted on 2020-06-04

One of the things we've been meaning to do for a long time is investigate which geospatial server software is most prevalent for serving up all these services GeoSeer has in its index. After all, what's the point of having the world's largest index of geospatial web services at your fingertips (shameless plug!) if you're not going to use it to answer interesting questions?

The answer is...

... not 42, that's a different question. The one word answer is: ArcGIS. But as ever with these things, there's a much more nuanced story to tell. For example, the software that hosts the most datasets is easily GeoServer. The question we're answering is: What's the most deployed software out there for serving up publicly accessible geospatial data via WMS, WFS, WCS, and WMTS services? While that may read like a lot of caveats, this isn't a tabloid newspaper! Here are the results in one big table.

Note: Deployment = At least one instance of this software, grouped by domain (i.e. geoserver.example.com, and geoserver2.example.com are two deployments); Service = A single service, the thing you get when you copy/paste a WMS/WFS/WCS/WMTS URL into your GIS.


Software# Deployments# Services# Datasets
ArcGIS Server2,75572,054517,169
Cardogis137142,869
Cubeserv7361,141
deegree453,04315,062
Erdas1652797
Ewmapa1717189
Extensis22123
Geognosis611256
Geomedia2653311,078
GeoServer96422,673963,603
GeoWebCache499842,128
MapBender430,99731,060
MapCache14237,495
MapGuide45258
MapProxy1565610
MapServer54457,606389,709
QGIS Server6061311,924
Tekla2222461
THREDDS4326,97651,345
UNSURE174391,395
UNKNOWN50712,470178,995
Type of geospatial server software and its count of deployments, as well as the number of hosted services and datasets provided by it.
UNSURE means it could be one of several things. UNKNOWN means no idea at all. Linked software is Open Source.
The proprietary world

The first thing that jumps out is that ArcGIS has a huge number of deployments at 2,733, that's 53.7% of them. In reality, there are actually a lot more ArcGIS servers out there (at least ~4,900 in our index), but here we're only counting the ones that are serving WMS/WFS/WMTS/WCS. The rest are likely only serving via ESRI standards.

The next obvious thing in regards to proprietary is that outside of ArcGIS, the rest of them aren't even "also rans", totalling just 2.12% of the deployments and are behind only 0.75% of the datasets served. Barely a rounding error! It's likely there are a few more different pieces of proprietary software in the UNKNOWN grouping, but probably not enough to make a real difference.

The power of Open Source

Open Source has a much healthier ecosystem, with MapServer and GeoServer having very large deployment counts, and niche servers like THREDDS (oceanic data community), and GeoWebCache (caching server) also serving up alot of data.

If you group the proprietary/open source servers together, things become even more interesting:


Software Type# Deployments# Services# Datasets
Proprietary2,864 (55.83%)73,441 (32.15%)534,083 (23.97%)
OpenSource1,742 (33.96%)142,099 (62.2%)1,513,194 (67.93%)
UNSURE/
UNKNOWN
507 (9.88%)12,470 (5.46%)178,995 (8.04%)
The total number of Deployments, Services, and Datasets grouped by whether the software is Open Source or proprietary.

Graph showing percentages of deployments/services/datasets that are served by proprietary/open source/unknown software Graph version of the above table.

Looking at the above it rapidly becomes clear that while there may be a lot of ArcGIS deployments, they're not sharing much data as compared to the Open Source installs. It seems reasonable to conclude that ESRI are very good at selling their software to cities/counties/local provinces, who then use it to comply with "Open Data" edicts, but when it comes time to roll out an SDI, Open Source is where it's at. In fact, Open Source solutions are behind at least two thirds of the world's OGC served datasets!

Deployment patterns

One final data table. This one breaking down some of the software a bit further, this time including the average number of services per deployment, and datasets per deployment.


Software Type# Deployments# ServicesAvg
Services/
Deployment
# DatasetsAvg
Datasets/
Deployment
Popular Data Servers
ArcGIS2,755 (53.7%)72,054 (31.54%)26.15517,169 (23.22%)187.72
GeoServer964 (18.79%)22,673 (9.92%)23.52963,603 (43.26%)999.59
MapServer544 (10.6%)57,606 (25.22%)105.89389,709 (17.49%)716.38
THREDDS43 (0.84%)26,976 (11.81%)627.3551,345 (2.3%)1194.07
Totals (All)5,130228,44944.532,227,667434.24
Subset of data. Includes the average (mean) number of services and datasets per deployment.

This further reinforces the point that ArcGIS deployments don't have many datasets on them as compared to the Open Source variants. It also shows how different servers structure themselves; MapServer has a lot of services per deployment, and THREDDS has a huge number. THREDDS then carries this over to a very high number of datasets per deployment as well, explaining why with such a low number of deployments it still serves more datasets than all of the non-ArcGIS proprietary systems combined.

How it was done

That's the end of the stats, but for those interested in how it was done, read on. (It's like a bonus blog post!)
Fingerprinting

The short version is that most servers return unique components in their responses (which are XML documents) that allow us to fingerprint them. For example: A unique XML namespace for example; a comment that explicitly says what it is: <!-- MapServer version ... --> (hmm, what could that be?); a certain combination of supported formats; and even the path component of the URL to the service: https://psl.noaa.gov/thredds/wms/Datasets/NARR/Monthlies/monolevel/wspd.10m.mon.mean.nc.

We can also rely on lazy administrators who have left defaults in place. For example default service titles ("MapGuide WMS Server") and abstracts, or a ridiculously long, 5000+ item list of default projections that the server supports that 1 in 6 GeoServer administrators hasn't culled.

False Negatives over False Positives

Using these various fingerprints we can then assign a server-score to the response depending on which factors it meets. We leant towards false negatives, meaning if we weren't sure it was unique, we wouldn't use it as a fingerprint. This is evidenced by the low number of "UNSURE" results, the majority of which are some flavour of MapBender impersonating deegree.

Limitations

It's important with this sort of thing to point out the limitations of the methodology, and the caveats it comes with, like we do with the extents plots.

Fingerprinting does have its limits, for example GeoWebCache is integrated into GeoServer, so stand-alone GeoWebCaches may be under-counted. Similarly the proxy servers (MapProxy, GeoWebCache, and MapCache), by definition are only caches for actual renderers sitting behind them. That rendering software could be anything. As such the numbers for the caches should certainly be treated with a grain of salt; it may be underestimated because they're often invisible. This also means when they're not invisible we have no way of knowing what's behind them.

Confidence Levels

Some pieces of software we're very confident we've managed to identify all of the deployments in our index because they have clear fingerprints that server administrators are extremely unlikely to change (custom namespaces, obnoxious hard-coded software-licensing details, etc). The below table shows how confident we are that we found all of that software within our index. High confidence means we're pretty sure we found it all, low confidence means there could be more deployments in the "UNKNOWN" and/or "UNSURE" groups.



SoftwareConfidence
ArcGIS ServerHigh
CardogisHigh
CubeservHigh
deegreeLow
ErdasHigh
EwmapaLow
ExtensisLow
GeognosisHigh
GeomediaHigh
GeoServerMedium
GeoWebCacheLow
MapBenderLow
MapCacheLow
MapGuideLow
MapProxyLow
MapServerHigh
QGIS ServerHigh
TeklaMedium
THREDDSMedium
Table showing how confident we are that we found all of deployments of a specific type of software within our index.
General Notes and Caveats
  • There can be many software installations behind one "deployment".
  • Some domains have multiple different pieces of software behind them; this is why the number of deployments is higher than the number of "hosts" on the stats page.
  • Results based on a snapshot of global geospatial services for mid May 2020
  • Excludes servers that only have "meaningless" data/services, and demo/test servers. Only includes servers that actually serve data.
  • The GeoSeer index, while the largest we know of, doesn't cover all public services. But there's certainly enough that this should be an accurate representation.
  • This only covers public facing, freely accessible services (i.e. the sort GeoSeer Indexes). There will be many more deployments of all of this software that only points at internal corporate networks.
As ever, let us know if you have any thoughts/feedback/comments etc.


Data portals - not always the solution

Posted on 2020-01-10

There are lots of Open Data portals out there, many promising a wealth of data, be it spatial or otherwise.

We want to ask: how effective is this strategy of deploying lots of data portals? GeoSeer uses many of these portals as its seeds so we think we're in a good place to investigate them.

data.gov.uk data portal
Data.gov.uk front end

How many are there?

The first problem is the sheer number of them. Lets start with CKAN; CKAN is "the world’s leading Open Source data portal platform" and it offers an API which GeoSeer can use to harvest geospatial web services from portals that use this software. It seems to be the software of choice for many government based data portals and as it stands right now, GeoSeer is aware of 191 working CKAN portals.

The other big data "portal" isn't really a piece of software but a standard: CSW - Catalog Services for the Web. Basically it's a standard for serving metadata via XML, and most deployments we're aware of are either GeoNetwork or PyCSW; there's also ESRI's geoportal software but almost no-one seems to use that, and then whatever the European Commission rolled themselves for the INSPIRE CSW which is different again. GeoSeer has 325 working CSW services in its index (and 47 non-working). Note that some of these may be hosted by the same organisation but be sharing metadata for different types of data. I.e. example.com/csw/ and example.com/csw/inspire.

Lots of portals == Good?

If you haven't done the maths in your head, that's 516 working data portals. There's actually some overlap between these groups; some CKAN portals also have CSW backends, but the general point stands.

Great, loads of data portals, that is great right? Well, not quite, you see, what if you want to actually find data? That is the ostensible purpose of all of these things isn't it? Well now as an end user you've got 516 data portals to search through... And of course those are just the data portals that support those two API's, there are many bespoke data portals that don't have nice API's that GeoSeer doesn't crawl (for example, Belgium, or the DKAN software).

Lets say you want some data for a location in Colorado, USA. Do you use the local data portals, such as Denver Opendata, or go to the state level (Colorado's portal (which is actually what's behind Denver's portal)), or the national data.gov or maybe domain specific ones like NETL's, NOAA's, NASA's, the USGS's, etc.? And that's ignoring the fact many of those organisations have multiple portals! You can see how this gets difficult fast.

And mostly unpopulated

The dataportals themselves, at least the national ones, usually boast many datasets, but how many of the spatial web services that are out there are actually in these portals? GeoSeer has the largest index of these services that we know of (by a large margin), so we thought we'd compare them. Here's a table, then we'll break it down.


GeoSeerCKANCKAN+CSWCSW+
Number of Hosts5,002259
(5.18%)
299
(5.98%)
461
(9.22%)
536
(10.72%)
Number of Services215,6997,783
(3.61%)
20,785
(9.64%)
7,971
(3.7%)
18,186
(8.43%)
Number of Datasets2,044,64465,959
(3.23%)
144,259
(7.06%)
130,065
(6.36%)
234,528
(11.47%)

Only includes hosts/services/datasets that were live in December 2019.
Note: CKAN numbers *exclude* the USA' data.gov CKAN portal because their API is different from the other 190 out there.

So, what does this show?
The GeoSeer column shows how many of these things GeoSeer has in its index today. The CSW and CKAN columns show how many of these things each portal type has in it. So between 190 CKAN portals they only point to 65,959 spatial web service datasets across all of them, which is 3.23% of the number that are actually out there. And the 325 CSW services point to 130,065 such datasets between them.

There are also two column headers that end in a +. These columns represent the addition of some of GeoSeer's secret sauce. Basically we know and understand the spatial web standards and the software behind them, and using this knowledge we make intelligent guesses at what else might exist on a server. These two columns therefore represent how many things there actually are on the services that the CKAN/CSW portals point to, or at least a minimum number.

But what does all that mean?

Well, there are several things we can say with surety:

  • As a general rule, if a spatial web service is put online, at most only half of datasets/services that are online on that box are actually put onto data portals. It may be (and almost certainly is) less than half.
  • That even if you did search 515 CSW and CKAN Open Data portals (excludes CKAN USA), you'd only be searching a tiny small fraction of the spatial web services out there. On the order of less than 8%!

Wrapping up

The above only covers False Negatives - things that should be in the data portals but are not. There's also the issue of False Positives: things the portals say exist but don't; with any luck we'll get around to analysing them in the future.

What's the solution to this? We don't claim to have answers, but if your organisation is considering rolling its own data portal ask yourself - is it worth it? For the considerable costs you're going to incur what value are you going to add that putting your data in the national database won't add? And if you run a national portal, make it easier to search, host and maintain data for local communities.

In the mean time, you freely can use GeoSeer to search across all those portals and many more easily and quickly. And you can integrate it into your own webgis/projects using the API or GeoSeer Licensed.


Plotting Dataset Extents

Posted on 2019-10-31

Back at the start of September we released some historical statistics and, almost as an afterthought, mentioned the new extents plots. In this post we explore those dataset extents plots in more detail.

Extent Plots; What Are They?

Put simply, every dataset that GeoSeer indexes follows various standards which say that the datasets should declare a rectangular bounding box which represents its spatial extent, i.e., where does the dataset cover? So if it's a dataset covering Germany, it should declare a box covering Germany, but because Germany isn't a rectangle, the box will overlap with surrounding countries to various degrees, including all of tiny Luxembourg.

What we've done is taken all of those extents boxes and stacked them all up, overlapping them to create what we call extent plots which show how many datasets cover a given area. As you can imagine there are a lot of caveats to this process (as well as to the dataset extents themselves, like the Luxembourg example above), these are covered in detail on the datasets extent plots page.

What do they show?

One important caveat to remember is that the plots only show dataset extents that are entirely within the plot extent. The exception to this is the Global plot where we're also excluding the 191,052 global datasets (about 10% of all datasets) as they add nothing to the plot. It's interesting to note that ~10% of all datasets are global though.

So the plots show where in the world there are lots of spatial datasets available via WMS (Web Map Service), WFS (Web Feature Service), WCS (Web Coverage Service), and/or WMTS (Web Map Tile Service). There are two colour schemes, we're mostly using the spectral one here because it brings out more detail even if it's not very aesthetically pleasing.

Global extent plot Starting with the global plot (above) it's obvious that the EU's INSPIRE directive has had a considerable effect, particularly in central Europe. The USA and Brazil also have considerable coverage.

Looking closer at West Europe (below), it becomes apparent that the rectangular nature of the extents are causing lots of overlaps in the region of the Netherlands/Belgium/Germany tripoint (apparently called the Vaalserberg), hence the very high values there. That said, there's still a lot of datasets covering the region. West Europe

Zooming even closer to one specific country such as the UK (below), it's possible to see a lot of nuance that's swamped out at the smaller scales. It's now clear that Wales has excellent national coverage compared to England (which is mostly just the South East, not even London), and certainly Scotland (just Glasgow and Edinburgh). UK Extent plot



Bad data

Do you notice anything odd about this plot of Africa (apart from it being green)? Africa plot with bad data

Yep, there's a very large number of datasets covering a tiny area off the west coast of Africa. Why? As you've probably guessed it's because 6,527 datasets are wrongly declaring their extent as being entirely around 0, 0 (lat/lon). This floods out Africa which unfortunately doesn't have many datasets in the first place. So we filter these bad datasets out of the plot extents to get the below. Now we can see that east Africa has a respectable number of datasets covering it, as do the Canary Islands. Africa plot with good data



Technical; How We Make Them

We use Python to create these plots by reading in all of the WGS84 (coordinate system) bounding boxes from the GeoSeer index, stacking them together with NumPy as a two dimensional array, and then plotting them out via MatPlotLib. NumPy does the magic of summing the extents together remarkably quickly so we can rebuild them every month with the updates.

Closing Remarks

There are other interesting insights that can be gleaned from these, take a look at the Datasets Extent Plots page for more. This is a good case study of the sort of cool stuff you can do with Python, and GeoSeer Datasets, for example if you have a research itch you need to scratch.

Because we like to share, the plots are available for use under the CC-BY 4.0 license, which means you can do anything you want with them but please link back to GeoSeer.

Finally, if there's any particular area you think would make an interesting plot, let us know and we'll take a look.


New Historical Statistics and Extent Plots

Posted on 2019-09-02

The GeoSeer stats page went live just shy of a year ago and we've been meaning to update it with more stats ever since. Today we've done just that, with a few new stats, and a lot of cool plots.

The first statistic is the most simple: The number of countries that are hosting OGC services. A country for our purposes is simply defined as having a unique ccTLD (the last part of a domain: .pl, .us, .br, .au, etc.). At the time of writing this blog post, it's 87 of the 244 defined ccTLD's. (Note this does include .eu for the European Union which most people wouldn't actually consider a country).

Historical Data

GeoSeer has been live for almost 18 months now, and we've been crawling the WWW for OGC services for even longer. This means we have a trove of historical data about services, and the new stats expose some of that. If you look at the stats page now, you'll see the General Stats section has been tweaked slightly.

As well as continuing to show stats about the current state of OGC services "Now", we've added an extra column for "Ever" which shows the total numbers that we've ever found since we started doing this. Then with a little maths we show the percentage of the things we've ever found that are still alive now.

The Ephemeral Nature of Public Data
Datasets

The single most glaring statistic from this historical data is that we've found a total of 4,949,124 datasets since we started crawling, but only 1,865,660 are live and active in our index right now. Or put another way, just 37.7% of the datasets hosted by OGC services that were publically available at some point in the past 18 months are still online!

Services

And while that's the most stand-out statistic, the others also show how transient the OGC services that host these datasets are. Over the course of the past ~18 months we've found 291,779 different services, yet only 71.83% of them were online and responding on our last crawl.

Hosts

The final statistic of note here is the number of hosts. These are the domain names themselves, and different subdomains are counted as different hosts (so www.example.com is different from ogc.example.com). Even these have experienced considerable churn over what is a relatively short period of time, with only 85.5% of hosts remaining online. We should point out that we ignore the scheme (that's the http:// or https://) and ignore the port when we consider if something is a "host", so if a host changes from insecure to secure (and quite a few do), it won't make a difference to this statistic.

Thoughts

All of this change makes it harder for users to rely on this data even if they can find it. Especially for things like scientific research which relies on repeatability, including the ability for other scientists to go back and take a second look at the original data; a difficult thing to do when the datasets/services/hosts have gone offline.

This also highlights the importance of keeping data portals current. Link rot is a real thing and data curators need to ensure they maintain their portals otherwise the portals are worse than useless (because they're wasting everyone's time with bad links).

Extent Plots

The other part of this statistics update is a collection of extent map plots that show what parts of the world have datasets. We're going to do a separate blog post about them in the future.


A Midyear Update

Posted on 2019-08-07

The GeoSeer index of OGC Services continues to grow, now standing tantalisingly close to 200,000 services: there are currently 197,911 from over four thousand four hundred different hosts. And of course this only includes active services; the index is kept in an "evergreen" state consisting only of services that actually worked when we last queried them. There are many more services that are intermittent but these aren't useful to you so don't feature in the index.

On adventures we go

As well as continuing to hone and expand the service, we've also been participating in some community events. In June we participated in the OGC's API Hackathon in London, part of the process for developing the next generation of OGC spatial standards. They're at an early phase - with API Features being the furthest along - and we participated with the aim of making sure that discoverability was kept in mind during their development. After all, there's no point developing cutting edge standards if no-one can find implementations.

Then we went to Italy to the European Commission's Joint Research Centre (JRC) - home of INSPIRE - to present and participate in a workshop about service discovery and search engines with regards to INSPIRE services. We met some of the people behind a few of the portals we harvest from, and exchanged thoughts on how services and data can be made more discoverable.

Statistics - SRS

We received a user enquiry as to which Spatial Reference System (SRS) was most common in OGC services, so we did a quick check and wanted to share the top results with everyone because who doesn't like stats. Note that there are lots of caveats that we won't go into here, we're sharing these as-is. It doesn't come as a surprise that EPSG:4326 aka WGS84, and Web Mercator are the most common.

SRS CodeNameNumber of datasetsNotes
EPSG:4326WGS84892,331Standard Latitude-Longitude
CRS:84WGS84514,924Longitude-Latitude swapped version of WGS84
EPSG:3857Web Mercator394,736De facto web mapping projection
EPSG:900913Web Mercator259,519Deprecated code for Web Mercator
EPSG:4258ETRS89166,684Europe
EPSG:25832ETRS89 / UTM zone 32N133,604Europe between 6°E and 12°E
EPSG:102100Web Mercator102,823ArcGIS Online version of EPSG:3857

The datasets define 1,318 different SRS'; above are just the ones with more than 100,000 datasets. We're always open to doing some stats analysis, just ask.

Licensing?

Finally, we've started investigating making the database available to third parties via licensing. If you're interested, let us know. Watch this space.


One Million Layers, and a Stats Page

Posted on 2018-09-27

GeoSeer has now hit the one million distinct spatial layers milestone in its index. That's a staggering amount of spatial data, and all of it is freely accessible via OGC standards, and of course, also easily searchable with GeoSeer. This actually represents over 1.7 million publicly available WMS, WFS, WCS, and WMTS layers - see this previous blog post for a discussion on why this number is even higher. This represents data from over 100,000 OGC services.

We've been gradually increasing the number of layers in our index consistently since launch as a result of a combination of things: our ongoing efforts to expand where we collect data from, improvements to the GeoSeerBot (we feed it lots of veggies!), and ever more layers being added to services we already index.

How many more layers and services are there out there? We don't know; but we plan on doing a blog post about the number of services, so keep an eye out. And we're going to keep trying to find more.

What was that about a Stats Page?

That's right, because we're big data nerds (see what we did there?), we've also created a page that's got a high-level breakdown of statistics for what's in our index. You can find the new stats page here. We don't claim to have a complete index of all public OGC services, but we're fairly certain it's a large chunk of the ones that are out there, so this is a fairly representative sample of what's available on the internet.

The stats page will be updated about once a month and should always approximately represent what's in our index. In the future we plan on adding further and more detailed statistics including a breakdown of what middleware is used to run these services, so keep an eye out for it.

Need more stats? Ask away!

If there's any particular statistic you're interested in that's not on there, let us know and we'll consider adding it. Or if we don't think others will find it interesting (how many people really want to know that the average (mean) number of Layers per Endpoint is (at the time of writing) 12.99? Or that the median and mode are both 2, the minimum is 1, and the maximum is 4,629), we'll tell you directly, we try to be nice like that. So ask away.


So, how many OGC layers are there?

Posted on 2018-07-18

Updated: 2018-09-27 with numbers for September 2018 which also reflects improvements to how we group things together.

One of the questions we come across quite often is the deceptively simple "How many layers are there"? At the time of writing our front page says "over 1 million distinct... layers", so that's the answer right? Well, not quite, and why is that "distinct" in there anyway? There are actually quite a few potential answers so lets go through them.
Note: All numbers in this post are correct at the time of writing but will certainly change within a few weeks as we continue to index more services.

That's a lot of layers
Lets start with the largest number: 1,773,337 layers. This is also the simplest number - it's the total number of layers that we find in all of the unique capabilities documents that we download ("capabilities documents" are what map servers use to tell the world what layers they have and what features they support). This is the easiest number to give, and the one most commonly given. It is "correct" in that there really are over 1.7 million layers out there across various service endpoints, but as you'll see from the other numbers, there are a few problems with using it.
Meaningless layers
We do a lot of work to try and weed out "meaningless" layers from our index. This isn't a reflection on the data inside the layers, but on the metadata in the capabilities document. For instance there's no point us indexing a layer that has a name of "1" and no other information; for all we know these layers may have great data behind them, but if there isn't even a meaningful name our users will never be able to find those layers, so we simply remove them to stop them cluttering up the results.
It's at this stage that we also remove layers that are pre-installed defaults, like the TOPP/Tasmania data that comes with GeoServer.
In total all this filtering gets rid of over 47,000 layers, leaving us with around 1,720,000 layers.
Many endpoints and the same layer
It turns out a lot of those layers are duplicates; there are many services out there which have lots of different endpoints (the URL you use to access it) that all serve the same layer(s). In fact, there is one single layer that is served by over 2400 endpoints on the same host-domain (we group services by host-domains as part of the de-duplication process). That's an exception but there are over 580 layers that are duplicated over a hundred of times on the same host-domain, and in total we identify over 717,000 duplicate layers. We don't get rid of them entirely - you may have noticed in the results that we list multiple capabilities URLs for some layers - but we don't count them as separate layers. Once we get rid of all of those, we're down to 1,055,836 layers. It's also quite surprising to see that about half of the layers out there are duplicates.
Different service types
The final component is - what happens if a layer is served up from the same server as both WMS and WFS? Or WMS/WCS/WMTS, etc? For our purposes we try and group them together and treat them as a single layer, but as you've likely noticed in the results, we do flag that a layer is available as multiple service types. There are surprisingly few of these: only 10,752 layers are used across service types. This is where we get our final, front-page number of 1,045,059 layers.
So which is it?

As you've probably gathered by now, there isn't a "right" number. We choose to use the lowest number because it's most honest for our purposes; when you search GeoSeer you're searching 1,045,059 distinct spatial layers. It's of no help to you if you get the same layer 127 times in the results because that's how many endpoints host it. Yet across all servers and endpoints, GeoSeer is searching what represents 1,773,337 separate publicly accessible layers.

Blog content licensed as: CC-BY-SA 4.0