The GeoSeer Blog

What's the most deployed geospatial server software?

Posted on 2020-06-04

One of the things we've been meaning to do for a long time is investigate which geospatial server software is most prevalent for serving up all these services GeoSeer has in its index. After all, what's the point of having the world's largest index of geospatial web services at your fingertips (shameless plug!) if you're not going to use it to answer interesting questions?

The answer is...

... not 42, that's a different question. The one word answer is: ArcGIS. But as ever with these things, there's a much more nuanced story to tell. For example, the software that hosts the most datasets is easily GeoServer. The question we're answering is: What's the most deployed software out there for serving up publicly accessible geospatial data via WMS, WFS, WCS, and WMTS services? While that may read like a lot of caveats, this isn't a tabloid newspaper! Here are the results in one big table.

Note: Deployment = At least one instance of this software, grouped by domain (i.e. geoserver.example.com, and geoserver2.example.com are two deployments); Service = A single service, the thing you get when you copy/paste a WMS/WFS/WCS/WMTS URL into your GIS.


Software# Deployments# Services# Datasets
ArcGIS Server2,75572,054517,169
Cardogis137142,869
Cubeserv7361,141
deegree453,04315,062
Erdas1652797
Ewmapa1717189
Extensis22123
Geognosis611256
Geomedia2653311,078
GeoServer96422,673963,603
GeoWebCache499842,128
MapBender430,99731,060
MapCache14237,495
MapGuide45258
MapProxy1565610
MapServer54457,606389,709
QGIS Server6061311,924
Tekla2222461
THREDDS4326,97651,345
UNSURE174391,395
UNKNOWN50712,470178,995
Type of geospatial server software and its count of deployments, as well as the number of hosted services and datasets provided by it.
UNSURE means it could be one of several things. UNKNOWN means no idea at all. Linked software is Open Source.
The proprietary world

The first thing that jumps out is that ArcGIS has a huge number of deployments at 2,733, that's 53.7% of them. In reality, there are actually a lot more ArcGIS servers out there (at least ~4,900 in our index), but here we're only counting the ones that are serving WMS/WFS/WMTS/WCS. The rest are likely only serving via ESRI standards.

The next obvious thing in regards to proprietary is that outside of ArcGIS, the rest of them aren't even "also rans", totalling just 2.12% of the deployments and are behind only 0.75% of the datasets served. Barely a rounding error! It's likely there are a few more different pieces of proprietary software in the UNKNOWN grouping, but probably not enough to make a real difference.

The power of Open Source

Open Source has a much healthier ecosystem, with MapServer and GeoServer having very large deployment counts, and niche servers like THREDDS (oceanic data community), and GeoWebCache (caching server) also serving up alot of data.

If you group the proprietary/open source servers together, things become even more interesting:


Software Type# Deployments# Services# Datasets
Proprietary2,864 (55.83%)73,441 (32.15%)534,083 (23.97%)
OpenSource1,742 (33.96%)142,099 (62.2%)1,513,194 (67.93%)
UNSURE/
UNKNOWN
507 (9.88%)12,470 (5.46%)178,995 (8.04%)
The total number of Deployments, Services, and Datasets grouped by whether the software is Open Source or proprietary.

Graph showing percentages of deployments/services/datasets that are served by proprietary/open source/unknown software Graph version of the above table.

Looking at the above it rapidly becomes clear that while there may be a lot of ArcGIS deployments, they're not sharing much data as compared to the Open Source installs. It seems reasonable to conclude that ESRI are very good at selling their software to cities/counties/local provinces, who then use it to comply with "Open Data" edicts, but when it comes time to roll out an SDI, Open Source is where it's at. In fact, Open Source solutions are behind at least two thirds of the world's OGC served datasets!

Deployment patterns

One final data table. This one breaking down some of the software a bit further, this time including the average number of services per deployment, and datasets per deployment.


Software Type# Deployments# ServicesAvg
Services/
Deployment
# DatasetsAvg
Datasets/
Deployment
Popular Data Servers
ArcGIS2,755 (53.7%)72,054 (31.54%)26.15517,169 (23.22%)187.72
GeoServer964 (18.79%)22,673 (9.92%)23.52963,603 (43.26%)999.59
MapServer544 (10.6%)57,606 (25.22%)105.89389,709 (17.49%)716.38
THREDDS43 (0.84%)26,976 (11.81%)627.3551,345 (2.3%)1194.07
Totals (All)5,130228,44944.532,227,667434.24
Subset of data. Includes the average (mean) number of services and datasets per deployment.

This further reinforces the point that ArcGIS deployments don't have many datasets on them as compared to the Open Source variants. It also shows how different servers structure themselves; MapServer has a lot of services per deployment, and THREDDS has a huge number. THREDDS then carries this over to a very high number of datasets per deployment as well, explaining why with such a low number of deployments it still serves more datasets than all of the non-ArcGIS proprietary systems combined.

How it was done

That's the end of the stats, but for those interested in how it was done, read on. (It's like a bonus blog post!)
Fingerprinting

The short version is that most servers return unique components in their responses (which are XML documents) that allow us to fingerprint them. For example: A unique XML namespace for example; a comment that explicitly says what it is: <!-- MapServer version ... --> (hmm, what could that be?); a certain combination of supported formats; and even the path component of the URL to the service: https://psl.noaa.gov/thredds/wms/Datasets/NARR/Monthlies/monolevel/wspd.10m.mon.mean.nc.

We can also rely on lazy administrators who have left defaults in place. For example default service titles ("MapGuide WMS Server") and abstracts, or a ridiculously long, 5000+ item list of default projections that the server supports that 1 in 6 GeoServer administrators hasn't culled.

False Negatives over False Positives

Using these various fingerprints we can then assign a server-score to the response depending on which factors it meets. We leant towards false negatives, meaning if we weren't sure it was unique, we wouldn't use it as a fingerprint. This is evidenced by the low number of "UNSURE" results, the majority of which are some flavour of MapBender impersonating deegree.

Limitations

It's important with this sort of thing to point out the limitations of the methodology, and the caveats it comes with, like we do with the extents plots.

Fingerprinting does have its limits, for example GeoWebCache is integrated into GeoServer, so stand-alone GeoWebCaches may be under-counted. Similarly the proxy servers (MapProxy, GeoWebCache, and MapCache), by definition are only caches for actual renderers sitting behind them. That rendering software could be anything. As such the numbers for the caches should certainly be treated with a grain of salt; it may be underestimated because they're often invisible. This also means when they're not invisible we have no way of knowing what's behind them.

Confidence Levels

Some pieces of software we're very confident we've managed to identify all of the deployments in our index because they have clear fingerprints that server administrators are extremely unlikely to change (custom namespaces, obnoxious hard-coded software-licensing details, etc). The below table shows how confident we are that we found all of that software within our index. High confidence means we're pretty sure we found it all, low confidence means there could be more deployments in the "UNKNOWN" and/or "UNSURE" groups.



SoftwareConfidence
ArcGIS ServerHigh
CardogisHigh
CubeservHigh
deegreeLow
ErdasHigh
EwmapaLow
ExtensisLow
GeognosisHigh
GeomediaHigh
GeoServerMedium
GeoWebCacheLow
MapBenderLow
MapCacheLow
MapGuideLow
MapProxyLow
MapServerHigh
QGIS ServerHigh
TeklaMedium
THREDDSMedium
Table showing how confident we are that we found all of deployments of a specific type of software within our index.
General Notes and Caveats
  • There can be many software installations behind one "deployment".
  • Some domains have multiple different pieces of software behind them; this is why the number of deployments is higher than the number of "hosts" on the stats page.
  • Results based on a snapshot of global geospatial services for mid May 2020
  • Excludes servers that only have "meaningless" data/services, and demo/test servers. Only includes servers that actually serve data.
  • The GeoSeer index, while the largest we know of, doesn't cover all public services. But there's certainly enough that this should be an accurate representation.
  • This only covers public facing, freely accessible services (i.e. the sort GeoSeer Indexes). There will be many more deployments of all of this software that only points at internal corporate networks.
As ever, let us know if you have any thoughts/feedback/comments etc.


Data portals - not always the solution

Posted on 2020-01-10

There are lots of Open Data portals out there, many promising a wealth of data, be it spatial or otherwise.

We want to ask: how effective is this strategy of deploying lots of data portals? GeoSeer uses many of these portals as its seeds so we think we're in a good place to investigate them.

data.gov.uk data portal
Data.gov.uk front end

How many are there?

The first problem is the sheer number of them. Lets start with CKAN; CKAN is "the world’s leading Open Source data portal platform" and it offers an API which GeoSeer can use to harvest geospatial web services from portals that use this software. It seems to be the software of choice for many government based data portals and as it stands right now, GeoSeer is aware of 191 working CKAN portals.

The other big data "portal" isn't really a piece of software but a standard: CSW - Catalog Services for the Web. Basically it's a standard for serving metadata via XML, and most deployments we're aware of are either GeoNetwork or PyCSW; there's also ESRI's geoportal software but almost no-one seems to use that, and then whatever the European Commission rolled themselves for the INSPIRE CSW which is different again. GeoSeer has 325 working CSW services in its index (and 47 non-working). Note that some of these may be hosted by the same organisation but be sharing metadata for different types of data. I.e. example.com/csw/ and example.com/csw/inspire.

Lots of portals == Good?

If you haven't done the maths in your head, that's 516 working data portals. There's actually some overlap between these groups; some CKAN portals also have CSW backends, but the general point stands.

Great, loads of data portals, that is great right? Well, not quite, you see, what if you want to actually find data? That is the ostensible purpose of all of these things isn't it? Well now as an end user you've got 516 data portals to search through... And of course those are just the data portals that support those two API's, there are many bespoke data portals that don't have nice API's that GeoSeer doesn't crawl (for example, Belgium, or the DKAN software).

Lets say you want some data for a location in Colorado, USA. Do you use the local data portals, such as Denver Opendata, or go to the state level (Colorado's portal (which is actually what's behind Denver's portal)), or the national data.gov or maybe domain specific ones like NETL's, NOAA's, NASA's, the USGS's, etc.? And that's ignoring the fact many of those organisations have multiple portals! You can see how this gets difficult fast.

And mostly unpopulated

The dataportals themselves, at least the national ones, usually boast many datasets, but how many of the spatial web services that are out there are actually in these portals? GeoSeer has the largest index of these services that we know of (by a large margin), so we thought we'd compare them. Here's a table, then we'll break it down.


GeoSeerCKANCKAN+CSWCSW+
Number of Hosts5,002259
(5.18%)
299
(5.98%)
461
(9.22%)
536
(10.72%)
Number of Services215,6997,783
(3.61%)
20,785
(9.64%)
7,971
(3.7%)
18,186
(8.43%)
Number of Datasets2,044,64465,959
(3.23%)
144,259
(7.06%)
130,065
(6.36%)
234,528
(11.47%)

Only includes hosts/services/datasets that were live in December 2019.
Note: CKAN numbers *exclude* the USA' data.gov CKAN portal because their API is different from the other 190 out there.

So, what does this show?
The GeoSeer column shows how many of these things GeoSeer has in its index today. The CSW and CKAN columns show how many of these things each portal type has in it. So between 190 CKAN portals they only point to 65,959 spatial web service datasets across all of them, which is 3.23% of the number that are actually out there. And the 325 CSW services point to 130,065 such datasets between them.

There are also two column headers that end in a +. These columns represent the addition of some of GeoSeer's secret sauce. Basically we know and understand the spatial web standards and the software behind them, and using this knowledge we make intelligent guesses at what else might exist on a server. These two columns therefore represent how many things there actually are on the services that the CKAN/CSW portals point to, or at least a minimum number.

But what does all that mean?

Well, there are several things we can say with surety:

  • As a general rule, if a spatial web service is put online, at most only half of datasets/services that are online on that box are actually put onto data portals. It may be (and almost certainly is) less than half.
  • That even if you did search 515 CSW and CKAN Open Data portals (excludes CKAN USA), you'd only be searching a tiny small fraction of the spatial web services out there. On the order of less than 8%!

Wrapping up

The above only covers False Negatives - things that should be in the data portals but are not. There's also the issue of False Positives: things the portals say exist but don't; with any luck we'll get around to analysing them in the future.

What's the solution to this? We don't claim to have answers, but if your organisation is considering rolling its own data portal ask yourself - is it worth it? For the considerable costs you're going to incur what value are you going to add that putting your data in the national database won't add? And if you run a national portal, make it easier to search, host and maintain data for local communities.

In the mean time, you freely can use GeoSeer to search across all those portals and many more easily and quickly. And you can integrate it into your own webgis/projects using the API or GeoSeer Licensed.


GeoSeer Licensed Products Released

Posted on 2019-11-27

We hinted at it in August and now it's here; today we're releasing a licensed version of the database that sits behind GeoSeer, creatively called: GeoSeer Licensed.

GeoSeer Licensed content allows organisations and businesses to host and integrate their own local copy of GeoSeer's industry leading database of spatial web services directly into their own applications, products, or services. We figure this can improve end-user workflows, make discovery of third-party data and services much easier, and help organisations realise some of vast economic benefits Open Data presents - estimated at €75 billion across the EU in 2020 alone. Also, you can build cool things with it.

This nicely compliments the GeoSeer API which was released back in April. Where the API allows organisations to easily build GeoSeer's search into web applications by making calls to our servers (like the GeoSeer WebGIS demo does), GeoSeer Licensed allows organisations to host the database locally, or release it in a product, meaning any sort of application can be built around it, not just search.

The Value Add

Lets be honest, there's nothing stopping you from building your own spider, finding 550+ dataportals to use as seeds, scraping half a million web-pages a month, and building your own database of geospatial web services. So why use GeoSeer Licensed?

  • Industry Leading Database - At the time of writing we're not aware of any database of geospatial web services anywhere near as large as this (and we've looked).
  • Current - We run regular crawls to make sure everything in GeoSeer Licensed is current. We provide monthly updates so you always have the latest data.
  • Pre-Cleaned - despite these being standards, everyone likes to do things differently. We've pre-cleaned the fields and tried to standardise them to make them consistent. For example we've found no less than 55 ways to say "No license fees" across 12 languages and turned that into a simple: "No".
  • XML Free - it's 2019 and fewer people want to deal with the hassle of namespaces, esoteric data models, and the other complexities XML brings. We've extracted the data from the XML documents and put it into a database, with some JSON sprinkled in where necessary.
  • Spatial Extents - For GeoSeer Datasets we include the extents bounding boxes in WGS84 format, along with scale-appropriate textual representations of the locations, potentially down to county level (like you see in GeoSeer Search).
  • Quick start - Because we've done all the hard work, written up documentation about what each field means (so you don't have to read the standards), and packaged it in an SQLite database, it's super easy to get started with. Simply open your favourite database admin tool (it probably supports SQLite), and get querying with good old fashioned SQL.

The Products

GeoSeer Services is a database with all of the current geospatial web services that GeoSeer knows about in it, as well as information about their endpoints, hosts, and more. At the time of writing it has over 215,000 services in it from across 4,930 different hosts. This information is good for investigating who is hosting services, what sorts of services exist, INSPIRE deployment patterns/conformity, etc.

GeoSeer Datasets builds on GeoSeer Services, including not only all of the service information, but all of the dataset metadata as well. This includes: dataset extent bounding boxes, dataset keywords, declared projections, scale-appropriate textual location, metadata urls, and more. GeoSeer Datasets is well suited to building search engines (surprise!), GIS, web-GIS, academic research, and much more.

Both products are available with a number of different license types, from a research license through to commercial licenses. We can also provide subsets of the database if you don't want everything.

We like to think we've built a great search engine around this data, so now it's your turn - what can you build with it? Find out more about GeoSeer licensing.



Blog content licensed as: CC-BY-SA 4.0