The GeoSeer Blog

All pages with the tag: CKAN

Data portals - not always the solution

Posted on 2020-01-10

There are lots of Open Data portals out there, many promising a wealth of data, be it spatial or otherwise.

We want to ask: how effective is this strategy of deploying lots of data portals? GeoSeer uses many of these portals as its seeds so we think we're in a good place to investigate them.

data.gov.uk data portal
Data.gov.uk front end

How many are there?

The first problem is the sheer number of them. Lets start with CKAN; CKAN is "the world’s leading Open Source data portal platform" and it offers an API which GeoSeer can use to harvest geospatial web services from portals that use this software. It seems to be the software of choice for many government based data portals and as it stands right now, GeoSeer is aware of 191 working CKAN portals.

The other big data "portal" isn't really a piece of software but a standard: CSW - Catalog Services for the Web. Basically it's a standard for serving metadata via XML, and most deployments we're aware of are either GeoNetwork or PyCSW; there's also ESRI's geoportal software but almost no-one seems to use that, and then whatever the European Commission rolled themselves for the INSPIRE CSW which is different again. GeoSeer has 325 working CSW services in its index (and 47 non-working). Note that some of these may be hosted by the same organisation but be sharing metadata for different types of data. I.e. example.com/csw/ and example.com/csw/inspire.

Lots of portals == Good?

If you haven't done the maths in your head, that's 516 working data portals. There's actually some overlap between these groups; some CKAN portals also have CSW backends, but the general point stands.

Great, loads of data portals, that is great right? Well, not quite, you see, what if you want to actually find data? That is the ostensible purpose of all of these things isn't it? Well now as an end user you've got 516 data portals to search through... And of course those are just the data portals that support those two API's, there are many bespoke data portals that don't have nice API's that GeoSeer doesn't crawl (for example, Belgium, or the DKAN software).

Lets say you want some data for a location in Colorado, USA. Do you use the local data portals, such as Denver Opendata, or go to the state level (Colorado's portal (which is actually what's behind Denver's portal)), or the national data.gov or maybe domain specific ones like NETL's, NOAA's, NASA's, the USGS's, etc.? And that's ignoring the fact many of those organisations have multiple portals! You can see how this gets difficult fast.

And mostly unpopulated

The dataportals themselves, at least the national ones, usually boast many datasets, but how many of the spatial web services that are out there are actually in these portals? GeoSeer has the largest index of these services that we know of (by a large margin), so we thought we'd compare them. Here's a table, then we'll break it down.


GeoSeerCKANCKAN+CSWCSW+
Number of Hosts5,002259
(5.18%)
299
(5.98%)
461
(9.22%)
536
(10.72%)
Number of Services215,6997,783
(3.61%)
20,785
(9.64%)
7,971
(3.7%)
18,186
(8.43%)
Number of Datasets2,044,64465,959
(3.23%)
144,259
(7.06%)
130,065
(6.36%)
234,528
(11.47%)

Only includes hosts/services/datasets that were live in December 2019.
Note: CKAN numbers *exclude* the USA' data.gov CKAN portal because their API is different from the other 190 out there.

So, what does this show?
The GeoSeer column shows how many of these things GeoSeer has in its index today. The CSW and CKAN columns show how many of these things each portal type has in it. So between 190 CKAN portals they only point to 65,959 spatial web service datasets across all of them, which is 3.23% of the number that are actually out there. And the 325 CSW services point to 130,065 such datasets between them.

There are also two column headers that end in a +. These columns represent the addition of some of GeoSeer's secret sauce. Basically we know and understand the spatial web standards and the software behind them, and using this knowledge we make intelligent guesses at what else might exist on a server. These two columns therefore represent how many things there actually are on the services that the CKAN/CSW portals point to, or at least a minimum number.

But what does all that mean?

Well, there are several things we can say with surety:

  • As a general rule, if a spatial web service is put online, at most only half of datasets/services that are online on that box are actually put onto data portals. It may be (and almost certainly is) less than half.
  • That even if you did search 515 CSW and CKAN Open Data portals (excludes CKAN USA), you'd only be searching a tiny small fraction of the spatial web services out there. On the order of less than 8%!

Wrapping up

The above only covers False Negatives - things that should be in the data portals but are not. There's also the issue of False Positives: things the portals say exist but don't; with any luck we'll get around to analysing them in the future.

What's the solution to this? We don't claim to have answers, but if your organisation is considering rolling its own data portal ask yourself - is it worth it? For the considerable costs you're going to incur what value are you going to add that putting your data in the national database won't add? And if you run a national portal, make it easier to search, host and maintain data for local communities.

In the mean time, you freely can use GeoSeer to search across all those portals and many more easily and quickly. And you can integrate it into your own webgis/projects using the API or GeoSeer Licensed.