The GeoSeer Blog

All pages with the tag: GeoSeer

Data portals - not always the solution

Posted on 2020-01-10

There are lots of Open Data portals out there, many promising a wealth of data, be it spatial or otherwise.

We want to ask: how effective is this strategy of deploying lots of data portals? GeoSeer uses many of these portals as its seeds so we think we're in a good place to investigate them.

data.gov.uk data portal
Data.gov.uk front end

How many are there?

The first problem is the sheer number of them. Lets start with CKAN; CKAN is "the world’s leading Open Source data portal platform" and it offers an API which GeoSeer can use to harvest geospatial web services from portals that use this software. It seems to be the software of choice for many government based data portals and as it stands right now, GeoSeer is aware of 191 working CKAN portals.

The other big data "portal" isn't really a piece of software but a standard: CSW - Catalog Services for the Web. Basically it's a standard for serving metadata via XML, and most deployments we're aware of are either GeoNetwork or PyCSW; there's also ESRI's geoportal software but almost no-one seems to use that, and then whatever the European Commission rolled themselves for the INSPIRE CSW which is different again. GeoSeer has 325 working CSW services in its index (and 47 non-working). Note that some of these may be hosted by the same organisation but be sharing metadata for different types of data. I.e. example.com/csw/ and example.com/csw/inspire.

Lots of portals == Good?

If you haven't done the maths in your head, that's 516 working data portals. There's actually some overlap between these groups; some CKAN portals also have CSW backends, but the general point stands.

Great, loads of data portals, that is great right? Well, not quite, you see, what if you want to actually find data? That is the ostensible purpose of all of these things isn't it? Well now as an end user you've got 516 data portals to search through... And of course those are just the data portals that support those two API's, there are many bespoke data portals that don't have nice API's that GeoSeer doesn't crawl (for example, Belgium, or the DKAN software).

Lets say you want some data for a location in Colorado, USA. Do you use the local data portals, such as Denver Opendata, or go to the state level (Colorado's portal (which is actually what's behind Denver's portal)), or the national data.gov or maybe domain specific ones like NETL's, NOAA's, NASA's, the USGS's, etc.? And that's ignoring the fact many of those organisations have multiple portals! You can see how this gets difficult fast.

And mostly unpopulated

The dataportals themselves, at least the national ones, usually boast many datasets, but how many of the spatial web services that are out there are actually in these portals? GeoSeer has the largest index of these services that we know of (by a large margin), so we thought we'd compare them. Here's a table, then we'll break it down.


GeoSeerCKANCKAN+CSWCSW+
Number of Hosts5,002259
(5.18%)
299
(5.98%)
461
(9.22%)
536
(10.72%)
Number of Services215,6997,783
(3.61%)
20,785
(9.64%)
7,971
(3.7%)
18,186
(8.43%)
Number of Datasets2,044,64465,959
(3.23%)
144,259
(7.06%)
130,065
(6.36%)
234,528
(11.47%)

Only includes hosts/services/datasets that were live in December 2019.
Note: CKAN numbers *exclude* the USA' data.gov CKAN portal because their API is different from the other 190 out there.

So, what does this show?
The GeoSeer column shows how many of these things GeoSeer has in its index today. The CSW and CKAN columns show how many of these things each portal type has in it. So between 190 CKAN portals they only point to 65,959 spatial web service datasets across all of them, which is 3.23% of the number that are actually out there. And the 325 CSW services point to 130,065 such datasets between them.

There are also two column headers that end in a +. These columns represent the addition of some of GeoSeer's secret sauce. Basically we know and understand the spatial web standards and the software behind them, and using this knowledge we make intelligent guesses at what else might exist on a server. These two columns therefore represent how many things there actually are on the services that the CKAN/CSW portals point to, or at least a minimum number.

But what does all that mean?

Well, there are several things we can say with surety:

  • As a general rule, if a spatial web service is put online, at most only half of datasets/services that are online on that box are actually put onto data portals. It may be (and almost certainly is) less than half.
  • That even if you did search 515 CSW and CKAN Open Data portals (excludes CKAN USA), you'd only be searching a tiny small fraction of the spatial web services out there. On the order of less than 8%!

Wrapping up

The above only covers False Negatives - things that should be in the data portals but are not. There's also the issue of False Positives: things the portals say exist but don't; with any luck we'll get around to analysing them in the future.

What's the solution to this? We don't claim to have answers, but if your organisation is considering rolling its own data portal ask yourself - is it worth it? For the considerable costs you're going to incur what value are you going to add that putting your data in the national database won't add? And if you run a national portal, make it easier to search, host and maintain data for local communities.

In the mean time, you freely can use GeoSeer to search across all those portals and many more easily and quickly. And you can integrate it into your own webgis/projects using the API or GeoSeer Licensed.


GeoSeer Licensed Products Released

Posted on 2019-11-27

We hinted at it in August and now it's here; today we're releasing a licensed version of the database that sits behind GeoSeer, creatively called: GeoSeer Licensed.

GeoSeer Licensed content allows organisations and businesses to host and integrate their own local copy of GeoSeer's industry leading database of spatial web services directly into their own applications, products, or services. We figure this can improve end-user workflows, make discovery of third-party data and services much easier, and help organisations realise some of vast economic benefits Open Data presents - estimated at €75 billion across the EU in 2020 alone. Also, you can build cool things with it.

This nicely compliments the GeoSeer API which was released back in April. Where the API allows organisations to easily build GeoSeer's search into web applications by making calls to our servers (like the GeoSeer WebGIS demo does), GeoSeer Licensed allows organisations to host the database locally, or release it in a product, meaning any sort of application can be built around it, not just search.

The Value Add

Lets be honest, there's nothing stopping you from building your own spider, finding 550+ dataportals to use as seeds, scraping half a million web-pages a month, and building your own database of geospatial web services. So why use GeoSeer Licensed?

  • Industry Leading Database - At the time of writing we're not aware of any database of geospatial web services anywhere near as large as this (and we've looked).
  • Current - We run regular crawls to make sure everything in GeoSeer Licensed is current. We provide monthly updates so you always have the latest data.
  • Pre-Cleaned - despite these being standards, everyone likes to do things differently. We've pre-cleaned the fields and tried to standardise them to make them consistent. For example we've found no less than 55 ways to say "No license fees" across 12 languages and turned that into a simple: "No".
  • XML Free - it's 2019 and fewer people want to deal with the hassle of namespaces, esoteric data models, and the other complexities XML brings. We've extracted the data from the XML documents and put it into a database, with some JSON sprinkled in where necessary.
  • Spatial Extents - For GeoSeer Datasets we include the extents bounding boxes in WGS84 format, along with scale-appropriate textual representations of the locations, potentially down to county level (like you see in GeoSeer Search).
  • Quick start - Because we've done all the hard work, written up documentation about what each field means (so you don't have to read the standards), and packaged it in an SQLite database, it's super easy to get started with. Simply open your favourite database admin tool (it probably supports SQLite), and get querying with good old fashioned SQL.

The Products

GeoSeer Services is a database with all of the current geospatial web services that GeoSeer knows about in it, as well as information about their endpoints, hosts, and more. At the time of writing it has over 215,000 services in it from across 4,930 different hosts. This information is good for investigating who is hosting services, what sorts of services exist, INSPIRE deployment patterns/conformity, etc.

GeoSeer Datasets builds on GeoSeer Services, including not only all of the service information, but all of the dataset metadata as well. This includes: dataset extent bounding boxes, dataset keywords, declared projections, scale-appropriate textual location, metadata urls, and more. GeoSeer Datasets is well suited to building search engines (surprise!), GIS, web-GIS, academic research, and much more.

Both products are available with a number of different license types, from a research license through to commercial licenses. We can also provide subsets of the database if you don't want everything.

We like to think we've built a great search engine around this data, so now it's your turn - what can you build with it? Find out more about GeoSeer licensing.


A Midyear Update

Posted on 2019-08-07

The GeoSeer index of OGC Services continues to grow, now standing tantalisingly close to 200,000 services: there are currently 197,911 from over four thousand four hundred different hosts. And of course this only includes active services; the index is kept in an "evergreen" state consisting only of services that actually worked when we last queried them. There are many more services that are intermittent but these aren't useful to you so don't feature in the index.

On adventures we go

As well as continuing to hone and expand the service, we've also been participating in some community events. In June we participated in the OGC's API Hackathon in London, part of the process for developing the next generation of OGC spatial standards. They're at an early phase - with API Features being the furthest along - and we participated with the aim of making sure that discoverability was kept in mind during their development. After all, there's no point developing cutting edge standards if no-one can find implementations.

Then we went to Italy to the European Commission's Joint Research Centre (JRC) - home of INSPIRE - to present and participate in a workshop about service discovery and search engines with regards to INSPIRE services. We met some of the people behind a few of the portals we harvest from, and exchanged thoughts on how services and data can be made more discoverable.

Statistics - SRS

We received a user enquiry as to which Spatial Reference System (SRS) was most common in OGC services, so we did a quick check and wanted to share the top results with everyone because who doesn't like stats. Note that there are lots of caveats that we won't go into here, we're sharing these as-is. It doesn't come as a surprise that EPSG:4326 aka WGS84, and Web Mercator are the most common.

SRS CodeNameNumber of datasetsNotes
EPSG:4326WGS84892,331Standard Latitude-Longitude
CRS:84WGS84514,924Longitude-Latitude swapped version of WGS84
EPSG:3857Web Mercator394,736De facto web mapping projection
EPSG:900913Web Mercator259,519Deprecated code for Web Mercator
EPSG:4258ETRS89166,684Europe
EPSG:25832ETRS89 / UTM zone 32N133,604Europe between 6°E and 12°E
EPSG:102100Web Mercator102,823ArcGIS Online version of EPSG:3857

The datasets define 1,318 different SRS'; above are just the ones with more than 100,000 datasets. We're always open to doing some stats analysis, just ask.

Licensing?

Finally, we've started investigating making the database available to third parties via licensing. If you're interested, let us know. Watch this space.


GeoSeer API Goes Live

Posted on 2019-04-09

We've hinted at it in previous blog posts, but now it's time for the big reveal: the GeoSeer API is live!

Designed to allow you to integrate the power of GeoSeer's search into your business's Web GIS or other application, the API allows your users to easily and seamlessly search for datasets without having to leave their normal tooling. There's an entire-page with information about it here.

As well as including all the features you're used to in the web-search, the API also includes some cool new features:
  • Bounding Box Search - Search for datasets that are within, disjoint, or intersecting a given bounding box, while also using a search term. Ideal for searching for layers that overlap the user's current viewing area.
  • Lat/Lon Search - Easily find datasets that intersect a specific point. Your user selects a location and now they can find data that intersect it. Simple.
  • Service Type filter - Only find datasets that are of the OGC service type(s) that you're interested in. Does your application only support WMS and WFS for instance? Then filter results to only search those service types.
  • Service Search - The GeoSeer web search only allows users to search datasets/layers, but the API also allows searching by service. Readily find services hosted by anyone from local government, through to global spanning organisations like the World Food Programme and everyone between.

We've created the snazzy GeoSeer API WebGIS that demonstrates the API in action, giving you a feeling for what you can do with it and how it could integrate with your own application(s).

The API has several plans to cover various needs, and the Enterprise plan allows for considerable customisation so you can get exactly what you need. So take a look and find out more about the API


One Year Old and Better Than Ever

Posted on 2019-03-03

Today GeoSeer celebrates its first birthday, having originally gone live on 2018-03-03, so we want to look back at the year and take a glimpse into the future.

This year we...

It has been a busy year for us. We added CSW harvesting in May, a stats page for our fellow big-data nerds in September, and a new look, along with an API beta in January. This blog was itself created in April, and got its own RSS feed back in January.

Ever more data

We've also done a lot of general work to try and increase the index size, but not at the cost of spurious results. When we went live a year ago, our index had (using our current methodology) 836,917 distinct layers from 89,825 services. Today we boast 1,229,623 distinct layers (46% more) from 167,882 services (89% more).

During our latest crawl, our index size jumped to well over 3 million layers. "Jackpot" we thought! But upon further investigation (because we're always suspicious of anything anomalous - you have to be if you want to develop something good) we discovered it was from a single host that claimed to have 2.1 million layers across about 5000 services. Deeper investigation showed that it seems that it's the same 500 layers shared thousands of times. So we removed them all from the index and only keep one of each layer to ensure the best possible results for you, our users.

What's next?

At this point it's becoming apparent that we're hitting the point of diminishing returns. We don't think there are many more readily discoverable OGC services and layers out there. We currently scrape over 300 data portals, plus many other data sources to try and find every service we can, but we can only find services which are publicly advertised somewhere, hence the "readily discoverable". But we're not giving up yet, and we have some ideas for several more scraping methodologies to further enhance the index.

And of course we'll also be releasing the API shortly. Watch this space!


A New Look for a New Year

Posted on 2019-01-09

We thought we'd welcome in 2019 with a slight update to the look of the site to improve usability. In particular, the GeoSeer website should now be much better behaved on mobile devices. There's also now more consistency in page navigation to help you find where you're going, and we've tweaked the search results page to better expose meaningful information, including the service's url.

The changes are not just cosmetic, we've also improved the search functionality to try and provide better results for multi-term queries. So searches for things like tree preservation orders will now preferentially try and find results where the words are next to each other without your having to put quotes ("") around it. We've also done a fair amount of work to the location assignment service (the bit that decides what area of the planet a bounding box covers) so you should be getting better results there too.

And as if all that wasn't enough, there are a couple of new features - this blog now has an RSS feed so you can better keep up with our posts.
But we've been keeping the best until last - we now have an API! It's still in beta for the next few weeks but if you're interested in using it, do let us know. There's an entire page with information about the API on it, and we'll be doing a blog post about it when we launch it.


One Million Layers, and a Stats Page

Posted on 2018-09-27

GeoSeer has now hit the one million distinct spatial layers milestone in its index. That's a staggering amount of spatial data, and all of it is freely accessible via OGC standards, and of course, also easily searchable with GeoSeer. This actually represents over 1.7 million publicly available WMS, WFS, WCS, and WMTS layers - see this previous blog post for a discussion on why this number is even higher. This represents data from over 100,000 OGC services.

We've been gradually increasing the number of layers in our index consistently since launch as a result of a combination of things: our ongoing efforts to expand where we collect data from, improvements to the GeoSeerBot (we feed it lots of veggies!), and ever more layers being added to services we already index.

How many more layers and services are there out there? We don't know; but we plan on doing a blog post about the number of services, so keep an eye out. And we're going to keep trying to find more.

What was that about a Stats Page?

That's right, because we're big data nerds (see what we did there?), we've also created a page that's got a high-level breakdown of statistics for what's in our index. You can find the new stats page here. We don't claim to have a complete index of all public OGC services, but we're fairly certain it's a large chunk of the ones that are out there, so this is a fairly representative sample of what's available on the internet.

The stats page will be updated about once a month and should always approximately represent what's in our index. In the future we plan on adding further and more detailed statistics including a breakdown of what middleware is used to run these services, so keep an eye out for it.

Need more stats? Ask away!

If there's any particular statistic you're interested in that's not on there, let us know and we'll consider adding it. Or if we don't think others will find it interesting (how many people really want to know that the average (mean) number of Layers per Endpoint is (at the time of writing) 12.99? Or that the median and mode are both 2, the minimum is 1, and the maximum is 4,629), we'll tell you directly, we try to be nice like that. So ask away.


So, how many OGC layers are there?

Posted on 2018-07-18

Updated: 2018-09-27 with numbers for September 2018 which also reflects improvements to how we group things together.

One of the questions we come across quite often is the deceptively simple "How many layers are there"? At the time of writing our front page says "over 1 million distinct... layers", so that's the answer right? Well, not quite, and why is that "distinct" in there anyway? There are actually quite a few potential answers so lets go through them.
Note: All numbers in this post are correct at the time of writing but will certainly change within a few weeks as we continue to index more services.

That's a lot of layers
Lets start with the largest number: 1,773,337 layers. This is also the simplest number - it's the total number of layers that we find in all of the unique capabilities documents that we download ("capabilities documents" are what map servers use to tell the world what layers they have and what features they support). This is the easiest number to give, and the one most commonly given. It is "correct" in that there really are over 1.7 million layers out there across various service endpoints, but as you'll see from the other numbers, there are a few problems with using it.
Meaningless layers
We do a lot of work to try and weed out "meaningless" layers from our index. This isn't a reflection on the data inside the layers, but on the metadata in the capabilities document. For instance there's no point us indexing a layer that has a name of "1" and no other information; for all we know these layers may have great data behind them, but if there isn't even a meaningful name our users will never be able to find those layers, so we simply remove them to stop them cluttering up the results.
It's at this stage that we also remove layers that are pre-installed defaults, like the TOPP/Tasmania data that comes with GeoServer.
In total all this filtering gets rid of over 47,000 layers, leaving us with around 1,720,000 layers.
Many endpoints and the same layer
It turns out a lot of those layers are duplicates; there are many services out there which have lots of different endpoints (the URL you use to access it) that all serve the same layer(s). In fact, there is one single layer that is served by over 2400 endpoints on the same host-domain (we group services by host-domains as part of the de-duplication process). That's an exception but there are over 580 layers that are duplicated over a hundred of times on the same host-domain, and in total we identify over 717,000 duplicate layers. We don't get rid of them entirely - you may have noticed in the results that we list multiple capabilities URLs for some layers - but we don't count them as separate layers. Once we get rid of all of those, we're down to 1,055,836 layers. It's also quite surprising to see that about half of the layers out there are duplicates.
Different service types
The final component is - what happens if a layer is served up from the same server as both WMS and WFS? Or WMS/WCS/WMTS, etc? For our purposes we try and group them together and treat them as a single layer, but as you've likely noticed in the results, we do flag that a layer is available as multiple service types. There are surprisingly few of these: only 10,752 layers are used across service types. This is where we get our final, front-page number of 1,045,059 layers.
So which is it?

As you've probably gathered by now, there isn't a "right" number. We choose to use the lowest number because it's most honest for our purposes; when you search GeoSeer you're searching 1,045,059 distinct spatial layers. It's of no help to you if you get the same layer 127 times in the results because that's how many endpoints host it. Yet across all servers and endpoints, GeoSeer is searching what represents 1,773,337 separate publicly accessible layers.


GeoSeer Update: CSWs, Search Scoring, and Guatemala

Posted on 2018-05-16

Another month and another update. This month's update comprises two main components - scraping CSW services, and improved results scoring. Plus as a bonus, many more layers for Guatemala!

CSW services

The most notable thing we've done this update is include over 60 CSW services into our crawl. This didn't add as many services as we hoped, in large part because we already have most of them.
We learnt the hard way that despite being a standard, CSW services are highly temperamental and software specific. Both GeoNetwork and PyCSW (the two most-deployed as far as we can see) have numerous bugs and idiosyncrasies that make getting their data very painful, even though both are CSW 2.0.2 "compliant".

Guatemala

We've also manually added about 9 new services for Guatemala, taking the number of layers that are searchable for that country from 95 to 800! A big thanks to Raul Calderon for bringing those services to light.

As a result of this update, and re-crawling all of our already-known services, the number of searchable layers has increased by about 10% to over 790,000 distinct layers. This is despite further improving the quality of the "remove junk layers" filter and removing over 10,000 more poorly-documented layers.

Improved Search

Finally, and possibly most importantly, we've done some work to improve the quality of the results. We now rate the quality of the metadata for each individual layer and use that as part of the search result scoring. You should hopefully see better quality results for any given search now.

Feedback is always welcome and if you have any thoughts or suggestions on the search quality, or services you think we should be indexing, please do contact us.


GeoSeer's First Big Update: Over 250,000 New Layers

Posted on 2018-04-27

You may have noticed the number of layers that GeoSeer now has in its index has jumped dramatically. Previously we had about 450,000 layers, now we have around 715,000 layers, that's over quarter of a million more layers! And that's after we've improved the junk filter to get rid of a lot of the spurious test layers (it's unlikely anyone actually wants to see the GeoServer test layers for instance), and layers with no names/titles.

These extra layers are a result of a whole bunch of work to improve the GeoSeerBot (the thing that goes crawling around the internet trying to find data). We now search many more data sources, and we're also now scraping numerous HTML pages. We haven't yet started scraping CSW services, that's our next goal.

We've also done some work to resolve a few behind-the-scenes niggles. For example previously we kind of didn't have the country of Chile in our spatial data (ooops!), and so no layers were being assigned to Chile.

Blog content licensed as: CC-BY-SA 4.0