GeoSeer Logo

Blog



All pages with the tag: Data

One Million Layers, and a Stats Page

Posted on 2018-09-27

GeoSeer has now hit the one million distinct spatial layers milestone in its index. That's a staggering amount of spatial data, and all of it is freely accessible via OGC standards, and of course, also easily searchable with GeoSeer. This actually represents over 1.7 million publicly available WMS, WFS, WCS, and WMTS layers - see this previous blog post for a discussion on why this number is even higher. This represents data from over 100,000 OGC services.

We've been gradually increasing the number of layers in our index consistently since launch as a result of a combination of things: our ongoing efforts to expand where we collect data from, improvements to the GeoSeerBot (we feed it lots of veggies!), and ever more layers being added to services we already index.

How many more layers and services are there out there? We don't know; but we plan on doing a blog post about the number of services, so keep an eye out. And we're going to keep trying to find more.

What was that about a Stats Page?

That's right, because we're big data nerds (see what we did there?), we've also created a page that's got a high-level breakdown of statistics for what's in our index. You can find the new stats page here. We don't claim to have a complete index of all public OGC services, but we're fairly certain it's a large chunk of the ones that are out there, so this is a fairly representative sample of what's available on the internet.

The stats page will be updated about once a month and should always approximately represent what's in our index. In the future we plan on adding further and more detailed statistics including a breakdown of what middleware is used to run these services, so keep an eye out for it.

Need more stats? Ask away!

If there's any particular statistic you're interested in that's not on there, let us know and we'll consider adding it. Or if we don't think others will find it interesting (how many people really want to know that the average (mean) number of Layers per Endpoint is (at the time of writing) 12.99? Or that the median and mode are both 2, the minimum is 1, and the maximum is 4,629), we'll tell you directly, we try to be nice like that. So ask away.


So, how many OGC layers are there?

Posted on 2018-07-18

Updated: 2018-09-27 with numbers for September 2018 which also reflects improvements to how we group things together.

One of the questions we come across quite often is the deceptively simple "How many layers are there"? At the time of writing our front page says "over 1 million distinct... layers", so that's the answer right? Well, not quite, and why is that "distinct" in there anyway? There are actually quite a few potential answers so lets go through them.
Note: All numbers in this post are correct at the time of writing but will certainly change within a few weeks as we continue to index more services.

That's a lot of layers
Lets start with the largest number: 1,773,337 layers. This is also the simplest number - it's the total number of layers that we find in all of the unique capabilities documents that we download ("capabilities documents" are what map servers use to tell the world what layers they have and what features they support). This is the easiest number to give, and the one most commonly given. It is "correct" in that there really are over 1.7 million layers out there across various service endpoints, but as you'll see from the other numbers, there are a few problems with using it.
Meaningless layers
We do a lot of work to try and weed out "meaningless" layers from our index. This isn't a reflection on the data inside the layers, but on the metadata in the capabilities document. For instance there's no point us indexing a layer that has a name of "1" and no other information; for all we know these layers may have great data behind them, but if there isn't even a meaningful name our users will never be able to find those layers, so we simply remove them to stop them cluttering up the results.
It's at this stage that we also remove layers that are pre-installed defaults, like the TOPP/Tasmania data that comes with GeoServer.
In total all this filtering gets rid of over 47,000 layers, leaving us with around 1,720,000 layers.
Many endpoints and the same layer
It turns out a lot of those layers are duplicates; there are many services out there which have lots of different endpoints (the URL you use to access it) that all serve the same layer(s). In fact, there is one single layer that is served by over 2400 endpoints on the same host-domain (we group services by host-domains as part of the de-duplication process). That's an exception but there are over 580 layers that are duplicated over a hundred of times on the same host-domain, and in total we identify over 717,000 duplicate layers. We don't get rid of them entirely - you may have noticed in the results that we list multiple capabilities URLs for some layers - but we don't count them as separate layers. Once we get rid of all of those, we're down to 1,055,836 layers. It's also quite surprising to see that about half of the layers out there are duplicates.
Different service types
The final component is - what happens if a layer is served up from the same server as both WMS and WFS? Or WMS/WCS/WMTS, etc? For our purposes we try and group them together and treat them as a single layer, but as you've likely noticed in the results, we do flag that a layer is available as multiple service types. There are surprisingly few of these: only 10,752 layers are used across service types. This is where we get our final, front-page number of 1,045,059 layers.
So which is it?

As you've probably gathered by now, there isn't a "right" number. We choose to use the lowest number because it's most honest for our purposes; when you search GeoSeer you're searching 1,045,059 distinct spatial layers. It's of no help to you if you get the same layer 127 times in the results because that's how many endpoints host it. Yet across all servers and endpoints, GeoSeer is searching what represents 1,773,337 separate publicly accessible layers.


GeoSeer Update: CSWs, Search Scoring, and Guatemala

Posted on 2018-05-16

Another month and another update. This month's update comprises two main components - scraping CSW services, and improved results scoring. Plus as a bonus, many more layers for Guatemala!

CSW services

The most notable thing we've done this update is include over 60 CSW services into our crawl. This didn't add as many services as we hoped, in large part because we already have most of them.
We learnt the hard way that despite being a standard, CSW services are highly temperamental and software specific. Both GeoNetwork and PyCSW (the two most-deployed as far as we can see) have numerous bugs and idiosyncrasies that make getting their data very painful, even though both are CSW 2.0.2 "compliant".

Guatemala

We've also manually added about 9 new services for Guatemala, taking the number of layers that are searchable for that country from 95 to 800! A big thanks to Raul Calderon for bringing those services to light.

As a result of this update, and re-crawling all of our already-known services, the number of searchable layers has increased by about 10% to over 790,000 distinct layers. This is despite further improving the quality of the "remove junk layers" filter and removing over 10,000 more poorly-documented layers.

Improved Search

Finally, and possibly most importantly, we've done some work to improve the quality of the results. We now rate the quality of the metadata for each individual layer and use that as part of the search result scoring. You should hopefully see better quality results for any given search now.

Feedback is always welcome and if you have any thoughts or suggestions on the search quality, or services you think we should be indexing, please do contact us.


GeoSeer's First Big Update: Over 250,000 New Layers

Posted on 2018-04-27

You may have noticed the number of layers that GeoSeer now has in its index has jumped dramatically. Previously we had about 450,000 layers, now we have around 715,000 layers, that's over quarter of a million more layers! And that's after we've improved the junk filter to get rid of a lot of the spurious test layers (it's unlikely anyone actually wants to see the GeoServer test layers for instance), and layers with no names/titles.

These extra layers are a result of a whole bunch of work to improve the GeoSeerBot (the thing that goes crawling around the internet trying to find data). We now search many more data sources, and we're also now scraping numerous HTML pages. We haven't yet started scraping CSW services, that's our next goal.

We've also done some work to resolve a few behind-the-scenes niggles. For example previously we kind of didn't have the country of Chile in our spatial data (ooops!), and so no layers were being assigned to Chile.