The GeoSeer Blog

All pages with the tag: Update

A Midyear Update

Posted on 2019-08-07

The GeoSeer index of OGC Services continues to grow, now standing tantalisingly close to 200,000 services: there are currently 197,911 from over four thousand four hundred different hosts. And of course this only includes active services; the index is kept in an "evergreen" state consisting only of services that actually worked when we last queried them. There are many more services that are intermittent but these aren't useful to you so don't feature in the index.

On adventures we go

As well as continuing to hone and expand the service, we've also been participating in some community events. In June we participated in the OGC's API Hackathon in London, part of the process for developing the next generation of OGC spatial standards. They're at an early phase - with API Features being the furthest along - and we participated with the aim of making sure that discoverability was kept in mind during their development. After all, there's no point developing cutting edge standards if no-one can find implementations.

Then we went to Italy to the European Commission's Joint Research Centre (JRC) - home of INSPIRE - to present and participate in a workshop about service discovery and search engines with regards to INSPIRE services. We met some of the people behind a few of the portals we harvest from, and exchanged thoughts on how services and data can be made more discoverable.

Statistics - SRS

We received a user enquiry as to which Spatial Reference System (SRS) was most common in OGC services, so we did a quick check and wanted to share the top results with everyone because who doesn't like stats. Note that there are lots of caveats that we won't go into here, we're sharing these as-is. It doesn't come as a surprise that EPSG:4326 aka WGS84, and Web Mercator are the most common.

SRS CodeNameNumber of datasetsNotes
EPSG:4326WGS84892,331Standard Latitude-Longitude
CRS:84WGS84514,924Longitude-Latitude swapped version of WGS84
EPSG:3857Web Mercator394,736De facto web mapping projection
EPSG:900913Web Mercator259,519Deprecated code for Web Mercator
EPSG:4258ETRS89166,684Europe
EPSG:25832ETRS89 / UTM zone 32N133,604Europe between 6°E and 12°E
EPSG:102100Web Mercator102,823ArcGIS Online version of EPSG:3857

The datasets define 1,318 different SRS'; above are just the ones with more than 100,000 datasets. We're always open to doing some stats analysis, just ask.

Licensing?

Finally, we've started investigating making the database available to third parties via licensing. If you're interested, let us know. Watch this space.


One Year Old and Better Than Ever

Posted on 2019-03-03

Today GeoSeer celebrates its first birthday, having originally gone live on 2018-03-03, so we want to look back at the year and take a glimpse into the future.

This year we...

It has been a busy year for us. We added CSW harvesting in May, a stats page for our fellow big-data nerds in September, and a new look, along with an API beta in January. This blog was itself created in April, and got its own RSS feed back in January.

Ever more data

We've also done a lot of general work to try and increase the index size, but not at the cost of spurious results. When we went live a year ago, our index had (using our current methodology) 836,917 distinct layers from 89,825 services. Today we boast 1,229,623 distinct layers (46% more) from 167,882 services (89% more).

During our latest crawl, our index size jumped to well over 3 million layers. "Jackpot" we thought! But upon further investigation (because we're always suspicious of anything anomalous - you have to be if you want to develop something good) we discovered it was from a single host that claimed to have 2.1 million layers across about 5000 services. Deeper investigation showed that it seems that it's the same 500 layers shared thousands of times. So we removed them all from the index and only keep one of each layer to ensure the best possible results for you, our users.

What's next?

At this point it's becoming apparent that we're hitting the point of diminishing returns. We don't think there are many more readily discoverable OGC services and layers out there. We currently scrape over 300 data portals, plus many other data sources to try and find every service we can, but we can only find services which are publicly advertised somewhere, hence the "readily discoverable". But we're not giving up yet, and we have some ideas for several more scraping methodologies to further enhance the index.

And of course we'll also be releasing the API shortly. Watch this space!


A New Look for a New Year

Posted on 2019-01-09

We thought we'd welcome in 2019 with a slight update to the look of the site to improve usability. In particular, the GeoSeer website should now be much better behaved on mobile devices. There's also now more consistency in page navigation to help you find where you're going, and we've tweaked the search results page to better expose meaningful information, including the service's url.

The changes are not just cosmetic, we've also improved the search functionality to try and provide better results for multi-term queries. So searches for things like tree preservation orders will now preferentially try and find results where the words are next to each other without your having to put quotes ("") around it. We've also done a fair amount of work to the location assignment service (the bit that decides what area of the planet a bounding box covers) so you should be getting better results there too.

And as if all that wasn't enough, there are a couple of new features - this blog now has an RSS feed so you can better keep up with our posts.
But we've been keeping the best until last - we now have an API! It's still in beta for the next few weeks but if you're interested in using it, do let us know. There's an entire page with information about the API on it, and we'll be doing a blog post about it when we launch it.


One Million Layers, and a Stats Page

Posted on 2018-09-27

GeoSeer has now hit the one million distinct spatial layers milestone in its index. That's a staggering amount of spatial data, and all of it is freely accessible via OGC standards, and of course, also easily searchable with GeoSeer. This actually represents over 1.7 million publicly available WMS, WFS, WCS, and WMTS layers - see this previous blog post for a discussion on why this number is even higher. This represents data from over 100,000 OGC services.

We've been gradually increasing the number of layers in our index consistently since launch as a result of a combination of things: our ongoing efforts to expand where we collect data from, improvements to the GeoSeerBot (we feed it lots of veggies!), and ever more layers being added to services we already index.

How many more layers and services are there out there? We don't know; but we plan on doing a blog post about the number of services, so keep an eye out. And we're going to keep trying to find more.

What was that about a Stats Page?

That's right, because we're big data nerds (see what we did there?), we've also created a page that's got a high-level breakdown of statistics for what's in our index. You can find the new stats page here. We don't claim to have a complete index of all public OGC services, but we're fairly certain it's a large chunk of the ones that are out there, so this is a fairly representative sample of what's available on the internet.

The stats page will be updated about once a month and should always approximately represent what's in our index. In the future we plan on adding further and more detailed statistics including a breakdown of what middleware is used to run these services, so keep an eye out for it.

Need more stats? Ask away!

If there's any particular statistic you're interested in that's not on there, let us know and we'll consider adding it. Or if we don't think others will find it interesting (how many people really want to know that the average (mean) number of Layers per Endpoint is (at the time of writing) 12.99? Or that the median and mode are both 2, the minimum is 1, and the maximum is 4,629), we'll tell you directly, we try to be nice like that. So ask away.


GeoSeer Update: CSWs, Search Scoring, and Guatemala

Posted on 2018-05-16

Another month and another update. This month's update comprises two main components - scraping CSW services, and improved results scoring. Plus as a bonus, many more layers for Guatemala!

CSW services

The most notable thing we've done this update is include over 60 CSW services into our crawl. This didn't add as many services as we hoped, in large part because we already have most of them.
We learnt the hard way that despite being a standard, CSW services are highly temperamental and software specific. Both GeoNetwork and PyCSW (the two most-deployed as far as we can see) have numerous bugs and idiosyncrasies that make getting their data very painful, even though both are CSW 2.0.2 "compliant".

Guatemala

We've also manually added about 9 new services for Guatemala, taking the number of layers that are searchable for that country from 95 to 800! A big thanks to Raul Calderon for bringing those services to light.

As a result of this update, and re-crawling all of our already-known services, the number of searchable layers has increased by about 10% to over 790,000 distinct layers. This is despite further improving the quality of the "remove junk layers" filter and removing over 10,000 more poorly-documented layers.

Improved Search

Finally, and possibly most importantly, we've done some work to improve the quality of the results. We now rate the quality of the metadata for each individual layer and use that as part of the search result scoring. You should hopefully see better quality results for any given search now.

Feedback is always welcome and if you have any thoughts or suggestions on the search quality, or services you think we should be indexing, please do contact us.


GeoSeer's First Big Update: Over 250,000 New Layers

Posted on 2018-04-27

You may have noticed the number of layers that GeoSeer now has in its index has jumped dramatically. Previously we had about 450,000 layers, now we have around 715,000 layers, that's over quarter of a million more layers! And that's after we've improved the junk filter to get rid of a lot of the spurious test layers (it's unlikely anyone actually wants to see the GeoServer test layers for instance), and layers with no names/titles.

These extra layers are a result of a whole bunch of work to improve the GeoSeerBot (the thing that goes crawling around the internet trying to find data). We now search many more data sources, and we're also now scraping numerous HTML pages. We haven't yet started scraping CSW services, that's our next goal.

We've also done some work to resolve a few behind-the-scenes niggles. For example previously we kind of didn't have the country of Chile in our spatial data (ooops!), and so no layers were being assigned to Chile.

Blog content licensed as: CC-BY-SA 4.0