The GeoSeer Blog

All pages with the tag: History

New Historical Statistics and Extent Plots

Posted on 2019-09-02

The GeoSeer stats page went live just shy of a year ago and we've been meaning to update it with more stats ever since. Today we've done just that, with a few new stats, and a lot of cool plots.

The first statistic is the most simple: The number of countries that are hosting OGC services. A country for our purposes is simply defined as having a unique ccTLD (the last part of a domain: .pl, .us, .br, .au, etc.). At the time of writing this blog post, it's 87 of the 244 defined ccTLD's. (Note this does include .eu for the European Union which most people wouldn't actually consider a country).

Historical Data

GeoSeer has been live for almost 18 months now, and we've been crawling the WWW for OGC services for even longer. This means we have a trove of historical data about services, and the new stats expose some of that. If you look at the stats page now, you'll see the General Stats section has been tweaked slightly.

As well as continuing to show stats about the current state of OGC services "Now", we've added an extra column for "Ever" which shows the total numbers that we've ever found since we started doing this. Then with a little maths we show the percentage of the things we've ever found that are still alive now.

The Ephemeral Nature of Public Data
Datasets

The single most glaring statistic from this historical data is that we've found a total of 4,949,124 datasets since we started crawling, but only 1,865,660 are live and active in our index right now. Or put another way, just 37.7% of the datasets hosted by OGC services that were publically available at some point in the past 18 months are still online!

Services

And while that's the most stand-out statistic, the others also show how transient the OGC services that host these datasets are. Over the course of the past ~18 months we've found 291,779 different services, yet only 71.83% of them were online and responding on our last crawl.

Hosts

The final statistic of note here is the number of hosts. These are the domain names themselves, and different subdomains are counted as different hosts (so www.example.com is different from ogc.example.com). Even these have experienced considerable churn over what is a relatively short period of time, with only 85.5% of hosts remaining online. We should point out that we ignore the scheme (that's the http:// or https://) and ignore the port when we consider if something is a "host", so if a host changes from insecure to secure (and quite a few do), it won't make a difference to this statistic.

Thoughts

All of this change makes it harder for users to rely on this data even if they can find it. Especially for things like scientific research which relies on repeatability, including the ability for other scientists to go back and take a second look at the original data; a difficult thing to do when the datasets/services/hosts have gone offline.

This also highlights the importance of keeping data portals current. Link rot is a real thing and data curators need to ensure they maintain their portals otherwise the portals are worse than useless (because they're wasting everyone's time with bad links).

Extent Plots

The other part of this statistics update is a collection of extent map plots that show what parts of the world have datasets. We're going to do a separate blog post about them in the future.


One Year Old and Better Than Ever

Posted on 2019-03-03

Today GeoSeer celebrates its first birthday, having originally gone live on 2018-03-03, so we want to look back at the year and take a glimpse into the future.

This year we...

It has been a busy year for us. We added CSW harvesting in May, a stats page for our fellow big-data nerds in September, and a new look, along with an API beta in January. This blog was itself created in April, and got its own RSS feed back in January.

Ever more data

We've also done a lot of general work to try and increase the index size, but not at the cost of spurious results. When we went live a year ago, our index had (using our current methodology) 836,917 distinct layers from 89,825 services. Today we boast 1,229,623 distinct layers (46% more) from 167,882 services (89% more).

During our latest crawl, our index size jumped to well over 3 million layers. "Jackpot" we thought! But upon further investigation (because we're always suspicious of anything anomalous - you have to be if you want to develop something good) we discovered it was from a single host that claimed to have 2.1 million layers across about 5000 services. Deeper investigation showed that it seems that it's the same 500 layers shared thousands of times. So we removed them all from the index and only keep one of each layer to ensure the best possible results for you, our users.

What's next?

At this point it's becoming apparent that we're hitting the point of diminishing returns. We don't think there are many more readily discoverable OGC services and layers out there. We currently scrape over 300 data portals, plus many other data sources to try and find every service we can, but we can only find services which are publicly advertised somewhere, hence the "readily discoverable". But we're not giving up yet, and we have some ideas for several more scraping methodologies to further enhance the index.

And of course we'll also be releasing the API shortly. Watch this space!

Blog content licensed as: CC-BY-SA 4.0