Today GeoSeer celebrates its first birthday, having originally gone live on 2018-03-03, so we want to look back at the year and take a glimpse into the future.
This year we...
It has been a busy year for us. We added CSW harvesting in May, a stats page for our fellow big-data nerds in September, and a new look, along with an API beta in January. This blog was itself created in April, and got its own RSS feed back in January.
Ever more data
We've also done a lot of general work to try and increase the index size, but not at the cost of spurious results. When we went live a year ago, our index had (using our current methodology) 836,917 distinct layers from 89,825 services. Today we boast 1,229,623 distinct layers (46% more) from 167,882 services (89% more).
During our latest crawl, our index size jumped to well over 3 million layers. "Jackpot" we thought! But upon further investigation (because we're always suspicious of anything anomalous - you have to be if you want to develop something good) we discovered it was from a single host that claimed to have 2.1 million layers across about 5000 services. Deeper investigation showed that it seems that it's the same 500 layers shared thousands of times. So we removed them all from the index and only keep one of each layer to ensure the best possible results for you, our users.
At this point it's becoming apparent that we're hitting the point of diminishing returns. We don't think there are many more readily discoverable OGC services and layers out there. We currently scrape over 300 data portals, plus many other data sources to try and find every service we can, but we can only find services which are publicly advertised somewhere, hence the "readily discoverable". But we're not giving up yet, and we have some ideas for several more scraping methodologies to further enhance the index.
And of course we'll also be releasing the API shortly. Watch this space!