Notes

More Cache Gems (to checkout)

https://github.com/gurgeous/httpdisk
https://github.com/DannyBen/webcache - hassle-free caching for HTTP download
https://github.com/DannyBen/lightly - a file cache for performing heavy tasks, lightly
https://github.com/dannguyen/active_scraper
https://github.com/vcr/vcr

More Web Crawler Gems (to checkout)

https://github.com/gurgeous/sinew - a ruby DSL for structured web crawling, with a robust caching system

Web Crawler / Spider / Scraper Names

Gopher ? => Webgo e.g. Webgo.get - Why? Why not?

well known crawlers (and user agent strings):

Googlebot by Google
Bingbot by Microsoft
Slurp by Yahoo!
??
more http://www.robotstxt.org/db.html

Web Crawler Config / Settings

User-agent: *
Crawl-Delay: 20

Resources