Robert Dempsey has written a code-packed article for Amazon Web Services’ “Developer Connection” site called Using Amazon S3, EC2, SQS, Lucene and Ruby for Web Spidering. It’s a bit of an epic and covers using a multitude of Amazon Web Services together (namely the S3 storage system, the EC2 “Elastic Compute Cloud”, and the Simple Queue Service), with Ruby acting as the glue that holds them all together. This could be of great interest to anyone who wants to put together large-scale crawlers using on-demand hardware and services.
As an aside, I’m interested in all interesting Ruby-related Amazon / S3 / EC2 articles and links for a future “list post,” so if you have any recommendations, leave a comment. Read More