Using Amazon’s Web Services for Spidering the Web

By Peter Cooper / February 13, 2008

Robert Dempsey has written a code-packed article for Amazon Web Services' "Developer Connection" site called Using Amazon S3, EC2, SQS, Lucene and Ruby for Web Spidering. It's a bit of an epic and covers using a multitude of Amazon Web Services together (namely the S3 storage system, the EC2 "Elastic Compute Cloud", and the Simple Queue Service), with Ruby acting as the glue that holds them all together. This could be of great interest to anyone who wants to put together large-scale crawlers using on-demand hardware and services.

As an aside, I'm interested in all interesting Ruby-related Amazon / S3 / EC2 articles and links for a future "list post," so if you have any recommendations, leave a comment. Thanks!

Comments

John says:
February 13, 2008 at 1:35 pm
There's also the "rufus-sqs" gem that leverages Amazon's SQS [REST interface] :

gem install -y rufus-sqs
http://rufus.rubyforge.org/rufus-sqs/
Jason says:
February 14, 2008 at 3:05 am
A new plugin written by my coworker for managing rails on ec2 / s3. Works pretty well.

http://rubyforge.org/projects/rubber/
Markus says:
February 14, 2008 at 7:29 am
Some links you might want to include in your list post:
http://rubyworks-ec2.rubyforge.org/
http://ec2onrails.rubyforge.org/
Thorsten says:
February 14, 2008 at 7:51 am
Not an article, but we recently released version 1.5 of the RightAws gem which provides Ruby interfaces for EC2, S3, SQS, and now also SimpleDB. Persistent connections, support for >2GB objects, error retries, XML parsing with libxml, and more goodies. See http://rubyforge.org/projects/rightaws/

We also have a good number of EC2 related articles on our blog: http://info.rightscale.com/blog