Easy Web Spidering in Ruby with Anemone

By Ric Roberts / July 2, 2009

anemone Anemone is a free, multi-threaded Ruby web spider framework from Chris Kite, which is useful for collecting information about websites. With Anemone you can write tasks to generate some interesting statistics on a site just by giving it the URL.

Its only dependency is Nokogiri (an HTML and XML parser). Other than that, you just need to install the gem to get started using Anemone's simple syntax which, among other things, allows you to tell it which pages to include (based on regular expressions) or define callbacks.

This example taken from Anemone's homepage prints out the URL of every page on a site:

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
    puts page.url
  end
end

The bin folder in the project contains some more in-depth examples, including tasks for counting the number of unique pages on a site, the number of pages at a certain depth in a site, or a list of urls encountered. There's also a combined-task which wraps up a few of these, intended to be run as a daily cron job.

You can install Anemone as a gem or get the source from Github of course, and there's some fairly comprehensive RDoc documentation available in the source or online.

Also worth seeing.. Mobile Orchard's Beginning iPhone Programming Workshop. Bay Area/July 30-31. Seattle/Aug 20-21. Ruby Inside discount of $200 -- use "ri" discount code.

Comments

ben says:
July 3, 2009 at 2:45 pm
Cool but is there anyone that made it work ?
from the given example, just got the homepage, no crawl.

searching further, got on spidr wich seems to do the same work with kind of the same syntax; it just fails.

so for the time spent, disappointed…
Ric says:
July 3, 2009 at 5:35 pm
Hi Ben. I gave some of the examples a try, and they worked for me.
Soleone says:
July 3, 2009 at 11:45 pm
Just for your information: there's also a dependency on facets it seems.
Soleone says:
July 3, 2009 at 11:54 pm
Hmm, I get only one link when trying the example like this:

Anemone.crawl("http://www.rubyinside.com") { |a| a.on_every_page{|p| puts p.url} }

=> http://www.rubyinside.com
Harry says:
July 4, 2009 at 11:53 am
Soleone: try add a slash after url, like : "http://www.rubyinside.com/"
Ric Roberts says:
July 4, 2009 at 12:03 pm
Try that last example with a trailing slash on the url. Not sure why, but this seems to make a difference. :)
Soleone says:
July 6, 2009 at 12:59 pm
Okay, the new version (0.0.6) doesn't have the trailing slash problem anymore, nice!
Carlos Valencia says:
July 9, 2009 at 6:47 pm
I like . It is too simple to be true.
Glenn Gillen says:
July 10, 2009 at 7:36 pm
Nice.

If all you're looking for is to take a mirror of a site you can simply do:

wget -m http://www.rubyinside.com/

If you just want to spider all your links to make sure nothing is broken:

wget --spider http://www.rubyinside.com/

But if you want to do anything more useful, this looks like a pretty simple approach. Will have to give it a look.