Easy Web Spidering in Ruby with Anemone
Anemone is a free, multi-threaded Ruby web spider framework from Chris Kite, which is useful for collecting information about websites. With Anemone you can write tasks to generate some interesting statistics on a site just by giving it the URL.
Its only dependency is Nokogiri (an HTML and XML parser). Other than that, you just need to install the gem to get started using Anemone's simple syntax which, among other things, allows you to tell it which pages to include (based on regular expressions) or define callbacks.
This example taken from Anemone's homepage prints out the URL of every page on a site:
require 'anemone' Anemone.crawl("http://www.example.com/") do |anemone| anemone.on_every_page do |page| puts page.url end end
The bin folder in the project contains some more in-depth examples, including tasks for counting the number of unique pages on a site, the number of pages at a certain depth in a site, or a list of urls encountered. There's also a combined-task which wraps up a few of these, intended to be run as a daily cron job.
You can install Anemone as a gem or get the source from Github of course, and there's some fairly comprehensive RDoc documentation available in the source or online.
July 3, 2009 at 2:45 pm
Cool but is there anyone that made it work ?
from the given example, just got the homepage, no crawl.
searching further, got on spidr wich seems to do the same work with kind of the same syntax; it just fails.
so for the time spent, disappointed…
July 3, 2009 at 5:35 pm
Hi Ben. I gave some of the examples a try, and they worked for me.
July 3, 2009 at 11:45 pm
Just for your information: there's also a dependency on facets it seems.
July 3, 2009 at 11:54 pm
Hmm, I get only one link when trying the example like this:
Anemone.crawl("http://www.rubyinside.com") { |a| a.on_every_page{|p| puts p.url} }
=> http://www.rubyinside.com
July 4, 2009 at 11:53 am
Soleone: try add a slash after url, like : "http://www.rubyinside.com/"
July 4, 2009 at 12:03 pm
Try that last example with a trailing slash on the url. Not sure why, but this seems to make a difference. :)
July 6, 2009 at 12:59 pm
Okay, the new version (0.0.6) doesn't have the trailing slash problem anymore, nice!
July 9, 2009 at 6:47 pm
I like . It is too simple to be true.
July 10, 2009 at 7:36 pm
Nice.
If all you're looking for is to take a mirror of a site you can simply do:
wget -m http://www.rubyinside.com/
If you just want to spider all your links to make sure nothing is broken:
wget --spider http://www.rubyinside.com/
But if you want to do anything more useful, this looks like a pretty simple approach. Will have to give it a look.