scRUBYt – Hot, New Ruby Web-Scraping Toolkit Released
For the past few months Peter Szinek has been giving me lots of tasty tidbits about his forthcoming ScRUBYt Web-scraping toolkit, and now it's finally fully released to the public! Peter describes ScRUBYt as "WWW::Mechanize and Hpricot on Steroids" and this description is pretty bang on.
As well as providing a simple DSL for performing Web actions (clicking links, submitting forms, etc.), one of ScRUBYt's most impressive features is that you can provide it with 'example' data from which it will extrapolate a search pattern and then find any other similar data within the same page. This is demonstrated perfectly by Peter's basic example:
ebay_data = Scrubyt::Extractor.define do fetch 'http://www.ebay.com/' fill_textfield 'satitle', 'ipod' submit click_link 'Apple iPod' record do item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER' price '$71.99' end next_page 'Next >', :limit => 5 end
This code goes to ebay.com, looks for iPods, and then extracts all records using a dummy one as an example. It then proceeds through up to 5 more pages of records, returning them all as an XML dataset.
If this all floats your boat, there's a lot to explore. Start off with the official site and Peter's comprehensive announcement. Peter also has a lengthy tutorial available which makes good reading.
February 7, 2007 at 12:34 pm
This is rather impressive. The end result could just as well be done with Curl, but this way, it is a lot clearer to understand in the source. On the downside, this script will stop working when pricing changes or the item does not show on the first page any more.
February 9, 2007 at 2:23 pm
Tom,
This is simply not true :-)
as Peter also pointed out, this is just a dummy example. The system learns how to extract similar examples, then the learned rules are extracted - and those are agnostic to any older example or anything, thus they will work until the page *structure* is changed - then you must provide actual examples to learn the new rules... working on the automatization of this, btw.