Ruby gets a stylish HTML scraper – scrAPI
The indefatigable Assaf Arkin has done it again by developing a new Ruby HTML scraping toolkit, scrAPI. Peter Szinek recently wrote a popular article about scraping from Ruby using Manic Miner, RubyfulSoup, REXML, and WWW::Mechanize, but none of these are as immediately useful as scrAPI.. so why?
scrAPI lets you scrape from HTML using CSS selectors. For example, here's Assaf's example that defines scraper objects that can scrape auctions from eBay:
ebay_auction = Scraper.define do process "h3.ens>a", :description=>:text, :url=>"@href" process "td.ebcPr>span", :price=>:text process "div.ebPicture >a>img", :image=>"@src" result :description, :url, :price, :image end ebay = Scraper.define do array :auctions process "table.ebItemlist tr.single", :auctions => ebay_auction result :auctions end
Now that the objects are set up ready to scrape, you can put them into action like so:
auctions = ebay.scrape(html) # No. of auctions found puts auctions.size # First auction: auction = auctions[0] puts auction.description puts auction.url
Simple example with serious power. Go get scrAPI and play.
July 12, 2006 at 4:17 pm
I'm not on the up and up with Page Scraping, how does this compare to _why's Hpricot?
July 12, 2006 at 6:22 pm
Hpricot lets you pull certain elements from a page programatically.. whereas this kinda bundles that sort of functionality into a reusable pattern. So rather than 'get this, then get this', this is.. 'get each of these things and return them to me in a solid lump'.
August 3, 2006 at 5:19 pm
That looks really interesting. Do you think you could post an example with the original HTML as well? So that we can see from original document, to scrAPI code, to the final output?
It looks like it might be a much more elegant solution for those of us looking to build databases of information from other sites and need an easier way to do that.
thanks!
August 5, 2006 at 12:36 am
Michael,
The original HTML for this example is an eBay page with search results. For the demo I did, I just searched for "iPod nano", saved the page and ran this code on the saved page.