Ruby legend whytheluckystuff has developed a new HTML parser called Hpricot. It’s easy to install and use and parses HTML in a liberal fashion. It does, however, require a compiler to install (as it’s written in C), so should be okay on Linux and Mac OS X, though not necessarily on Windows (yet).
Here’s some demo code:
require ‘hpricot’
doc = Hpricot.parse("index.html")
(doc/:p/:a).each do |link|
p link.attributes
end
This is a good alternative to RubyfulSoup, if you’re finding RubyfulSoup too slow (though RubyfulSoup is certainly worth a try!) Read More