Nokogiri: A Faster, Better HTML and XML Parser for Ruby (than Hpricot)
Yesterday, Aaron Patterson (@tenderlove) and Mike Dalessio released Nokogiri (Github repository), a new HTML and XML parser for Ruby. It "parses and searches XML/HTML faster than Hpricot" (Hpricot being the current de facto Ruby HTML parser) and boasts XPath support, CSS3 selector support (a big deal, because CSS3 selectors are mega powerful) and the ability to be used as a "drop in" replacement for Hpricot.
On an Hpricot vs Nokogiri benchmark, Nokogiri clocked in at 7 times faster at initially loading an XML document, 5 times faster at searching for content based on an XPath, and 1.62 times faster at searching for content via a CSS-based search. These are impressive results, since Hpricot was previously considered to be quite speedy itself. (Update - November 3, 2008: WHY FIGHTS BACK! HPRICOT IN PERFORMANCE BUSTING SHOCKER!!)
The code examples provided on the introduction post give you the basic idea, and the library can be installed using gem install nokogiri
(though this didn't work for me on OS X - further instructions below).
Installing on OS X
Note! Developer Aaron Patterson responded to the issues below in an update to the library. Now doing a regular gem install of Nokogiri should work fine. The information below is remaining in place solely for historical / reference purposes.
Upon trying sudo gem install nokogiri
, I encountered multiple problems on OS X. Perhaps it'll work first time for you, but if not, here are some pointers. (Bear in mind, I run the default Ruby that comes with OS X - no special configurations. If you're running Ruby from DarwinPorts, etc, the following might not work at all.)
Trying to install the gem failed after "checking for racc... no". I assumed it was trying to download and install racc by the following line, but it's not. You need to download and install racc yourself. The latest tarball for that is at http://i.loveruby.net/archive/racc/racc-1.4.5-all.tar.gz - download this, and open a Terminal. Continue along these lines:
tar xzvf racc-1.4.5-all.tar.gz cd racc-1.4.5-all sudo ruby setup.rb config sudo ruby setup.rb setup sudo ruby setup.rb install
Trying to install the gem at this point still won't work, as for some reason the racc executable has ended up in /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin
rather than /usr/bin
proper. My solution to that was to add that directory to my path in ~/.bash_profile
- but you might prefer to symbolically link it. Your choice. If you have no ~/.bash_profile
and you're following these instructions blindly, just put this in ~/.bash_profile
:
PATH=$PATH:/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin export PATH
Next, something called "frex" is also missing. This is more easily installed with gem:
sudo gem install aaronp-frex -s http://gems.github.com
Once this is done, then nokogiri should finally install with gem:
sudo gem install nokogiri
Run up irb
and give require 'nokogiri'
a try to make sure.
Please leave any corrections, suggestions, or cries for help in the comments. Thanks!
October 31, 2008 at 6:07 pm
Indeed it won't install racc for you, the error message (rather cryptically) is telling you to install it manually.
checking for racc... no
need racc, get the tarball from http://i.loveruby.net/archive/racc/racc-1.4.5-all.tar.gz
October 31, 2008 at 6:24 pm
Yeah, I saw that. I was assuming the following compilation error was it trying to install Racc. Usually if a gem has dependencies that fail, that's the end of it.. but in this case it carries on and tries to compile itself without all the dependencies being satisified - which is rather odd.
October 31, 2008 at 6:24 pm
Kudos to AP and Mike for finally releasing this. Merb is going to be making use of it for speedier test helpers and more compliant CSS3.
Between nokogiri and webrat, Merb tests in 1.0 are going to be a world better.
October 31, 2008 at 7:39 pm
Looks goot, but Hpricot runs on JRuby!
October 31, 2008 at 7:42 pm
If the gem doesn't install with a vanilla OS X, please let me know. It is a bug, and I will fix it.
I'm not down with making the installation so complex. Not to mention, I consider the code on github to be unstable.
October 31, 2008 at 7:46 pm
It installs, once the dependencies are resolved.
I believe my OS X and Ruby install to be reasonably vanilla. I do have a stackload of gems installed, but I'm running the regular OS X supplied Ruby and RubyGems otherwise. I'll give it a whirl on my newish MacBook Pro that I don't really use for Ruby dev..
October 31, 2008 at 7:51 pm
On the MBP now - getting a different error on here.
..
Building native extensions. This could take a while...
ERROR: Error installing nokogiri:
ERROR: Failed to build gem native extension.
rake RUBYARCHDIR=/Library/Ruby/Gems/1.8/gems/nokogiri-1.0.1/lib RUBYLIBDIR=/Library/Ruby/Gems/1.8/gems/nokogiri-1.0.1/lib
(in /Library/Ruby/Gems/1.8/gems/nokogiri-1.0.1)
rake aborted!
undefined method `add_development_dependency' for #
/Library/Ruby/Gems/1.8/gems/nokogiri-1.0.1/rakefile:19:in `new'
..
I think this error is probably because the default version of RubyGems on OS X is still 1.0.1, whereas on my other machine I'm running 1.2.0.
I've just run the update for RubyGems, and it's now at 1.3.1. gem install nokogiri now gives me the same error as it did on the other machine:
..
checking for racc... no
need racc, get the tarball from http://i.loveruby.net/archive/racc/racc-1.4.5-all.tar.gz
*** extconf.rb failed ***
..
So - yeah - it's just depedencies. Once Racc and Frex are installed, it should be fine.
The only way it could be more seamless is if Racc was gemified and included as a gem dependency.. and if Frex was also a gem dependency, so that gem would install them both automatically.
October 31, 2008 at 7:55 pm
Thanks Peter. Actually, neither of those should be dependencies. They are build time dependencies and not runtime dependencies. I've found the problem and @jbarnette is fixing it.
October 31, 2008 at 8:02 pm
Okay. A new gem is pushed. Once the gem index refreshes, you should be able to install version 1.0.2 without any dependencies.
October 31, 2008 at 8:27 pm
An additional error happened when installing the gem with MacPorts Ruby 1.8.6:
Building native extensions. This could take a while...
ERROR: Error installing nokogiri:
ERROR: Failed to build gem native extension.
rake RUBYARCHDIR=/opt/local/lib/ruby/gems/1.8/gems/nokogiri-1.0.1/lib RUBYLIBDIR=/opt/local/lib/ruby/gems/1.8/gems/nokogiri-1.0.1/lib
rake aborted!
no such file to load -- hoe
/opt/local/lib/ruby/gems/1.8/gems/nokogiri-1.0.1/rakefile:4
(See full trace by running task with --trace)
(in /opt/local/lib/ruby/gems/1.8/gems/nokogiri-1.0.1)
This was easily fixed by doing:
sudo gem install hoe
October 31, 2008 at 8:30 pm
Also for the Newbies (like me) out there, when you run up irb to test the install, before requiring nokogiri do 'require "rubygems"'. This one ALWAYS trips me up :-)
October 31, 2008 at 8:52 pm
FYI: Tried a simple gem install nokogiri at 4:50 EDT on All Hallows Eve, installed and runs flawlessly on my Mac Book Pro running stock Ruby 1.8.
October 31, 2008 at 8:55 pm
Awesome, Kevin. I've added a note to the post to indicate that my instructions are now obsolete.
November 1, 2008 at 12:40 pm
I should point out that I had to upgrade hoe to version 1.8.2 before the nokogiri gem installation proceeded.
November 2, 2008 at 1:52 am
I am on windows xp, got this error
D:\Documents and Settings\dzhang2>gem install nokogiri
Bulk updating Gem source index for: http://gems.rubyforge.org/
Building native extensions. This could take a while...
ERROR: Error installing nokogiri:
ERROR: Failed to build gem native extension.
rake RUBYARCHDIR=c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/lib RUBYLIBDIR=c:
/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/lib
(in c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2)
rake aborted!
couldn't find HOME environment -- expanding `~/.hoerc'
c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/rakefile:20:in `new'
(See full trace by running task with --trace)
Gem files will remain installed in c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2
for inspection.
Results logged to c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/gem_make.out
*********************************************
anybody got it installed on windows?
thanks
November 3, 2008 at 11:14 am
what the right way to use Nokogiri::XML with
namespaces?
is necessary to register namespaces as libxml-rb does?
http://libxml.rubyforge.org/rdoc/classes/LibXML/XML/XPath.html
bye
November 3, 2008 at 5:14 pm
Hi Aphe,
For handling XML namespaces, Aaron and I tried to make it a little simpler than the libxml-style namespace registration.
You should be able to make a query like:
xml = Nokogiri::XML.parse(...)
tires = xml.xpath('//bike:tire', {'bike' => 'http://schwinn.com/'})
more generally, the xpath() method takes an optional second argument which is a hash of namespace-alias => URL.
You can take a look at some of the test cases for more details. We're working on more complete documentation!
November 3, 2008 at 5:21 pm
@Dong,
You should be able to avoid that (common) hoe error message by setting a phone HOME environment variable.
Try running:
set HOME=foo
before installing!
November 4, 2008 at 2:15 am
Mike
thanks for the tip. now, I am getting a different error
D:\Documents and Settings\dzhang2>gem install nokogiri
Bulk updating Gem source index for: http://gems.rubyforge.org/
Building native extensions. This could take a while...
ERROR: Error installing nokogiri:
ERROR: Failed to build gem native extension.
rake RUBYARCHDIR=c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/lib RUBYLIBDIR=c:
/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/lib
(in c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2)
rake aborted!
undefined method `add_development_dependency' for #
c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/rakefile:20:in `new'
(See full trace by running task with --trace)
Gem files will remain installed in c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2
for inspection.
Results logged to c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/gem_make.out
thanks
Dong
November 4, 2008 at 9:20 pm
Dong:
It looks like your rubygems is not up to date. I had the same problem, so I downloaded the latest version of rubygems from rubyforge and installed it.
You can try:
gem upgrade --system
I have Ubuntu, so that doesn't work for me.
November 5, 2008 at 6:01 pm
Oh dear, your info is out of date already... hpricot is now faster...
http://hackety.org/2008/11/03/hpricotStrikesBack.html
November 5, 2008 at 6:16 pm
Mike
that is it! after update my rubygems, the install went through.
thanks
Dong
November 5, 2008 at 6:21 pm
sorry, previous message should be to Jamie.
Jamie, appreciate your help.
November 9, 2008 at 1:39 am
Nokogiri certainly is not better at "it just installs!". On a clean ubuntu 8.10 install hpricot installs just fine, while nokogiri installs with loads of issues, see above, and mine is different:
$ sudo gem install nokogiri
Building native extensions. This could take a while...
ERROR: Error installing nokogiri:
ERROR: Failed to build gem native extension.
rake RUBYARCHDIR=/usr/lib/ruby/gems/1.8/gems/nokogiri-1.0.3/lib RUBYLIBDIR=/usr/lib/ruby/gems/1.8/gems/nokogiri-1.0.3/lib
(in /usr/lib/ruby/gems/1.8/gems/nokogiri-1.0.3)
/usr/lib/ruby/gems/1.8/gems/rake-0.8.3/lib/rake/gempackagetask.rb:13:Warning: Gem::manage_gems is deprecated and will be removed on or after March 2009.
checking for xmlParseDoc() in -lxml2... no
checking for xsltParseStylesheetDoc() in -lxslt... no
checking for libxml/xmlversion.h in /usr/include/libxml2,/usr/include/libxml2... no
need libxml
*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of
necessary libraries and/or headers.
Installed libxml and libxml2, still getting same error.
Nokogiri: big fail !
November 9, 2008 at 4:02 pm
Plans for JRuby support? Looks like this will cause problems for Webrat, who just switched to Nokogiri.
November 14, 2008 at 6:30 pm
Lawrence - Try to install the libxml-dev and libxml2-dev packages. That way the header files are available.
checking for libxml/xmlversion.h in /usr/include/libxml2,/usr/include/libxml2... no
Seems to point to a missing header.