libxml-ruby 1.x Released: Ruby Now A Serious XML Player
Try this:
gem install libxml-ruby
And prepare to be shocked as you see libxml-ruby 1.1.1 installed (at the time of writing). Yes, libxml-ruby has, seven years on, made it to version 1! It's a long history involving a lot of people, initially Sean Chittenden, then Trans Onoma, Ross Bamford, Dan Janowski, and now Charlie Savage who has pushed the library to its 1.1.1 state.
libxml-ruby is, essentially, a set of bindings to libxml2, a ridiculously fast and complete C-powered XML parser. It passes all 1800 OASIS XML Test Suite tests, it's fast (depending on the C library as it does), and, finally, reliable. It's Ruby 1.9.1 compatible too, has great documentation (I love the theme they've used!), a clean API, and strong test coverage. Essentially, it can allow us to consign REXML to the trash can of Ruby history.
Now, if you're perfectly happy with abstractions like Hpricot or Nokogiri (which also uses libxml but handles its bindings separately), libxml-ruby might not be of much immediate use to you. If performance is key though, check it out. libxml-ruby gives you an incredibly powerful and fast "true" XML parsing library at your fingertips, with all of the pains and pleasures that entails.
Congratulations to everyone involved for this significant Ruby milestone. We're in the powerful XML club now.
March 12, 2009 at 1:41 am
This is good news. Particularly because I think its the only Ruby XML library to implement bindings for the libxml validations.
March 12, 2009 at 2:54 am
The theme is Hanna, by Mislav Marohnić, and worthy of its own post: http://github.com/mislav/hanna/
March 12, 2009 at 3:21 am
Been using this for a while now. So much better than ReXML, and since it picked up again, rock solid and easy to use to boot. Highly recommend taking a look if you haven't already.
March 12, 2009 at 4:08 am
Pity. I thought if all the XML libraries sucked that XML itself might die. No such luck apparently.
March 12, 2009 at 4:14 am
@Daniel: Ha ha! Yes, there's very much a "necessary evil" about it all, isn't there?
March 12, 2009 at 5:26 am
I had a layout that parsed a huge xml file, with REXML it took over 9 seconds to get all the information I needed. Libxml (.8?) handled the job in less than half a second. I was amazed at how much faster it parsed the xpath queries. Finally going 1.x is a big deal IMO.
March 12, 2009 at 3:06 pm
Did you know Rails 2.3 RC2 will have LibXML support? Thanks to some cool guy who made rails bindings for it :)
http://rails.lighthouseapp.com/projects/8994/tickets/2084-alternative-xml-parsers-support-in-activesupport-for-activeresource
March 12, 2009 at 4:47 pm
Are you sure it's actually faster than Nokogiri? I haven't seen any data to suggest this.
Hopefully in the future they'll be able to restrain themselves from including breaking changes in point-releases; that has blown away my trust in this gem.
March 12, 2009 at 4:56 pm
@Assaf - Yes, Hanna is great little theme.
@Bart - Nice, thanks for the heads up.
@Phil - The breakage was unavoidable to get libxml-ruby into decent shape. We put in a lot of effort to maintain backwards compatibility (see all the deprecation warnings), but there are a couple of places it wasn't possible (for example, encodings). Anyway, now that the API has been revamped, it will be stable.
March 12, 2009 at 5:08 pm
I didn't actually say libxml-ruby was faster than Nokogiri, just that Nokogiri provides more of an abstraction. The following sentence about performance is really stressing that libxml-ruby is "at the metal" XML type parsing, not that Nokogiri is monumentally slower. I believe libxml-ruby should be faster than Nokogiri on the tasks that libxml-ruby is well suited for, since Nokogiri provides many abstractions.
March 12, 2009 at 9:45 pm
The first time this dropped on my machine, it dropped a broken ruby interpreter in the bin directory of the gem. Not sure why that happened, but it did, and it broke everything. Luckily I was able to figure out what happened, but damn that was wild.
Nokogiri and hpricot appear to be a bit faster in my informal tests. All in all, I'd say a nice upgrade.
March 12, 2009 at 11:13 pm
Anyone want to put together a reasonable set of benchmarks for all of these libraries? It'd be very interesting. I'd do it, but my history with statistics is very, very bad. I get shouted at for being inaccurate :)
March 13, 2009 at 1:51 am
Since libxml-ruby and Nokogiri both wrap libxml2, their performance should be equivalent. libxml-ruby might be faster on searching since it uses straight XPath, while Nokogiri uses CSS selectors which it translates to XPath via Ruby code. But I'd imagine that overhead is negligible.
The libraries are more similar than different, but from my view:
* libxml-ruby exposes a greater breadth and depth of libxml2's functionality (encodings, much better error support, validations, parser contexts, etc).
* Nokogiri offers CSS selectors and follows Hpricot's api so it can be a drop-in replacement. It also wraps different types of DOM objects as different types of Ruby objects (Element, Comment, CData, etc).
And Aaron can jump in on the Nokogiri side if I've missed something.
* libxml-ruby
March 13, 2009 at 2:53 am
CSS to XPath conversions are cached. The conversion is very fast, but you only pay the conversion price once. Also, don't forget nokogiri does XPath queries too.
I'm curious about the encoding and error support. I don't believe your statement to be true. Which parser contexts are you talking about? I don't expose a DOM context because I don't believe that to be necessary. I do expose Reader, PushParser, and SAX parsing contexts. You're right about DTD validations at this point though.
I believe libxml-ruby does not expose a Push Parser, does it?
Someone should make a checklist, or something. There are a billion features in libxml2.
March 13, 2009 at 2:58 am
@peter, @phil
http://rubyforge.org/pipermail/libxml-devel/2008-November/001239.html
Granted, we need a more comprehensive set.
March 13, 2009 at 3:56 am
Hey Aaron,
The parser contexts are xmlParserCtxtPtr and htmlParserCtxtPtr, which give low-level access to each parse run (most of the time they are overkill, but ever so often very helpful). They do provide a nice unified internal api though.
Validation - DTD, RelaxNG, XML Schema
Full namespace support
Encodings - I went back and aligned the bindings to use libxml2's encoding constants consitently across all apis (libxml2 is a bit inconsistent, I assume based on age of the api).
Errors - I was looking at ruby_xml_error.c. Looks like you've gone down a similar path in xml_syntax_error.c (just checked the nokogiri code again)
And for push parsing, didn't know it existed, so libxml-ruby doesn't expose that. Have you run into a good use case for that?
March 13, 2009 at 4:10 am
@Charlie
Can you be more specific about namespace support? Nokogiri handles namespaces as well.
Ya, the encoding api is definitely not consistent in libxml, it shows with some of the nokogiri bindings. But we do support it.
Yes, push parsing is crucial for dealing with never ending documents like XMPP.