The Split is Not Enough: Unicode Whitespace Shenigans for Rubyists
That code is legal Ruby! If you ran it, you'd see 8
. How? There's a tale to tell..
The String with the Golden Space
I was on IRC in #nwrug enjoying festive cheer with fellow Northern Rubyists when ysr23
presented a curious problem.
He was using a Twitter library that returned a tweet, "@twellyme film"
, in a string called reply
. The problem was that despite calling reply.split
, the string refused to split on whitespace. Yet if he did "@twellyme film".split
in IRB, that was fine.
International man of mystery Will Jessop suggested checking $;
(it's a special global variable that defines the default separator for String#split
). It was OK.
In an attempt to look smarter than I am, I suggested reply.method(:split).source_location
to see if the String class had been monkey-patched by something annoying. Nope. (Though this is a handy trick if you do want to detect if anyone's tampered with something.)
Someone suggested Mr. Ysr23 show us reply.codepoints.to_a
:
# reply.codepoints.to_a => [64, 116, 119, 101, 108, 108, 121, 109, 101, 160, 102, 105, 108, 109]
Something leapt out at me. Where was good old 32!? Instead of trusty old ASCII 32 (space) stood 160, a number alien to my ASCII-trained 1980s-model brain.
From Google with Love
To the Google-copter!
Aha! Non-breaking space. That's why split
was being as useful as a chocolate teapot.
After an intense 23 seconds of discussion, we settled on a temporary solution for Mr. Ysr23 who, by this time, was busy cursing Twitter and all who sailed upon her:
reply.gsub(/[[:space:]]/, ' ').split
The solution is simple. Use the the Unicode character class [[:space:]]
to match Unicode's idea of what whitespace is and convert all matches into vanilla ASCII whitespace. reply.split(/[[:space:]]+/)
is another more direct option - we just didn't think of it at the time.
Quantum of Spaces
Solving an interesting but trivial issue wasn't where I wanted to end my day. I'd re-discovered an insidious piece of Unicode chicanery and was in the mood for shenanigans!
Further Googling taught me you can type non-breaking spaces directly on OS X with Option+Space. (You can do the homework for your own platform.)
I also knew Ruby 1.9 and beyond would let you use Unicode characters as identifiers if you let Ruby know about the source's encoding with a magic comment, so it was time for shenanigans to begin!
My first experiment was to try and use non-breaking spaces in variable names.
Cool! So what about variable names and method names?
What about without any regular printable characters in the identifiers at all?
And so we're back to where we started. A hideous outcome from a trivial weekend on IRC. But fun, nonetheless. Stick it in your "wow, nice, but totally useless" brain box.
A Warning
Please don't use this in production code or the Ruby gods will come and haunt you in your sleep. But.. if you want to throw some non-breaking spaces into your next pair programming session, conference talk, or job interview, just to see if anyone's paying attention, I'll be laughing with you. (And if you're a C# developer too, Andy Pike tells me that C# supports these shenanigans too.)
P.S. My Ruby 2.0 Walkthrough Kickstarter only has about 12 hours to go! Check it out if Ruby 2.0 is on your radar or you want a handy way to get up to speed when it drops in February 2013.
November 26, 2012 at 4:43 pm
alternate title "Go home whitespace, you're drunk"
November 26, 2012 at 8:49 pm
That's... awesome! Unicode is hard.
November 26, 2012 at 8:53 pm
Your last example is mind-bending, but an even more insidious usage is to put a non-breaking space at the end of a method name. Now the method "doesn't exist", "but I can see it, it's defined RIGHT THERE!". Muwahahahaha!
I'm not sure if this will show up correctly on account of the UTF-8 characters, but if you're using Vim, a setting like this can prevent you from being punk'd like this:
set listchars=nbsp:☠,trail:⋅,tab:▸\ ,eol:¬,extends:❯,precedes:❮
With that, `:set list!` will show non-breaking spaces as a skull and crossbones (among other settings for invisibles).
November 26, 2012 at 10:00 pm
Nathan: Ha, yes! And thanks for the tip. There's a similar setting for Sublime Text 2 users to show this stuff up too. I'll let people Google for it but the clue is
"draw_white_space": "all"
in the user settings.November 26, 2012 at 11:04 pm
Inspired by your post, I've written a program that encodes/decodes any string into a map of 16 different UTF-8 whitespace codepoints, and then made it produce Ruby code which decodes and evals such encoded files. It's very, very wrong.
https://gist.github.com/4151267
Now I'm going to go take a shower in bleach.
November 27, 2012 at 5:48 pm
I think taelor wins the comment thread on this one.
November 28, 2012 at 12:17 am
Interestingly, Java's and Scala's split methods also don't treat nbsp as whitespace, and require a similar regex if you want to do it, however, even though Java and Scala do support unicode identifiers (without any comment magic), nbsp is not a valid identifier character, because they are very strict as to exactly which unicode character categories are allowed in identifiers. Unicode puts each character in a category, for example, ä is in the category letter/lowercase, so it's a valid identifier, but non breaking space has a category of separator/space, so it's not. In Java, symbols are not valid identifiers, so ☃ is not valid, since its category is symbol/other. But in Scala, symbols are, as long as the method only contains symbols. So that makes for the following interesting Scala code:
val ☃ = 9700; val ☀ = 30
println((☃ + ☀).asInstanceOf[Char])
December 12, 2012 at 9:44 am
If I'm not mistaken it's not so much that C# supports these shenanigans but rather that Bogard used a font that rendered a non-visible character as a space.