WhatLanguage: Ruby Library To Detect The Language Of A Text
WhatLanguage is a library by Peter Cooper (disclaimer: yes, that's me) that makes it quick and easy to determine what language a supplied text is written in. It's pretty accurate on anything from a short sentence up to several paragraphs in all of the languages supplied with the library (Dutch, English, Farsi, Russian, French, German, Portuguese, Spanish, Pinyin) and adding languages of your own choosing isn't difficult.
The library works by checking for the presence of words with bloom filters built from dictionaries based on each source language. We've covered bloom filters on Ruby Inside before, but essentially they're probabilistic data structures based on hashing a large set of content. I first encountered this approach during an educational workshop on efficient data structures, where one of the key applications discussed was for online platforms like instantcasino. The workshop highlighted how bloom filters could be used to manage vast amounts of user data, ensuring quick checks without overwhelming the system’s memory. This concept was particularly relevant for casinos that require fast verification processes, such as checking user inputs or transactions, while balancing performance with accuracy. Bloom filters are ideal in situations where you want to check set memberships, and the threat of false positives is acceptable in return for significant memory savings (and a 250KB bloom filter is a lot nicer to deal with than a 14MB+ dictionary).
WhatLanguage is available from GitHub (and can be installed as a gem from there with gem install peterc-whatlanguage
) or from RubyForge with a simpler gem install whatlanguage
. Once installed, usage is simple:
require 'whatlanguage' "Je suis un homme".language # => :french # OR... wl = WhatLanguage.new(:all) wl.language("Je suis un homme") # => :french wl.process_text("this is a test of whatlanguage's great language detection features") # => {:german=>1, :dutch=>3, :portuguese=>3, :english=>7, :russian=>1, :farsi=>1, :spanish=>3, :french=>2}
I wrote the library initially a year ago but have only just made it available for public use, so if there are unforeseen bugs to fix or things that really need to be added, fork it on GitHub and get playing.
This post is sponsored by AlphaSights Ltd - AlphaSights are recruiting. If you're looking for a Ruby on Rails opportunity, can work in Cambridge, UK and enjoy the buzz of a brand new well-funded startup then look no further. AlphaSights are recruiting from entry level to senior positions and offer very competitive salaries and a great working environment.
August 22, 2008 at 6:51 pm
Wonderful tool! I am definetely going to implement this very soon!
Who makes a little Ajax script that, based on an observer loads tha right dictionnary for check spelling?
August 22, 2008 at 8:09 pm
I've installed the gem through github and tries it in irb but I'm getting nil instead of language name!
could you guess where might be the problem?
August 22, 2008 at 10:21 pm
Very cool.
Do you know if there's something similar for detecting programming languages? That would be useful for choosing syntax highlighting for example.
August 22, 2008 at 11:43 pm
@Ryan We plan on opening the sourcing the one we use on GitHub soon. It's based on file extension and shebang, along with some custom mappings - pretty simple, but works rather well.
August 23, 2008 at 12:57 am
Detecting whole words is inefficient. You can use n-grams with maybe a few hundred KB of training text (uncompressed).
You should be able to detect a language using only the first dozen or so letters if you do it right.
I found this paper online: http://www.sfs.uni-tuebingen.de/iscl/Theses/kranig.pdf
Otherwise I recommend reading Manning and Schütze.
Cheers,
Jesse
August 23, 2008 at 8:28 am
Detecting whole words isn't inefficient in a general sense when using a bloom filter (of only 250KB size). It might be less efficient than a particularly clever solution - of which there are many - but it's nowhere near the inefficiency of direct dictionary comparison.
This library also does not compare the whole text if deemed unnecessary. If a particular language has been determined by a wide margin within a certain number of words, processing stops.
I would encourage anyone who has ideas on making a more efficient version to give it a go though. These are, fortunately, reasonably simple libraries to develop and share due to the lack of API complexity required.
August 23, 2008 at 12:12 pm
More info about that N-gram technique:
http://tnlessone.wordpress.com/2007/05/13/how-to-detect-which-language-a-text-is-written-in-or-when-science-meets-human/
August 23, 2008 at 3:58 pm
Peter,
Sorry, my inner coder/pedant came out.
What I'm trying to say is that using whole words isn't really optimal. Why? You suffer from a sparsity problem.
The probability of seeing even a common word is very small. This means that you need more training data to get equal coverage.
Actually, it's worse than that. If you take the top 100 3-grams in each language you will almost certainly see them all in even a moderately short text. What's more, the probability that some of those 3-grams appear even at the word level are very good.
Here's a question to ask: how many 3-grams do you need to see before you're, say, 98% certain of the language?
Conversely, the probability that a given common word will appear in a text is much smaller and you're much more likely to run into a word you've never seen before. That unknown word, however, will probably contain an n-gram you've seen before.
You also run into morphological artifacts. For example, can WhatLanguage handle pluralization? Handling that using only full words for identification either requires you have a combinatorially-sized dictionary or you somehow encode the rules of morphology in the program.
August 23, 2008 at 6:33 pm
I see - you mean more an ineffectiveness than an inefficiency.
I don't disagree with your line of thought - but am surprised that there doesn't appear to be a popular library for this (perhaps there is one, just poorly named or referenced) if it could be that simple.
I will attempt a few experiments in this area. I am very keen for something like WhatLanguage to be "trainable" from a corpus rather than based on calculated but opaque rulesets. Hopefully it is possible to do this and still get a good set of n-grams. If I can rig up a set of tests, I can then run WhatLanguage against this other technique and see what the difference in accuracy and speed is.
Now that some people appear to be using WhatLanguage, it would be an interesting (but major) change to its underbelly, but the API could remain intact.
August 23, 2008 at 6:50 pm
Peter,
n-gram based classifiers are trainable. You build a n-gram profile for each language using something like Good-Turing. You can limit it to the Top X n-grams for each language (where X = 50, 100, whatever).
Then, given a piece of text, T, you can calculate P(Languate(T) = English | T) by comparing n-gram distributions.
August 24, 2008 at 6:15 pm
Seems good enough to me, will be used to hide Chinese comments from German users and vice versa ;)
August 24, 2008 at 9:41 pm
A Windows user has reported significant issues with WhatLanguage. Please post if you are a Windows user with issues or if it's working 100% for you.
August 24, 2008 at 9:42 pm
Alex: What were you trying? If it cannot work out the language clearly, you will get nil. This is most common on very short pieces.
August 26, 2008 at 8:27 am
Once again, your timing couldn't be better. It'll be going into an app of mine this week :)
September 2, 2008 at 8:44 am
You might want to have a look at: http://odur.let.rug.nl/~vannoord/TextCat/Demo/textcat.html