SourceClassifier: Identifying Programming Languages Quickly
If you're developing a snippets or pastie-type system or another form of CMS where source code might be stored, it'd be incredibly useful to automatically detect what language a provided source is in so that you can style it appropriately.
Chris Lowis' SourceClassifier (or Github repo) library does just that, using a Bayesian classifier trained on source code from the Alioth Shootouts. Out of the box it has support for C, Java, JavaScript, Perl, Python and Ruby, but you can train it to recognize others (CSS and HTML seem like notable omissions to me).
January 5, 2009 at 4:49 pm
CSS and HTML are omitted because they aren't programming languages.
January 5, 2009 at 4:50 pm
Strictly true, but for many of the places you'd use something like SourceClassifier - like a pastie site - CSS and HTML are considered "languages" (or at least have their own formatting conventions) in a pragmatic sense.
January 5, 2009 at 5:07 pm
Thanks for the mention Peter! Good point about HTML and CSS, they certainly fit the use case for the gem - even if they're not, as pointed out, programming languages.
I think csszengarden is a good corpus for css files, and I'll try to find a suitable one for HTML, maybe I'll stick to semantically correct/valid XHTML for simplicity! Good suggestion, I'll add these to v0.3 of the gem.
January 5, 2009 at 5:34 pm
Seems odd that PHP isn't on the list since it is one of the most ubiquitous web languages out there. Great resource though, keep it up.
January 5, 2009 at 5:57 pm
Bayesian classifier can be cool enough for languages that are more-or-less syntax identical (LISP varieties or, say, Python and Ruby), but is the real-world performance really trivial in this case?
I'm pretty sure a rule-based pre-filter would do a pretty good job to sort out main syntax families, to be then rammed with a Bayesian approach.
January 6, 2009 at 12:33 am
Needs to be added to gist.github.com stat! Looking forward to it
January 6, 2009 at 7:09 pm
Thanks for all your comments. I've updated the gem to recognise PHP and CSS (trained on examples from csszengarden.com). I'm still looking for a suitable corpus of valid (X)HTML.
@railsjedi - that'd be great!
@Apostlion - do you mean training performance? Once trained the training file can be kept in memory. Perhaps it'd be useful to run some benchmarks on performance of recognition, I'll add it to the TODO list. Rules-based filters for this task do also exist, however the classifier approach uses a small amount of code and is trivial to extend to new languages. I think both approaches have their merits.
January 15, 2009 at 11:06 am
@Chris -
I wasn't exactly referring to performance, more like training material bias. Say, (for a trivial example), that a Shootout Python scripts were mostly written by a structural programming fan, and Ruby scripts were mostly written by a functional programming fan.
Then, the Bayesian classifier may assume that, say, if...else construct is a Python give-away, while lambda is a Ruby construct — even though both obviously are present in both languages, and downplay the ‘real’ differences — such as ubiquitous end's in Ruby scripts.