My JavaScript book is out! Don't miss the opportunity to upgrade your beginner or average dev skills.

Saturday, February 14, 2009

Levenshtein algorithm revisited: 2.5 times faster

First of all one consideration:

JavaScript is SLOW!


Chrome, Safari, Opera, FireFox (Internet Explorer obviously is the only one 2 to 8 times slower!)
It does not matter which tracemonkey or V8 you are using, JS is far away to be fast as C#, C++, C are!

About Levenshtein


If you would like to create a project like BJSpell with a reasonable suggestion, the levenshtein algorithm will trill at some point: it calculates the distance between 1 string to another one. It means that from the word "Gofy" to the word "Goofy", this algo will return 1 character to change replace, or add, before Gofy will become Goofy.

Why Levenshtein


To obtain a suggestion that makes sense, we could use some known algorithm to recognize at least words close to the mistyped one. So far, my suggestion in BJSpell was something truly rude and often useless, but as far as I know there is NO WAY to implement something more clever, using every possible best performances practice.

How to speed up the algorithm


Every "bloody" implementation of this algorithm uses 1 more useless loop to pre-assign values to the multidimensional Array, plus a boring if else instead of ternary operator.
Removing those loops I switched performances from 1.680 seconds to 560, over 100000 computations using C#. The trick is simple: set the Array when you need it!
Here is the code I wrote for JavaScript first and C# after:

// JavaScript
var levenshtein = function(min, split){
// Levenshtein Algorithm Revisited - WebReflection
try{split=!("0")[0]}catch(i){split=true};
return function(a, b){
if(a == b)return 0;
if(!a.length || !b.length)return b.length || a.length;
if(split){a = a.split("");b = b.split("")};
var len1 = a.length + 1,
len2 = b.length + 1,
I = 0,
i = 0,
d = [[0]],
c, j, J;
while(++i < len2)
d[0][i] = i;
i = 0;
while(++i < len1){
J = j = 0;
c = a[I];
d[i] = [i];
while(++j < len2){
d[i][j] = min(d[I][j] + 1, d[i][J] + 1, d[I][J] + (c != b[J]));
++J;
};
++I;
};
return d[len1 - 1][len2 - 1];
}
}(Math.min, false);


// C#
public int levenshtein(string a, string b){
// Levenshtein Algorithm Revisited - WebReflection
if (a == b)
return 0;
if (a.Length == 0 || b.Length == 0)
return a.Length == 0 ? b.Length : a.Length;
int len1 = a.Length + 1,
len2 = b.Length + 1,
I = 0,
i = 0,
c, j, J;
int[,] d = new int[len1, len2];
while(i < len2)
d[0, i] = i++;
i = 0;
while(++i < len1){
J = j = 0;
c = a[I];
d[i, 0] = i;
while(++j < len2){
d[i, j] = Math.Min(Math.Min(d[I, j] + 1, d[i, J] + 1), d[I, J] + (c == b[J] ? 0 : 1));
++J;
};
++I;
};
return d[len1 - 1, len2 - 1];
}


As Summary


Levenshtein algorithm is still too slow if performed via JavaScript but for small dictionaries (up to 5000 words instead of 70000 or more as I tested with the entire en_US vocabulary) is a perfect choice for suggestions.

Have fun with algorithms ... next one? Who knows, probably the soundex :D

21 comments:

Brett said...

What license is the code under at your site? Any particular license? MIT ok? ;)

I'm asking because besides your other code being useful, we have an implementation at http://kevin.vanzonneveld.net/techblog/article/javascript_equivalent_for_phps_levenshtein/ , but although I haven't read the article or tested, I presume yours is faster.

If you do decide to do soundex, someone already submitted such a version if it helps for comparison.

Maybe you might get some ideas at http://trac.phpjs.org/projects/phpjs/browser/trunk/_unported :)

I know we can also improve performance, etc. for some of our existing functions as well.

(Some of that will be solved by allowing the namespaced version of the library to move static variables out of the individual functions into the top of the namespace, but we also want the functions to be independent (and simple) for those who don't want the full package.)

Since you often seem to come up with really slick optimizations, you might be interested in our new asort() and other array sort functions, which per the PHP.JS policy, makes the code work with objects (as associative arrays) as well (despite the lack of order guarantee in JS). In order to work PHP-style with such "arrays" (by reference), we really have no choice but to delete and then re-add elements as we do sorting, and in doing so, I introduced bubble sort (for which I know there are some optimizations for). Anyhow, just thought you might be intrigued by the issues it raises.

It's fun to look at the PHP source code as well in the process of doing this.

Andrea Giammarchi said...

Brett, I know that project and I contributed since the beginning. I wrote a similar project years before that one inside the overbyte Editor (Settings: enable jhp)

PHP.JS implementation as every other uses 2 completely useless extra loop. I think everybody "copy and pasted" the code from wikipedia, which is there just to explain the logic, but nobody though about those 2 unnecessary loops.

My license here is Mit style, specially for this function where you do not need to be Einstein to both understand or re-create :D

In jhp there was asort as well, if I am not wrong, but that project disappeared the day I realized that instead of PHP into JS, I would like to have JS into PHP (which will became almost possible with 5.3 and runkit + operator extension when those will be compilable under 5.3)

Best Regards

Andrea Giammarchi said...

P.S. I know you guys have problems with the namespace, I created the define php like function with true constants for both IE (via VBScript) and others (via cont) and you decided to remove it because it created global scope constants (exactly as PHP does ...)

Brett said...

I'm not clear what you're speaking of. I've only been really working with the project for maybe one or two months. There is a define() function already at http://kevin.vanzonneveld.net/techblog/article/javascript_equivalent_for_phps_define/ though I haven't checked carefully for what your function may do differently...

Andrea Giammarchi said...

Brett, that is NOT a define function. Check my link, which is the one they used at the beginning ;-)

Brett said...

Yes, that is definitely not a define function. :) Well, actually, maybe the author was simply trying to make it work in all browsers. I have to say, your implementation is truly clever. It looks like Kevin just forgot to add it (see here, as he's very good at giving credit, and your name is not listed with that function, though it is for array_map()). I can point it out to him again.

Andrea Giammarchi said...

Brett, I told you they where using mine before but they removed it for namespace problems (my define defines global constants as PHP define does ... too similar for those guys :D)
They perfectly know my solution, specially the author of PHP.JS

Thanks anyway for your suggestion. Regards

Brett said...

Ok, I just submitted a closure-free version of your define() and levenshtein() to Kevin. I think they should work with his compiler now, but we'll see. In any case, I tested and your levenshtein() is still faster than our current implementation.

Brett said...

Oh, btw, what can we also adapt the functions at JHP under MIT if we find them useful?

Brett said...

I think I posted another reply here earlier but don't see it here now. Anyways, wanted to let you know we've used your code at http://kevin.vanzonneveld.net/techblog/article/javascript_equivalent_for_phps_levenshtein/ and http://kevin.vanzonneveld.net/techblog/article/javascript_equivalent_for_phps_define/ . I removed the closure since Kevin's compiler can't work with them, but levenshtein is still faster than the other implementations.

Andrea Giammarchi said...

Brett, please remove that try catch from the function to speed up it.

The closure problem was probably about missed brackets.

var levenshtein = (function(){
var split = false;
try{('0')[0]}catch(e){split=true};
return function(a, b){
... etc
};
})();

Regards

Brett said...

Hi,

I know how to keep it as a closure like you had it before, but I had to change it to work without a closure since his compiler won't work with functions defined as such. We can have internal closures, but his site currently won't support the main functions themselves being closures.

As far as the try-catch, if I remove that, does that mean I can remove the block later for "split" (if (split)...)? Do some browsers not support treating strings by index?

Andrea Giammarchi said...

Brett, a JavaScript compiler with closures problems sounds ridiculous to me ... anyway, remove that try catch from the function.

try {
("0")[0];
} catch(i) {
levenshtein.split = true;
};

then in the function ...
if(levenshtein.split){a = a.split("");b = b.split("")};
add the min as var before the loop
var min = Math.min
and you'll have almost same performances.
Regards (and please tell him to change compiler ...)

Brett said...

Ok, so levenshtein.split is turned into a configuration option then? But I'm still unclear why--do some browsers not support accessing strings by index? I've never heard of that...

Our function already declares Math.min at the top of the function (safer practice with JS and it is his policy to indicate what the nature of the variables are early on).

I can tell him about the compiler, but he's pretty busy with the new site. Also, I think his compiler may be parsing the text to work with his site's PHP, not actually as a functioning JavaScript parser, but I could be wrong. Anyways, independent closure functions would be nice (though when we compile all functions into a namespace, my hope is to extract out redundant and static information from the individual functions for sharing common variables (especially large ones) among ALL PHP.JS functions; but fixing the independent versions still would be nice).

Btw, I've blogged another few entries about your Relator class at my blog if you were interested (skim down a little to find the Relator posts)...

best wishes, Brett

Andrea Giammarchi said...

>> do some browsers not support accessing strings by index?
have you never heard about a crap browser installed by default with a blue e as logo called Internet Explorer? :D

Andrea Giammarchi said...

Sorry Brett, the try catch is useless for IE, this is the correct one:
try{split=!("0")[0]}catch(i){split=true};

in IE 5.5+ is undefined, in some browser (IE5 or 4?) it could generate an error ... so one try catch to understand if the browser is compatible or not, that's it. But only once, not for every function call ;)

Brett said...

> have you never heard about a crap browser installed by default with a blue e as logo called Internet Explorer? :D

I try to ignore it if I can... :) As long as they follow the rules, I don't mind, but no one should be above the law of standards! :) So, I tend not to bother programming to them, though I know I should be aware of the issues (you can see I haven't been around that long).

Thanks for the explanation and fix--I get it now and made a patch, though for now we have to live with it being called each time (I passed on your suggestion (as it was mine too)--hopefully he may be able to allow functions defined as closures after he gets the new site fixed.).

Best wishes, Brett

Andi Kalsch said...

Make it more memory friendly - if you save the whole matrix you will loose a lot of memory: http://gist.github.com/527658

Anonymous said...

Since this site is the top hit when you search for "javascript levenshtein", I'd like to comment that this algorithm is actually very slow. Take a look at this comparison where I've picked three algorithms from the top Google hits: http://jsperf.com/levenshtein-algorithms

Compared to Wikibooks' algorithm this is slower in Chrome and Firefox but slightly faster in IE 9. phpjs.net's algorithm beats this one across the board, being between 40% to 70% faster depending on the browser.

The lesson: You can't know if an algorithm is faster before you measure! :)

Andrea Giammarchi said...

The Lesson: in February 2009 this was the fastest out of same search you did today on Google

I took it from first link of the list after a Google search, I improved it, I posted it ... time spent: few minutes, your contribute to this post: update with better solution

win, win? ;)

Andrea Giammarchi said...

second lesson of the day:

indeed, the fastest *now* in jsperf has in comments:
revised by: Andrea Giammarchi (http://webreflection.blogspot.com)

you see, you landed here, somebody else as well, you now can blame me it's not that fast ... somebody else improved it!