Web Reflection: Levenshtein algorithm revisited: 2.5 times faster

My JavaScript book is out! Don't miss the opportunity to upgrade your beginner or average dev skills.

Saturday, February 14, 2009

Levenshtein algorithm revisited: 2.5 times faster

First of all one consideration:

JavaScript is SLOW!

Chrome, Safari, Opera, FireFox (Internet Explorer obviously is the only one 2 to 8 times slower!)
It does not matter which tracemonkey or V8 you are using, JS is far away to be fast as C#, C++, C are!

About Levenshtein

If you would like to create a project like BJSpell with a reasonable suggestion, the levenshtein algorithm will trill at some point: it calculates the distance between 1 string to another one. It means that from the word "Gofy" to the word "Goofy", this algo will return 1 character to change replace, or add, before Gofy will become Goofy.

Why Levenshtein

To obtain a suggestion that makes sense, we could use some known algorithm to recognize at least words close to the mistyped one. So far, my suggestion in BJSpell was something truly rude and often useless, but as far as I know there is NO WAY to implement something more clever, using every possible best performances practice.

How to speed up the algorithm

Every "bloody" implementation of this algorithm uses 1 more useless loop to pre-assign values to the multidimensional Array, plus a boring if else instead of ternary operator.
Removing those loops I switched performances from 1.680 seconds to 560, over 100000 computations using C#. The trick is simple: set the Array when you need it!
Here is the code I wrote for JavaScript first and C# after:


// JavaScript
var levenshtein = function(min, split){
    // Levenshtein Algorithm Revisited - WebReflection
    try{split=!("0")[0]}catch(i){split=true};
    return function(a, b){
        if(a == b)return 0;
        if(!a.length || !b.length)return b.length || a.length;
        if(split){a = a.split("");b = b.split("")};
        var len1 = a.length + 1,
            len2 = b.length + 1,
            I = 0,
            i = 0,
            d = [[0]],
            c, j, J;
        while(++i < len2)
            d[0][i] = i;
        i = 0;
        while(++i < len1){
            J = j = 0;
            c = a[I];
            d[i] = [i];
            while(++j < len2){
                d[i][j] = min(d[I][j] + 1, d[i][J] + 1, d[I][J] + (c != b[J]));
                ++J;
            };
            ++I;
        };
        return d[len1 - 1][len2 - 1];
    }
}(Math.min, false);


// C#
public int levenshtein(string a, string b){
    // Levenshtein Algorithm Revisited - WebReflection
    if (a == b)
        return 0;
    if (a.Length == 0 || b.Length == 0)
        return a.Length == 0 ? b.Length : a.Length;
    int len1 = a.Length + 1,
        len2 = b.Length + 1,
        I = 0,
        i = 0,
        c, j, J;
    int[,] d = new int[len1, len2];
    while(i < len2)
        d[0, i] = i++;
    i = 0;
    while(++i < len1){
        J = j = 0;
        c = a[I];
        d[i, 0] = i;
        while(++j < len2){
            d[i, j] = Math.Min(Math.Min(d[I, j] + 1, d[i, J] + 1), d[I, J] + (c == b[J] ? 0 : 1));
            ++J;
        };
        ++I;
    };
    return d[len1 - 1, len2 - 1];
}

As Summary

Levenshtein algorithm is still too slow if performed via JavaScript but for small dictionaries (up to 5000 words instead of 70000 or more as I tested with the entire en_US vocabulary) is a perfect choice for suggestions.

Have fun with algorithms ... next one? Who knows, probably the soundex :D

21 comments:

Brett15 February, 2009 16:38
What license is the code under at your site? Any particular license? MIT ok? ;)

I'm asking because besides your other code being useful, we have an implementation at http://kevin.vanzonneveld.net/techblog/article/javascript_equivalent_for_phps_levenshtein/ , but although I haven't read the article or tested, I presume yours is faster.

If you do decide to do soundex, someone already submitted such a version if it helps for comparison.

Maybe you might get some ideas at http://trac.phpjs.org/projects/phpjs/browser/trunk/_unported :)

I know we can also improve performance, etc. for some of our existing functions as well.

(Some of that will be solved by allowing the namespaced version of the library to move static variables out of the individual functions into the top of the namespace, but we also want the functions to be independent (and simple) for those who don't want the full package.)

Since you often seem to come up with really slick optimizations, you might be interested in our new asort() and other array sort functions, which per the PHP.JS policy, makes the code work with objects (as associative arrays) as well (despite the lack of order guarantee in JS). In order to work PHP-style with such "arrays" (by reference), we really have no choice but to delete and then re-add elements as we do sorting, and in doing so, I introduced bubble sort (for which I know there are some optimizations for). Anyhow, just thought you might be intrigued by the issues it raises.

It's fun to look at the PHP source code as well in the process of doing this.
ReplyDelete
Replies
Andrea Giammarchi15 February, 2009 17:05
Brett, I know that project and I contributed since the beginning. I wrote a similar project years before that one inside the overbyte Editor (Settings: enable jhp)

PHP.JS implementation as every other uses 2 completely useless extra loop. I think everybody "copy and pasted" the code from wikipedia, which is there just to explain the logic, but nobody though about those 2 unnecessary loops.

My license here is Mit style, specially for this function where you do not need to be Einstein to both understand or re-create :D

In jhp there was asort as well, if I am not wrong, but that project disappeared the day I realized that instead of PHP into JS, I would like to have JS into PHP (which will became almost possible with 5.3 and runkit + operator extension when those will be compilable under 5.3)

Best Regards
ReplyDelete
Replies
Andrea Giammarchi15 February, 2009 17:09
P.S. I know you guys have problems with the namespace, I created the define php like function with true constants for both IE (via VBScript) and others (via cont) and you decided to remove it because it created global scope constants (exactly as PHP does ...)
ReplyDelete
Replies
Brett17 February, 2009 12:13
I'm not clear what you're speaking of. I've only been really working with the project for maybe one or two months. There is a define() function already at http://kevin.vanzonneveld.net/techblog/article/javascript_equivalent_for_phps_define/ though I haven't checked carefully for what your function may do differently...
ReplyDelete
Replies
Andrea Giammarchi17 February, 2009 13:11
Brett, that is NOT a define function. Check my link, which is the one they used at the beginning ;-)
ReplyDelete
Replies
Brett17 February, 2009 17:03
Yes, that is definitely not a define function. :) Well, actually, maybe the author was simply trying to make it work in all browsers. I have to say, your implementation is truly clever. It looks like Kevin just forgot to add it (see here, as he's very good at giving credit, and your name is not listed with that function, though it is for array_map()). I can point it out to him again.
ReplyDelete
Replies
Andrea Giammarchi17 February, 2009 17:09
Brett, I told you they where using mine before but they removed it for namespace problems (my define defines global constants as PHP define does ... too similar for those guys :D)
They perfectly know my solution, specially the author of PHP.JS

Thanks anyway for your suggestion. Regards
ReplyDelete
Replies
Brett18 February, 2009 03:48
Ok, I just submitted a closure-free version of your define() and levenshtein() to Kevin. I think they should work with his compiler now, but we'll see. In any case, I tested and your levenshtein() is still faster than our current implementation.
ReplyDelete
Replies
Brett18 February, 2009 03:59
Oh, btw, what can we also adapt the functions at JHP under MIT if we find them useful?
ReplyDelete
Replies
Brett16 March, 2009 11:12
I think I posted another reply here earlier but don't see it here now. Anyways, wanted to let you know we've used your code at http://kevin.vanzonneveld.net/techblog/article/javascript_equivalent_for_phps_levenshtein/ and http://kevin.vanzonneveld.net/techblog/article/javascript_equivalent_for_phps_define/ . I removed the closure since Kevin's compiler can't work with them, but levenshtein is still faster than the other implementations.
ReplyDelete
Replies
Andrea Giammarchi16 March, 2009 11:34
Brett, please remove that try catch from the function to speed up it.

The closure problem was probably about missed brackets.

var levenshtein = (function(){
var split = false;
try{('0')[0]}catch(e){split=true};
return function(a, b){
... etc
};
})();

Regards
ReplyDelete
Replies
Brett16 March, 2009 22:31
Hi,

I know how to keep it as a closure like you had it before, but I had to change it to work without a closure since his compiler won't work with functions defined as such. We can have internal closures, but his site currently won't support the main functions themselves being closures.

As far as the try-catch, if I remove that, does that mean I can remove the block later for "split" (if (split)...)? Do some browsers not support treating strings by index?
ReplyDelete
Replies
Andrea Giammarchi17 March, 2009 11:15
Brett, a JavaScript compiler with closures problems sounds ridiculous to me ... anyway, remove that try catch from the function.

try {
("0")[0];
} catch(i) {
levenshtein.split = true;
};

then in the function ...
if(levenshtein.split){a = a.split("");b = b.split("")};
add the min as var before the loop
var min = Math.min
and you'll have almost same performances.
Regards (and please tell him to change compiler ...)
ReplyDelete
Replies
Brett19 March, 2009 03:45
Ok, so levenshtein.split is turned into a configuration option then? But I'm still unclear why--do some browsers not support accessing strings by index? I've never heard of that...

Our function already declares Math.min at the top of the function (safer practice with JS and it is his policy to indicate what the nature of the variables are early on).

I can tell him about the compiler, but he's pretty busy with the new site. Also, I think his compiler may be parsing the text to work with his site's PHP, not actually as a functioning JavaScript parser, but I could be wrong. Anyways, independent closure functions would be nice (though when we compile all functions into a namespace, my hope is to extract out redundant and static information from the individual functions for sharing common variables (especially large ones) among ALL PHP.JS functions; but fixing the independent versions still would be nice).

Btw, I've blogged another few entries about your Relator class at my blog if you were interested (skim down a little to find the Relator posts)...

best wishes, Brett
ReplyDelete
Replies
Andrea Giammarchi19 March, 2009 10:23
>> do some browsers not support accessing strings by index?
have you never heard about a crap browser installed by default with a blue e as logo called Internet Explorer? :D
ReplyDelete
Replies
Andrea Giammarchi19 March, 2009 10:48
Sorry Brett, the try catch is useless for IE, this is the correct one:
try{split=!("0")[0]}catch(i){split=true};

in IE 5.5+ is undefined, in some browser (IE5 or 4?) it could generate an error ... so one try catch to understand if the browser is compatible or not, that's it. But only once, not for every function call ;)
ReplyDelete
Replies
Brett19 March, 2009 15:48
> have you never heard about a crap browser installed by default with a blue e as logo called Internet Explorer? :D

I try to ignore it if I can... :) As long as they follow the rules, I don't mind, but no one should be above the law of standards! :) So, I tend not to bother programming to them, though I know I should be aware of the issues (you can see I haven't been around that long).

Thanks for the explanation and fix--I get it now and made a patch, though for now we have to live with it being called each time (I passed on your suggestion (as it was mine too)--hopefully he may be able to allow functions defined as closures after he gets the new site fixed.).

Best wishes, Brett
ReplyDelete
Replies
Andi Kalsch17 August, 2010 09:32
Make it more memory friendly - if you save the whole matrix you will loose a lot of memory: http://gist.github.com/527658
ReplyDelete
Replies
Anonymous16 August, 2011 17:55
Since this site is the top hit when you search for "javascript levenshtein", I'd like to comment that this algorithm is actually very slow. Take a look at this comparison where I've picked three algorithms from the top Google hits: http://jsperf.com/levenshtein-algorithms

Compared to Wikibooks' algorithm this is slower in Chrome and Firefox but slightly faster in IE 9. phpjs.net's algorithm beats this one across the board, being between 40% to 70% faster depending on the browser.

The lesson: You can't know if an algorithm is faster before you measure! :)
ReplyDelete
Replies
Andrea Giammarchi16 August, 2011 18:12
The Lesson: in February 2009 this was the fastest out of same search you did today on Google

I took it from first link of the list after a Google search, I improved it, I posted it ... time spent: few minutes, your contribute to this post: update with better solution

win, win? ;)
ReplyDelete
Replies
Andrea Giammarchi16 August, 2011 18:14
second lesson of the day:

indeed, the fastest *now* in jsperf has in comments:
revised by: Andrea Giammarchi (http://webreflection.blogspot.com)

you see, you landed here, somebody else as well, you now can blame me it's not that fast ... somebody else improved it!
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.