Web Reflection: November 2009

My JavaScript book is out! Don't miss the opportunity to upgrade your beginner or average dev skills.

Sunday, November 29, 2009

XML To HTML Snippet

This John Resig post about nodeName demonstrates once again how trustfulness are edge cases in JavaScript.
I must agree 100% with @jdalton: frameworks or selector libraries should not be concerned about these cases.

First of all there is no universal solution so whatever effort able to slow down libraries won't be perfect, then why bother?

Secondly, I cannot even understand why on earth somebody could need to adopt XML nodes in that way.
Agreed that importNode or adoptNode should not be that buggy, but at the same time I have always used XSL(T) to inject XML into HTML and I have never had problems.

Different Worlds

In an XML document a tag is just a tag. It does not matter which name we chose or which JavaScript event we attached, XML is simply a data protocol, or transporter, and nothing else.
A link, a div, an head, the html node itself, does not mean anything different in XML so again: why do we need to import in that way?
In Internet Explorer we have the xml property which is, for HTML represented inside XML docs, the fastest/simplest way to move that node and what it contains inside an element and via innerHTML.
Moreover, namespaces are a problem, we cannot easily represent them into an HTML document so, in any case, we need to be sure about represented data.

A Better Import

Rather than ask every selector library to handle these edge cases we could simply adopt a better import strategy.
This XML to HTML transformer is already a valid alternative, but we can write something a bit better or more suitable for common cases.
As example, a truly common case in these kind of transformations is a CDATA section inside a script node.
With CDATA we can put almost whatever we want inside the node but 99.9% of the time what we need is a JavaScript code, rather than a comment.
Another thing to consider is that if we need to import or adopt an XML node, 99.9% of the time we need to import its full content and not an empty node (otherwise we should ask us why we are representing data like that, no?).
I bet the deep variable will be basically true by default so, here there is my alternative proposal which should be a bit faster, avoiding implicit boolean cast for each attribute or node, and considering what I have said few lines ago:


function XML2HTML(xml){
    // WebReflection Suggestion - MIT Style License
    for(var
        nodeName = xml.nodeName.toUpperCase(),
        html = document.createElement(nodeName),
        attributes = xml.attributes || [],
        i = 0, length = attributes.length,
        tmp;
        i < length; ++i
    )
        html.setAttribute((tmp = attributes[i]).name, tmp.value)
    ;
    for(var
        childNodes = xml.childNodes,
        i = 0, length = childNodes.length;
        i < length; ++i
    ){
        switch((tmp = childNodes[i]).nodeType){
            case 1:
                html.appendChild(XML2HTML(tmp));
                break;
            case 3:
                html.appendChild(document.createTextNode(tmp.nodeValue));
                break;
            case 4:
            case 8:
                // assuming .text works in every browser
                nodeName === "SCRIPT" ?
                    html.text = tmp.nodeValue :
                    html.appendChild(document.createComment(tmp.nodeValue))
                ;
                break;
        }
    };
    return html
};

I have tried to post above snippet into John post as well but for some reason it is still not there (maybe waiting to be approved)
We can test above snippet via this piece of code:


var data = '<section data="123"><script><![CDATA[\nalert(123)\n]]></script><option>value</option></section>';
try{
var xml = new ActiveXObject("Microsoft.XMLDOM");
xml.loadXML(data);
}catch(e){
var xml = new DOMParser().parseFromString(data, "text/xml");
}
var section = xml.getElementsByTagName("section")[0];
onload = function(){
    document.body.appendChild(XML2HTML(section));
    alert(document.body.innerHTML);
};

We can use latest snippet to test the other function as well and as soon as I can I will try to compare solutions to provide some benchmark.

Monday, November 23, 2009

On element.dataset And data-* Attribute

Why on earth? I mean why we should put a data-whatever attribute into our layout? Where is the good old MVC? How can you consider data-* more semantic? more semantic than what? Why we would like to kill truly semantic pages and graceful enhancements? Why we need JavaScript redundant info inside nodes attributes but we cannot understand a script tag in the middle of the page? Why in the performances matter era we would like to let users download stuff that they will probably never use?

What Am I Talking About

We can find the "magic" data attribute description in the W3C Semantics, structure, and APIs for HTML documents page. These days there is a page that is going around even too much ... finally somebody realized that this data-thingy is nothing different than what we could have always used since ages: XML.

Why We Are Doing Wrong

If we are enthusiast about a custom attribute able to bring whatever information, we should ask ourself why on earth we are not using simply XML plus XLS to transform nodes with custom attributes. Each node can be easily and quickly transformed, in the client side as well and in a cross browser way, via one or more cached XSLT able to create runtime whatever we need.
We need to sort the music file list via data-length attribute? No problems at all, we can reorder via DOM every node we need and put the XML fragment transformed via XSLT into a single HTML node in that page. Moreover, we can use modular XSL to transform branches or specific cases ... but that is too clean and professional, isn't it?

Data Used Via Selectors

Let's say we have 3 songs with the same length, taking the example from W3 page, OK?


<ol>
  <li data-length="2m11s">Beyond The Sea</li>
  <li data-length="2m11s">Beside The Sea</li>
  <li data-length="2m11s">Be The Sea</li>
</ol>

First of all that example is quite hilarious, to make a sort as fast as possible that kind of representation is not ideal, is it?
It does not matter, the point is that as soon as this stuff will be implemented, all jQuery users will start to think so semantic that everything will become a:


$("@data-whatever=whatevervalue").each(...)

Seriously ... if we want to make the web the worst semantic place ever, just put in medium skilled developers this kind of weapon, and we'll scream in few months HTML5 Failed!
Who will care anymore about the appropriate element when everything can be easily validated in the W3 for totally dirty and useless, for non JS users, layouts?

The Namespace War Has Started Already

I can imagine lots of new devs starting to use the data as if it is their own attribute, ignoring conflicts problem we've always had with the global namespace. Move this problem into the DOM, and the Cactus Jam is ready to eat.
On the other hand how best practice will be a DOM loads of nodes with multiple data attributes?


<div
   data-jquery-pluginname-good-stuff="$(this).whatever()"
   data-dojo-loadsync="thisNodeFile"
   data-prototype-valueof="Object.prototype.valueOf=this"
/>

... I mean ... seriously, is this the HTML5 we are talking about? An empty element there just to make the anti semantic magic happen?

Dozens Of Different Ways

First of all querySelectorAll.
The document method we all were waiting for is finally 90% here and rather than use semantic and logic selectors to retrieve what we need and when we need, we prefer to make the DOM that dirty? Are we truly so monkeys that we cannot spot a list of songs and order them by their length that will be 99.9% of the time part of that song info and accordingly present in the DOM as context?
Where are classes? Where are external resources totally ignored by those users which aim is simply the one to surf quickly, or to crawl info? data-whatever?

Why We Don't Need data-*

First of all, classes. If we have a list of songs the class songlist in the single outer UL or OL node is everything we need to retrieve that list and order by title, duration, everything else present in that list, since in whatever grid we have ever used, we order by columns where we know the column content.
How can a user even decide to order by length if this length is not displayed?
How can he order by something not even showed?
It's like me in a shop asking to order by palm trees a list of sounds systems ... I think they'll look at me like a mad person and they would be right, don't you agree?
So, the semantic part of this attribute does not exist. The same example showed in the W3 draft is ridiculous for the reason I have already said. If the song length info is already in the DOM and properly showed we don't need redundant info for every user, we just need a good sort function for that list of songs and nothing else.


$.fn.sort = function(){
    // proof of concept, not even tested
    // written directly here in the blogger textarea
    // don't try at home but please try to understand the logic
    var li = Array.prototype.slice.call(this);
    li.sort(function(a, b){
        return
            $(a).find(".length").text() < 
            $(b).find(".length").text() ?
        1 : -1
    });
    return $(li).each(function(i, li){
        li.parentNode.appendChild(li);
    });
};
$(".songs li").sort();

Is above proof of concept that different from a $(".songs @data-length").sort() ? It's even shorter!

Map The DOM If Necessary

If we are struggling that much with this "so much missed data-*" we can still use the class to map common info ... how?


<ol class="songs">
  <li class="map-0">Beyond The Sea</li>
  <li class="map-0">Beside The Sea</li>
  <li class="map-1">Be The Sea</li>
</ol>

If we need to attach properties into a DOM node we can still use the class attribute.
Semantic, classifying the node as a mapped one, not redundant, letting us provide more than once same kind of info for different nodes, and lightweight, avoiding data description provided by the mapped object.
In few words, and always via JavaScript, we can attach a file with this content:


var map=[{
    length:"3m21s",
    artist:"Nature"
},{
    length:"2m59s",
    artist:"You"
}];

For each node, when necessary, we could simply retrieve the associated map in this way:


function nodeInfo(node){
    return map[
        (
            /(?:^|\s)map-(\d+)(?:\s|$)/.exec(node.className) ||
            [0,-1]
        )[1]
    ];
};

And that's it. Each mapped node will return an object or undefined when nodeInfo function is called. Again, all this stuff is a proof of concept, but can we agree that data-* is just the wrong response to solve a JavaScript problem that should be solved everywhere except into the HTML?

Tuesday, November 17, 2009

195 Chars To Help Lazy Loading

Update
removed named function expression

Update
I wrote events letters in the wrong place, please use the latest script

We have talked many times about performances, and not only here.
One common technique to speed up a generic online page is the lazy loading.
For lazy loading I mean:

runtime dependencies resolution via script injection, preferably including all dependencies we need in a shot rather than script after script
Google comments style, where code is evaluated when required but the size is still there
namespaced strings, an example I have explained via code in Ajaxian
whatever else, because sometimes we need this or that library only in certain conditions

What Is The Problem

If we would like to widely adopt this lazy load pattern, we should be aware about a "little detail": Firefox < 3.6 does not support document.readyState, an historically valid Internet Explorer feature adopted in HTML5 and present in basically every other browsers.
We can find the readyState description and the related bug in Mozilla domain.
Reading suggestions, I found quite pointless the long diatribe about:

Should it be "loading" or "interactive" before it will switch to "complete" ?

IMHO, who cares, as long as we can rely in the final statement: complete

Affected Libraries

I have no idea but for sure jQuery! The event.js file shows at line 857 this check for the bindReady event:


// Catch cases where $(document).ready() is called after the
// browser event has already occurred.
if ( document.readyState === "complete" ) {
    return jQuery.ready();
}

The code goes on with common DOMContentLoaded checks and emulations.
The problem is that with such library where basically everything starts with an ready event:


$(function(){
    // the ready event we are theoretically
    // sure will be fired even if the 
    // code has been loaded after
});

every Firefox < 3.6 user will miss that code, plug-in, extension, whatever.
I have already informed jQuery dev ML about this problem but obviously they already know. John Resig told me that there is no guarantee the ready event will be fired if the page has been loaded.
Fair enough, I can perfectly understand John point which is: all jQuery supported browsers may not support document.readyState.
AFAIK, even if this is a good reason to avoid some obtrusive code, we all would expect consistency from a framework so if something worked even in IE I can't even think about Firefox problems.

The Solution

This missed FF feature could affect different libraries, not only jQuery.
We, as developers, could help every library author adding 195 uncompressed bytes, even less once deflated, as first stand alone piece of code ever in our page:


// WebReflection Solution
(function(h,a,c,k){if(h[a]==null&&h[c]){h[a]="loading";h[c](k,c=function(){h[a]="complete";h.removeEventListener(k,c,!1)},!1)}})(document,"readyState","addEventListener","DOMContentLoaded");

// NOTE: IE will never consider false s[o]==null

Since Firefox is usually updated automatically, all we need to do once we are sure people are surfing with version 3.6 or greater is simply remove above little line of code.

Explained Solution


// verify that document.readyState is undefined
// verify that document.addEventListener is there
// these two conditions are basically telling us
// we are using Firefox < 3.6
if(document.readyState == null && document.addEventListener){
    // on DOMContentLoaded event, supported since ages
    document.addEventListener("DOMContentLoaded", function DOMContentLoaded(){
        // remove the listener itself
        document.removeEventListener("DOMContentLoaded", DOMContentLoaded, false);
        // assign readyState as complete
        document.readyState = "complete";
    }, false);
    // set readyState = loading or interactive
    // it does not really matter for this purpose
    document.readyState = "loading";
}

Conclusion

Being this a browser problem and not directly libraries related, it does not probably make sense to put this fix just for few months until next FF release. At the same time we can guarantee for Firefox users, and only if the library does not sniff the browser via this read only property, that lazy loaded stuff a la $(rockNroll) will work into our favorite browser too or we could simply rely and with our code into a readyState "complete" check to decide what to do (this is the real reason I have investigated more this problem, but this is another story coming soon).

Sunday, November 15, 2009

Why Samsung N510

An unusual topic for this blog but this toy turned into a wonderful surprise and maybe, somebody would like to know how come that a developers decides to buy a netbook, rather than latest ultra speedy tech on the go ...

Decide What You Need

It was about 5 years or more I bought another elegant toy: the VAIO vgns2hp, with an extra 512Mb of RAM in order to reach the mythical Gig of RAM which did not come for free. I spent a fortune by that time but it was for a good reason: in 2009 I was still able to perform every task I had to perform ... except gaming.
Please note I am talking about a daily basis "back home from work" computer and not necessary about development environments, where compilation time as example has to bee fast or we gonna spend our life compiling and nothing else ...
So, being a performances maniac I always thought dated hardware is the best proof that we are doing well. Specially Web speaking, it does not matter if we have the 16 core 128 bit monster, 'cause people do not change hardware so frequently as we think. As example, if we put IE6 over an Quad Core Opteron I bet it will be faster than every browser I have used so far with my "humiliating hardware".
The funny part of web development is testing (I know ... I know ...) and we cannot recognize performances problems, bad practices, missed optimizations, if we test with a powerful platform. Ideally we should use iPhone, Android, or Palm to test our web applications! A bit extreme, I agree, but one thing is sure: if that little device has been able to render and execute the page code in a reasonable time, I don't need a degree to understand that everybody else at home or office will be able to experience good overall performances.
Does anybody remember the option: simulate 33.6Kbps connection?
In Flash world what we call Ajax has always been there (since version 5 and loadVariables, switched to LoadVars in version MX ... plus XMLSockets).
That option aim was to better understand bandwidth situation by that time. The IDE was able to simulate low bandwidth connections giving us the real feeling (or the real frustration). Nowadays, I do not know tools able to simulate precedent hardware configurations, neither about virtual machine able to run browsers in slow mode ... but it does not matter, I don't need them, cause my hardware is a good compromise between the past, and the present ... the future? Well, it's far away and unpredictable!
As summary, if the VAIO was still ther, I would have use it for other N months because I have never felt it was not good enough to do what I do on the go, or at home: experiments, projects, surfing, developing, music listening, some movie ... all good ... but!

Decide What You Miss

In 4 new hardware generations I have missed different things such:

long battery life, mine died a year ago but even before I could not be productive on the go due to 1 Hr and a half maximum battery
led based display, I have always been envious of these new amazing brilliant shiny displays people where showing off around ... and I need to read and write 80% of the time, and my eyes got bored even if VAIO had a truly good LCD display
recently, WebGL and O3D support, that 32Mb mobile radeon has always been a pain the ass with 3D stuff and O3D as WebGL could not even work properly
720p and 1080p HD support, a connection with the present and the future few mobile devices can handle properly
money, come on, Christmas is coming, and I was not planning at all to spend 1000 or more pounds for myself since I could have done well with my VAIO for other 6 months or more ...

Be Happy With Your Choice

Too many times I have bought something without being absolutely sure about the product ... specially in an emergency (without a keyboard, a monitor, and a flat connection I feel like Gollum without his ring) it could happen to buy something knowing something better is coming soon.
Let's think about a kid waiting for its first console ever for his birthday, knowing that PS3 is out since two months ago and the rich uncle arrive with the PS2 present ... that feeling! Even when I have chose the VAIO I knew Intel was planning to release new kickass mobile dual core processors but at that time I absolutely need a laptop and I could not wait other 2 months ... but hey, this is about technology, the second you pay the latest gadget ever somebody has already tweeted the same company released the new one and for half of its price!
In few words, if we decide to buy tech, specially IT related tech, we pay 1000% more the value of the tech if it's new, and just the right price for the old one ... but if we spend a reasonable amount of money for something that will match 100% our expectation without being greedy about having the last piece of the art technology, we will appreciate more what we have, and this has been the case so far: I am happy!

Be Updated

I used to be a Quake 3 Arena killer and during my career an hardware and software technician ... in those days there was nothing I did not know about latest technologies. I was talking about SLI and multi core as the computation solution only thanks to my Voodoo 2 card and it was in 1998 or even before, I was an hardware maniac, overclocker, bios hacker, etc etc ... but now I am a completely different person ... I mean, I don't care at all about hardware, I have switched my IT interests, but I still like to have a quick view of the tech panorama. We don't need that much time. Sites such Tom's Hardware has always been there, it is worth it to have a quick look, specially if we are thinking to buy some new stuff (and it is since O3D problems I started to think ... maybe I should change this VAIO ... ).
Being updated is also the key for the precedent point: be happy! If we buy something and we don't know that something is old, we'll feel like idiots the second after we'll meet our geek friend who spent less money for the newer model ... isn't it?

Define Your Product

This operation is not that difficult to perform. All we need to do is to define a price range, the rest will come easily 'cause in the same range we often have similar products, so we don't need to decide between a dishwasher and a lighter, we need to decide between similar products, considering different variables.
I have used one of the first Samsung success in the netbook market, the NC10, a truly exemplar piece of art by that debut time. I was stunned by its compactness and speed ... it was performing better than my VAIO and it was half the size with 8 times more battery life ... are you kidding me? That has been the day I have decided my next "portable friend" would have been a netbook. I don't get why people would like to spend more to use 1/16 of their hardware and I don't use heavy software at all ...
That NC10 was amazing, but few months later ASUS came out with another piece of art: Eee PC Seashell 1101HA, a masterpiece of design and features.
I was going straight on to buy latter toy but something blocked me: the Tottenham Court Road Escalator has been completely covered with new Samsung X-Series Advertisement ... an artistic way to do ad and I have been trapped by that ad like an idiot. Models where cool and tiny, and I thought those were new netbook from Samsung ... and I was wrong. As soon as I have found a Samsung shop I went in to ask about these new netbook and they corrected me saying: Sir, these are powerful laptop for these prices ... OK, reasonable, a powerful laptop for 499 pounds, elegant and reach of features, become instantly part of my new range of products.
But it is only thanks to that ad that I could have spotted the N510 ... and that shiny nVidia logo: ion
All I knew is that nVidia and Intel has few arguments about this ion solution ... and you know what I have done? The simplest thing to do: I said thanks, I went outside the shop, and I went back to the chapter: Be Updated
Surfing with my Android I have discovered that nVidia finally arrived into netbooks and laptop and that this ion is promising 10 times faster performances and full HD stream decoder. The price is that obviously if we use ion 100% of the time the battery life will be shorter, but this N510 has a maximum declared battery life of 7 hours! As average, the battery life should be about 3 or 4 hours even if I am watching a movie or surfing the latest porting for Compiz into WebGL (AFAIK it does not exists yet).
The price, less than latest new entry, the X-Serie laptop, a bit more than Seashell, but with an Atom N280 CPU, slightly faster than the Z version present in the Seashell, and the newest nVidia technology inside, theoretically able to move new Operating Systems as well (Windows 7 - Kubuntu or Ubuntu) ... Deal!

Conclusion

Christmas is close, and I thought that some guidelines about shopping could be nice. The IT sector is extremely complex, we can rarely find the best product because every product seems to miss something. The Seashell had slower CPU for no reason and no hardware acceleration, but it has best battery life and stunning design plus it's ASUS, a company to me historically famous for its speed prone motherboards and always in the performances field. The X-Serie wa new and cool but it missed Seashell dimensions, battery life, and it does not come with nVidia ion. Intel Graphic Chips are truly good but every netbook has not the power ion has.
Samsung N510 has everything to be truly used as daily basis on the go device. The keyboard is good, the hardware is excellent, the price reasonable (I have spent 390 pounds) and it is future proof thanks to this pioneer match between Atom 280 and nVidia ion. Any side effect? The OS is XP and not Seven, as showed in the website, and I have spent long time to remove unnecessary crapware, but now it boots fast and it works like a charm. I think Seven could run without problems adding 1 Gb of RAM and there are already specific drivers and software in the product page. I would vote this Samsung N510 5 stars out of 5 'cause I am convinced this is the best netbook you can find in the market right now: pretending zero but giving all you expect from a netbook and even more, good stuff!

Inline Downloads with Data URLs

A quick post about another silly idea ...
With Data URLs we can incorporate images in layout or CSS.
The schema is really simple:


data:[<mediatype>][;base64],<data>

Since we need to specify a mediatype we could play around creating something unexpected ;-)

My Silly Idea

If we select something in a web page we can perform different actions via right click. So far so good ... but one thing we are missing, at least in Firefox, is a "Save As" option. If we want bring a piece of code, text, something else, into another software or editor we need to select, right click, copy, open or find the editor, right click, paste.
The ultra skilled developers goes well with ctrl+c and ctrl+v but there are still 3 operations to do: copy, find the destination, paste
What about making possible to simply save that part and go on reading or surfing in order to do not distract too much our lecture and review eventually later that piece of text or code?

Firefox Inline Download


function download(text){
    // Yet Another Silly WebReflection Experiment
    var iframe = document.createElement("iframe");
    iframe.src = "data:application/octet-stream;base64," + btoa(text);
    iframe.style.position = "absolute";
    iframe.style.top = "-10000px";
    document.documentElement.insertBefore(
        iframe,
        document.documentElement.firstChild
    );
};

download("Hello World :-)");

If we have configured Firefox to ask us where to save files, we can even choose the name. Being the inline data protocol that simple, unfortunately I could not find a way to name the file. The concept in any case is simple, we could create a bookmark or a link able to save the selected text, if any.
In this case we select, and with a click we can directly organize the content in a named file or we can open it with the editor that will be probably the first option in the Open With question.

Side Effects

Well, the first one is that I am not even sure if this could be considered a security problem, and I am testing in Firefox 3.6 beta 2 (so I am not even sure this is possible with other versions).
We cannot remove the iframe until the user has saved the content and never before the save dialog will be showed, otherwise a generic onload will remove the content before Firefox can understand what to do.
On the other hand, since the content will be inline, the "Open With" should always work 'cause Firefox, which is a clever browser, saves and eventually remove, even if we cancel the operation.

Conclusion

Nobody will probably ever use above snippet, but I thought it was an interesting and new way to manipulate a technique used for totally different purposes (OK, I have to admit the only excuse I have is my new netbook and I had to test some code after Notepad++ installation :D)

Thursday, November 12, 2009

How To Map Your Code

Most of the tools we use on daily basis to develop applications with any kind of programming language need to understand our code in order to help us while we are writing, reading, or creating code.
We open an editor, we read highlighted code, we like suggestions when we write a dot and we use compilers, minifiers, compressors, obfuscators, converters, documentation generators ... but have we ever though about the program behind our same program? Have we never thought that highlights, suggestions, minifications, and syntax analyzer do not come for free and these are primordial programs we've ever used often without even realizing we are using it?
This post is about general suggestions, techniques, and practices, used to map code. In this specific case, we will analyze step by step the logic behind a code mapper using the most used programming language in the world: JavaScript.

Map VS Tokens

First of all we need to understand what we need. A Map could be considered a generic list of coordinates able to tell us "what is where" while a tokenizer is usually an extremely detailed map with extra info, where everything is deeply analyzed and hopefully never casual, as a map could be.
Every programming language has an interpreter, and usually an interpreter is able to analyze the syntax and create tokens to use runtime or to compile the language into another one (byte/machine code). As metaphoric example, the world is an open source application and we, nature and animals included, are tokens, then everything is part of a global application.
So far, services such Google Map, are applications able to map the world. Google Map does not (yet) care about our role in the system, it simply tells us what/who is where. Silly metaphors a part, being global tokens analysis much harder to perform, this post will talk about the simple way: the generic Map, but if interested, and again via JavaScript, we could have a look into Narcissus, a good old Mozilla Project able to analyze JavaScript via JavaScript itself.

Time To Start Thinking About The Problem

OK, everything seems so easy for human eyes ... we take a string, we recognize what is where, and that's it! Easy! Specially with an highlighter able to make code lecture a pleasure, isn't it?
Well, we'll see that things are not that simple as we think, and that is why historical projects such Scintilla are still under development.

What We Are Interested About

As I have said, JavaScript will be the reference language but problems and techniques are similar for every other.
The first requirement to understand is what as what are we looking for. In JavaScript case, we can split the language in these main cases:

strings, we cannot touch them!
literal regular expressions, untouchable as well
comments, we may consider to get rid of them
E4X, damn cool XML as is, again untouchable
everything else, considered code

For above categories, we could slit them in sub categories:

single quoted strings
double quoted strings
single line comments
multiline comments

So far, we have just decided what will be our map about, and nothing else. Now it's time to think how to implement this map.

The Basic Problem: Who Comes First

Somebody could think:dude, what's the fuss man, just search via RegExp and that's it... and no it's not.
This is a classic case where everything could go wrong, included your favorite JavaScript editor, but still perfectly valid syntax:


// ... code

var theWeirdCase = /"['//*"]'/;

// other code ...

Uhm, check that out, a regular expression with a string "['//*" inside plus another intersected string '//*"]' plus a single line comment // plus the beginning of a multiline one /* ... well done, surrounded by code, that case, reproducible with every combination with a string and a regexp inside, a comment with a string inside, etc etc, could fuck up all our parsing plans, isn't it?
So rule number one: understand what part of the map is more relevant, where in this case, relevance is defined by who comes first.
If we have whatever inside a string, whatever IS inside a string.
If we have whatever inside a literal RegExp, whatever IS inside a regexp.
Same is for E4X or comments, single or multi line, whatever will be part of that comment. Does anybody agree?

Char By Char VS Regular Expressions

OK, here we are, let's define the best strategy to analyze code, right? Char By Char could be the easy way to analyze code: it's simple to implement, we feel like we have control over every single char, we can spot who comes first without problems or errors, isn't it?
Regular Expressions are simply abstracts and clever char by char parsers. We delegate tedious code to the RegExp engine which aim is to understand the expression, and gives us the match, if any. Regular Expressions are somehow similar, conceptual speaking, to SQL: we type Maya and Egyptians characters and magically we obtain the result, without caring at all about lower level layers.
I can still spot the dude ... guy voice saying:
mate, I am not a noob, if RegExp are char by char parsers, my char by char parser will be faster for sure.
This time the dude guy is not completely wrong but as usual, it depends.
If we are programming with C, C++, or heyholetsGo!, without considering better suited programing languages for this purpose such Caml or OCaml, Regular Expressions will require a library with such overload for generic purpose that maybe we can do better and faster for the specific case.
On the other hand, all we know about programming is that less is better and if we can trust extremely tested and famous regular expression engines (PCRE) why on earth we should waste time writing whatever specific case parser which aim will be the same of a RegExp?
And why on earth we think our check/test/verify implementation will be more stable than a short, quick, and dirty RegExp?

JavaScript Is Not As Fast As C Is

Thanks to competitors and new strategies to manage JS dynamic code, performances are often faster even than PHP, Ruby, Python (not Iron or C-) but JS cannot compete with lower level languages compiled directly into machine code and without dynamic nature.
Rather than explain in 3 hours why char by char is not always worth it with JavaScript, here I am with a test anybody can try and with every browser.


// commented to avoid problems with highlight used
// n this blog, an old char by char parser ;-)

// PLEASE REMOVE NEXT COOMMENT CHARS "//" TO TEST

// var theWeirdCase = /"['//*"]'/;
var aLongString = new Array(150001).join(".") + theWeirdCase;
var i = 0;
var skip = false;
var time = [new Date];
var position = [(function(){
    var length = aLongString.length;
    var mlc = aLongString.indexOf("/*");
    var slc = aLongString.indexOf("//");
    var sqt = aLongString.indexOf("'");
    var dqt = aLongString.indexOf('"');
    var rex = aLongString.indexOf("/");
    if(mlc < 0)mlc = length;
    if(slc < 0)slc = length;
    if(sqt < 0)sqt = length;
    if(dqt < 0)dqt = length;
    if(rex < 0)rex = length;
    var position = Math.min(mlc, slc, sqt, dqt, rex);
    if(position === length)position = -1;
    return position;
})()];
time[i] = new Date - time[i];
time[++i] = new Date;
position[i] = (function(){
    for(var
        position = -1, c = 0, length = aLongString.length;
        c < length; ++c
    ){
        switch(aLongString.charAt(c)){
            case "'":position=c;c=length;break;
            case '"':position=c;c=length;break;
            case "/":
                if(c < length - 1){
                    switch(aLongString.charAt(c + 1)){
                        case "*":position=c;c=length;break;
                        case "/":position=c;c=length;break;
                        default:position=c;c=length;break;
                    }
                }
                break;
        };
    };
    return position;
})();
time[i] = new Date - time[i];
try{aLongString[0]}catch(e){skip=true};
if(!skip && aLongString[0] === "."){
    time[++i] = new Date;
    position[i] = (function(){
        for(var
            position = -1, c = 0, length = aLongString.length;
            c < length; ++c
        ){
            switch(aLongString[c]){
                case "'":position=c;c=length;break;
                case '"':position=c;c=length;break;
                case "/":
                    if(c < length - 1){
                        switch(aLongString.charAt(c + 1)){
                            case "*":position=c;c=length;break;
                            case "/":position=c;c=length;break;
                            default:position=c;c=length;break;
                        }
                    }
                    break;
            };
        };
        return position;
    })();
    time[i] = new Date - time[i];
};
time[++i] = new Date;
position[i] = aLongString.search(/\/*|\/\/|'|"|\//);
time[i] = new Date - time[i];

alert([
    time.join("\n"),
    position.join("\n")
].join("\n"));

Above benchmark try to replicate common techniques to find a single piece of the full map, in this case over about 150Kb of fake JavaScript code.

How To Read The Benchmark

dude ... the benh confirm that ... shut up! The bench shows that Regular Expressions are the fastest way to parse code via JavaScript. It does not matter if the score could not be the fastest in above case, what matters is that with a single Regular Expression we can grab 0, 1, or every match in the code.
If we think that in 150Kb of code to highlight showed time will increase for each mapped part in the same code, while the regular expression could be just one, we can easily see that:

the RegExp is able to find and validate at the same time what we are looking for, above benchmark misses all manual validations we need to do to understand if the case, first found char a part, is exactly the one we where looking for
with a single regular expression we save a large number of characters and the code is easier to maintain
for each manual parsed char we need to perform a manual check for the exact case, the total amount of manual checks increase with number of chars and code complexity
with a single regexp we don't care that much about code size because the engine, hopefully written in C, should be fast enough to be able to parse a large amount of data

We Still Need Good Regular Expressions

This is the most critical part, if we use bad regular expressions we could have a lot of false positive and we could mess up the map. The reason Regular Expressions are not always well considered, is that these could be hard to read, write, or understand, and people could put "chars around" without improving anyhow the regexp, adding more false positives instead. I do not pretend to give you best regular expressions ever for all cases, but it's since 2000 I am using RegExps and hopefully I know a chicky bit about them.


JSMap.parser = [
    // WebReflection Suggestion
    {
        test:/\/\/[^\1]*?(\r\n|\r|\n)/g,
        type:JSMap.COMMENT_SL,
        place:function(value, a){
            return value.charAt(2) === "@" ? value : a;
        }
    },{
        test:/\/\*[^\1]*?(\*\/)/g,
        type:JSMap.COMMENT_ML,
        place:function(value){
            return value.charAt(value.length - 3) === "@" ? value : JSMap.parser[0].place(value, "");
        }
    },{
        test:/(["'])(?:(?=(\\?))\2.)*?\1/g,
        type:JSMap.STRING
    },{
        test:/\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g,
        type:JSMap.REGEXP
    },{
        // Note: experimental, not fully tested/supported
        test:/<>[^\1]*?(<\/>)|<(\w+|\{\w+\})(?:\s*\/|[^\>]*?>.*?<\/\2\s*)>/g,
        type:JSMap.E4X
    }
];

Above collection contains all kind of things we would like to Map for JavaScript.
Some object contains a place method which aim is to avoid, if necessary, the match replacement. In this case I have considered Internet Explorer conditional comments and nothing else, but there are other cases where a comment should not be removed (e.g. /*! my license */ )
Each object contains a map type, 'cause we would like to know what we have found there, don't we?


JSMap.CODE      = 1;
JSMap.COMMENT_SL= 2;
JSMap.COMMENT_ML= 4;
JSMap.STRING    = 8;
JSMap.REGEXP    = 16;
JSMap.E4X       = 32;
JSMap.ALL       = 63;
JSMap.DEFAULT   = ~JSMap.E4X;

Since we are creating a customizable map, we would like to choose what we want to find or not. We can decide using some bit operation.
As we can see, the default one exclude the E4X case, it's not common, since it has not been implemented yet in every browser, plus for this post and example is not perfect.
To exclude something all we need to do is to use ~ char, while if we want to decide just few thing we can always use the or | :


JSMap.COMMENT_SL | JSMap.COMMENT_ML

Above example will look for single and multi line comments. Maybe we want just understand if there is some conditional comment?
In cany case, bear in mind that who comes first is still the main problem, so if we restrict the searh we could find false positives (e.g. comments inside strings or regexps)

How To Logically Proceed With The Mapper

OK, we have reduced code size already 1/3rd thanks to god regular expressions. Now what's next?
The problem is still the same: Who Comes First!
For each performed Regular Expression we should store results somewhere. In this case we will have 4 searches performed via RegExp for the entire code, rather than a check for each possible matched character, but false positives could always be there.


function JSMap(CODE, type){
    if(!type)
        type = JSMap.DEFAULT
    ;
    for(var
        Map = [],
        length = JSMap.parser.length, i = 0,
        a, b, exec, parser, value;
        i < length; ++i
    ){
        parser = JSMap.parser[i];
        if(parser.type & type){
            while(exec = parser.test.exec(CODE)){
                value = exec[0];
                Map.push({start:exec.index, end:exec.index + value.length, value:value, type:parser.type});
            };
        };
    };
    // ... the rest of the code

OK, looping over the list of RegExp we have collected every match.
Each match has been stored as an object where properties are:

where the match start, we can reuse start and end info regardless for other purposes
where the match end
the match itself, alone could be reused or simply replaced in the current CODE
the match type, what we have found

Start and end are valid for every match and compatible with substring.
If for some reason we will change a single value and its length will change, we can easily synchronize the Map adding or removing the difference between old length and the new one for every other mapped objects after the current one.
these properties could be superfluous but we want control for any kind of occasion.

An Ordered Map Without False Positives Collisions

The simplest way to reorder the map is a native Array.sort operation, perfect to understand who comes first!


Map.sort(function(a,b){
    return b.start < a.start ? 1 : -1
});

The start point is the sort key and thanks to Regular Expressions we will rarely have same start point. If we have one, we need to rethink the RegExp because it is not good enough since, for example, a comment cannot be a regexp and viceversa.
The native sort operation is hopefully fast enough to guarantee still better performances. All we need to do now is to add, if necessary, the code.

Cleaning Adding Surrounding Code

Once we have mapped all we were looking for, we can consider CODE everything before, in between, or after our matches.


// type could NOT include CODE
// and if code starts with a comment, there
// is nothing to do ...
if(type && 0 < (i = Map[0].start))
    Map.unshift({start:0, end:i, value:CODE.substring(0, i), type:JSMap.CODE})
;
// every other Mapped case will be ordered by start
// Accordingly, the first start we will encounter
// will be the valid one ... the first match
// will be considered valid by dafault (var a)
for(length = Map.length, i = 1, a = Map[0]; i < length; ++i){
    b = Map[i];
    switch(true){
        // if there is a gap between the
        // precedent a match and b one
        // there MUST be code between
        case a.end < b.start:
            // let's add it if we need
            // continue otherwise
            if(type){
                Map.splice(i, 0, {start:a.end, end:b.start, value:CODE.substring(a.end, b.start), type:JSMap.CODE});
                ++length;
                ++i;
            };
        // if there is no gap or code
        // has been inserted
        // we can go on with the for loop
        // considering the current b
        // valid, assigned for this reason to a
        case a.end === b.start:
            a = b;
            break;
        // if there is no gap
        // which means next match starts
        // before a.end and no after
        // we have a false positive
        default:
            // remove this match in any case
            Map.splice(i, 1);
            --length;
            --i;
            break;
    };
};
// if the last match ends before the string.length
// there must be another piece of code to add
if(type && (i = Map[i - 1].end) < CODE.length)
    Map.push({start:i, end:CODE.length, value:CODE.substring(i, CODE.length), type:JSMap.CODE})
;

Finally, if there is no match but JSMap.CODE is part of the search, we can consider the entire string a piece of code:


Map.push({start:0, end:CODE.length, value:CODE, type:JSMap.CODE})

How To Use A Map

Almost the end of this tedious post. Once we have created a map of our code, we can reuse matches in any way we prefer.
This is a "paste and go" example that will let us test whatever code we want:


/**
 * JSMap test case
 */
onload = function(){
  document.body.appendChild(
    document.createElement("textarea")
  ).onchange=function(){
    var time    = new Date,
        Map     = JSMap(this.value/*, JSMap.CODE|JSMap.COMMENT_ML|JSMap.COMMENT_SL|JSMap.REGEXP*/)
    ;
    time = new Date - time;
    if(!Map.map)Map.map=function(fn){for(var i = 0, length = this.length; i < length; ++i)this[i]=fn(this[i]);return this};
    document.body.innerHTML = 'Mapped in ' + (time/1000) + ' milliseconds.<pre>' + Map.map(function(o){
        var value = o.value.replace(/</g, "&lt;").replace(/>/g, "&gt;");
        switch(o.type){
            case JSMap.STRING:
                value = '<span style="color:#66F">' + value + '</span>';
                break;
            case JSMap.COMMENT_SL:
                value = '<span style="color:#999">' + value + '</span>';
                break;
            case JSMap.COMMENT_ML:
                value = '<span style="color:#F66">' + value + '</span>';
                break;
            case JSMap.REGEXP:
                value = '<span style="color:#F00">' + value + '</span>';
                break;
            case JSMap.E4X:
                value = '<span style="color:#00F">' + value + '</span>';
                break;
            default:
                // eventually we can parse
                // the code with numbers
                // spaces, keywords, methods
                // and whatever we need
                // that would be another Map
                // specific for the language
                // or simply a replacer via RegExp
                break;
        };
        return value;
    }).join("") + '</pre>';
  };
};

Save in an html page, paste some valid code into the textarea, disabling spell check if the source is massive, and click somewhere outside the area.

Conclusion

Even if the case is JavaScript and a JavaScript map, with this post we can hopefully better understand problems behind generic code parsing.
What is missing in this post is a code parser able to highlight or understand every piece of code present in the map.
If we loop over the returned map it is possible to understand which one is code and what is next. If a variable is going to be assigned to a regexp or string, as example, the value will be found in the next object present in the map.
I did not consider numbers 'cause these are simple to parse in a code portion, while these could slow down every other operation since these could be easily spotted inside strings, comments, regexps, or E4X syntax.
Considerations, techniques, and benchmarks, are relative for this case but generally valid for any kind of purpose. CSS selectors, HTML, and other cases, need to consider when Regular Expressions are worthy (e.g. not that fast char by char parser due to used programming language) and when a simple indexOf could be the best solution ever. I hope you enjoyed this post, I surely did writing it since the argument is not that frequent and techniques extremely different (and trust me, I have showed maximum a 30% of what we could fnd out there).
Oooops, I almost forgot the JSMap source code!

Wednesday, November 11, 2009

Literal Regular Expression Safe Regular Expression

... sorry for the redundant title but that's exactly what is this post about ... after yesterday explanation about problem, logic, and solution, to grab valid strings inside JS code, here I am with the literal RegExp able to grab literals RegExps in a generic JavaScript code.

Why Do Not Add Just A "/" Into Other Strings RegExp

One comment gave me the hint to write this second post about RegExps. While time is a bit over during days, this answer is simple, but not obvious!
Differences between strings and literal regular expressions are basically these:

there must be at least one char, or the parser will consider the literal RegExp an inline comment //
the slash does NOT necessary need to be escaped. If we have a slash inside a range [a/b] the latter one won't break the RegExp and the slash will be considered just one valid char in that range
there could be one or more chars after, where i(ignore case), g(match all), and m(multi line) can be present one or more times

Latter point is not truly a problem since this syntax will break the code in any case:


function igm();
var a = "string"igm();

But still, we need to understand first couple of points.

The RegExp Safe Regular Expression


// WebReflection Solution
/\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g

Since yesterday after 10 seconds somebody pointed me another solution, I bet this will happen again but so far I have tested above little monster enough to say that should work without problems but obviously only if the code is valid, otherwise we don't even need to waste our time trying to parse it.
As example, yesterday somebody told me:look, it does not work with this


a = \"string"

Well, now consider that an escaped char could be everywhere in the code but again, these regular expressions are not code sanitizer, n any case improbable since:


// tell me what do you expect and WHY!
a = "string\"
b = \"other"
\"
" c = what?!"

So any kind of weird combination wont work but if the regular expression is valid, escaped or not escaped, the precedent solution should work like a charm.

Explanation

I won't go step by step for the entire RegExp this time, things are the same described in my precedent post so please read there if you want to know more. The emulated look-behind pattern has been included in this regexp to skip groupd of possible ranges present in the regexp. When a range is encountered, starting with char "[", it is skipped till the end. If there is no end theoretically the literal RegExp is broken and the code won't execute. Same strategy is used for the other case, where no [ is encountered, if there is a char followed by a slash, we go on as described in the other post. In this way we should be sure that whatever will be, we'll find the end of the RegExp included chars. I did not spend too much time ensuring consistency for these flags since "/whatever/ii" will be part of inconsistent code which is a syntax parser problem, and not mine.

Test Cases


//comment <-- should not be matched at all
var a = /a/;
var b = /\//i;
var c = /[/]/;

I bet there are hundreds of RegExp or minifier out there able to fail with the latest one, since even different Editors have problems trying to understand what is going on.

The Test Case

Same code I have posted yesterday, except the alert will be for all arguments. I know I have used an empty replace, which is a bad practice, but that was good enough for test purpose:


onload = function(){
  document.body.appendChild(
    document.createElement("textarea")
  ).onchange=function(){
    this.value.replace(
      // WebReflection Solution Test
      /\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g,
      function(){
        alert([].slice.call(arguments).join("\n"));
      }
    );
  };
};

Please let me know if you find a better solution or whatever gotcha via the test case, considering that arguments[0] should be exactly the matched RegExp, thanks.

P.S. about the inline comment, it's not worth it to avoid that case for two reasons: we can always test that match.charAt(1) !== "/" plus the problem is still: who comes first? If we have a string inside a regexp or vice-versa there is no way to exclude these cases in a single, reasonable, RegExp. As I have said, as soon as I'll find some time, I will explain how to create a proper function able to manage all JavaScript cases, stay tuned!

Tuesday, November 10, 2009

String Escape Safe Regular Expression

I should have probably investigated more but apparently I did it ... the most problematic I've encountered so far with JavaScript RegExp seems to be solved!

Update

Indeed, I should have investigated ... I just like to find solutions by my own. I am not surprised somebody already investigated this classic parsing problem.
Steve talked about it a year ago, using the lookbehind missed feature I talked in this post.
Above post has much more details than mine (and much more Edits as well).
The good part I am happy about is that both me and Steve came out with basically the same solution, but His one is definitively more compact:


// Steves Levithan compact solution
/(["'])(?:(?=(\\?))\2.)*?\1/g

The assumption of above regexp is that if there is a char followed by an escape one, there must be another char that cannot be the initial single or double quote, being the latter one outside the second uncaptured part, and after a non greedy operation.
If the second condition, \2, does not exist, the dot "." will pass the current char, no escape found, performing the char by char parsing I have described in my solution.
The dot is my [^\\], the double escape is represented by "\2.", as is for the escape plus whatever else that is not the end of the string, equivalent of my [\\(?=\1)]\1
I don't want to edit lots ot times this post, and I'll leave it as is to let you understand the problem, the logic, and the solution.
The only thing I would like to check are performances, since my less compact solution should be theoretically faster for common strings where the escape char is not present while Steve one will try to look for the escape plus will assign the possible missed match plus will pass whatever else char after, if any, considering outside there is a "break", and all these operations for whatever length, and still a char by char operation.
Whatever will be, we know we have at least two alternatives, and both mine and Steves one should be cross browser.

A Bit Of History

In all these years of programming with different languages, I have created dunno how many code parsers. WebReflection itself is using one of these parsers to highlight my sources. My good old PHP Comments Remover (2005 though ...) used another code parser. MyMin project used another one as well ... in few words, in my programming history I don't know how many times I had to deal with sources. The strategy I have always adopted, specially for JavaScript, is the char by char parser. The reason is simple, I have never created or found a good regular expression able to threat this case:


var code1 = "this is some \"test\"\\";
var code2 = "and this is \"anot\\her\" one!";

Above code, managed as a string, will become a stringe like:
"var code1 = \"this is some \\\"test\\\"\\\\";
var code2 = \"and this is \\\"anot\\her\\\" one!\";"
And if you know Regular Expressions, you know why this case is not that simple to manage isn't it?
Well, right now I was forking a project with a massive usage of Regular Expressions for CSS selectors and I could not avoid to notice the classical wrong match to manage strings:


/['"]([^'"]*?)['"]/g

Above match is almost a non-sense. If we have a string such "told'ya!" that RegExp will match told', leaving "ya!" out of the game. To make it a bit better the classic procedure is this one:


/(['"])([^\1]*?)\1/g

Whit above RegExp we are looking for quote or double quote char and we are searching the next one being sure if the first match is a single quote, the string will finish with a single one, and viceversa. There is still the problem that if we have the first matched quote or double quote and an escaped one in the middle of the string, that regular expression will truncate again the latter one giving us a untrustable result.

Why It Is More Difficult Via JavaScript

Regular Expressions in JavaScript miss at least one of most common features in PCRE world: the look-behind assertion!
Fortunately, we have an helpful Backreferences able in some case to slow down the match, but often the only or best way we have to create more clever matches!

The String Escape Safe Regular Expression


// WebReflection Solution
/(['"])((?:[^\\]|\\{2}|[\\(?=\1)]\1|[\\(?!\1)])*?)\1/g

I am not sure above little monster is the best RegExp you can find for this problem, and JavaScript features, what I am sure about, is that I have done dozen of tests and results seems to be perfect: Hooray!!!
If you are not familiar with RegExp, please let me try to explain what's going on there:


/
  // look for a single or a double quote char
  // this will be referenced as \1 in the rest of the regexp
  // in order to completely ignore the other one
  (['"])

  // the second match is performed over the string
  // that could be empty, or it could contain
  // any character included the first match, if escaped
  (

    // the second match will be a char by char parser
    // the only character we are worried about
    // is the one able to escape the first match
    (
      ?: // we are not interested about next capture
      // since the only scary char is the escape
      // but it is not necessary present
      // (let's say is less present than any other)
      // speed up the RegExp validating every char
      // but the escape ... these are all good!
      [^\\]
      |
      // if we encounter an escape char and this
      // is escaping itself we can skip 2 chars
      \\{2}
      |
      // alternatively, we could have
      // an escaped match (current one: single or double)
      // in this case we want to be sure that the escape
      // is for the matched char and not just an escape
      [\\(?=\1)]\1
      |
      // we need to validate whatever else has been
      // escaped as well so if the escape char is
      // NOT followed by the initial match or
      // another escape char it's ok
      // and we go on with next char
      [\\(?!\1)]

    // precedent cases should be performed for each
    // encountered char but these cannot be greedy
    // otherwise we risk to wrap the full string
    // var a = "a", b = "b";
    // 'a", b = "b' <-- greedy!
    )*?
  )

  // to make precedent assumptions valid
  // we need to be sure the string terminates
  // with the initial matched char
  \1
/g

That's pretty much it, if we use match method, replace, or exec, the matched[1] or RegExp.$1 will be the char used to encapsulate the string, single or double quote, while matched[2] or RegExp.$2 will contain the string itself.

In Any Case It Is Still Not Perfect

If we consider JavaScript regular expressions, same stuff used to solve the problem, we'll have another one.


var re = /ooo"yeah/;
var s = "no way";

In above example there will be some problem since the double quote inside the regular expression will be matched like a charm with my suggestion.
This is the reason we still need char by char parsers but hey ... I was trying to parse some selector and the usage of @test="case" which is even apparently not standard, so bear in mind we cannot use this RegExp unless the code won't contain literal regexps.
What is the trap here? That char by char a part, it's quite impossible to decide who comes first, "the slash or the quote"?

Quick And Dirty Solution Tester

With this code it should be simple to copy and paste some valid source to read parse after parse what is OK and what is not:


onload = function(){
  document.body.appendChild(
    document.createElement("textarea")
  ).onchange=function(){
    this.value.replace(
      // WebReflection Solution Test
      /(['"])((?:[^\\]|\\{2}|[\\(?=\1)]\1|[\\(?!\1)])*?)\1/g,
      function(){
        alert([arguments[1], arguments[2]].join("\n"));
      }
    );
  };
};

Please share whatever problem you'll find with such Regular Expression or suggest me a better faster approach to solve this problem with same test cases, thanks.

Monday, November 09, 2009

Google Closure ? I'm Not Impressed

We all know that Google is synonym of performances, simplicity, and again performances. A search engine that uses a truncated body for a not valid W3 markup should be the most bytes and performances maniac in the current web era, isn't it?
Well, Google Closure Tools has been a negative surprise, at least this is what I can tell about it after a first quick review.

Closure Compiler

It's since ages I am wondering what kind of tool Big G is using to produce their scripts and finally we got the answer: Closure Compiler ... ooooh yeah!
Packer, YUIC, it does not matter, when Google needs something, it creates something. This is almost intrinsic, as developers, in our DNA: we spot some interesting concept? We rewrite it from the scratch pretending we are doing it better!
This is not the case, or better, something could go terribly wrong!


// ==ClosureCompiler==
// @compilation_level ADVANCED_OPTIMIZATIONS
// @output_file_name default.js
// ==/ClosureCompiler==

(function(){
   "use strict";
   this.myLib = this.myLib || {};
}).call(this);

myLib.test = 123;

The produced output:


(function(){this.a=this.a||{}}).call(this);myLib.test=123;

And 2 warnings:


JSC_USELESS_CODE: Suspicious code. Is there a missing '+' on the previous line? at line 2 character 4
"use strict";
^
JSC_USED_GLOBAL_THIS: dangerous use of the global this object at line 3 character 4
this.myLib = this.myLib || {};
^

Excuse Me?
The sophisticated compiler is able to understand the "use strict" ES5 activation statement and the fact we are passing the global object as this reference in the closure. It does not matter, as showed in the produced code the result will be a broken library, thanks to its missed name, magically transformed into "a".
Advanced Optimization Could Fail to both give us right suggestions and fix or optimize the code, since in that case, as example, this.a won't perform anyhow faster than original this.myLib.
Advanced Optimization parameter could also be dangerous for lazy loaded libraries.
We need to be extremely careful with this option and, as result, rather than a Compiler, we will deal with a "code messer" where hard debug will become automatically the hardest ever.
Read Carefully This Page if you are planning to use this option because under the flag "best ratio" and "removed dead code" we could have massive surprises in the middle of the application.

As summary, SIMPLE_OPTIMIZATIONS as compilation_level directive is so far the recommended one, but at the same time it won't offer that different ratio compared against YUI Compressor or Dean's Packer (NO base62) produced outputs while ADVANCED_OPTIMIZATIONS could be tested for single stand alone files hoping these won't break the global namespace via renamed variables.
In this case a JavaScript closure, the real one, is an absolute must!

Closure Library

This is another part, not strictly related with the Compiler, but apparently able to work with it. The Closure Library is a wide namespace loads of core features and a cross browser User Interface. I have to admit this library is a massive piece of work, but techniques used to make it happen are often hilarious.
First of all, this is the first time I read protected variables called with an underscore at the end, rather than as first char:


// normal code
function Constructor(){
    this._protected = [];
};

// Closure Library Creativity
function Constructor(){
    this.protected_ = [];
};

Why On Earth? Python style a part, where the underscore has a concrete meaning, the technique to use underscore as first character has a reason to exists.


function copyOnlyPublic(o){
    var $o = {}, k;
    for(k in o){
        if(k.charAt() !== "_")
            $o[k] = o[k]
        ;
    };
    return $o;
};


var myC = new Constructor;
var $myC = copyOnlyPublic(myC);

No Way! To make the style "creative" the charAt method with optional 0 as argument needs to become:


if(k.charAt(k.length - 1) !== "_")

Is this what we would expect from the performances king? I don't think so.
Is this faster to read at least for human eyes? Neither!
Gotchas are everywhere in the library ... the most redundant stuff I've ever seen is the array namespace!


goog.array.indexOf = function(arr, obj, opt_fromIndex) {
  if (arr.indexOf) {
    return arr.indexOf(obj, opt_fromIndex);
  }
  if (Array.indexOf) {
    return Array.indexOf(arr, obj, opt_fromIndex);
  }

  var fromIndex = opt_fromIndex == null ?
      0 : (opt_fromIndex < 0 ?
           Math.max(0, arr.length + opt_fromIndex) : opt_fromIndex);
  for (var i = fromIndex; i < arr.length; i++) {
    if (i in arr && arr[i] === obj)
      return i;
  }
  return -1;
};

OMG, I can't believe performances matter only for a missed body tag in the layout ... it cannot be real, can it?
What we have there? 1 to 2 possibly missed checks for each call (IE) and everything just to emulate the native Array.indexOf which is present in basically every browser except truly old or Internet Explorer?


goog.array.indexOf = Array.indexOf || function(arr, obj, opt_fromIndex) {
  var fromIndex = opt_fromIndex == null ?
      0 : (opt_fromIndex < 0 ?
           Math.max(0, arr.length + opt_fromIndex) : opt_fromIndex);
  for (var i = fromIndex; i < arr.length; i++) {
    if (i in arr && arr[i] === obj)
      return i;
  }
  return -1;
};

Array.indexOf is used as fallback if for some unknown reason (and browser...) an Array has not indexOf but Array.indexOf is present ... well, if we can trust that case is there any valid reason to create a performance gap like that for almost every method?
forEach, lastIndexOf, every JavaScript 1.6 or greater emulated method contains redundant checks performed for each call ... where is the performance maniac here?
The feeling is that this library has been created by some Java guy, probably extremely skilled with Java, but definitively not that clever with JavaScript programming style. Google if you need skilled JS developers there are hundreds of us out there. What I mean is that it does not matter if a minifier is able to remove dead code because dead code should not be there at all, isn't it?


// WebReflection Suggestion
goog.array.clone = (function(slice){
try{slice.call(document.childNodes)}catch(e){
    slice = function(){
        var rv = [];
        if(this instanceof Object)
            // suitable for arguments
            // and every other ArrayLike instance
            return rv.slice.call(this)
        ;
        for (var i = 0, len = this.length; i < len; ++i)
            rv[i] = this[i]
        ;
        return rv;
    };
};
return function(arr){
    return slice.call(arr);
};
})(Array.prototype.slice);

Above suggested snippet is based on Features Detection, a technique apparently completely discarded in those files I've read in this library.

Features Detection Cons

performed runtime, few milliseconds before the library or function is ready to use

Features Detection Pros

performed once, and never again for the entire session
best performances, browser independent and browser focused at the same time
being usually based over most recent standards, features detections could cost a bit more only for deprecated, obsolete, or truly old browsers

Moreover!

If the problem is the wasted millisecond to perform a feature detection, we can always fallback into lazy feature detection.
I agree that web performances are more about download time and round trip, but if Google has V8 as engine monster, do we agree that better JavaScript practices could make even V8 faster?

Lack Of Creativity

Even if this library uses some weird practice, most logical and common techniques to speed up execution and reduce code are often not considered. As example, this is just one method we can find in the crypt file:


goog.crypt.byteArrayToString = function(array) {
  var output = [];
  for (var i = 0; i < array.length; i++) {
    output[i] = String.fromCharCode(array[i]);
  }
  return output.join('');
};

Oh Really? So now a JavaScript developer should create a function which accept an array in order to create another array to populate over a loop and a global method call to perform a join at the end? That's weird, I thought that method could have been written in this way:


// WebReflection Suggestion
goog.crypt.byteArrayToString = function(array) {
  return String.fromCharCode.apply(null, array);
};

... but maybe it's just me that noob that I cannot spot the difference except better performances over less code ... please enlighten me!

Closure Library, The Good Part

Is massive, for basically each file there is a test case so it is robust. For sure as we have seen before, this is not the fastest general purpose library we could find in the net, something I was expecting from Big G, but hey ... we cannot blame the excellent work done so far, can we?

Closure Templates

Well, here I am almost without a word ... I mean, the number 1 search engine able to create monster pages via JavaScript templates? I do hope those files will be performed everywhere but a web page and runtime, 'cause with incoming HTML5 I feel horrified thinking about a new web loads of document.write and nothing else. Where is the semantic? Where is the logic applied before and not during execution? Where are best practices? Looking at examples even V8CGI project is not able to perform that code ... what is that, exactly? And WHY???
The only thing I do like there is the definition file, something simple, clever, easy to parse, exactly the opposite of its implementation via JavaScript: please avoid it!

Closure Conclusion

No this is not another piece of the puzzle, just the end of this post. I've been probably too aggressive and it's only thanks to Google decision that I can write a post like this: Open Source is the key, or your libraries, operating systems, whatever, will be always behind 'cause developers able to help you are not only in your team.
I was expecting excellent ideas, new killer techniques unknown for everybody else, what I have found is "yet another toolkit". I am always up to contribute so if interested put my name in the committers list, I have already lot of stuff to optimize because a Compiler, whatever it does, cannot create a better code, it can simply try to make a bit better and shorter the existent one: never delegate skills to a machine until these will be able to take decisions for us!