My JavaScript book is out! Don't miss the opportunity to upgrade your beginner or average dev skills.

Thursday, November 12, 2009

How To Map Your Code

Most of the tools we use on daily basis to develop applications with any kind of programming language need to understand our code in order to help us while we are writing, reading, or creating code.
We open an editor, we read highlighted code, we like suggestions when we write a dot and we use compilers, minifiers, compressors, obfuscators, converters, documentation generators ... but have we ever though about the program behind our same program? Have we never thought that highlights, suggestions, minifications, and syntax analyzer do not come for free and these are primordial programs we've ever used often without even realizing we are using it?
This post is about general suggestions, techniques, and practices, used to map code. In this specific case, we will analyze step by step the logic behind a code mapper using the most used programming language in the world: JavaScript.

Map VS Tokens

First of all we need to understand what we need. A Map could be considered a generic list of coordinates able to tell us "what is where" while a tokenizer is usually an extremely detailed map with extra info, where everything is deeply analyzed and hopefully never casual, as a map could be.
Every programming language has an interpreter, and usually an interpreter is able to analyze the syntax and create tokens to use runtime or to compile the language into another one (byte/machine code). As metaphoric example, the world is an open source application and we, nature and animals included, are tokens, then everything is part of a global application.
So far, services such Google Map, are applications able to map the world. Google Map does not (yet) care about our role in the system, it simply tells us what/who is where. Silly metaphors a part, being global tokens analysis much harder to perform, this post will talk about the simple way: the generic Map, but if interested, and again via JavaScript, we could have a look into Narcissus, a good old Mozilla Project able to analyze JavaScript via JavaScript itself.

Time To Start Thinking About The Problem

OK, everything seems so easy for human eyes ... we take a string, we recognize what is where, and that's it! Easy! Specially with an highlighter able to make code lecture a pleasure, isn't it?
Well, we'll see that things are not that simple as we think, and that is why historical projects such Scintilla are still under development.

What We Are Interested About

As I have said, JavaScript will be the reference language but problems and techniques are similar for every other.
The first requirement to understand is what as what are we looking for. In JavaScript case, we can split the language in these main cases:
  • strings, we cannot touch them!
  • literal regular expressions, untouchable as well
  • comments, we may consider to get rid of them
  • E4X, damn cool XML as is, again untouchable
  • everything else, considered code

For above categories, we could slit them in sub categories:
  • single quoted strings
  • double quoted strings
  • single line comments
  • multiline comments

So far, we have just decided what will be our map about, and nothing else. Now it's time to think how to implement this map.

The Basic Problem: Who Comes First

Somebody could think:dude, what's the fuss man, just search via RegExp and that's it... and no it's not.
This is a classic case where everything could go wrong, included your favorite JavaScript editor, but still perfectly valid syntax:

// ... code

var theWeirdCase = /"['//*"]'/;

// other code ...

Uhm, check that out, a regular expression with a string "['//*" inside plus another intersected string '//*"]' plus a single line comment // plus the beginning of a multiline one /* ... well done, surrounded by code, that case, reproducible with every combination with a string and a regexp inside, a comment with a string inside, etc etc, could fuck up all our parsing plans, isn't it?
So rule number one: understand what part of the map is more relevant, where in this case, relevance is defined by who comes first.
If we have whatever inside a string, whatever IS inside a string.
If we have whatever inside a literal RegExp, whatever IS inside a regexp.
Same is for E4X or comments, single or multi line, whatever will be part of that comment. Does anybody agree?

Char By Char VS Regular Expressions

OK, here we are, let's define the best strategy to analyze code, right? Char By Char could be the easy way to analyze code: it's simple to implement, we feel like we have control over every single char, we can spot who comes first without problems or errors, isn't it?
Regular Expressions are simply abstracts and clever char by char parsers. We delegate tedious code to the RegExp engine which aim is to understand the expression, and gives us the match, if any. Regular Expressions are somehow similar, conceptual speaking, to SQL: we type Maya and Egyptians characters and magically we obtain the result, without caring at all about lower level layers.
I can still spot the dude ... guy voice saying:
mate, I am not a noob, if RegExp are char by char parsers, my char by char parser will be faster for sure.
This time the dude guy is not completely wrong but as usual, it depends.
If we are programming with C, C++, or heyholetsGo!, without considering better suited programing languages for this purpose such Caml or OCaml, Regular Expressions will require a library with such overload for generic purpose that maybe we can do better and faster for the specific case.
On the other hand, all we know about programming is that less is better and if we can trust extremely tested and famous regular expression engines (PCRE) why on earth we should waste time writing whatever specific case parser which aim will be the same of a RegExp?
And why on earth we think our check/test/verify implementation will be more stable than a short, quick, and dirty RegExp?

JavaScript Is Not As Fast As C Is

Thanks to competitors and new strategies to manage JS dynamic code, performances are often faster even than PHP, Ruby, Python (not Iron or C-) but JS cannot compete with lower level languages compiled directly into machine code and without dynamic nature.
Rather than explain in 3 hours why char by char is not always worth it with JavaScript, here I am with a test anybody can try and with every browser.

// commented to avoid problems with highlight used
// n this blog, an old char by char parser ;-)

// PLEASE REMOVE NEXT COOMMENT CHARS "//" TO TEST

// var theWeirdCase = /"['//*"]'/;
var aLongString = new Array(150001).join(".") + theWeirdCase;
var i = 0;
var skip = false;
var time = [new Date];
var position = [(function(){
var length = aLongString.length;
var mlc = aLongString.indexOf("/*");
var slc = aLongString.indexOf("//");
var sqt = aLongString.indexOf("'");
var dqt = aLongString.indexOf('"');
var rex = aLongString.indexOf("/");
if(mlc < 0)mlc = length;
if(slc < 0)slc = length;
if(sqt < 0)sqt = length;
if(dqt < 0)dqt = length;
if(rex < 0)rex = length;
var position = Math.min(mlc, slc, sqt, dqt, rex);
if(position === length)position = -1;
return position;
})()];
time[i] = new Date - time[i];
time[++i] = new Date;
position[i] = (function(){
for(var
position = -1, c = 0, length = aLongString.length;
c < length; ++c
){
switch(aLongString.charAt(c)){
case "'":position=c;c=length;break;
case '"':position=c;c=length;break;
case "/":
if(c < length - 1){
switch(aLongString.charAt(c + 1)){
case "*":position=c;c=length;break;
case "/":position=c;c=length;break;
default:position=c;c=length;break;
}
}
break;
};
};
return position;
})();
time[i] = new Date - time[i];
try{aLongString[0]}catch(e){skip=true};
if(!skip && aLongString[0] === "."){
time[++i] = new Date;
position[i] = (function(){
for(var
position = -1, c = 0, length = aLongString.length;
c < length; ++c
){
switch(aLongString[c]){
case "'":position=c;c=length;break;
case '"':position=c;c=length;break;
case "/":
if(c < length - 1){
switch(aLongString.charAt(c + 1)){
case "*":position=c;c=length;break;
case "/":position=c;c=length;break;
default:position=c;c=length;break;
}
}
break;
};
};
return position;
})();
time[i] = new Date - time[i];
};
time[++i] = new Date;
position[i] = aLongString.search(/\/*|\/\/|'|"|\//);
time[i] = new Date - time[i];

alert([
time.join("\n"),
position.join("\n")
].join("\n"));

Above benchmark try to replicate common techniques to find a single piece of the full map, in this case over about 150Kb of fake JavaScript code.

How To Read The Benchmark

dude ... the benh confirm that ... shut up! The bench shows that Regular Expressions are the fastest way to parse code via JavaScript. It does not matter if the score could not be the fastest in above case, what matters is that with a single Regular Expression we can grab 0, 1, or every match in the code.
If we think that in 150Kb of code to highlight showed time will increase for each mapped part in the same code, while the regular expression could be just one, we can easily see that:
  • the RegExp is able to find and validate at the same time what we are looking for, above benchmark misses all manual validations we need to do to understand if the case, first found char a part, is exactly the one we where looking for
  • with a single regular expression we save a large number of characters and the code is easier to maintain
  • for each manual parsed char we need to perform a manual check for the exact case, the total amount of manual checks increase with number of chars and code complexity
  • with a single regexp we don't care that much about code size because the engine, hopefully written in C, should be fast enough to be able to parse a large amount of data


We Still Need Good Regular Expressions

This is the most critical part, if we use bad regular expressions we could have a lot of false positive and we could mess up the map. The reason Regular Expressions are not always well considered, is that these could be hard to read, write, or understand, and people could put "chars around" without improving anyhow the regexp, adding more false positives instead. I do not pretend to give you best regular expressions ever for all cases, but it's since 2000 I am using RegExps and hopefully I know a chicky bit about them.

JSMap.parser = [
// WebReflection Suggestion
{
test:/\/\/[^\1]*?(\r\n|\r|\n)/g,
type:JSMap.COMMENT_SL,
place:function(value, a){
return value.charAt(2) === "@" ? value : a;
}
},{
test:/\/\*[^\1]*?(\*\/)/g,
type:JSMap.COMMENT_ML,
place:function(value){
return value.charAt(value.length - 3) === "@" ? value : JSMap.parser[0].place(value, "");
}
},{
test:/(["'])(?:(?=(\\?))\2.)*?\1/g,
type:JSMap.STRING
},{
test:/\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g,
type:JSMap.REGEXP
},{
// Note: experimental, not fully tested/supported
test:/<>[^\1]*?(<\/>)|<(\w+|\{\w+\})(?:\s*\/|[^\>]*?>.*?<\/\2\s*)>/g,
type:JSMap.E4X
}
];

Above collection contains all kind of things we would like to Map for JavaScript.
Some object contains a place method which aim is to avoid, if necessary, the match replacement. In this case I have considered Internet Explorer conditional comments and nothing else, but there are other cases where a comment should not be removed (e.g. /*! my license */ )
Each object contains a map type, 'cause we would like to know what we have found there, don't we?

JSMap.CODE = 1;
JSMap.COMMENT_SL= 2;
JSMap.COMMENT_ML= 4;
JSMap.STRING = 8;
JSMap.REGEXP = 16;
JSMap.E4X = 32;
JSMap.ALL = 63;
JSMap.DEFAULT = ~JSMap.E4X;

Since we are creating a customizable map, we would like to choose what we want to find or not. We can decide using some bit operation.
As we can see, the default one exclude the E4X case, it's not common, since it has not been implemented yet in every browser, plus for this post and example is not perfect.
To exclude something all we need to do is to use ~ char, while if we want to decide just few thing we can always use the or | :

JSMap.COMMENT_SL | JSMap.COMMENT_ML

Above example will look for single and multi line comments. Maybe we want just understand if there is some conditional comment?
In cany case, bear in mind that who comes first is still the main problem, so if we restrict the searh we could find false positives (e.g. comments inside strings or regexps)

How To Logically Proceed With The Mapper

OK, we have reduced code size already 1/3rd thanks to god regular expressions. Now what's next?
The problem is still the same: Who Comes First!
For each performed Regular Expression we should store results somewhere. In this case we will have 4 searches performed via RegExp for the entire code, rather than a check for each possible matched character, but false positives could always be there.

function JSMap(CODE, type){
if(!type)
type = JSMap.DEFAULT
;
for(var
Map = [],
length = JSMap.parser.length, i = 0,
a, b, exec, parser, value;
i < length; ++i
){
parser = JSMap.parser[i];
if(parser.type & type){
while(exec = parser.test.exec(CODE)){
value = exec[0];
Map.push({start:exec.index, end:exec.index + value.length, value:value, type:parser.type});
};
};
};
// ... the rest of the code

OK, looping over the list of RegExp we have collected every match.
Each match has been stored as an object where properties are:
  • where the match start, we can reuse start and end info regardless for other purposes
  • where the match end
  • the match itself, alone could be reused or simply replaced in the current CODE
  • the match type, what we have found

Start and end are valid for every match and compatible with substring.
If for some reason we will change a single value and its length will change, we can easily synchronize the Map adding or removing the difference between old length and the new one for every other mapped objects after the current one.
these properties could be superfluous but we want control for any kind of occasion.

An Ordered Map Without False Positives Collisions

The simplest way to reorder the map is a native Array.sort operation, perfect to understand who comes first!

Map.sort(function(a,b){
return b.start < a.start ? 1 : -1
});

The start point is the sort key and thanks to Regular Expressions we will rarely have same start point. If we have one, we need to rethink the RegExp because it is not good enough since, for example, a comment cannot be a regexp and viceversa.
The native sort operation is hopefully fast enough to guarantee still better performances. All we need to do now is to add, if necessary, the code.

Cleaning Adding Surrounding Code

Once we have mapped all we were looking for, we can consider CODE everything before, in between, or after our matches.

// type could NOT include CODE
// and if code starts with a comment, there
// is nothing to do ...
if(type && 0 < (i = Map[0].start))
Map.unshift({start:0, end:i, value:CODE.substring(0, i), type:JSMap.CODE})
;
// every other Mapped case will be ordered by start
// Accordingly, the first start we will encounter
// will be the valid one ... the first match
// will be considered valid by dafault (var a)
for(length = Map.length, i = 1, a = Map[0]; i < length; ++i){
b = Map[i];
switch(true){
// if there is a gap between the
// precedent a match and b one
// there MUST be code between
case a.end < b.start:
// let's add it if we need
// continue otherwise
if(type){
Map.splice(i, 0, {start:a.end, end:b.start, value:CODE.substring(a.end, b.start), type:JSMap.CODE});
++length;
++i;
};
// if there is no gap or code
// has been inserted
// we can go on with the for loop
// considering the current b
// valid, assigned for this reason to a
case a.end === b.start:
a = b;
break;
// if there is no gap
// which means next match starts
// before a.end and no after
// we have a false positive
default:
// remove this match in any case
Map.splice(i, 1);
--length;
--i;
break;
};
};
// if the last match ends before the string.length
// there must be another piece of code to add
if(type && (i = Map[i - 1].end) < CODE.length)
Map.push({start:i, end:CODE.length, value:CODE.substring(i, CODE.length), type:JSMap.CODE})
;

Finally, if there is no match but JSMap.CODE is part of the search, we can consider the entire string a piece of code:

Map.push({start:0, end:CODE.length, value:CODE, type:JSMap.CODE})


How To Use A Map

Almost the end of this tedious post. Once we have created a map of our code, we can reuse matches in any way we prefer.
This is a "paste and go" example that will let us test whatever code we want:

/**
* JSMap test case
*/
onload = function(){
document.body.appendChild(
document.createElement("textarea")
).onchange=function(){
var time = new Date,
Map = JSMap(this.value/*, JSMap.CODE|JSMap.COMMENT_ML|JSMap.COMMENT_SL|JSMap.REGEXP*/)
;
time = new Date - time;
if(!Map.map)Map.map=function(fn){for(var i = 0, length = this.length; i < length; ++i)this[i]=fn(this[i]);return this};
document.body.innerHTML = 'Mapped in ' + (time/1000) + ' milliseconds.<pre>' + Map.map(function(o){
var value = o.value.replace(/</g, "&lt;").replace(/>/g, "&gt;");
switch(o.type){
case JSMap.STRING:
value = '<span style="color:#66F">' + value + '</span>';
break;
case JSMap.COMMENT_SL:
value = '<span style="color:#999">' + value + '</span>';
break;
case JSMap.COMMENT_ML:
value = '<span style="color:#F66">' + value + '</span>';
break;
case JSMap.REGEXP:
value = '<span style="color:#F00">' + value + '</span>';
break;
case JSMap.E4X:
value = '<span style="color:#00F">' + value + '</span>';
break;
default:
// eventually we can parse
// the code with numbers
// spaces, keywords, methods
// and whatever we need
// that would be another Map
// specific for the language
// or simply a replacer via RegExp
break;
};
return value;
}).join("") + '</pre>';
};
};

Save in an html page, paste some valid code into the textarea, disabling spell check if the source is massive, and click somewhere outside the area.

Conclusion

Even if the case is JavaScript and a JavaScript map, with this post we can hopefully better understand problems behind generic code parsing.
What is missing in this post is a code parser able to highlight or understand every piece of code present in the map.
If we loop over the returned map it is possible to understand which one is code and what is next. If a variable is going to be assigned to a regexp or string, as example, the value will be found in the next object present in the map.
I did not consider numbers 'cause these are simple to parse in a code portion, while these could slow down every other operation since these could be easily spotted inside strings, comments, regexps, or E4X syntax.
Considerations, techniques, and benchmarks, are relative for this case but generally valid for any kind of purpose. CSS selectors, HTML, and other cases, need to consider when Regular Expressions are worthy (e.g. not that fast char by char parser due to used programming language) and when a simple indexOf could be the best solution ever. I hope you enjoyed this post, I surely did writing it since the argument is not that frequent and techniques extremely different (and trust me, I have showed maximum a 30% of what we could fnd out there).
Oooops, I almost forgot the JSMap source code!

13 comments:

Mr Speaker said...

Great post! I've tried my hand at rolling-my-own code highlighters a couple of times and the results have always been patchy (at best). I'm going to go revisit a couple of those projects now!

Gareth Heyes said...

Hmmmmmm that won't strip all comments and doesn't detect all regexps ;)

Andrea Giammarchi said...

can you guys be more constructive with comments please ... there is no code that strips code and all concepts are simply correct. Maybe RegExp are not perfect but this is another story, OK?
False positives can always happen and if you open your favorite IDE trying that weirdCase tell me it works without problems ... regards

Andrea Giammarchi said...

P.S. I have parsed the full ExtJS debug library, highlighted (no code) without problems. Bring me examples or you are not helping at all.

Gareth Heyes said...

Hmmmm well for a start:-

string = 'I am a \
string';

Probably same with the regexp:-

/multiline regexp\
on FF/

Andrea Giammarchi said...

As I have said, this is a RegExp problem, it could be better. This does not change anyhow what I have written in this post, suggested practices, and the logic behind.

This post title is: How To Map Your Code

and not: The Best RegExp in the world

(I even wrote I could not guarantee perfect RegExp, first comment: the RegExp is not perfect, ... oh really?)

You have better JS techniques not explained in this post? I'd love to see them.

You have better regexp to solve the problem? Please share.

There is something not clear about the applied logic? I'll be more than happy to better explain it.

Regards

sirdarckcat said...

Dude, you treat your readers like garbage..

so, you complain of people not making perfect code on the most stupid things like a String.fromCharCode.apply, but you dont like when someone tells you that your code is not perfect.. wtf?

Anyway.. its your blog, do whatever u want.

Andrea Giammarchi said...

Sorry you are right, it's just that I am having hard time with people not understanding my posts, commenting stuff not related.

I write about potatoes, people complain about tomatoes ... a bit different, if I was writing a code style, a technical solution, rather than a full abstract process about how to solve the problem ... he would have been more than right!

Gareth Heyes said...

Hey Andrea

I'm not having a go, I was just pointing out it's missing some stuff for what I want to use it for. You did ask for an example =)

I'm not saying your code sucks or anything. You are using it for one purpose I want to use it for another.

Andrea Giammarchi said...

@Gareth my apologies, but please if you have better RegExps share, thanks

Gareth Heyes said...

These handle everything apart from E4X.


unicode = /\\u[0-9a-fA-F]{4}/,
endStatement = /(?:(?:^\s*|\s*$)|[,]|[;\n\r]+)/, spaces = /\s*/,
variable = new RegExp("(?:(?:[a-zA-Z$_]|" + unicode.source + ")(?:[a-zA-Z0-9_$]|(?:" + unicode.source + "))*)"),
operators = /(?:[!][=]{0,2}|[%][=]?|[&]{1,2}|[&][=]|[*][=]?|[+]{1,2}|[+][=]|[\-]{1,2}|[\-][=]|[\/][=]?|[<]{1,2}[=]?|[=]{1,3}|[>]{1,3}[=]?|[\^][=]?|[|][=]?|[|]{2})/,
regexpObj = new RegExp("(?:[\\/](?:\\\\[\\/]|[^\\/*])(?:\\\\[\\/]|[^\\/])*?[\\/](?:[gmi]*))"),
strings = new RegExp("(?:(?:['](?:\\\\{2}|\\\\[']|\\\\[\\r\\n]|[^'])*['])|(?:[\"](?:\\\\{2}|\\\\[\"]|\\\\[\\r\\n]|[^\"])*[\"]))"),
numbers = new RegExp('(?:[0][xX][0-9a-fA-F]*)|(?:[0]|[1-9]\\d+)?(?:[.]?\\d+)+(?:[eE][+-]?\\d+)?'),
comments = new RegExp('(' + strings.source + '|' + numbers.source + '(?:' + operators.source + ')?(?![*\\/])' + '|' + spaces.source + "(?:[\\/](?:\\\\[\\/]|[^\\/*])(?:\\\\[\\/]|[^\\/])*?[\\/](?:[gmi]*))" + spaces.source + ')' + '|((?:^\s*[\\-]{2}[>]|[<][!][\\-]{2}).*(?=[\\r\\n]?)|[\\/]{2}.*(?=[\\r\\n]?))|([\\/][*](?:[\\s]|.)*?[*][\\/])', 'gm');

kangax said...

Seems to be failing with this one:

(function(x){
return 1/x//x/
})();

Expected: `1/x` followed by single-line comment

Actual: /x//x/ is recognized as regex literal

Andrea Giammarchi said...

The literal I wrote for literals suffers false positives indeed and that case with no char range is an error ... I guess I should change it soon ...