Most of the tools we use on daily basis to develop applications with any kind of programming language need to understand our code in order to help us while we are writing, reading, or creating code.
We open an editor, we read highlighted code, we like suggestions when we write a
dot and we use compilers, minifiers, compressors, obfuscators, converters, documentation generators ... but have we ever though about the program behind our same program? Have we never thought that highlights, suggestions, minifications, and syntax analyzer do not come for free and these are primordial programs we've ever used often without even realizing we are using it?
This post is about general suggestions, techniques, and practices, used to map code. In this specific case, we will analyze step by step the logic behind a
code mapper using the most used programming language in the world:
JavaScript.
Map VS Tokens
First of all we need to understand what we need. A Map could be considered a generic list of coordinates able to tell us "
what is where" while a tokenizer is usually an extremely detailed map with extra info, where everything is deeply analyzed and hopefully never casual, as a map could be.
Every programming language has an interpreter, and usually an interpreter is able to analyze the syntax and create tokens to use runtime or to compile the language into another one (byte/machine code). As metaphoric example, the world is an open source application and we, nature and animals included, are tokens, then everything is part of a global application.
So far, services such Google Map, are applications able to map the world. Google Map does not (yet) care about our role in the system, it simply tells us
what/who is where. Silly metaphors a part, being global tokens analysis much harder to perform, this post will talk about the simple way: the generic Map, but if interested, and again via JavaScript, we could have a look into Narcissus, a good old
Mozilla Project able to analyze JavaScript via JavaScript itself.
Time To Start Thinking About The Problem
OK, everything seems so easy for human eyes ... we take a string, we recognize what is where, and that's it! Easy! Specially with an highlighter able to make code lecture a pleasure, isn't it?
Well, we'll see that things are not that simple as we think, and that is why historical projects such
Scintilla are still under development.
What We Are Interested About
As I have said, JavaScript will be the reference language but problems and techniques are similar for every other.
The first requirement to understand is
what as what are we looking for. In JavaScript case, we can split the language in these main cases:
- strings, we cannot touch them!
- literal regular expressions, untouchable as well
- comments, we may consider to get rid of them
- E4X, damn cool XML as is, again untouchable
- everything else, considered code
For above categories, we could slit them in sub categories:
- single quoted strings
- double quoted strings
- single line comments
- multiline comments
So far, we have just decided what will be our map about, and nothing else. Now it's time to think how to implement this map.
The Basic Problem: Who Comes First
Somebody could think:
dude, what's the fuss man, just search via RegExp and that's it... and no it's not.
This is a classic case where everything could go wrong, included your favorite JavaScript editor, but still perfectly valid syntax:
// ... code
var theWeirdCase = /"['//*"]'/;
// other code ...
Uhm, check that out, a regular expression with a string
"['//*" inside plus another intersected string
'//*"]' plus a single line comment
// plus the beginning of a multiline one
/* ... well done, surrounded by code, that case, reproducible with every combination with a string and a regexp inside, a comment with a string inside, etc etc, could fuck up all our parsing plans, isn't it?
So rule number one: understand what part of the map is more relevant, where in this case, relevance is defined by who comes first.
If we have whatever inside a string, whatever IS inside a string.
If we have whatever inside a literal RegExp, whatever IS inside a regexp.
Same is for E4X or comments, single or multi line, whatever will be part of that comment. Does anybody agree?
Char By Char VS Regular Expressions
OK, here we are, let's define the best strategy to analyze code, right? Char By Char could be the easy way to analyze code: it's simple to implement, we feel like we have control over every single char, we can spot
who comes first without problems or errors, isn't it?
Regular Expressions are simply abstracts and clever char by char parsers. We delegate tedious code to the RegExp engine which aim is to understand the expression, and gives us the match, if any. Regular Expressions are somehow similar, conceptual speaking, to SQL: we type Maya and Egyptians characters and magically we obtain the result, without caring at all about lower level layers.
I can still spot the
dude ... guy voice saying:
mate, I am not a noob, if RegExp are char by char parsers, my char by char parser will be faster for sure.
This time the
dude guy is not completely wrong but as usual,
it depends.
If we are programming with C, C++, or
heyholetsGo!, without considering better suited programing languages for this purpose such
Caml or OCaml, Regular Expressions will require a library with such overload for generic purpose that maybe we can do better and faster for the specific case.
On the other hand, all we know about programming is that
less is better and if we can trust extremely tested and famous regular expression engines (PCRE) why on earth we should waste time writing whatever specific case parser which aim will be the same of a RegExp?
And why on earth we think our check/test/verify implementation will be more stable than a short, quick, and dirty RegExp?
JavaScript Is Not As Fast As C Is
Thanks to competitors and new strategies to manage JS dynamic code, performances are often faster even than PHP, Ruby, Python (not Iron or C-) but JS cannot compete with lower level languages compiled directly into machine code and without dynamic nature.
Rather than explain in 3 hours why char by char is not always worth it with JavaScript, here I am with a test anybody can try and with
every browser.
// commented to avoid problems with highlight used
// n this blog, an old char by char parser ;-)
// PLEASE REMOVE NEXT COOMMENT CHARS "//" TO TEST
// var theWeirdCase = /"['//*"]'/;
var aLongString = new Array(150001).join(".") + theWeirdCase;
var i = 0;
var skip = false;
var time = [new Date];
var position = [(function(){
var length = aLongString.length;
var mlc = aLongString.indexOf("/*");
var slc = aLongString.indexOf("//");
var sqt = aLongString.indexOf("'");
var dqt = aLongString.indexOf('"');
var rex = aLongString.indexOf("/");
if(mlc < 0)mlc = length;
if(slc < 0)slc = length;
if(sqt < 0)sqt = length;
if(dqt < 0)dqt = length;
if(rex < 0)rex = length;
var position = Math.min(mlc, slc, sqt, dqt, rex);
if(position === length)position = -1;
return position;
})()];
time[i] = new Date - time[i];
time[++i] = new Date;
position[i] = (function(){
for(var
position = -1, c = 0, length = aLongString.length;
c < length; ++c
){
switch(aLongString.charAt(c)){
case "'":position=c;c=length;break;
case '"':position=c;c=length;break;
case "/":
if(c < length - 1){
switch(aLongString.charAt(c + 1)){
case "*":position=c;c=length;break;
case "/":position=c;c=length;break;
default:position=c;c=length;break;
}
}
break;
};
};
return position;
})();
time[i] = new Date - time[i];
try{aLongString[0]}catch(e){skip=true};
if(!skip && aLongString[0] === "."){
time[++i] = new Date;
position[i] = (function(){
for(var
position = -1, c = 0, length = aLongString.length;
c < length; ++c
){
switch(aLongString[c]){
case "'":position=c;c=length;break;
case '"':position=c;c=length;break;
case "/":
if(c < length - 1){
switch(aLongString.charAt(c + 1)){
case "*":position=c;c=length;break;
case "/":position=c;c=length;break;
default:position=c;c=length;break;
}
}
break;
};
};
return position;
})();
time[i] = new Date - time[i];
};
time[++i] = new Date;
position[i] = aLongString.search(/\/*|\/\/|'|"|\//);
time[i] = new Date - time[i];
alert([
time.join("\n"),
position.join("\n")
].join("\n"));
Above benchmark try to replicate common techniques to find a single piece of the full map, in this case over about 150Kb of fake JavaScript code.
How To Read The Benchmark
dude ... the benh confirm that ... shut up! The bench shows that Regular Expressions are the fastest way to parse code via JavaScript. It does not matter if the score could not be the fastest in above case, what matters is that with a single Regular Expression we can grab 0, 1, or
every match in the code.
If we think that in 150Kb of code to highlight showed time will increase for each mapped part in the same code, while the regular expression could be just one, we can easily see that:
- the RegExp is able to find and validate at the same time what we are looking for, above benchmark misses all manual validations we need to do to understand if the case, first found char a part, is exactly the one we where looking for
- with a single regular expression we save a large number of characters and the code is easier to maintain
- for each manual parsed char we need to perform a manual check for the exact case, the total amount of manual checks increase with number of chars and code complexity
- with a single regexp we don't care that much about code size because the engine, hopefully written in C, should be fast enough to be able to parse a large amount of data
We Still Need Good Regular Expressions
This is the most critical part, if we use bad regular expressions we could have a lot of false positive and we could mess up the map. The reason Regular Expressions are not always well considered, is that these could be hard to read, write, or understand, and people could put "
chars around" without improving anyhow the regexp, adding more false positives instead. I do not pretend to give you best regular expressions ever for all cases, but it's since 2000 I am using RegExps and hopefully I know a chicky bit about them.
JSMap.parser = [
// WebReflection Suggestion
{
test:/\/\/[^\1]*?(\r\n|\r|\n)/g,
type:JSMap.COMMENT_SL,
place:function(value, a){
return value.charAt(2) === "@" ? value : a;
}
},{
test:/\/\*[^\1]*?(\*\/)/g,
type:JSMap.COMMENT_ML,
place:function(value){
return value.charAt(value.length - 3) === "@" ? value : JSMap.parser[0].place(value, "");
}
},{
test:/(["'])(?:(?=(\\?))\2.)*?\1/g,
type:JSMap.STRING
},{
test:/\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g,
type:JSMap.REGEXP
},{
// Note: experimental, not fully tested/supported
test:/<>[^\1]*?(<\/>)|<(\w+|\{\w+\})(?:\s*\/|[^\>]*?>.*?<\/\2\s*)>/g,
type:JSMap.E4X
}
];
Above collection contains all kind of things we would like to Map for JavaScript.
Some object contains a
place method which aim is to avoid, if necessary, the match replacement. In this case I have considered Internet Explorer conditional comments and nothing else, but there are other cases where a comment should not be removed (e.g. /*! my license */ )
Each object contains a map type, 'cause we would like to know what we have found there, don't we?
JSMap.CODE = 1;
JSMap.COMMENT_SL= 2;
JSMap.COMMENT_ML= 4;
JSMap.STRING = 8;
JSMap.REGEXP = 16;
JSMap.E4X = 32;
JSMap.ALL = 63;
JSMap.DEFAULT = ~JSMap.E4X;
Since we are creating a customizable map, we would like to choose what we want to find or not. We can decide using some bit operation.
As we can see, the default one exclude the E4X case, it's not common, since it has not been implemented yet in every browser, plus for this post and example is not perfect.
To exclude something all we need to do is to use
~ char, while if we want to decide just few thing we can always use the or
| :
JSMap.COMMENT_SL | JSMap.COMMENT_ML
Above example will look for single and multi line comments. Maybe we want just understand if there is some conditional comment?
In cany case, bear in mind that
who comes first is still the main problem, so if we restrict the searh we could find false positives (e.g. comments inside strings or regexps)
How To Logically Proceed With The Mapper
OK, we have reduced code size already 1/3rd thanks to god regular expressions. Now what's next?
The problem is still the same: Who Comes First!
For each performed Regular Expression we should store results somewhere. In this case we will have 4 searches performed via RegExp for the entire code, rather than a check for each possible matched character, but false positives could always be there.
function JSMap(CODE, type){
if(!type)
type = JSMap.DEFAULT
;
for(var
Map = [],
length = JSMap.parser.length, i = 0,
a, b, exec, parser, value;
i < length; ++i
){
parser = JSMap.parser[i];
if(parser.type & type){
while(exec = parser.test.exec(CODE)){
value = exec[0];
Map.push({start:exec.index, end:exec.index + value.length, value:value, type:parser.type});
};
};
};
// ... the rest of the code
OK, looping over the list of RegExp we have collected every match.
Each match has been stored as an object where properties are:
- where the match start, we can reuse start and end info regardless for other purposes
- where the match end
- the match itself, alone could be reused or simply replaced in the current CODE
- the match type, what we have found
Start and end are valid for every match and compatible with
substring.
If for some reason we will change a single value and its length will change, we can easily synchronize the Map adding or removing the difference between old length and the new one for every other mapped objects after the current one.
these properties could be superfluous but we want control for any kind of occasion.
An Ordered Map Without False Positives Collisions
The simplest way to reorder the map is a native Array.
sort operation, perfect to understand who comes first!
Map.sort(function(a,b){
return b.start < a.start ? 1 : -1
});
The start point is the sort key and thanks to Regular Expressions we will rarely have same start point. If we have one, we need to rethink the RegExp because it is not good enough since, for example, a comment cannot be a regexp and viceversa.
The native sort operation is hopefully fast enough to guarantee still better performances. All we need to do now is to add, if necessary, the code.
Cleaning Adding Surrounding Code
Once we have mapped all we were looking for, we can consider CODE everything before, in between, or after our matches.
// type could NOT include CODE
// and if code starts with a comment, there
// is nothing to do ...
if(type && 0 < (i = Map[0].start))
Map.unshift({start:0, end:i, value:CODE.substring(0, i), type:JSMap.CODE})
;
// every other Mapped case will be ordered by start
// Accordingly, the first start we will encounter
// will be the valid one ... the first match
// will be considered valid by dafault (var a)
for(length = Map.length, i = 1, a = Map[0]; i < length; ++i){
b = Map[i];
switch(true){
// if there is a gap between the
// precedent a match and b one
// there MUST be code between
case a.end < b.start:
// let's add it if we need
// continue otherwise
if(type){
Map.splice(i, 0, {start:a.end, end:b.start, value:CODE.substring(a.end, b.start), type:JSMap.CODE});
++length;
++i;
};
// if there is no gap or code
// has been inserted
// we can go on with the for loop
// considering the current b
// valid, assigned for this reason to a
case a.end === b.start:
a = b;
break;
// if there is no gap
// which means next match starts
// before a.end and no after
// we have a false positive
default:
// remove this match in any case
Map.splice(i, 1);
--length;
--i;
break;
};
};
// if the last match ends before the string.length
// there must be another piece of code to add
if(type && (i = Map[i - 1].end) < CODE.length)
Map.push({start:i, end:CODE.length, value:CODE.substring(i, CODE.length), type:JSMap.CODE})
;
Finally, if there is no match but JSMap.CODE is part of the search, we can consider the entire string a piece of code:
Map.push({start:0, end:CODE.length, value:CODE, type:JSMap.CODE})
How To Use A Map
Almost the end of this tedious post. Once we have created a map of our code, we can reuse matches in any way we prefer.
This is a "
paste and go" example that will let us test whatever code we want:
/**
* JSMap test case
*/
onload = function(){
document.body.appendChild(
document.createElement("textarea")
).onchange=function(){
var time = new Date,
Map = JSMap(this.value/*, JSMap.CODE|JSMap.COMMENT_ML|JSMap.COMMENT_SL|JSMap.REGEXP*/)
;
time = new Date - time;
if(!Map.map)Map.map=function(fn){for(var i = 0, length = this.length; i < length; ++i)this[i]=fn(this[i]);return this};
document.body.innerHTML = 'Mapped in ' + (time/1000) + ' milliseconds.<pre>' + Map.map(function(o){
var value = o.value.replace(/</g, "<").replace(/>/g, ">");
switch(o.type){
case JSMap.STRING:
value = '<span style="color:#66F">' + value + '</span>';
break;
case JSMap.COMMENT_SL:
value = '<span style="color:#999">' + value + '</span>';
break;
case JSMap.COMMENT_ML:
value = '<span style="color:#F66">' + value + '</span>';
break;
case JSMap.REGEXP:
value = '<span style="color:#F00">' + value + '</span>';
break;
case JSMap.E4X:
value = '<span style="color:#00F">' + value + '</span>';
break;
default:
// eventually we can parse
// the code with numbers
// spaces, keywords, methods
// and whatever we need
// that would be another Map
// specific for the language
// or simply a replacer via RegExp
break;
};
return value;
}).join("") + '</pre>';
};
};
Save in an html page, paste some valid code into the textarea, disabling spell check if the source is massive, and click somewhere outside the area.
Conclusion
Even if the case is JavaScript and a JavaScript map, with this post we can hopefully better understand problems behind generic code parsing.
What is missing in this post is a
code parser able to highlight or understand every piece of code present in the map.
If we loop over the returned map it is possible to understand which one is code and what is next. If a variable is going to be assigned to a regexp or string, as example, the value will be found in the next object present in the map.
I did not consider numbers 'cause these are simple to parse in a code portion, while these could slow down every other operation since these could be easily spotted inside strings, comments, regexps, or E4X syntax.
Considerations, techniques, and benchmarks, are relative for this case but generally valid for any kind of purpose. CSS selectors, HTML, and other cases, need to consider when Regular Expressions are worthy (e.g. not that fast char by char parser due to used programming language) and when a simple indexOf could be the best solution ever. I hope you enjoyed this post, I surely did writing it since the argument is not that frequent and techniques extremely different (and trust me, I have showed maximum a 30% of what we could fnd out there).
Oooops, I almost forgot the
JSMap source code!