Web Reflection: escape

... sorry for the redundant title but that's exactly what is this post about ... after yesterday explanation about problem, logic, and solution, to grab valid strings inside JS code, here I am with the literal RegExp able to grab literals RegExps in a generic JavaScript code.

Why Do Not Add Just A "/" Into Other Strings RegExp

One comment gave me the hint to write this second post about RegExps. While time is a bit over during days, this answer is simple, but not obvious!
Differences between strings and literal regular expressions are basically these:

there must be at least one char, or the parser will consider the literal RegExp an inline comment //
the slash does NOT necessary need to be escaped. If we have a slash inside a range [a/b] the latter one won't break the RegExp and the slash will be considered just one valid char in that range
there could be one or more chars after, where i(ignore case), g(match all), and m(multi line) can be present one or more times

Latter point is not truly a problem since this syntax will break the code in any case:


function igm();
var a = "string"igm();

But still, we need to understand first couple of points.

The RegExp Safe Regular Expression


// WebReflection Solution
/\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g

Since yesterday after 10 seconds somebody pointed me another solution, I bet this will happen again but so far I have tested above little monster enough to say that should work without problems but obviously only if the code is valid, otherwise we don't even need to waste our time trying to parse it.
As example, yesterday somebody told me:look, it does not work with this


a = \"string"

Well, now consider that an escaped char could be everywhere in the code but again, these regular expressions are not code sanitizer, n any case improbable since:


// tell me what do you expect and WHY!
a = "string\"
b = \"other"
\"
" c = what?!"

So any kind of weird combination wont work but if the regular expression is valid, escaped or not escaped, the precedent solution should work like a charm.

Explanation

I won't go step by step for the entire RegExp this time, things are the same described in my precedent post so please read there if you want to know more. The emulated look-behind pattern has been included in this regexp to skip groupd of possible ranges present in the regexp. When a range is encountered, starting with char "[", it is skipped till the end. If there is no end theoretically the literal RegExp is broken and the code won't execute. Same strategy is used for the other case, where no [ is encountered, if there is a char followed by a slash, we go on as described in the other post. In this way we should be sure that whatever will be, we'll find the end of the RegExp included chars. I did not spend too much time ensuring consistency for these flags since "/whatever/ii" will be part of inconsistent code which is a syntax parser problem, and not mine.

Test Cases


//comment <-- should not be matched at all
var a = /a/;
var b = /\//i;
var c = /[/]/;

I bet there are hundreds of RegExp or minifier out there able to fail with the latest one, since even different Editors have problems trying to understand what is going on.

The Test Case

Same code I have posted yesterday, except the alert will be for all arguments. I know I have used an empty replace, which is a bad practice, but that was good enough for test purpose:


onload = function(){
  document.body.appendChild(
    document.createElement("textarea")
  ).onchange=function(){
    this.value.replace(
      // WebReflection Solution Test
      /\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g,
      function(){
        alert([].slice.call(arguments).join("\n"));
      }
    );
  };
};

Please let me know if you find a better solution or whatever gotcha via the test case, considering that arguments[0] should be exactly the matched RegExp, thanks.

P.S. about the inline comment, it's not worth it to avoid that case for two reasons: we can always test that match.charAt(1) !== "/" plus the problem is still: who comes first? If we have a string inside a regexp or vice-versa there is no way to exclude these cases in a single, reasonable, RegExp. As I have said, as soon as I'll find some time, I will explain how to create a proper function able to manage all JavaScript cases, stay tuned!

I should have probably investigated more but apparently I did it ... the most problematic I've encountered so far with JavaScript RegExp seems to be solved!

Update

Indeed, I should have investigated ... I just like to find solutions by my own. I am not surprised somebody already investigated this classic parsing problem.
Steve talked about it a year ago, using the lookbehind missed feature I talked in this post.
Above post has much more details than mine (and much more Edits as well).
The good part I am happy about is that both me and Steve came out with basically the same solution, but His one is definitively more compact:


// Steves Levithan compact solution
/(["'])(?:(?=(\\?))\2.)*?\1/g

The assumption of above regexp is that if there is a char followed by an escape one, there must be another char that cannot be the initial single or double quote, being the latter one outside the second uncaptured part, and after a non greedy operation.
If the second condition, \2, does not exist, the dot "." will pass the current char, no escape found, performing the char by char parsing I have described in my solution.
The dot is my [^\\], the double escape is represented by "\2.", as is for the escape plus whatever else that is not the end of the string, equivalent of my [\\(?=\1)]\1
I don't want to edit lots ot times this post, and I'll leave it as is to let you understand the problem, the logic, and the solution.
The only thing I would like to check are performances, since my less compact solution should be theoretically faster for common strings where the escape char is not present while Steve one will try to look for the escape plus will assign the possible missed match plus will pass whatever else char after, if any, considering outside there is a "break", and all these operations for whatever length, and still a char by char operation.
Whatever will be, we know we have at least two alternatives, and both mine and Steves one should be cross browser.

A Bit Of History

In all these years of programming with different languages, I have created dunno how many code parsers. WebReflection itself is using one of these parsers to highlight my sources. My good old PHP Comments Remover (2005 though ...) used another code parser. MyMin project used another one as well ... in few words, in my programming history I don't know how many times I had to deal with sources. The strategy I have always adopted, specially for JavaScript, is the char by char parser. The reason is simple, I have never created or found a good regular expression able to threat this case:


var code1 = "this is some \"test\"\\";
var code2 = "and this is \"anot\\her\" one!";

Above code, managed as a string, will become a stringe like:
"var code1 = \"this is some \\\"test\\\"\\\\";
var code2 = \"and this is \\\"anot\\her\\\" one!\";"
And if you know Regular Expressions, you know why this case is not that simple to manage isn't it?
Well, right now I was forking a project with a massive usage of Regular Expressions for CSS selectors and I could not avoid to notice the classical wrong match to manage strings:


/['"]([^'"]*?)['"]/g

Above match is almost a non-sense. If we have a string such "told'ya!" that RegExp will match told', leaving "ya!" out of the game. To make it a bit better the classic procedure is this one:


/(['"])([^\1]*?)\1/g

Whit above RegExp we are looking for quote or double quote char and we are searching the next one being sure if the first match is a single quote, the string will finish with a single one, and viceversa. There is still the problem that if we have the first matched quote or double quote and an escaped one in the middle of the string, that regular expression will truncate again the latter one giving us a untrustable result.

Why It Is More Difficult Via JavaScript

Regular Expressions in JavaScript miss at least one of most common features in PCRE world: the look-behind assertion!
Fortunately, we have an helpful Backreferences able in some case to slow down the match, but often the only or best way we have to create more clever matches!

The String Escape Safe Regular Expression


// WebReflection Solution
/(['"])((?:[^\\]|\\{2}|[\\(?=\1)]\1|[\\(?!\1)])*?)\1/g

I am not sure above little monster is the best RegExp you can find for this problem, and JavaScript features, what I am sure about, is that I have done dozen of tests and results seems to be perfect: Hooray!!!
If you are not familiar with RegExp, please let me try to explain what's going on there:


/
  // look for a single or a double quote char
  // this will be referenced as \1 in the rest of the regexp
  // in order to completely ignore the other one
  (['"])

  // the second match is performed over the string
  // that could be empty, or it could contain
  // any character included the first match, if escaped
  (

    // the second match will be a char by char parser
    // the only character we are worried about
    // is the one able to escape the first match
    (
      ?: // we are not interested about next capture
      // since the only scary char is the escape
      // but it is not necessary present
      // (let's say is less present than any other)
      // speed up the RegExp validating every char
      // but the escape ... these are all good!
      [^\\]
      |
      // if we encounter an escape char and this
      // is escaping itself we can skip 2 chars
      \\{2}
      |
      // alternatively, we could have
      // an escaped match (current one: single or double)
      // in this case we want to be sure that the escape
      // is for the matched char and not just an escape
      [\\(?=\1)]\1
      |
      // we need to validate whatever else has been
      // escaped as well so if the escape char is
      // NOT followed by the initial match or
      // another escape char it's ok
      // and we go on with next char
      [\\(?!\1)]

    // precedent cases should be performed for each
    // encountered char but these cannot be greedy
    // otherwise we risk to wrap the full string
    // var a = "a", b = "b";
    // 'a", b = "b' <-- greedy!
    )*?
  )

  // to make precedent assumptions valid
  // we need to be sure the string terminates
  // with the initial matched char
  \1
/g

That's pretty much it, if we use match method, replace, or exec, the matched[1] or RegExp.$1 will be the char used to encapsulate the string, single or double quote, while matched[2] or RegExp.$2 will contain the string itself.

In Any Case It Is Still Not Perfect

If we consider JavaScript regular expressions, same stuff used to solve the problem, we'll have another one.


var re = /ooo"yeah/;
var s = "no way";

In above example there will be some problem since the double quote inside the regular expression will be matched like a charm with my suggestion.
This is the reason we still need char by char parsers but hey ... I was trying to parse some selector and the usage of @test="case" which is even apparently not standard, so bear in mind we cannot use this RegExp unless the code won't contain literal regexps.
What is the trap here? That char by char a part, it's quite impossible to decide who comes first, "the slash or the quote"?

Quick And Dirty Solution Tester

With this code it should be simple to copy and paste some valid source to read parse after parse what is OK and what is not:


onload = function(){
  document.body.appendChild(
    document.createElement("textarea")
  ).onchange=function(){
    this.value.replace(
      // WebReflection Solution Test
      /(['"])((?:[^\\]|\\{2}|[\\(?=\1)]\1|[\\(?!\1)])*?)\1/g,
      function(){
        alert([arguments[1], arguments[2]].join("\n"));
      }
    );
  };
};

Please share whatever problem you'll find with such Regular Expression or suggest me a better faster approach to solve this problem with same test cases, thanks.

Wednesday, November 11, 2009

Literal Regular Expression Safe Regular Expression