Web Reflection: Literal Regular Expression Safe Regular Expression

... sorry for the redundant title but that's exactly what is this post about ... after yesterday explanation about problem, logic, and solution, to grab valid strings inside JS code, here I am with the literal RegExp able to grab literals RegExps in a generic JavaScript code.

Why Do Not Add Just A "/" Into Other Strings RegExp

One comment gave me the hint to write this second post about RegExps. While time is a bit over during days, this answer is simple, but not obvious!
Differences between strings and literal regular expressions are basically these:

there must be at least one char, or the parser will consider the literal RegExp an inline comment //
the slash does NOT necessary need to be escaped. If we have a slash inside a range [a/b] the latter one won't break the RegExp and the slash will be considered just one valid char in that range
there could be one or more chars after, where i(ignore case), g(match all), and m(multi line) can be present one or more times

Latter point is not truly a problem since this syntax will break the code in any case:


function igm();
var a = "string"igm();

But still, we need to understand first couple of points.

The RegExp Safe Regular Expression


// WebReflection Solution
/\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g

Since yesterday after 10 seconds somebody pointed me another solution, I bet this will happen again but so far I have tested above little monster enough to say that should work without problems but obviously only if the code is valid, otherwise we don't even need to waste our time trying to parse it.
As example, yesterday somebody told me:look, it does not work with this


a = \"string"

Well, now consider that an escaped char could be everywhere in the code but again, these regular expressions are not code sanitizer, n any case improbable since:


// tell me what do you expect and WHY!
a = "string\"
b = \"other"
\"
" c = what?!"

So any kind of weird combination wont work but if the regular expression is valid, escaped or not escaped, the precedent solution should work like a charm.

Explanation

I won't go step by step for the entire RegExp this time, things are the same described in my precedent post so please read there if you want to know more. The emulated look-behind pattern has been included in this regexp to skip groupd of possible ranges present in the regexp. When a range is encountered, starting with char "[", it is skipped till the end. If there is no end theoretically the literal RegExp is broken and the code won't execute. Same strategy is used for the other case, where no [ is encountered, if there is a char followed by a slash, we go on as described in the other post. In this way we should be sure that whatever will be, we'll find the end of the RegExp included chars. I did not spend too much time ensuring consistency for these flags since "/whatever/ii" will be part of inconsistent code which is a syntax parser problem, and not mine.

Test Cases


//comment <-- should not be matched at all
var a = /a/;
var b = /\//i;
var c = /[/]/;

I bet there are hundreds of RegExp or minifier out there able to fail with the latest one, since even different Editors have problems trying to understand what is going on.

The Test Case

Same code I have posted yesterday, except the alert will be for all arguments. I know I have used an empty replace, which is a bad practice, but that was good enough for test purpose:


onload = function(){
  document.body.appendChild(
    document.createElement("textarea")
  ).onchange=function(){
    this.value.replace(
      // WebReflection Solution Test
      /\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g,
      function(){
        alert([].slice.call(arguments).join("\n"));
      }
    );
  };
};

Please let me know if you find a better solution or whatever gotcha via the test case, considering that arguments[0] should be exactly the matched RegExp, thanks.

P.S. about the inline comment, it's not worth it to avoid that case for two reasons: we can always test that match.charAt(1) !== "/" plus the problem is still: who comes first? If we have a string inside a regexp or vice-versa there is no way to exclude these cases in a single, reasonable, RegExp. As I have said, as soon as I'll find some time, I will explain how to create a proper function able to manage all JavaScript cases, stay tuned!

6 comments:

abozhilov12 November, 2009 18:59
Again you don't know what are you talking about.

var b = 10, g = 1;
var c = 20
/b/g; // <= division

var b = 10, g = 1;
var c = 20;
/b/g; // <= regular expression

Regular expression literals strongly depend from context. With simple \/ you can't catch start of regular expression literal. You must write better parser from that which presented here.

whole(true) /b/g; //<= division
while(true) /b/g; //<= regular expression
Andrea Giammarchi13 November, 2009 00:18
false positive ... expecte behavior ... now please find a programmer able to write that code, and fire him ... AH AH AH !!!
Andrea Giammarchi13 November, 2009 00:31
P.S.
Again you don't know what are you talking about.
sure my favorite wannabe, teach me ;)

You think you can comment by defaut right? Nobody is filtering here right? Nite nite troll!
Andrea Giammarchi13 November, 2009 00:52
false positive ... expecte behavior
what I mean is that you have demonstrated the RegExp work (AS IS)

You need a syntax parser to understand code and a RegExp will never be able to understand code syntax flow. If interested in asshole programmers syntax, unless this comment was not just to show off how ignorant you are, you can have a look into Narcissus which is a JavaScript tokenizer in JavaScript.

In any case, please enlighten us with your solution, but what I can do is trying to make the RegExp better (but still, where is the solution troll?)

Have fun with studies and netiquette and I'd love to know you name, dear wannabe, mine is public ;)
Anonymous08 August, 2010 07:11
You know, I can find a lot of interesting information in your posts, and I think all do. But, it's a pity that "It is impossible for a man to learn what he thinks he already knows."
Anonymous09 August, 2010 03:13
“The way to get good ideas is to get lots of ideas, and throw the bad ones away” You have defenitely thrown the worst away!

Note: Only a member of this blog may post a comment.

Wednesday, November 11, 2009

Literal Regular Expression Safe Regular Expression