My JavaScript book is out! Don't miss the opportunity to upgrade your beginner or average dev skills.

Tuesday, November 10, 2009

String Escape Safe Regular Expression

I should have probably investigated more but apparently I did it ... the most problematic I've encountered so far with JavaScript RegExp seems to be solved!

Update


Indeed, I should have investigated ... I just like to find solutions by my own. I am not surprised somebody already investigated this classic parsing problem.
Steve talked about it a year ago, using the lookbehind missed feature I talked in this post.
Above post has much more details than mine (and much more Edits as well).
The good part I am happy about is that both me and Steve came out with basically the same solution, but His one is definitively more compact:

// Steves Levithan compact solution
/(["'])(?:(?=(\\?))\2.)*?\1/g

The assumption of above regexp is that if there is a char followed by an escape one, there must be another char that cannot be the initial single or double quote, being the latter one outside the second uncaptured part, and after a non greedy operation.
If the second condition, \2, does not exist, the dot "." will pass the current char, no escape found, performing the char by char parsing I have described in my solution.
The dot is my [^\\], the double escape is represented by "\2.", as is for the escape plus whatever else that is not the end of the string, equivalent of my [\\(?=\1)]\1
I don't want to edit lots ot times this post, and I'll leave it as is to let you understand the problem, the logic, and the solution.
The only thing I would like to check are performances, since my less compact solution should be theoretically faster for common strings where the escape char is not present while Steve one will try to look for the escape plus will assign the possible missed match plus will pass whatever else char after, if any, considering outside there is a "break", and all these operations for whatever length, and still a char by char operation.
Whatever will be, we know we have at least two alternatives, and both mine and Steves one should be cross browser.


A Bit Of History

In all these years of programming with different languages, I have created dunno how many code parsers. WebReflection itself is using one of these parsers to highlight my sources. My good old PHP Comments Remover (2005 though ...) used another code parser. MyMin project used another one as well ... in few words, in my programming history I don't know how many times I had to deal with sources. The strategy I have always adopted, specially for JavaScript, is the char by char parser. The reason is simple, I have never created or found a good regular expression able to threat this case:

var code1 = "this is some \"test\"\\";
var code2 = "and this is \"anot\\her\" one!";

Above code, managed as a string, will become a stringe like:
"var code1 = \"this is some \\\"test\\\"\\\\";
var code2 = \"and this is \\\"anot\\her\\\" one!\";"
And if you know Regular Expressions, you know why this case is not that simple to manage isn't it?
Well, right now I was forking a project with a massive usage of Regular Expressions for CSS selectors and I could not avoid to notice the classical wrong match to manage strings:

/['"]([^'"]*?)['"]/g

Above match is almost a non-sense. If we have a string such "told'ya!" that RegExp will match told', leaving "ya!" out of the game. To make it a bit better the classic procedure is this one:

/(['"])([^\1]*?)\1/g

Whit above RegExp we are looking for quote or double quote char and we are searching the next one being sure if the first match is a single quote, the string will finish with a single one, and viceversa. There is still the problem that if we have the first matched quote or double quote and an escaped one in the middle of the string, that regular expression will truncate again the latter one giving us a untrustable result.


Why It Is More Difficult Via JavaScript

Regular Expressions in JavaScript miss at least one of most common features in PCRE world: the look-behind assertion!
Fortunately, we have an helpful Backreferences able in some case to slow down the match, but often the only or best way we have to create more clever matches!

The String Escape Safe Regular Expression


// WebReflection Solution
/(['"])((?:[^\\]|\\{2}|[\\(?=\1)]\1|[\\(?!\1)])*?)\1/g

I am not sure above little monster is the best RegExp you can find for this problem, and JavaScript features, what I am sure about, is that I have done dozen of tests and results seems to be perfect: Hooray!!!
If you are not familiar with RegExp, please let me try to explain what's going on there:

/
// look for a single or a double quote char
// this will be referenced as \1 in the rest of the regexp
// in order to completely ignore the other one
(['"])

// the second match is performed over the string
// that could be empty, or it could contain
// any character included the first match, if escaped
(

// the second match will be a char by char parser
// the only character we are worried about
// is the one able to escape the first match
(
?: // we are not interested about next capture
// since the only scary char is the escape
// but it is not necessary present
// (let's say is less present than any other)
// speed up the RegExp validating every char
// but the escape ... these are all good!
[^\\]
|
// if we encounter an escape char and this
// is escaping itself we can skip 2 chars
\\{2}
|
// alternatively, we could have
// an escaped match (current one: single or double)
// in this case we want to be sure that the escape
// is for the matched char and not just an escape
[\\(?=\1)]\1
|
// we need to validate whatever else has been
// escaped as well so if the escape char is
// NOT followed by the initial match or
// another escape char it's ok
// and we go on with next char
[\\(?!\1)]

// precedent cases should be performed for each
// encountered char but these cannot be greedy
// otherwise we risk to wrap the full string
// var a = "a", b = "b";
// 'a", b = "b' <-- greedy!
)*?
)

// to make precedent assumptions valid
// we need to be sure the string terminates
// with the initial matched char
\1
/g


That's pretty much it, if we use match method, replace, or exec, the matched[1] or RegExp.$1 will be the char used to encapsulate the string, single or double quote, while matched[2] or RegExp.$2 will contain the string itself.

In Any Case It Is Still Not Perfect

If we consider JavaScript regular expressions, same stuff used to solve the problem, we'll have another one.

var re = /ooo"yeah/;
var s = "no way";

In above example there will be some problem since the double quote inside the regular expression will be matched like a charm with my suggestion.
This is the reason we still need char by char parsers but hey ... I was trying to parse some selector and the usage of @test="case" which is even apparently not standard, so bear in mind we cannot use this RegExp unless the code won't contain literal regexps.
What is the trap here? That char by char a part, it's quite impossible to decide who comes first, "the slash or the quote"?

Quick And Dirty Solution Tester

With this code it should be simple to copy and paste some valid source to read parse after parse what is OK and what is not:

onload = function(){
document.body.appendChild(
document.createElement("textarea")
).onchange=function(){
this.value.replace(
// WebReflection Solution Test
/(['"])((?:[^\\]|\\{2}|[\\(?=\1)]\1|[\\(?!\1)])*?)\1/g,
function(){
alert([arguments[1], arguments[2]].join("\n"));
}
);
};
};


Please share whatever problem you'll find with such Regular Expression or suggest me a better faster approach to solve this problem with same test cases, thanks.

8 comments:

  1. Question:
    Can't we take as assumpion that every character past the '\' character is escaped?
    because if the assumpion is true, then the regex is more simpler:

    /(['"])((\\.)|((?!\1).))+\1/g

    take a look and give me feedback if you have case against this one ;)

    ReplyDelete
  2. I caught the second one but not the first one, thanks for the test case :)

    There are others that are difficult to catch....like e4x

    <>//not a comment</>;
    <>/* still not a comment*/</>

    ReplyDelete
  3. Just wondering: if your regex is capable of handling single or double quote strings with single or double quotes inside them (e.g. "here 'tis"), would it be that hard to add support for /hmm"yeah/ ? It's the same pattern, just different enclosing characters

    ReplyDelete
  4. @garethheyes unfortunately that is another difficult case for regexp.
    The problem is again "who comes first"?
    Unless we are not parsin char by char we'll never know (or better, we need to control the matched position).

    In JS we have 3 things to leave there, 2 to remove, the rest eventually to minify/munge.

    For each piece of code and without a char by char parser we should use 5 regexp:
    - this one for strings
    - another one for literal regexp
    - another one for e4x
    - one for inline comments
    - one for multi-line comments
    We need to understand which one comes first in order to avoid stripped comments from strings, strings inside comments, regexp inside strings, etc etc ... the first one, if the code is valid, is the right one.
    To understand if the code is valid all we need is a try catch over Function(code), which will create a function body, without executing it, so that is the simplest part.
    As summary, one function body, 5 robust regexps, a position mapper clever enough to create a map of the code removing inconsistent matches. Everything should be performed without modify the initial string but remembering the position ... anyway it's quite confusing to explain here, if I'll find some time, I will post about a JavScript code mapper ;)

    ReplyDelete
  5. @Fake51 theoretically it is the same pattern, problems come with dirty regexp, understood by JavaScript char2char parsers, but difficult to match.
    I am talking about this:
    var validRe = /[/]/;
    You don't necessarily need to escape characters inside a range of chars (square brackets [..range...])

    That is the most problematic thing for a single regexp

    ReplyDelete
  6. @Andrea

    And I fixed that first one ;)

    strings = new RegExp("(?:(?:['](?:\\\\{2}|\\\\[']|\\\\[\\r\\n]|[^'])*['])|(?:[\"](?:\\\\{2}|\\\\[\"]|\\\\[\\r\\n]|[^\"])*[\"]))")

    I can't use \1 etc

    ReplyDelete
  7. I remember that once I've worked on this problem in three different languages - PHP, JavaScript and Python. Only in PHP I was able to do it sensibly using regular expressions as PHP regex engine supports atomic expressions.
    Unfortunately Python and JavaScript (with no atomic regex solution) was to extent slow on large strings :/

    ReplyDelete
  8. Great information! I’ve been looking for something like this for a while now. Thanks!

    ReplyDelete

Note: Only a member of this blog may post a comment.