[libcxx] Multiline regular expression matching

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[libcxx] Multiline regular expression matching

Jonathan Sauer
Hello,

N3242 is silent on the issue of multiline regular expression matching, i.e. if ^ and $ only match the beginning
and end of the string, respectively, or if they also match occurrances of \n or \r inside the string. It is only
possible to turn matching the former off (via match_not_bol and match_not_eol, respectively). ECMAScript, on
whose regular expressions the C++0x regex library is based (among others), provides an additional flag in the
RegExp constructor to turn on multiline matching (see also <http://www.regular-expressions.info/anchors.html>
for more information on multiline matching).

I looked into previous standard committee documents about regular expressions, but was unable to find anything
regarding this issue.

I then tried my hand at a workaround using a non-captured disjunction in the following test program, using
libc++ trunk:

// /opt/bin/clang -std=c++0x -stdlib=libc++ -lc++ clang.cpp
#include <regex>

static const std::regex INCLUDE_REGEXP("(?:^|[\\n\\r])#include\\s*<([^>]+)>");

static const std::string s =
        "attribute vec3 vertexUV0;\n"
        "#include <shaders/include/Lighting.glsl>\n"
        "#include <shaders/include/ProjectTextureOnCube.glsl>\n"
        "uniform mat4 mvp;\n";


int main(int, char**)
{
        std::sregex_iterator it(s.begin(), s.end(), INCLUDE_REGEXP);
        std::sregex_iterator const end;
        if (it == end)
        {
                std::printf("Not found\n");
        }
        else
        {
                while (it != end)
                {
                        std::printf("Found '%s [%s]'\n", it->str().c_str(), it->str(1).c_str());
                        ++it;
                }
        }
}


This resulted in the output

        "Not found".

Exchanging the disjunction's alternatives ("(?:^|[\\n\\r])" => "(?:[\\n\\r]|^)"), resulted in a (seemingly)
endless stream of
        Found ' []'
        Found ' []'
        Found ' []'
        Found ' []'
        ...

Removing the disjunction results in two matches (as expected):
        Found '#include <shaders/include/Lighting.glsl> [shaders/include/Lighting.glsl]'
        Found '#include <shaders/include/ProjectTextureOnCube.glsl> [shaders/include/ProjectTextureOnCube.glsl]'


>From my reading of the ECMAScript standard (<http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf>),
above regular expressions are (at least syntactically) valid. So I have the following questions:

- Is libc++'s current behaviour a bug?
- Is there another, simpler way to perform multiline matching using std::regex?


Thanks in advance,
Jonathan


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Multiline regular expression matching

Howard Hinnant
On Mar 23, 2011, at 9:42 PM, Jonathan Sauer wrote:

> Hello,
>
> N3242 is silent on the issue of multiline regular expression matching, i.e. if ^ and $ only match the beginning
> and end of the string, respectively, or if they also match occurrances of \n or \r inside the string. It is only
> possible to turn matching the former off (via match_not_bol and match_not_eol, respectively). ECMAScript, on
> whose regular expressions the C++0x regex library is based (among others), provides an additional flag in the
> RegExp constructor to turn on multiline matching (see also <http://www.regular-expressions.info/anchors.html>
> for more information on multiline matching).
>
> I looked into previous standard committee documents about regular expressions, but was unable to find anything
> regarding this issue.
>
> I then tried my hand at a workaround using a non-captured disjunction in the following test program, using
> libc++ trunk:
>
> // /opt/bin/clang -std=c++0x -stdlib=libc++ -lc++ clang.cpp
> #include <regex>
>
> static const std::regex INCLUDE_REGEXP("(?:^|[\\n\\r])#include\\s*<([^>]+)>");
>
> static const std::string s =
> "attribute vec3 vertexUV0;\n"
> "#include <shaders/include/Lighting.glsl>\n"
> "#include <shaders/include/ProjectTextureOnCube.glsl>\n"
> "uniform mat4 mvp;\n";
>
>
> int main(int, char**)
> {
> std::sregex_iterator it(s.begin(), s.end(), INCLUDE_REGEXP);
> std::sregex_iterator const end;
> if (it == end)
> {
> std::printf("Not found\n");
> }
> else
> {
> while (it != end)
> {
> std::printf("Found '%s [%s]'\n", it->str().c_str(), it->str(1).c_str());
> ++it;
> }
> }
> }
>
>
> This resulted in the output
>
> "Not found".
>
> Exchanging the disjunction's alternatives ("(?:^|[\\n\\r])" => "(?:[\\n\\r]|^)"), resulted in a (seemingly)
> endless stream of
> Found ' []'
> Found ' []'
> Found ' []'
> Found ' []'
> ...
>
> Removing the disjunction results in two matches (as expected):
> Found '#include <shaders/include/Lighting.glsl> [shaders/include/Lighting.glsl]'
> Found '#include <shaders/include/ProjectTextureOnCube.glsl> [shaders/include/ProjectTextureOnCube.glsl]'
>
>
>> From my reading of the ECMAScript standard (<http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf>),
> above regular expressions are (at least syntactically) valid. So I have the following questions:
>
> - Is libc++'s current behaviour a bug?

I believe it is a libc++ bug.  I've committed a fix revision 128350.

> - Is there another, simpler way to perform multiline matching using std::regex?

Your way looks as good as any to me.

Thanks for bringing this to our attention.

-Howard

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Multiline regular expression matching

Jonathan Sauer
Hello,

>> - Is libc++'s current behaviour a bug?
>
> I believe it is a libc++ bug.  I've committed a fix revision 128350.

It works now as expected. Thank you!

To avoid capturing the previous line ending, I changed the expression a little bit
using an assertion to (C-string, hence the double escape):
"(?=^|[\\n\\r])#include\\s*<([^>]+)>". This also worked.


Jonathan


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev