Regular expression to detect semi-colon terminated C++ for & while loops

In my Python application, I need to write a regular expression that matches a C++ for or while loop that has been terminated with a semi-colon (;). For example, it should match this:

for (int i = 0; i < 10; i++);

... but not this:

for (int i = 0; i < 10; i++)

This looks trivial at first glance, until you realise that the text between the opening and closing parenthesis may contain other parenthesis, for example:

for (int i = funcA(); i < funcB(); i++);

I'm using the python.re module. Right now my regular expression looks like this (I've left my comments in so you can understand it easier):

# match any line that begins with a "for" or "while" statement:
^\s*(for|while)\s*
\(  # match the initial opening parenthesis
    # Now make a named group 'balanced' which matches a balanced substring.
    (?P<balanced>
        # A balanced substring is either something that is not a parenthesis:
        [^()]
        | # …or a parenthesised string:
        \( # A parenthesised string begins with an opening parenthesis
            (?P=balanced)* # …followed by a sequence of balanced substrings
        \) # …and ends with a closing parenthesis
    )*  # Look for a sequence of balanced substrings
\)  # Finally, the outer closing parenthesis.
# must end with a semi-colon to match:
\s*;\s*

This works perfectly for all the above cases, but it breaks as soon as you try and make the third part of the for loop contain a function, like so:

for (int i = 0; i < 10; doSomethingTo(i));

I think it breaks because as soon as you put some text between the opening and closing parenthesis, the "balanced" group matches that contained text, and thus the (?P=balanced) part doesn't work any more since it won't match (due to the fact that the text inside the parenthesis is different).

In my Python code I'm using the VERBOSE and MULTILINE flags, and creating the regular expression like so:

REGEX_STR = r"""# match any line that begins with a "for" or "while" statement:
^\s*(for|while)\s*
\(  # match the initial opening parenthesis
    # Now make a named group 'balanced' which matches
    # a balanced substring.
    (?P<balanced>
        # A balanced substring is either something that is not a parenthesis:
        [^()]
        | # …or a parenthesised string:
        \( # A parenthesised string begins with an opening parenthesis
            (?P=balanced)* # …followed by a sequence of balanced substrings
        \) # …and ends with a closing parenthesis
    )*  # Look for a sequence of balanced substrings
\)  # Finally, the outer closing parenthesis.
# must end with a semi-colon to match:
\s*;\s*"""

REGEX_OBJ = re.compile(REGEX_STR, re.MULTILINE| re.VERBOSE)

Can anyone suggest an improvement to this regular expression? It's getting too complicated for me to get my head around.


Asked by: Maddie261 | Posted: 06-12-2021






Answer 1

You could write a little, very simple routine that does it, without using a regular expression:

  • Set a position counter pos so that is points to just before the opening bracket after your for or while.
  • Set an open brackets counter openBr to 0.
  • Now keep incrementing pos, reading the characters at the respective positions, and increment openBr when you see an opening bracket, and decrement it when you see a closing bracket. That will increment it once at the beginning, for the first opening bracket in "for (", increment and decrement some more for some brackets in between, and set it back to 0 when your for bracket closes.
  • So, stop when openBr is 0 again.

The stopping positon is your closing bracket of for(...). Now you can check if there is a semicolon following or not.

Answered by: Brooke404 | Posted: 07-01-2022



Answer 2

This is the kind of thing you really shouldn't do with a regular expression. Just parse the string one character at a time, keeping track of opening/closing parentheses.

If this is all you're looking for, you definitely don't need a full-blown C++ grammar lexer/parser. If you want practice, you can write a little recursive-decent parser, but even that's a bit much for just matching parentheses.

Answered by: Madaline710 | Posted: 07-01-2022



Answer 3

This is a great example of using the wrong tool for the job. Regular expressions do not handle arbitrarily nested sub-matches very well. What you should do instead is use a real lexer and parser (a grammar for C++ should be easy to find) and look for unexpectedly empty loop bodies.

Answered by: Kate593 | Posted: 07-01-2022



Answer 4

Try this regexp

^\s*(for|while)\s*
\(
(?P<balanced>
[^()]*
|
(?P=balanced)
\)
\s*;\s

I removed the wrapping \( \) around (?P=balanced) and moved the * to behind the any not paren sequence. I have had this work with boost xpressive, and rechecked that website (Xpressive) to refresh my memory.

Answered by: Daisy973 | Posted: 07-01-2022



Answer 5

I wouldn't even pay attention to the contents of the parens.

Just match any line that starts with for and ends with semi-colon:

^\t*for.+;$

Unless you've got for statements split over multiple lines, that will work fine?

Answered by: Elise481 | Posted: 07-01-2022



Answer 6

I don't know that regex would handle something like that very well. Try something like this

line = line.Trim();
if(line.StartsWith("for") && line.EndsWith(";")){
    //your code here
}

Answered by: Chloe850 | Posted: 07-01-2022



Answer 7

A little late to the party, but I think regular expressions are not the right tool for the job.

The problem is that you'll come across edge cases which would add extranous complexity to the regular expression. @est mentioned an example line:

for (int i = 0; i < 10; doSomethingTo("("));

This string literal contains an (unbalanced!) parenthesis, which breaks the logic. Apparently, you must ignore contents of string literals. In order to do this, you must take the double quotes into account. But string literals itself can contain double quotes. For instance, try this:

for (int i = 0; i < 10; doSomethingTo("\"(\\"));

If you address this using regular expressions, it'll add even more complexity to your pattern.

I think you are better off parsing the language. You could, for instance, use a language recognition tool like ANTLR. ANTLR is a parser generator tool, which can also generate a parser in Python. You must provide a grammar defining the target language, in your case C++. There are already numerous grammars for many languages out there, so you can just grab the C++ grammar.

Then you can easily walk the parser tree, searching for empty statements as while or for loop body.

Answered by: Lily689 | Posted: 07-01-2022



Answer 8

Greg is absolutely correct. This kind of parsing cannot be done with regular expressions. I suppose it is possible to build some horrendous monstrosity that would work for many cases, but then you'll just run across something that does.

You really need to use more traditional parsing techniques. For example, its pretty simple to write a recursive decent parser to do what you need.

Answered by: Lucas749 | Posted: 07-01-2022



Answer 9

Another thought that ignores parentheses and treats the for as a construct holding three semicolon-delimited values:

for\s*\([^;]+;[^;]+;[^;]+\)\s*;

This option works even when split over multiple lines (once MULTILINE enabled), but assumes that for ( ... ; ... ; ... ) is the only valid construct, so wouldn't work with a for ( x in y ) construct, or other deviations.

Also assumes that there are no functions containing semi-colons as arguments, such as:

for ( var i = 0; i < ListLen('a;b;c',';') ; i++ );

Whether this is a likely case depends on what you're actually doing this for.

Answered by: Freddie885 | Posted: 07-01-2022



Answer 10

As Frank suggested, this is best without regex. Here's (an ugly) one-liner:

match_string = orig_string[orig_string.index("("):len(orig_string)-orig_string[::-1].index(")")]

Matching the troll line est mentioned in his comment:

orig_string = "for (int i = 0; i < 10; doSomethingTo(\"(\"));"
match_string = orig_string[orig_string.index("("):len(orig_string)-orig_string[::-1].index(")")]

returns (int i = 0; i < 10; doSomethingTo("("))

This works by running through the string forward until it reaches the first open paren, and then backward until it reaches the first closing paren. It then uses these two indices to slice the string.

Answered by: Melissa164 | Posted: 07-01-2022



Similar questions

python - Regular expression to match start of filename and filename extension

What is the regular expression to match strings (in this case, file names) that start with 'Run' and have a filename extension of '.py'? The regular expression should match any of the following: RunFoo.py RunBar.py Run42.py It should not match: myRunFoo.py RunBar.py1 Run42.txt The SQL equivalent of what I am looking for is ... LIKE 'Run%.py' ...


regex - python regular expression to split paragraphs

How would one write a regular expression to use in python to split paragraphs? A paragraph is defined by 2 linebreaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph. I am using python so the solution can use python's regular expression syntax whi...


python - Problem with Boolean Expression with a string value from a lIst

I have the following problem: # line is a line from a file that contains ["baa","beee","0"] line = TcsLine.split(",") NumPFCs = eval(line[2]) if NumPFCs==0: print line I want to print all the lines from the file if the second position of the list has a value == 0. I print the lines but after that the following happens: Traceback (most recent call last): ['baaa'...


python - split twice in the same expression?

Imagine I have the following: inFile = "/adda/adas/sdas/hello.txt" # that instruction give me hello.txt Name = inFile.name.split("/") [-1] # that one give me the name I want - just hello Name1 = Name.split(".") [0] Is there any chance to simplify that doing the same job in just one expression?


python - Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here. Here comes the HTML source: &lt;a href="http://www.ptop.se" target="_blank"&gt;http://www.ptop.se&lt;/a&gt; I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?


regex - How can I translate the following filename to a regular expression in Python?

I am battling regular expressions now as I type. I would like to determine a pattern for the following example file: b410cv11_test.ext. I want to be able to do a search for files that match the pattern of the example file aforementioned. Where do I start (so lost and confused) and what is the best way of arriving at a solution that best matches the file pattern? Thanks in advance.


python - What is the regular expression for the "root" of a website in django?

I'm using django and when users go to www.website.com/ I want to point them to the index view. Right now I'm doing this: (r'^$', 'ideas.idea.views.index'), However, it's not working. I'm assuming my regular expression is wrong. Can anyone help me out? I've looked at python regular expressions but they didn't help me.


regex - Python Regular Expression to add links to urls

I'm trying to make a regular expression that will correctly capture URLs, including ones that are wrapped in parenthesis as in (http://example.com) and spoken about on coding horror at https://blog.codinghorror.com/the-problem-with-urls/ I'm currently using the foll...


regex - How do i write a regular expression for the following pattern in python?

How do i look for the following pattern using regular expression in python? for the two cases Am looking for str2 after the "=" sign Case 1: str1=str2 Case 2: str1 = str2 please note there can be a space or none between the either side of the "=" sign Mine is like this, but only works for one of the cases! m=re...


regex - Why is the regular expression returning an error in python?

Am trying the following regular expression in python but it returns an error import re ... #read a line from a file to variable line # loking for the pattern 'WORD' in the line ... m=re.search('(?&lt;=[WORD])\w+',str(line)) m.group(0) i get the following error: AttributeError: 'NoneType' object has no attribute 'group'






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top