'python regex to find multiline C comment spanning multiple lines

I m trying to get a regex which will work on multi-line C comments. Managed to make it work for /* comments here */ but does not work if the comment goes to the next line. How do I make a regex which spans over multiple lines?

Using this as my input:

/* this comment
must be recognized */

The problem I get is "must, be and recognized" is matched as ID's and */ as illegal characters.

#!/usr/bin/python
import ply.lex as lex
tokens = ['ID', 'COMMENT']

t_ID   = r'[a-zA-Z_][a-zA-Z0-9_]*'

def t_COMMENT(t):
    r'(?s)/\*(.*?).?(\*/)'
    #r'(?s)/\*(.*?).?(\*/)' does not work either.
    return t

# Error handling rule
def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

lex.lex()   #Build the lexer

lex.input('/* this comment\r\n must be recognised */\r\n')
while True:
    tok = lex.token()
    if not tok:break
    if tok.type == 'COMMENT':
        print tok.type

I tried quite a few: Create array of regex match(multiline) and How to handle multiple rules for one token with PLY and few other things available at http://www.dabeaz.com/ply/ply.html



Solution 1:[1]

I use this regex when I want to find multi line comments in C:

If I want to include the '/* */' chars:

\/\*(\*(?!\/)|[^*])*\*\/

If I don't want to include it:

(?<=\*)[\n]*.*[\n]*.*[\n]*[\n]*?[\n]*(?=\*)

Solution 2:[2]

By default, in the regex used by the PLY lexer, the dot . does not math a new line \n. So if you really want to math any character, use (.|\n) instead of .

(I had the same problem, and your comment on your own question helped me so I just create an answer for the newcomers)

Solution 3:[3]

def t_COMMENT(t):
    r'(?s)/\*.*?\*/'
    return t

As described here:

  • (?s) is a modifier that makes . also match new line feeds
  • .*? is the non-greedy version of .*. It that matches the shortest possible sequence of characters (before a \*/ that comes next)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Shoosha
Solution 2 Q-B
Solution 3