'How to optimize a regular expression match search in Python
The Program
I am building a program that tracks which feature file steps are covered by a step definition. For example, I may have a feature step that is I should not click on the panel
. This feature step matches the step definition I {qualifier} click on the {place}
assuming that {qualifier}
maps to (should not|should)
and {place}
maps to (panel|page)
.
For every feature step that matches a step definition, I want to keep track of what step definition it actually matched with. So I need to have a connection between I should not click on the panel
and I {qualifier} click on the {place}
.
And for every feature step that does not match any of the step definitions, then I am going to generate a step definition and connect those two.
The Problem
Right now I take every step definition and convert them into a regular expression, something like...
I {qualifier} click on the {place}
will be converted to (I (should not|should) click on the (panel|page))
I am using a Python dictionary where the key is the converted regular expression and the value is the original step definition.
My problem arises when I am going through every single feature step and trying to connect them to their matching step definitions. I am currently just looping through every single regular expression and trying to match it with the feature step, something like this...
# every feature_step gets sent through this check
for regex in all_step_definition_regex:
if re.match(regex, feature_step):
step_definition = regex_to_step_definition_map[regex]
return True, step_definition
return False, None
This is taking an incredibly long time to run when every feature step has to be checked to see if it matches any of the individual regular expressions. One way to speed up the initial check is to join every regular expression together with an 'or' like re.match('|'.join(all_step_definition_regex), feature_step)
, but then I have no way to connect the feature step with it's matching step definition without looping back through all the individual regular expressions.
I was wondering if anyone has any idea how to speed up this process?
Solution 1:[1]
You can make each definition pattern a group, and then see which group matched, although you'll need to change your individual regexs to use non-capturing groups (?:) (which would be more efficient in any case, if you're not using the information):
definition_regex = re.compile(r'(' + r')|('.join(all_step_definition_regex) + r')')
def find_definition(feature_step):
match = definition_regex.match(feature_step)
if match is None:
return None
return match.lastindex - 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Ron Post |