I'm using Python regexes in a criminally inefficient manner

My goal here is to create a very simple template language. At the moment, I'm working on replacing a variable with a value, like this:

This input:

The Web

Should produce this output:

The Web This Is A Test Variable

I've got it working. But looking at my code, I'm running multiple identical regexes on the same strings -- that just offends my sense of efficiency. There's got to be a better, more Pythonic way. (It's the two "while" loops that really offend.)

This does pass the unit tests, so if this is silly premature optimization, tell me -- I'm willing to let this go. There may be dozens of these variable definitions and uses in a document, but not hundreds. But I suspect there's obvious (to other people) ways of improving this, and I'm curious what the StackOverflow crowd will come up with.

def stripMatchedQuotes(item):
    MatchedSingleQuotes = re.compile(r"'(.*)'", re.LOCALE)
    MatchedDoubleQuotes = re.compile(r'"(.*)"', re.LOCALE)
    item = MatchedSingleQuotes.sub(r'\1', item, 1)
    item = MatchedDoubleQuotes.sub(r'\1', item, 1)
    return item

def processVariables(item):
    VariableDefinition = re.compile(r'<%(.*?)=(.*?)%>', re.LOCALE)
    VariableUse = re.compile(r'<%(.*?)%>', re.LOCALE)

    while VariableDefinition.search(item):
        VarName, VarDef = VariableDefinition.search(item).groups()
        VarName = stripMatchedQuotes(VarName).upper().strip()
        VarDef = stripMatchedQuotes(VarDef.strip())
        Variables[VarName] = VarDef
        item = VariableDefinition.sub('', item, 1)

    while VariableUse.search(item):
        VarName = stripMatchedQuotes(VariableUse.search(item).group(1).upper()).strip()
        item = VariableUse.sub(Variables[VarName], item, 1)

    return item

Asked by: Agata693 | Posted: 05-10-2021

Answer 1

The first thing that may improve things is to move the re.compile outside the function. The compilation is cached, but there is a speed hit in checking this to see if its compiled.

Another possibility is to use a single regex as below:

MatchedQuotes = re.compile(r"(['\"])(.*)\1", re.LOCALE)
item = MatchedQuotes.sub(r'\2', item, 1)

Finally, you can combine this into the regex in processVariables. Taking Torsten Marek's suggestion to use a function for re.sub, this improves and simplifies things dramatically.

VariableDefinition = re.compile(r'<%(["\']?)(.*?)\1=(["\']?)(.*?)\3%>', re.LOCALE)
VarRepl = re.compile(r'<%(["\']?)(.*?)\1%>', re.LOCALE)

def processVariables(item):
    vars = {}
    def findVars(m):
        vars[m.group(2).upper()] = m.group(4)
        return ""

    item = VariableDefinition.sub(findVars, item)
    return VarRepl.sub(lambda m: vars[m.group(2).upper()], item)

print processVariables('<%"TITLE"="This Is A Test Variable"%>The Web <%"TITLE"%>')

Here are my timings for 100000 runs:

Original       : 13.637
Global regexes : 12.771
Single regex   :  9.095
Final version  :  1.846

[Edit] Add missing non-greedy specifier

[Edit2] Added .upper() calls so case insensitive like original version

Answered by: Julian890 | Posted: 06-11-2021

Answer 2

sub can take a callable as it's argument rather than a simple string. Using that, you can replace all variables with one function call:

>>> import re
>>> var_matcher = re.compile(r'<%(.*?)%>', re.LOCALE)
>>> string = '<%"TITLE"%> <%"SHMITLE"%>'
>>> values = {'"TITLE"': "I am a title.", '"SHMITLE"': "And I am a shmitle."}
>>> var_matcher.sub(lambda m: vars[m.group(1)], string)
'I am a title. And I am a shmitle.

Follow eduffy.myopenid.com's advice and keep the compiled regexes around.

The same recipe can be applied to the first loop, only there you need to store the value of the variable first, and always return "" as replacement.

Answered by: Edgar974 | Posted: 06-11-2021

Answer 3

Never create your own programming language. Ever. (I used to have an exception to this rule, but not any more.)

There is always an existing language you can use which suits your needs better. If you elaborated on your use-case, people may help you select a suitable language.

Answered by: Owen279 | Posted: 06-11-2021

Answer 4

Creating a templating language is all well and good, but shouldn't one of the goals of the templating language be easy readability and efficient parsing? The example you gave seems to be neither.

As Jamie Zawinsky famously said:

Some people, when confronted with a problem, think "I know, I'll use regular expressions!" Now they have two problems.

If regular expressions are a solution to a problem you have created, the best bet is not to write a better regular expression, but to redesign your approach to eliminate their use entirely. Regular expressions are complicated, expensive, hugely difficult to maintain, and (ideally) should only be used for working around a problem someone else created.

Answered by: Thomas134 | Posted: 06-11-2021

Answer 5

You can match both kind of quotes in one go with r"(\"|')(.*?)\1" - the \1 refers to the first group, so it will only match matching quotes.

Answered by: William173 | Posted: 06-11-2021

Answer 6

You're calling re.compile quite a bit. A global variable for these wouldn't hurt here.

Answered by: Adelaide657 | Posted: 06-11-2021

Answer 7

If a regexp only contains one .* wildcard and literals, then you can use find and rfind to locate the opening and closing delimiters.

If it contains only a series of .*? wildcards, and literals, then you can just use a series of find's to do the work.

If the code is time-critical, this switch away from regexp's altogether might give a little more speed.

Also, it looks to me like this is an LL-parsable language. You could look for a library that can already parse such things for you. You could also use recursive calls to do a one-pass parse -- for example, you could implement your processVariables function to only consume up the first quote, and then call a quote-matching function to consume up to the next quote, etc.

Answered by: Anna772 | Posted: 06-11-2021

Answer 8

Why not use Mako? Seriously. What feature do you require that Mako doesn't have? Perhaps you can adapt or extend something that already works.

Answered by: Rubie280 | Posted: 06-11-2021

Answer 9

Don't call search twice in a row (in the loop conditional, and the first statement in the loop). Call (and cache the result) once before the loop, and then in the final statement of the loop.

Answered by: Ada305 | Posted: 06-11-2021

Answer 10

Why not use XML and XSLT instead of creating your own template language? What you want to do is pretty easy in XSLT.

Answered by: Aida486 | Posted: 06-11-2021

Similar questions

webserver - When deploying python, what web server options do we have? is the process inefficient at all?

I think in the past python scripts would run off CGI, which would create a new thread for each process. I am a newbie so I'm not really sure, what options do we have? Is the web server pipeline that python works under any more/less effecient than say php?

geometry - Extremely inefficient python code

I have made a program to allow users to input the largest possible hypotenuse of a right-angled triangle and my program will list down a list of all possible sides of the triangles. Problem is, the program takes forever to run when I input a value such as 10000. Any suggestions on how to improve the efficiency of the program? Code: largest=0 sets=0 hypotenuse=int(input("Please enter the length of t...

python - inefficient code: comparing combining different columns from different files awk or perl?

I have two files and I would like to match column 2 from file1 with column NF from file2. If they match I would like to output the whole line from file2 with, with in addition column 5 from file1 and column 5 from file 1 multiplied with column NF-2 from file 2at the end. The files have different lenghts. I have the following two file-types: file1 xx name1 1 we freq1 xy name2 2 wer f...

python - Is json.load inefficient?

I was looking at the source of the json module to try to answer another question when I found something curious. Removing the docstring and a whole bunch of keyword arguments, the source of json.load is as follows: def load(fp): return loads(fp.read()) This was not at all as I expected. If json.load doesn't avoid the overhead of reading the whole ...

python - How to improve very inefficient numpy code for calculating correlation

I wrote the following function to calculate a row by row correlation of a matrix with respect to a selected row (specified by the index parameter): # returns a 1D array of correlation coefficients whose length matches # the row count of the given np_arr_2d def ma_correlate_vs_index(np_arr_2d, index): def corr_upper(x, y): # just take the upper right corner of the correlation matrix ...

python - Why is adding a `reversed` argument to bisect functions considered inefficient?

We can use Python's bisect module to insert items into an already sorted list efficiently. But the list has to be sorted in ascending order, which is not always the case. In the documentation, the reason is explained: Unlike the sorted() function, it does not make sense for the bis...

python - Isn't CPython's str.join() a little inefficient?

This answer and its comments provide some insight into the inner working's of CPython's str.join(): If the argument is not already a list or a tuple, a new list is created with the same contents. The argument is iterated over once, to sum the lengths of the strings it holds. Memory i...

Python list unpacking, is this expensive? Is it inefficient?

Suppose I have a Python function that returns multiple elements, for example myfoo0() def myfoo0(): return([1, 2, 3]) and that this is used as: fit = myfoo0(); Now, consider a function that used some of the entries of fit as input. For example: def myfoo1(fit): [a, b, c] = fit return(doSomething(a))

python - Is my code inefficient or is my task hard?

hope someone can help here? I've written some code to perform a look-up task, and it's taking a long time to do it––far longer than it seems it should. For confidentiality reasons, I can't describe the exact task, but I can give one that's directly analogous. Let's say that I have a database containing a list of words and a set of values corresponding to these words. Say, colours and how much these colours ...

python - Time Inaccuracy or Inefficient Code?

I'm currently working on a program that will display the amount of time that has passed since a specific point. A stopwatch, if you will. I've finally gotten my code working, but as it turns out, it's not very accurate. It falls behind very quickly, being 1 or even 2 seconds off within just the first 10 seconds of running, I'm not completely sure why this is. # Draw def draw(): stdscr.erase() ...

Still can't find your answer? Check out these communities...

PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python