Caching compiled regex objects in Python?

Each time a python file is imported that contains a large quantity of static regular expressions, cpu cycles are spent compiling the strings into their representative state machines in memory.

a = re.compile("a.*b")
b = re.compile("c.*d")
...

Question: Is it possible to store these regular expressions in a cache on disk in a pre-compiled manner to avoid having to execute the regex compilations on each import?

Pickling the object simply does the following, causing compilation to happen anyway:

>>> import pickle
>>> import re
>>> x = re.compile(".*")
>>> pickle.dumps(x)
"cre\n_compile\np0\n(S'.*'\np1\nI0\ntp2\nRp3\n."

And re objects are unmarshallable:

>>> import marshal
>>> import re
>>> x = re.compile(".*")
>>> marshal.dumps(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unmarshallable object


Asked by: Arthur937 | Posted: 01-10-2021






Answer 1

Is it possible to store these regular expressions in a cache on disk in a pre-compiled manner to avoid having to execute the regex compilations on each import?

Not easily. You'd have to write a custom serializer that hooks into the C sre implementation of the Python regex engine. Any performance benefits would be vastly outweighed by the time and effort required.

First, have you actually profiled the code? I doubt that compiling regexes is a significant part of the application's run-time. Remember that they are only compiled the first time the module is imported in the current execution -- thereafter, the module and its attributes are cached in memory.

If you have a program that basically spawns once, compiles a bunch of regexes, and then exits, you could try re-engineering it to perform multiple tests in one invocation. Then you could re-use the regexes, as above.

Finally, you could compile the regexes into C-based state machines and then link them in with an extension module. While this would likely be more difficult to maintain, it would eliminate regex compilation entirely from your application.

Answered by: Adelaide562 | Posted: 02-11-2021



Answer 2

Note that each module initializes itself only once during the life of an app, no matter how many times you import it. So if you compile your expressions at the module's global scope (ie. not in a function) you should be fine.

Answered by: Blake209 | Posted: 02-11-2021



Answer 3

First of all, this is a clear limitation in the python re module. It causes a limit how much and how big regular expressions are reasonable. The limit is bigger with long running processes and smaller with short lived processes like command line applications.

Some years ago I did look at it and it is possible to dig out the compilation result, pickle it and then unpickle it and reuse it. The problem is that it requires using the sre.py internals and so won't probably work in different python versions.

I would like to have this kind of feature in my toolbox. I would also like to know, if there are any separate modules that could be used instead.

Answered by: Maria686 | Posted: 02-11-2021



Answer 4

The shelve module appears to work just fine:


import re
import shelve
a_pattern = "a.*b"
b_pattern = "c.*d"
a = re.compile(a_pattern)
b = re.compile(b_pattern)

x = shelve.open('re_cache')
x[a_pattern] = a
x[b_pattern] = b
x.close()

# ...
x = shelve.open('re_cache')
a = x[a_pattern]
b = x[b_pattern]
x.close()

You can then make a nice wrapper class that automatically handles the caching for you so that it becomes transparent to the user... an exercise left to the reader.

Answered by: Thomas875 | Posted: 02-11-2021



Answer 5

Hum,

Doesn't shelve use pickle ?

Anyway, I agree with the previous anwsers. Since a module is processed only once, I doubt compiling regexps will be your app bottle neck. And Python re module is wicked fast since it's coded in C :-)

But the good news is that Python got a nice community, so I am sure you can find somebody currently hacking just what you need.

I googled 5 sec and found : http://home.gna.org/oomadness/en/cerealizer/index.html.

Don't know if it will do it but if not, good luck in you research :-)

Answered by: Anna431 | Posted: 02-11-2021



Answer 6

Open /usr/lib/python2.5/re.py and look for "def _compile". You'll find re.py's internal cache mechanism.

Answered by: Anna378 | Posted: 02-11-2021



Similar questions

Way to have compiled python files in a separate folder?

Is it possible to have Python save the .pyc files to a separate folder location that is in sys.path? /code foo.py foo.pyc bar.py bar.pyc To: /code foo.py bar.py /code_compiled foo.pyc bar.pyc I would like this because I feel it'd be more organized. Thanks for any help you can give me.


python service restart (when compiled to exe)

I have a service, as follows: """ The most basic (working) CherryPy 3.1 Windows service possible. Requires Mark Hammond's pywin32 package. """ import cherrypy import win32serviceutil import win32service import sys import __builtin__ __builtin__.theService = None class HelloWorld: """ Sample request handler class. """ def __init__(self): self.iVal = 0 @cherrypy.expose def index(s...


regex - How can I obtain pattern string from compiled regexp pattern in python?

I have some code like this one: &gt;&gt;&gt; import re &gt;&gt;&gt; p = re.compile('my pattern') &gt;&gt;&gt; print p _sre.SRE_Pattern object at 0x02274380 Is it possible to get string "my pattern" from p variable?


python _+ django, is it compiled code?

Just looking into python from a .net background. Is python compiled like .net? If yes, can it be obfuscated and is it more or less secure than .net compiled code that is obfuscated? does pretty much every web host (unix) support django and python?


Python package with compiled code

I'm looking into releasing a python package which includes an existing fortran or C program. The fortran/C program is compiled by running ./configure make The python code calls the resulting binary through subprocess calls (i.e. the code is not really wrapped as such). What I would like is that when the user types python setup.py install the fortran/C prog...


c++ - module compiled with swig not found by python

I've a problem with SWIG and python. I've a c-class that compiles correctly, but the python script says it can't find the module. I compile with: swig -c++ -python codes/codes.i g++ -c -Wall -O4 -fPIC -pedantic codes/*.cc g++ -I/usr/include/python2.6 -shared codes/codes_wrap.cxx *.o -o _codes.so This gives me a _codes.so file, as I would expect, but then I have this python file:


macos - Run a python script and a compiled c code without terminal or dock item in Mac OS X

For great help from stackoverflow, the development for the Mac version of my program is done. Now I need to deploy my program, and I was wondering if there is any way to "hide" my running Python code (it also runs .so library and it seems it makes a dock item to appear). The program is supposed to be running in the background and it would be great if I can hide any terminal or dock items. In Windows or linux, it w...


How to revert compiled Python 2.6.4 to system default on Snow Leopard?

So earlier this year I manually built 2.6.4 for Snow Leopard because I wanted a slightly more updated version of Python than what Apple released. This has caused all kinds of problems when installing some eggs like PIL and running other 3rd party python apps. Now I just want to revert everything back to what Snow Leopard ships with because I have to get work done and it's getting in the way. If worse comes to worse, I'm go...


Problem running compiled Python script

This question already has answers here:


java - How to know if javac compiled cleanly using system() in python

In python, how can I tell if a certain system() call is successful? In the program I am writing I need to know if a java program compiled correctly using javac, which is called using system() (in the python program). So I need to know if javac threw any exceptions, if there were any syntax problems with the java program, any problems at all at compile time for the java program. Essentially, the program asks the user for a ...






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top