Capture the contents of a regex and delete them, efficiently

Situation:

  • text: a string
  • R: a regex that matches part of the string. This might be expensive to calculate.

I want to both delete the R-matches from the text, and see what they actually contain. Currently, I do this like:

import re
ab_re = re.compile("[ab]")
text="abcdedfe falijbijie bbbb laifsjelifjl"
ab_re.findall(text)
# ['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
ab_re.sub('',text)
# 'cdedfe flijijie  lifsjelifjl'

This runs the regex twice, near as I can tell. Is there a technique to do it all on pass, perhaps using re.split? It seems like with split based solutions I'd need to do the regex at least twice as well.


Asked by: Abigail977 | Posted: 01-10-2021






Answer 1

import re

r = re.compile("[ab]")
text = "abcdedfe falijbijie bbbb laifsjelifjl"

matches = []
replaced = []
pos = 0
for m in r.finditer(text):
    matches.append(m.group(0))
    replaced.append(text[pos:m.start()])
    pos = m.end()
replaced.append(text[pos:])

print matches
print ''.join(replaced)

Outputs:

['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
cdedfe flijijie  lifsjelifjl

Answered by: Cadie437 | Posted: 02-11-2021



Answer 2

What about this:

import re

text = "abcdedfe falijbijie bbbb laifsjelifjl"
matches = []

ab_re = re.compile( "[ab]" )

def verboseTest( m ):
    matches.append( m.group(0) )
    return ''

textWithoutMatches = ab_re.sub( verboseTest, text )

print matches
# ['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
print textWithoutMatches
# cdedfe flijijie  lifsjelifjl

The 'repl' argument of the re.sub function can be a function so you can report or save the matches from there and whatever the function returns is what 'sub' will substitute.

The function could easily be modified to do a lot more too! Check out the re module documentation on docs.python.org for more information on what else is possible.

Answered by: Blake853 | Posted: 02-11-2021



Answer 3

My revised answer, using re.split(), which does things in one regex pass:

import re
text="abcdedfe falijbijie bbbb laifsjelifjl"
ab_re = re.compile("([ab])")
tokens = ab_re.split(text)
non_matches = tokens[0::2]
matches = tokens[1::2]

(edit: here is a complete function version)

def split_matches(text,compiled_re):
    ''' given  a compiled re, split a text 
    into matching and nonmatching sections
    returns m, n_m, two lists
    '''
    tokens = compiled_re.split(text)
    matches = tokens[1::2]
    non_matches = tokens[0::2]
    return matches,non_matches

m,nm = split_matches(text,ab_re)
''.join(nm) # equivalent to ab_re.sub('',text)

Answered by: Carlos220 | Posted: 02-11-2021



Answer 4

You could use split with capturing parantheses. If you do, then the text of all groups in the pattern are also returned as part of the resulting list (from python doc).

So the code would be

import re
ab_re = re.compile("([ab])")
text="abcdedfe falijbijie bbbb laifsjelifjl"
matches = ab_re.split(text)
# matches = ['', 'a', '', 'b', 'cdedfe f', 'a', 'lij', 'b', 'ijie ', 'b', '', 'b', '', 'b', '', 'b', ' l', 'a', 'ifsjelifjl']

# now extract the matches
Rmatches = []
remaining = []
for i in range(1, len(matches), 2):
    Rmatches.append(matches[i])
# Rmatches = ['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']

for i in range(0, len(matches), 2):
    remaining.append(matches[i])
remainingtext = ''.join(remaining)
# remainingtext = 'cdedfe flijijie  lifsjelifjl'

Answered by: Melissa566 | Posted: 02-11-2021



Similar questions

python - How to efficiently process the contents of a parquet file

I'm trying to insert each row of a parquet file as a message in a redis stream. I'm relatively new to python so I've got an implementation working but it's quite slow. The slow part isn't inserting in to Redis, it's actually looping through the data in the parquet file to access each row. Even if I remove the redis insert and replace it with a print() statement the loop is quite slow.


python - How to compare and search list of integers efficiently?

I have a database populated with 1 million objects. Each object has a 'tags' field - set of integers. For example: object1: tags(1,3,4) object2: tags(2) object3: tags(3,4) object4: tags(5) and so on. Query parameter is a set on integers, lets try q(3,4,5) object1 does not match ('1' not in '3,4,5') object2 does not match ('2' not in '3,4,5') object3 matches ...


How do you safely and efficiently get the row id after an insert with mysql using MySQLdb in python?

I have a simple table in mysql with the following fields: id -- Primary key, int, autoincrement name -- varchar(50) description -- varchar(256) Using MySQLdb, a python module, I want to insert a name and description into the table, and get back the id. In pseudocode: db = MySQLdb.connection(...) queryString = "INSERT into tablename (name, description) ...


amazon s3 - Python: efficiently join chunks of bytes into one big chunk?

I'm trying to jury-rig the Amazon S3 python library to allow chunked handling of large files. Right now it does a "self.body = http_response.read()", so if you have a 3G file you're going to read the entire thing into memory before getting any control over it. My current approach is to try to keep the in...


python - Efficiently determining if a business is open or not based on store hours

Given a time (eg. currently 4:24pm on Tuesday), I'd like to be able to select all businesses that are currently open out of a set of businesses. I have the open and close times for every business for every day of the week Let's assume a business can open/close only on 00, 15, 30, 45 minute marks of each hour I'm assuming the same schedule each week. I am most interested in being ab...


python - How to efficiently calculate a running standard deviation

I have an array of lists of numbers, e.g.: [0] (0.01, 0.01, 0.02, 0.04, 0.03) [1] (0.00, 0.02, 0.02, 0.03, 0.02) [2] (0.01, 0.02, 0.02, 0.03, 0.02) ... [n] (0.01, 0.00, 0.01, 0.05, 0.03) I would like to efficiently calculate the mean and standard deviation at each index of a list, across all array elements. To do the mean, I have been looping through the array and summing the val...


python - How to efficiently use MySQLDB SScursor?

I have to deal with a large result set (could be hundreds thousands of rows, sometimes more). They unfortunately need to be retrieved all at once (on start up). I'm trying to do that by using as less memory as possible. By looking on SO I've found that using SSCursor might be what I'm looking for, but I still don't really know how to exactly use them. Is doing a fetchall() fr...


python - How to store a dynamic List into MySQL column efficiently?

I want to store a list of numbers along with some other fields into MySQL. The number of elements in the list is dynamic (some time it could hold about 60 elements) Currently I'm storing the list into a column of varchar type and the following operations are done. e.g. aList = [1234122433,1352435632,2346433334,1234122464] At storing time, aList is coverted to string as below


python - Find sequences of digits in long integers efficiently

Is it possible to find a defined sequence in an integer without converting it to a string? That is, is it possible to do some form of pattern matching directly on integers. I have not thought of one but I keeping thinking there should be a mathematical way of doing this. That's not to say it is more efficient. (edit) I actually what numbers that don't contain the sequences of digits I am looking for. The i...


parsing - How to efficiently parse emails without touching attachments using Python

I'm playing with Python imaplib (Python 2.6) to fetch emails from GMail. Everything I fetch an email with method http://docs.python.org/library/imaplib.html#imaplib.IMAP4.fetch I get whole email. I need only text part and also parse names of attachments, without downloading them. How this can be done? I see that emails returned b...


Efficiently reading a csv file with windows newline on linux in Python

The following is working under windows for reading csv files line by line. f = open(filename, 'r') for line in f: Though when copying the csv file to a linux server, it fails. It should be mentioned that performance is an issue as the csv files are huge. I am therefore concerned about the string copying when using things like strip.






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top