What is best way to remove duplicate lines matching regex from string using Python?

This is a pretty straight forward attempt. I haven't been using python for too long. Seems to work but I am sure I have much to learn. Someone let me know if I am way off here. Needs to find patterns, write the first line which matches, and then add a summary message for remaining consecutive lines which match pattern and return modified string.

Just to be clear...regex .*Dog.* would take

Cat
Dog
My Dog
Her Dog
Mouse

and return

Cat
Dog
::::: Pattern .*Dog.* repeats 2 more times.
Mouse


#!/usr/bin/env python
#

import re
import types

def remove_repeats (l_string, l_regex):
   """Take a string, remove similar lines and replace with a summary message.

   l_regex accepts strings and tuples.
   """

   # Convert string to tuple.
   if type(l_regex) == types.StringType:
      l_regex = l_regex,


   for t in l_regex:
      r = ''
      p = ''
      for l in l_string.splitlines(True):
         if l.startswith('::::: Pattern'):
            r = r + l
         else:
            if re.search(t, l): # If line matches regex.
                m += 1
                if m == 1: # If this is first match in a set of lines add line to file.
                   r = r + l
                elif m > 1: # Else update the message string.
                   p = "::::: Pattern '" + t + "' repeats " + str(m-1) +  ' more times.\n'
            else:
                if p: # Write the message string if it has value.
                   r = r + p
                   p = ''
                m = 0
                r = r + l

      if p: # Write the message if loop ended in a pattern.
          r = r + p
          p = ''

      l_string = r # Reset string to modified string.

   return l_string


Asked by: Stella822 | Posted: 28-01-2022






Answer 1

The rematcher function seems to do what you want:

def rematcher(re_str, iterable):

    matcher= re.compile(re_str)
    in_match= 0
    for item in iterable:
        if matcher.match(item):
            if in_match == 0:
                yield item
            in_match+= 1
        else:
            if in_match > 1:
                yield "%s repeats %d more times\n" % (re_str, in_match-1)
            in_match= 0
            yield item
    if in_match > 1:
        yield "%s repeats %d more times\n" % (re_str, in_match-1)

import sys, re

for line in rematcher(".*Dog.*", sys.stdin):
    sys.stdout.write(line)

EDIT

In your case, the final string should be:

final_string= '\n'.join(rematcher(".*Dog.*", your_initial_string.split("\n")))

Answered by: Ryan382 | Posted: 01-03-2022



Answer 2

Updated your code to be a bit more effective

#!/usr/bin/env python
#

import re
import types

def remove_repeats (l_string, l_regex):
   """Take a string, remove similar lines and replace with a summary message.

   l_regex accepts strings/patterns or tuples of strings/patterns.
   """

   # Convert string/pattern to tuple.
   if not hasattr(l_regex, '__iter__'):
      l_regex = l_regex,

   ret = []
   last_regex = None
   count = 0

   for line in l_string.splitlines(True):
      if last_regex:
         # Previus line matched one of the regexes
         if re.match(last_regex, line):
            # This one does too
            count += 1
            continue  # skip to next line
         elif count > 1:
            ret.append("::::: Pattern %r repeats %d more times.\n" % (last_regex, count-1))
         count = 0
         last_regex = None

      ret.append(line)

      # Look for other patterns that could match
      for regex in l_regex:
         if re.match(regex, line):
            # Found one
            last_regex = regex
            count = 1
            break  # exit inner loop

   return ''.join(ret)

Answered by: Audrey683 | Posted: 01-03-2022



Answer 3

First, your regular expression will match more slowly than if you had left off the greedy match.

.*Dog.*

is equivalent to

Dog

but the latter matches more quickly because no backtracking is involved. The longer the strings, the more likely "Dog" appears multiple times and thus the more backtracking work the regex engine has to do. As it is, ".*D" virtually guarantees backtracking.

That said, how about:

#! /usr/bin/env python

import re            # regular expressions
import fileinput    # read from STDIN or file

my_regex = '.*Dog.*'
my_matches = 0

for line in fileinput.input():
    line = line.strip()

    if re.search(my_regex, line):
        if my_matches == 0:
            print(line)
        my_matches = my_matches + 1
    else:
        if my_matches != 0:
            print('::::: Pattern %s repeats %i more times.' % (my_regex, my_matches - 1))
        print(line)
        my_matches = 0

It's not clear what should happen with non-neighboring matches.

It's also not clear what should happen with single-line matches surrounded by non-matching lines. Append "Doggy" and "Hula" to the input file and you'll get the matching message "0" more times.

Answered by: Aldus860 | Posted: 01-03-2022



Similar questions

python - How to duplicate a first and last line in pandas from a block that have matching block values?

In the following data: block M1 M2 M3 M4 M5 M6 M7 M8 H0 H1 S1 S2 S3 S4 S5 S6 S7 S8 151 A T T A A G A C A C C G C T T A G A 151 T G C T G T T G T A A T A T C A A T 151 C A A C A G T C C G G A C G C G C G 155 G T G T A T C T G T C T T T A T C T 155 C A A C A G T C C...


python - I can't figure out why i'm getting duplicate index per matching sublist

What im trying to do is take the fastest route to mark each and every index of a sublist inside a specific superlist. import numpy as np l1 = [['a', 'b'], ['c', 'd'], ['a', 'b'], ['a', 'b']] l2 = ['a', 'b'] a = np.array(l1) b = np.array(l2) x = np.where(a == b)[0] print(x) output: [0 0 2 2 3 3]


python - regex duplicate matching? nested groups?

This example is from a book. The email regex looks like this: emailRegex = re.compile(r'''( [a-zA-Z0-9._%+-]+ #username @ #@ symbol [a-zA-Z-0-9.-]+ #domain name (\.[a-zA-Z]{2,5}) #dot-something )''', re.VERBOSE) match one or more occurrences of [a-zA-Z0-9._%+-], foll...


python - Compare two lists and remove first matching duplicate from list2

I have two lists and I'm trying to remove the first duplicate item from list2. list1 =["cat"] list2 = ["cat","dog","elephant","cat"] expected output list2 = ["dog","elephant","cat"] I've tried using sets, but that removes ALL duplicates. I just want to remove the first duplicate fo...


python - Detect duplicate MP3 files with different bitrates and/or different ID3 tags?

How could I detect (preferably with Python) duplicate MP3 files that can be encoded with different bitrates (but they are the same song) and ID3 tags that can be incorrect? I know I can do an MD5 checksum of the files content but that won't work for different bitrates. And I don't know if ID3 tags have influence in generating the MD5 checksum. Should I...


duplicate removal - Python remove all lines which have common value in fields

I have lines of data comprising of 4 fields aaaa bbb1 cccc dddd aaaa bbb2 cccc dddd aaaa bbb3 cccc eeee aaaa bbb4 cccc ffff aaaa bbb5 cccc gggg aaaa bbb6 cccc dddd Please bear with me. The first and third field is always the same - but I don't need them, the 4th field can be the same or different. The thing is, I only want 2nd and 4th fields from lines which don't s...


python - Whats the best way to duplicate data in a django template?

This question already has answers here:


python - Django - Allow duplicate usernames

I'm working on a project in django which calls for having separate groups of users in their own username namespace. So for example, I might have multiple "organizations", and username should only have to be unique within that organization. I know I can do this by using another model that contains a username/organization id, but that still leaves this useless (and required) field ...


python - How to delete list elements while cycling the list itself without duplicate it

I lost a little bit of time in this Python for statement: class MyListContainer: def __init__(self): self.list = [] def purge(self): for object in self.list: if (object.my_cond()): self.list.remove(object) return self.list container = MyListContainer() # now suppose both obj.my_cond() return True obj1 = MyCustomObject(par) obj2 = MyCustomObject(...


python - pysqlite, query for duplicate entries with swapped columns

Currently I have a pysqlite db that I am using to store a list of road conditions. The source this list is generated from however is buggy and sometimes generates duplicates. Some of these duplicates will have the start and end points swapped but everything else the same. The method i currently have looks like this: def getDupes(self): '''This method is used to return a list of dupilicate entri...


list - Python: Create a duplicate of an array

I have an double array alist[1][1]=-1 alist2=[] for x in xrange(10): alist2.append(alist[x]) alist2[1][1]=15 print alist[1][1] and I get 15. Clearly I'm passing a pointer rather than an actual variable... Is there an easy way to make a seperate double array (no shared pointers) without having to do a double for loop? Thanks, Dan


python - Duplicate each member in a list

I want to write a function that reads a list [1,5,3,6,...] and gives [1,1,5,5,3,3,6,6,...]. Any idea how to do it?


Python mysql check for duplicate before insert

here is the table CREATE TABLE IF NOT EXISTS kompas_url ( id BIGINT(20) NOT NULL AUTO_INCREMENT, url VARCHAR(1000), created_date datetime, modified_date datetime, PRIMARY KEY(id) ) I am trying to do INSERT to kompas_url table only if url is not exist yet any idea? thanks


python - Duplicate key-value pairs returned by memcached

We are using a cluster of memcached servers for caching purpose, in a Django(Python) production, having tried both cmemcache and python-memcache as the API. The problem is under high concurrency, we started to have duplicate key-value pairs, that is to say we are having multi values for a single key. Is there anyone having had the same similar situation and what is the kill? Since the memcached servers themselves a...






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top