Parsing a string for nested patterns

What would be the best way to do this.

The input string is

<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust.</135_3></133_3>

the expected output is

{'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit \
using either camera now they are just sitting and collecting dust.':[133, 135],

'The other system worked for about 1 month': [116],

'on it then it started doing the same thing as the first one':[137]

}

that seems like a recursive regexp search but I can't figure out how exactly.

I can think of a tedious recursive function as of now, but have a feeling that there should be a better way.

Related question: Can regular expressions be used to match nested patterns?


Asked by: Roland718 | Posted: 01-10-2021






Answer 1

Use expat or another XML parser; it's more explicit than anything else, considering you're dealing with XML data anyway.

However, note that XML element names can't start with a number as your example has them.

Here's a parser that will do what you need, although you'll need to tweak it to combine duplicate elements into one dict key:

from xml.parsers.expat import ParserCreate

open_elements = {}
result_dict = {}

def start_element(name, attrs):
    open_elements[name] = True

def end_element(name):
    del open_elements[name]

def char_data(data):
    for element in open_elements:
        cur = result_dict.setdefault(element, '')
        result_dict[element] = cur + data

if __name__ == '__main__':
    p = ParserCreate()

    p.StartElementHandler = start_element
    p.EndElementHandler = end_element
    p.CharacterDataHandler = char_data

    p.Parse(u'<_133_3><_135_3><_116_2>The other system worked for about 1 month</_116_2> got some good images <_137_3>on it then it started doing the same thing as the first one</_137_3> so then I quit using either camera now they are just sitting and collecting dust.</_135_3></_133_3>', 1)

    print result_dict

Answered by: Arthur245 | Posted: 02-11-2021



Answer 2

Take an XML parser, make it generate a DOM (Document Object Model) and then build a recursive algorithm that traverses all the nodes, calls "text()" in each node (that should give you the text in the current node and all children) and puts that as a key in the dictionary.

Answered by: Sawyer546 | Posted: 02-11-2021



Answer 3

from cStringIO   import StringIO
from collections import defaultdict
####from xml.etree   import cElementTree as etree
from lxml import etree

xml = "<e133_3><e135_3><e116_2>The other system worked for about 1 month</e116_2> got some good images <e137_3>on it then it started doing the same thing as the first one</e137_3> so then I quit using either camera now they are just sitting and collecting dust. </e135_3></e133_3>"

d = defaultdict(list)
for event, elem in etree.iterparse(StringIO(xml)):
    d[''.join(elem.itertext())].append(int(elem.tag[1:-2]))

print(dict(d.items()))

Output:

{'on it then it started doing the same thing as the first one': [137], 
'The other system worked for about 1 month': [116], 
'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
either camera now they are just sitting and collecting dust. ': [133, 135]}

Answered by: Kelvin178 | Posted: 02-11-2021



Answer 4

I think a grammar would be the best option here. I found a link with some information: http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html

Answered by: Maria304 | Posted: 02-11-2021



Answer 5

Note that you can't actually solve this by a regular expression, since they don't have the expressive power to enforce proper nesting.

Take the following mini-language:

A certain number of "(" followed by the same number of ")", no matter what the number.

You could make a regular expression very easily to represent a super-language of this mini-language (where you don't enforce the equality of the number of starts parentheses and end parentheses). You could also make a regular expression very easilty to represent any finite sub-language (where you limit yourself to some max depth of nesting). But you can never represent this exact language in a regular expression.

So you'd have to use a grammar, yes.

Answered by: Chester670 | Posted: 02-11-2021



Answer 6

Here's an unreliable inefficient recursive regexp solution:

import re

re_tag = re.compile(r'<(?P<tag>[^>]+)>(?P<content>.*?)</(?P=tag)>', re.S)

def iterparse(text, tag=None):
    if tag is not None: yield tag, text
    for m in re_tag.finditer(text):
        for tag, text in iterparse(m.group('content'), m.group('tag')):
            yield tag, text

def strip_tags(content):
    nested = lambda m: re_tag.sub(nested, m.group('content'))
    return re_tag.sub(nested, content)


txt = "<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust. </135_3></133_3>"
d = {}
for tag, text in iterparse(txt):
    d.setdefault(strip_tags(text), []).append(int(tag[:-2]))

print(d)

Output:

{'on it then it started doing the same thing as the first one': [137], 
 'The other system worked for about 1 month': [116], 
 'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
 either camera now they are just sitting and collecting dust. ': [133, 135]}

Answered by: Walter687 | Posted: 02-11-2021



Similar questions

python - Parsing list of URLs with regex patterns

I have a large text file of URLs (>1 million URLs). The URLs represent product pages across several different domains. I'm trying to parse out the SKU and product name from each URL, such as: www.amazon.com/totes-Mens-Mike-Duck-Boot/dp/B01HQR3ODE/ totes-Mens-Mike-Duck-Boot B01HQR3ODE www.bestbuy.com/site/apple-airpods-white/5577872.p?skuId=5577872 apple...


python - Parsing list of URLs with regex patterns

I have a large text file of URLs (>1 million URLs). The URLs represent product pages across several different domains. I'm trying to parse out the SKU and product name from each URL, such as: www.amazon.com/totes-Mens-Mike-Duck-Boot/dp/B01HQR3ODE/ totes-Mens-Mike-Duck-Boot B01HQR3ODE www.bestbuy.com/site/apple-airpods-white/5577872.p?skuId=5577872 apple...


Is there a way in python to apply a list of regex patterns that are stored in a list to a single string?

i have a list of regex patterns (stored in a list type) that I would like to apply to a string. Does anyone know a good way to: Apply every regex pattern in the list to the string and Call a different function that is associated with that pattern in the list if it matches. I would like to do this in python if possible thanks in advance.


regex - Python raw strings and unicode : how to use Web input as regexp patterns?

EDIT : This question doesn't really make sense once you have picked up what the "r" flag means. More details here. For people looking for a quick anwser, I added on below. If I enter a regexp manually in a Python script, I can use 4 combinations of f...


design patterns - Why is IoC / DI not common in Python?

In Java IoC / DI is a very common practice which is extensively used in web applications, nearly all available frameworks and Java EE. On the other hand, there are also lots of big Python web applications, but beside of Zope (which I've heard should be really horr...


python - Does Django cache url regex patterns somehow?

I'm a Django newbie who needs help: Even though I change some urls in my urls.py I keep on getting the same error message from Django. Here is the relevant line from my settings.py: ROOT_URLCONF = 'mydjango.urls' Here is my urls.py: from django.conf.urls.defaults import * # Uncomment the next two lines to enable the admin: from django.contrib import admin admin.autodiscov...


design patterns - python global object cache

Little question concerning app architecture: I have a python script, running as a daemon. Inside i have many objects, all inheriting from one class (let's name it 'entity') I have also one main object, let it be 'topsys' Entities are identified by pair (id, type (= class, roughly)), and they are connected in many wicked ways. They are also created and deleted all the time, and they are need ...


python - Listing all patterns that a regex matches

I am looking for a way to list all possible patterns from a finite regex (with no duplicates). Is there any source available?


Python Backend Design Patterns

I am now working on a big backend system for a real-time and history tracking web service. I am highly experienced in Python and intend to use it with sqlalchemy (MySQL) to develop the backend. I don't have any major experience developing robust and sustainable backend systems and I was wondering if you guys could point me out to some documentation / books about backend design patterns? I basically need to feed dat...


python - Django using i18n in URL patterns

I have a app in django, which I need to implement i18n, should be easy in django. but my problem here is, I cannot use the HttpSession to store the user language! so my solution is, I got add to all URL the language as parameter! is there any easy why to do it?! or some API for django?! I saw


python regex to split on certain patterns with skip patterns

I want to split a Python string on certain patterns but not others. For example, I have the string Joe, Dave, Professional, Ph.D. and Someone else I want to split on \sand\s and ,, but not , Ph.D. How can this be accomplished in Python regex?


python - cannot import name patterns

Before I wrote in urls.py, my code... everything worked perfectly. Now I have problems - can't go to my site. "cannot import name patterns" My urls.py is: from django.conf.urls import patterns, include, url They said what error is somewhere here.






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top