python regular expression to split paragraphs

How would one write a regular expression to use in python to split paragraphs?

A paragraph is defined by 2 linebreaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph.

I am using python so the solution can use python's regular expression syntax which is extended. (can make use of (?P...) stuff)

Examples:

the_str = 'paragraph1\n\nparagraph2'
# splitting should yield ['paragraph1', 'paragraph2']

the_str = 'p1\n\t\np2\t\n\tstill p2\t   \n     \n\tp3'
# should yield ['p1', 'p2\t\n\tstill p2', 'p3']

the_str = 'p1\n\n\n\tp2'
# should yield ['p1', '\n\tp2']

The best I could come with is: r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', i.e.

import re
paragraphs = re.split(r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', the_str)

but that is ugly. Anything better?

EDIT:

Suggestions rejected:

r'\s*?\n\s*?\n\s*?' -> That would make example 2 and 3 fail, since \s includes \n, so it would allow paragraph breaks with more than 2 \ns.


Asked by: John142 | Posted: 28-01-2022






Answer 1

Unfortunately there's no nice way to write "space but not a newline".

I think the best you can do is add some space with the x modifier and try to factor out the ugliness a bit, but that's questionable: (?x) (?: [ \t\r\f\v]*? \n ){2} [ \t\r\f\v]*?

You could also try creating a subrule just for the character class and interpolating it three times.

Answered by: Brad614 | Posted: 01-03-2022



Answer 2

Are you trying to deduce the structure of a document in plain test? Are you doing what docutils does?

You might be able to simply use the Docutils parser rather than roll your own.

Answered by: Abigail651 | Posted: 01-03-2022



Answer 3

Not a regexp but really elegant:

from itertools import groupby

def paragraph(lines) :
    for group_separator, line_iteration in groupby(lines.splitlines(True), key = str.isspace) :
        if not group_separator :
            yield ''.join(line_iteration)

for p in paragraph('p1\n\t\np2\t\n\tstill p2\t   \n     \n\tp'): 
    print repr(p)

'p1\n'
'p2\t\n\tstill p2\t   \n'
'\tp3'

It's up to you to strip the output as you need it of course.

Inspired from the famous "Python Cookbook" ;-)

Answered by: Carina726 | Posted: 01-03-2022



Answer 4

Almost the same, but using non-greedy quantifiers and taking advantage of the whitespace sequence.

\s*?\n\s*?\n\s*?

Answered by: Catherine602 | Posted: 01-03-2022



Answer 5

FYI: I just wrote 2 solutions for this type of problem in another thread. First using regular expressions as requested here, and second using a state machine approach which streams through the input one line at a time:

https://stackoverflow.com/a/64863601/5201675

Answered by: Elian351 | Posted: 01-03-2022



Similar questions

Python regular expression grabbing paragraphs from old HTML

I am working on transferring old content from a website, written in some old HTML, to their new WordPress site. I am using Python to do this. I am trying to get the content from the old HTML pages using urllib.request Use a regular expression to grab the text of HTML <p> elements that have classes that identify them as the body of the text use XML-RPC methods to upload...


python - Regular expression to match start of filename and filename extension

What is the regular expression to match strings (in this case, file names) that start with 'Run' and have a filename extension of '.py'? The regular expression should match any of the following: RunFoo.py RunBar.py Run42.py It should not match: myRunFoo.py RunBar.py1 Run42.txt The SQL equivalent of what I am looking for is ... LIKE 'Run%.py' ...


python - Problem with Boolean Expression with a string value from a lIst

I have the following problem: # line is a line from a file that contains ["baa","beee","0"] line = TcsLine.split(",") NumPFCs = eval(line[2]) if NumPFCs==0: print line I want to print all the lines from the file if the second position of the list has a value == 0. I print the lines but after that the following happens: Traceback (most recent call last): ['baaa'...


python - split twice in the same expression?

Imagine I have the following: inFile = "/adda/adas/sdas/hello.txt" # that instruction give me hello.txt Name = inFile.name.split("/") [-1] # that one give me the name I want - just hello Name1 = Name.split(".") [0] Is there any chance to simplify that doing the same job in just one expression?


python - Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here. Here comes the HTML source: <a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a> I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?


regex - How can I translate the following filename to a regular expression in Python?

I am battling regular expressions now as I type. I would like to determine a pattern for the following example file: b410cv11_test.ext. I want to be able to do a search for files that match the pattern of the example file aforementioned. Where do I start (so lost and confused) and what is the best way of arriving at a solution that best matches the file pattern? Thanks in advance.


python - What is the regular expression for the "root" of a website in django?

I'm using django and when users go to www.website.com/ I want to point them to the index view. Right now I'm doing this: (r'^$', 'ideas.idea.views.index'), However, it's not working. I'm assuming my regular expression is wrong. Can anyone help me out? I've looked at python regular expressions but they didn't help me.


regex - Python Regular Expression to add links to urls

I'm trying to make a regular expression that will correctly capture URLs, including ones that are wrapped in parenthesis as in (http://example.com) and spoken about on coding horror at https://blog.codinghorror.com/the-problem-with-urls/ I'm currently using the foll...


python - Regular expression to detect semi-colon terminated C++ for & while loops

In my Python application, I need to write a regular expression that matches a C++ for or while loop that has been terminated with a semi-colon (;). For example, it should match this: for (int i = 0; i < 10; i++); ... but not this: for (int i = 0; i < 10; i++) This looks trivial at first glance, until you realise...


regex - How do i write a regular expression for the following pattern in python?

How do i look for the following pattern using regular expression in python? for the two cases Am looking for str2 after the "=" sign Case 1: str1=str2 Case 2: str1 = str2 please note there can be a space or none between the either side of the "=" sign Mine is like this, but only works for one of the cases! m=re...


regex - Why is the regular expression returning an error in python?

Am trying the following regular expression in python but it returns an error import re ... #read a line from a file to variable line # loking for the pattern 'WORD' in the line ... m=re.search('(?<=[WORD])\w+',str(line)) m.group(0) i get the following error: AttributeError: 'NoneType' object has no attribute 'group'






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top