python regular expression to split paragraphs
How would one write a regular expression to use in python to split paragraphs?
A paragraph is defined by 2 linebreaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph.
I am using python so the solution can use python's regular expression syntax which is extended. (can make use of (?P...)
stuff)
Examples:
the_str = 'paragraph1\n\nparagraph2'
# splitting should yield ['paragraph1', 'paragraph2']
the_str = 'p1\n\t\np2\t\n\tstill p2\t \n \n\tp3'
# should yield ['p1', 'p2\t\n\tstill p2', 'p3']
the_str = 'p1\n\n\n\tp2'
# should yield ['p1', '\n\tp2']
The best I could come with is: r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*'
, i.e.
import re
paragraphs = re.split(r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', the_str)
but that is ugly. Anything better?
EDIT:
Suggestions rejected:
r'\s*?\n\s*?\n\s*?'
-> That would make example 2 and 3 fail, since \s
includes \n
, so it would allow paragraph breaks with more than 2 \n
s.
Asked by: John142 | Posted: 28-01-2022
Answer 1
Unfortunately there's no nice way to write "space but not a newline".
I think the best you can do is add some space with the x
modifier and try to factor out the ugliness a bit, but that's questionable: (?x) (?: [ \t\r\f\v]*? \n ){2} [ \t\r\f\v]*?
You could also try creating a subrule just for the character class and interpolating it three times.
Answered by: Brad614 | Posted: 01-03-2022Answer 2
Are you trying to deduce the structure of a document in plain test? Are you doing what docutils does?
You might be able to simply use the Docutils parser rather than roll your own.
Answered by: Abigail651 | Posted: 01-03-2022Answer 3
Not a regexp but really elegant:
from itertools import groupby
def paragraph(lines) :
for group_separator, line_iteration in groupby(lines.splitlines(True), key = str.isspace) :
if not group_separator :
yield ''.join(line_iteration)
for p in paragraph('p1\n\t\np2\t\n\tstill p2\t \n \n\tp'):
print repr(p)
'p1\n'
'p2\t\n\tstill p2\t \n'
'\tp3'
It's up to you to strip the output as you need it of course.
Inspired from the famous "Python Cookbook" ;-)
Answered by: Carina726 | Posted: 01-03-2022Answer 4
Almost the same, but using non-greedy quantifiers and taking advantage of the whitespace sequence.
\s*?\n\s*?\n\s*?
Answered by: Catherine602 | Posted: 01-03-2022
Answer 5
FYI: I just wrote 2 solutions for this type of problem in another thread. First using regular expressions as requested here, and second using a state machine approach which streams through the input one line at a time:
https://stackoverflow.com/a/64863601/5201675
Answered by: Elian351 | Posted: 01-03-2022Similar questions
Python regular expression grabbing paragraphs from old HTML
I am working on transferring old content from a website, written in some old HTML, to their new WordPress site. I am using Python to do this. I am trying to
get the content from the old HTML pages using urllib.request
Use a regular expression to grab the text of HTML <p> elements that have classes that identify them as the body of the text
use XML-RPC methods to upload...
python - Regular expression to match start of filename and filename extension
What is the regular expression to match strings (in this case, file names) that start with 'Run' and have a filename extension of '.py'?
The regular expression should match any of the following:
RunFoo.py
RunBar.py
Run42.py
It should not match:
myRunFoo.py
RunBar.py1
Run42.txt
The SQL equivalent of what I am looking for is ... LIKE 'Run%.py' ...
python - Problem with Boolean Expression with a string value from a lIst
I have the following problem:
# line is a line from a file that contains ["baa","beee","0"]
line = TcsLine.split(",")
NumPFCs = eval(line[2])
if NumPFCs==0:
print line
I want to print all the lines from the file if the second position of the list has a value == 0.
I print the lines but after that the following happens:
Traceback (most recent call last):
['baaa'...
python - split twice in the same expression?
Imagine I have the following:
inFile = "/adda/adas/sdas/hello.txt"
# that instruction give me hello.txt
Name = inFile.name.split("/") [-1]
# that one give me the name I want - just hello
Name1 = Name.split(".") [0]
Is there any chance to simplify that doing the same job in just one expression?
python - Regular expression to extract URL from an HTML link
I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?
regex - How can I translate the following filename to a regular expression in Python?
I am battling regular expressions now as I type.
I would like to determine a pattern for the following example file: b410cv11_test.ext. I want to be able to do a search for files that match the pattern of the example file aforementioned. Where do I start (so lost and confused) and what is the best way of arriving at a solution that best matches the file pattern? Thanks in advance.
python - What is the regular expression for the "root" of a website in django?
I'm using django and when users go to www.website.com/ I want to point them to the index view.
Right now I'm doing this:
(r'^$', 'ideas.idea.views.index'),
However, it's not working. I'm assuming my regular expression is wrong. Can anyone help me out? I've looked at python regular expressions but they didn't help me.
regex - Python Regular Expression to add links to urls
I'm trying to make a regular expression that will correctly capture URLs, including ones that are wrapped in parenthesis as in (http://example.com) and spoken about on coding horror at https://blog.codinghorror.com/the-problem-with-urls/
I'm currently using the foll...
python - Regular expression to detect semi-colon terminated C++ for & while loops
In my Python application, I need to write a regular expression that matches a C++ for or while loop that has been terminated with a semi-colon (;). For example, it should match this:
for (int i = 0; i < 10; i++);
... but not this:
for (int i = 0; i < 10; i++)
This looks trivial at first glance, until you realise...
regex - How do i write a regular expression for the following pattern in python?
How do i look for the following pattern using regular expression in python? for the two cases
Am looking for str2 after the "=" sign
Case 1: str1=str2
Case 2: str1 = str2
please note there can be a space or none between the either side of the "=" sign
Mine is like this, but only works for one of the cases!
m=re...
regex - Why is the regular expression returning an error in python?
Am trying the following regular expression in python but it returns an error
import re
...
#read a line from a file to variable line
# loking for the pattern 'WORD' in the line ...
m=re.search('(?<=[WORD])\w+',str(line))
m.group(0)
i get the following error:
AttributeError: 'NoneType' object has no attribute 'group'
Still can't find your answer? Check out these communities...
PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python