Python Regular Expression to add links to urls
I'm trying to make a regular expression that will correctly capture URLs, including ones that are wrapped in parenthesis as in (http://example.com) and spoken about on coding horror at https://blog.codinghorror.com/the-problem-with-urls/
I'm currently using the following to create HTML A tags in python for links that start with http and www.
r1 = r"(\b(http|https)://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]))"
r2 = r"((^|\b)www\.([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]))"
return re.sub(r2,r'<a rel="nofollow" target="_blank" href="http://\1">\1</a>',re.sub(r1,r'<a rel="nofollow" target="_blank" href="\1">\1</a>',text))
this works well except for the case where someone wraps the url in parens. Does anyone have a better way?
Asked by: Kellan170 | Posted: 06-12-2021
Answer 1
Problem is, URLs could have parenthesis as part of them... (http://en.wikipedia.org/wiki/Tropical_Storm_Alberto_(2006)) . You can't treat that with regexp alone, since it doesn't have state. You need a parser. So your best chance would be to use a parser, and try to guess the correct close parenthesis. That is error-prone (the url could open parenthesis and never close it) so I guess you're out of luck anyway.
See also http://en.wikipedia.org/wiki/, or (http://en.wikipedia.org/wiki/)) and other similar valid URLs.
Answered by: Elise140 | Posted: 07-01-2022Similar questions
python - Regular expression syntax for "match nothing"?
I have a python template engine that heavily uses regexp. It uses concatenation like:
re.compile( regexp1 + "|" + regexp2 + "*|" + regexp3 + "+" )
I can modify the individual substrings (regexp1, regexp2 etc).
Is there any small and light expression that matches nothing, which I can use inside a template where I don't want any matches? Unfortunately, sometimes '+' or '*' is appende...
regex - Python regular expression to match # followed by 0-7 followed by ##
I would like to intercept string starting with \*#\*
followed by a number between 0 and 7
and ending with: ##
so something like \*#\*0##
but I could not find a regex for this
python - How do I use a regular expression to match a name?
I am a newbie in Python. I want to write a regular expression for some name checking.
My input string can contain a-z, A-Z, 0-9, and ' _ ', but it should start with either a-z or A-Z (not 0-9 and ' _ '). I want to write a regular expression for this. I tried, but nothing was matching perfectly.
Once the input string follows the regular expression rules, I can proceed further, otherwise discard that string.
regex - python regular expression for domain names
I am trying use the following regular expression to extract domain name from a text, but it just produce nothing, what's wrong with it? I don't know if this is suitable to ask this "fix code" question, maybe I should read more. I just want to save some time.
Thanks
pat_url = re.compile(r'''
(?:https?://)*
(?:[\w]+[\-\w]+[.])*
(?P<domain>[\w\-]*[\w.](com|net)([.]...
python - OR in regular expression?
I have text file with several thousands lines. I want to parse this file into database and decided to write a regexp. Here's part of file:
blablabla checked=12 unchecked=1
blablabla unchecked=13
blablabla checked=14
As a result, I would like to get something like
(12,1)
(0,13)
(14,0)
Is it possible?
regex - Find last match with python regular expression
I want to match the last occurrence of a simple pattern in a string, e.g.
list = re.findall(r"\w+ AAAA \w+", "foo bar AAAA foo2 AAAA bar2")
print "last match: ", list[len(list)-1]
However, if the string is very long, a huge list of matches is generated. Is there a more direct way to match the second occurrence of " AAAA ", or should I use this workaround?
python - Regular expression works normally, but fails when placed in an XML schema
I have a simple doc.xml file which contains a single root element with a Timestamp attribute:
<?xml version="1.0" encoding="utf-8"?>
<root Timestamp="04-21-2010 16:00:19.000" />
I'd like to validate this document against a my simple schema.xsd to make sure that the Timestamp is in the correct format:
<?xml version="1.0" encoding="utf...
regex - Python regular expression style
Is there a Pythonic 'standard' for how regular expressions should be used?
What I typically do is perform a bunch of re.compile statements at the top of my module and store the objects in global variables... then later on use them within my functions and classes.
I could define the regexs within the functions I would be using them, but then they would be recompiled every time.
Or, I could f...
python - What's wrong with my regular expression?
I'm expecting a string NOT to match a regular expression, but it is!
>>> re.compile('^P|([LA]E?)$').match('PE').group()
'P'
This seems like a bug, because I see no way for the $ to match. On the other hand, it seems unlikely that Python's re lib would not be able to handle this simple case. Am I missing something here?
btw, Python prints this out when I start it:
python - How do I search from the bottom up using a regular expression?
Here is an example of the type of text file I am trying to search (named usefile):
DOCK onomatopoeia
DOCK blah blah
blah DOCK blah
DOCK
blah blah blah
onomatopoeia
blah blah blah
blah blah DOCK
DOCK blah blah
DOCK blah
onomatopoeia
I am using a finditer statement to find everything between DOCK and onomatopoeia as follows:
re.finditer(r'((dock)(.+?)(onomat...
regex - Regular expression with [ or ( in python
I need to extract IP address in the form
prosseek.amer.corp.com [10.0.40.147]
or
prosseek.amer.corp.com (10.0.40.147)
with Python. How can I get the IP for either case with Python? I started with something like
site = "prosseek.amer.corp.com"
m = re.search("%s.*[\(\[](\d+\.\d+\.\d+\.\d+)" % site, r)
but it doesn't work....
regex - Python : One regular expression help
In python, I've got a string like
ABC(a =2,bc=2, asdf_3 = None)
By using regular expression, I want to make it like
ABC(a =2,bc=2)
I want to remove the parameter that named 'asdf_3', that's it!
Update: The parameters can be a lot, only the asdf_3 is same in all cases, the order is usually the last one.
python - How can I build a regular expression to match a single word?
Say I had the following strings:
Dublin, Ireland.
DublinIreland
Ireland, Dublin
What regular Expression could I use to find the word Dublin in the above strings, but, it cannot count DublinIreland. As in, DublinIreland doesn't say Dublin, it is a whole word that says DublinIreland.
python - Regular expression for a string like this
I need to match ANY strings that start with:
'/Engine
and end with:
ir_vrn'
I have used this:
vrn_page = re.compile('\'/Engine[a-zA-Z0-9._+-&/?:=]+ir_vrn\'')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/re.py", line 190, in compile
return _compile(pattern, flags)
...
python - Regular expression to match all words ending with and containing three 'e's
I am trying to write a regular expression that matches all words so that the only vowel is e and there are exactly three e's in the word, am writing this in python. I tried writing
(?= e){3}[^aiou]*
but it didn't work.
regex - python regular expression help
Need to match any string ends up with a letter, and the second last character is '>'
It will match:
abc>a
ddd_4>f
It will not match:
abc>ab
abc>2
regex - Python regular expression to strip script tags
I'm a little scared to ask this for fear of retribution from the SO "You can't parse HTML with regular expressions" cult. Why does re.subn(r'<(script).*?</\1>', '', data, re.DOTALL) not strip the multiline 'script' but only the two single-line ones at the end, please?
Thanks, HC
>>> import re
>>> data = """\
<nothtml>
<head>
<title>Re...
python - Why doesn't this regular expression match in this string?
I want to be able to replace a string in a file using regular expressions. But my function isn't finding a match. So I've mocked up a test to replicate what's happening.
I have defined the string I want to replace as follows:
string = 'buf = O_strdup("ONE=001&TYPE=PUZZLE&PREFIX=EXPRESS&");'
I want to replace the "TYPE=PUZZLE&PREFIX=EXPRESS&" part with something else....
python - Regular expression in loops
I have two lists with data that I want to compare dates for. I tried using regular expression within a loop to find the corresponding entry from L1 in L2. The entries in the lists consist of strings 'code, name, date', and I want to match the entry from L1 with the entry that begins with the same code in L2. I wrote the regular expression like this:
for line in L2:
if re.match((code), line):
python - need help regarding the following regular expression
i am using python with re module for regular expressions
i want to remove all the characters from the string except numbers and characters.
To achieve this i am using sub function
Code Snippet:-
>>> text="foo.bar"
>>> re.sub("[^A-Z][^a-z]","",text)
'fobar'
I wanted to know why above expression removes the "o."?
I am not able to understand why it ...
python - Regular expression to extract URL from an HTML link
I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?
python - What is the regular expression for the "root" of a website in django?
I'm using django and when users go to www.website.com/ I want to point them to the index view.
Right now I'm doing this:
(r'^$', 'ideas.idea.views.index'),
However, it's not working. I'm assuming my regular expression is wrong. Can anyone help me out? I've looked at python regular expressions but they didn't help me.
regex - How do i write a regular expression for the following pattern in python?
How do i look for the following pattern using regular expression in python? for the two cases
Am looking for str2 after the "=" sign
Case 1: str1=str2
Case 2: str1 = str2
please note there can be a space or none between the either side of the "=" sign
Mine is like this, but only works for one of the cases!
m=re...
python - Regular expression syntax for "match nothing"?
I have a python template engine that heavily uses regexp. It uses concatenation like:
re.compile( regexp1 + "|" + regexp2 + "*|" + regexp3 + "+" )
I can modify the individual substrings (regexp1, regexp2 etc).
Is there any small and light expression that matches nothing, which I can use inside a template where I don't want any matches? Unfortunately, sometimes '+' or '*' is appende...
regex - Python regular expression to match # followed by 0-7 followed by ##
I would like to intercept string starting with \*#\*
followed by a number between 0 and 7
and ending with: ##
so something like \*#\*0##
but I could not find a regex for this
regex - How can I create a regular expression in Python?
I'm trying to create regular expressions to filter certain text from a text file. What I want to filter has this format:
word_*_word.word
So for example, I would like the python code every match. Sample results would be:
program1_0.0-1_log.build
program2_0.1-3_log.build
How can I do this?
Thanks a lot for your help
python - How can I build a regular expression which has options part
How can I build a regular expression in python which can match all the following?
where it is a "string (a-zA-Z)" follow by a space follow by 1 or multiple 4 integers which separates by a comma:
Example:
someotherstring 42 1 48 17,
somestring 363 1 46 17,363 1 34 17,401 3 8 14,
otherstring 42 1 48 17,363 1 34 17,
I have tried the following, since I need t...
python - How do I use a regular expression to match a name?
I am a newbie in Python. I want to write a regular expression for some name checking.
My input string can contain a-z, A-Z, 0-9, and ' _ ', but it should start with either a-z or A-Z (not 0-9 and ' _ '). I want to write a regular expression for this. I tried, but nothing was matching perfectly.
Once the input string follows the regular expression rules, I can proceed further, otherwise discard that string.
regex - python regular expression for domain names
I am trying use the following regular expression to extract domain name from a text, but it just produce nothing, what's wrong with it? I don't know if this is suitable to ask this "fix code" question, maybe I should read more. I just want to save some time.
Thanks
pat_url = re.compile(r'''
(?:https?://)*
(?:[\w]+[\-\w]+[.])*
(?P<domain>[\w\-]*[\w.](com|net)([.]...
python - OR in regular expression?
I have text file with several thousands lines. I want to parse this file into database and decided to write a regexp. Here's part of file:
blablabla checked=12 unchecked=1
blablabla unchecked=13
blablabla checked=14
As a result, I would like to get something like
(12,1)
(0,13)
(14,0)
Is it possible?
python - Regular expression to match start of filename and filename extension
What is the regular expression to match strings (in this case, file names) that start with 'Run' and have a filename extension of '.py'?
The regular expression should match any of the following:
RunFoo.py
RunBar.py
Run42.py
It should not match:
myRunFoo.py
RunBar.py1
Run42.txt
The SQL equivalent of what I am looking for is ... LIKE 'Run%.py' ...
regex - python regular expression to split paragraphs
How would one write a regular expression to use in python to split paragraphs?
A paragraph is defined by 2 linebreaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph.
I am using python so the solution can use python's regular expression syntax whi...
python - Problem with Boolean Expression with a string value from a lIst
I have the following problem:
# line is a line from a file that contains ["baa","beee","0"]
line = TcsLine.split(",")
NumPFCs = eval(line[2])
if NumPFCs==0:
print line
I want to print all the lines from the file if the second position of the list has a value == 0.
I print the lines but after that the following happens:
Traceback (most recent call last):
['baaa'...
python - split twice in the same expression?
Imagine I have the following:
inFile = "/adda/adas/sdas/hello.txt"
# that instruction give me hello.txt
Name = inFile.name.split("/") [-1]
# that one give me the name I want - just hello
Name1 = Name.split(".") [0]
Is there any chance to simplify that doing the same job in just one expression?
python - Regular expression to extract URL from an HTML link
I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?
regex - How can I translate the following filename to a regular expression in Python?
I am battling regular expressions now as I type.
I would like to determine a pattern for the following example file: b410cv11_test.ext. I want to be able to do a search for files that match the pattern of the example file aforementioned. Where do I start (so lost and confused) and what is the best way of arriving at a solution that best matches the file pattern? Thanks in advance.
python - What is the regular expression for the "root" of a website in django?
I'm using django and when users go to www.website.com/ I want to point them to the index view.
Right now I'm doing this:
(r'^$', 'ideas.idea.views.index'),
However, it's not working. I'm assuming my regular expression is wrong. Can anyone help me out? I've looked at python regular expressions but they didn't help me.
python - Regular expression to detect semi-colon terminated C++ for & while loops
In my Python application, I need to write a regular expression that matches a C++ for or while loop that has been terminated with a semi-colon (;). For example, it should match this:
for (int i = 0; i < 10; i++);
... but not this:
for (int i = 0; i < 10; i++)
This looks trivial at first glance, until you realise...
regex - How do i write a regular expression for the following pattern in python?
How do i look for the following pattern using regular expression in python? for the two cases
Am looking for str2 after the "=" sign
Case 1: str1=str2
Case 2: str1 = str2
please note there can be a space or none between the either side of the "=" sign
Mine is like this, but only works for one of the cases!
m=re...
regex - Why is the regular expression returning an error in python?
Am trying the following regular expression in python but it returns an error
import re
...
#read a line from a file to variable line
# loking for the pattern 'WORD' in the line ...
m=re.search('(?<=[WORD])\w+',str(line))
m.group(0)
i get the following error:
AttributeError: 'NoneType' object has no attribute 'group'
Still can't find your answer? Check out these communities...
PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python