Python Regular Expression to add links to urls

I'm trying to make a regular expression that will correctly capture URLs, including ones that are wrapped in parenthesis as in (http://example.com) and spoken about on coding horror at https://blog.codinghorror.com/the-problem-with-urls/

I'm currently using the following to create HTML A tags in python for links that start with http and www.

r1 = r"(\b(http|https)://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]))"
r2 = r"((^|\b)www\.([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]))"
return re.sub(r2,r'<a rel="nofollow" target="_blank" href="http://\1">\1</a>',re.sub(r1,r'<a rel="nofollow" target="_blank" href="\1">\1</a>',text))

this works well except for the case where someone wraps the url in parens. Does anyone have a better way?


Asked by: Kellan170 | Posted: 06-12-2021






Answer 1

Problem is, URLs could have parenthesis as part of them... (http://en.wikipedia.org/wiki/Tropical_Storm_Alberto_(2006)) . You can't treat that with regexp alone, since it doesn't have state. You need a parser. So your best chance would be to use a parser, and try to guess the correct close parenthesis. That is error-prone (the url could open parenthesis and never close it) so I guess you're out of luck anyway.

See also http://en.wikipedia.org/wiki/, or (http://en.wikipedia.org/wiki/)) and other similar valid URLs.

Answered by: Elise140 | Posted: 07-01-2022



Similar questions

python - Regular expression syntax for "match nothing"?

I have a python template engine that heavily uses regexp. It uses concatenation like: re.compile( regexp1 + "|" + regexp2 + "*|" + regexp3 + "+" ) I can modify the individual substrings (regexp1, regexp2 etc). Is there any small and light expression that matches nothing, which I can use inside a template where I don't want any matches? Unfortunately, sometimes '+' or '*' is appende...


regex - Python regular expression to match # followed by 0-7 followed by ##

I would like to intercept string starting with \*#\* followed by a number between 0 and 7 and ending with: ## so something like \*#\*0## but I could not find a regex for this


python - How do I use a regular expression to match a name?

I am a newbie in Python. I want to write a regular expression for some name checking. My input string can contain a-z, A-Z, 0-9, and ' _ ', but it should start with either a-z or A-Z (not 0-9 and ' _ '). I want to write a regular expression for this. I tried, but nothing was matching perfectly. Once the input string follows the regular expression rules, I can proceed further, otherwise discard that string.


regex - python regular expression for domain names

I am trying use the following regular expression to extract domain name from a text, but it just produce nothing, what's wrong with it? I don't know if this is suitable to ask this "fix code" question, maybe I should read more. I just want to save some time. Thanks pat_url = re.compile(r''' (?:https?://)* (?:[\w]+[\-\w]+[.])* (?P&lt;domain&gt;[\w\-]*[\w.](com|net)([.]...


python - OR in regular expression?

I have text file with several thousands lines. I want to parse this file into database and decided to write a regexp. Here's part of file: blablabla checked=12 unchecked=1 blablabla unchecked=13 blablabla checked=14 As a result, I would like to get something like (12,1) (0,13) (14,0) Is it possible?


regex - Find last match with python regular expression

I want to match the last occurrence of a simple pattern in a string, e.g. list = re.findall(r"\w+ AAAA \w+", "foo bar AAAA foo2 AAAA bar2") print "last match: ", list[len(list)-1] However, if the string is very long, a huge list of matches is generated. Is there a more direct way to match the second occurrence of " AAAA ", or should I use this workaround?


python - Regular expression works normally, but fails when placed in an XML schema

I have a simple doc.xml file which contains a single root element with a Timestamp attribute: &lt;?xml version="1.0" encoding="utf-8"?&gt; &lt;root Timestamp="04-21-2010 16:00:19.000" /&gt; I'd like to validate this document against a my simple schema.xsd to make sure that the Timestamp is in the correct format: &lt;?xml version="1.0" encoding="utf...


regex - Python regular expression style

Is there a Pythonic 'standard' for how regular expressions should be used? What I typically do is perform a bunch of re.compile statements at the top of my module and store the objects in global variables... then later on use them within my functions and classes. I could define the regexs within the functions I would be using them, but then they would be recompiled every time. Or, I could f...


python - What's wrong with my regular expression?

I'm expecting a string NOT to match a regular expression, but it is! &gt;&gt;&gt; re.compile('^P|([LA]E?)$').match('PE').group() 'P' This seems like a bug, because I see no way for the $ to match. On the other hand, it seems unlikely that Python's re lib would not be able to handle this simple case. Am I missing something here? btw, Python prints this out when I start it:


python - How do I search from the bottom up using a regular expression?

Here is an example of the type of text file I am trying to search (named usefile): DOCK onomatopoeia DOCK blah blah blah DOCK blah DOCK blah blah blah onomatopoeia blah blah blah blah blah DOCK DOCK blah blah DOCK blah onomatopoeia I am using a finditer statement to find everything between DOCK and onomatopoeia as follows: re.finditer(r'((dock)(.+?)(onomat...


regex - Regular expression with [ or ( in python

I need to extract IP address in the form prosseek.amer.corp.com [10.0.40.147] or prosseek.amer.corp.com (10.0.40.147) with Python. How can I get the IP for either case with Python? I started with something like site = "prosseek.amer.corp.com" m = re.search("%s.*[\(\[](\d+\.\d+\.\d+\.\d+)" % site, r) but it doesn't work....


regex - Python : One regular expression help

In python, I've got a string like ABC(a =2,bc=2, asdf_3 = None) By using regular expression, I want to make it like ABC(a =2,bc=2) I want to remove the parameter that named 'asdf_3', that's it! Update: The parameters can be a lot, only the asdf_3 is same in all cases, the order is usually the last one.


python - How can I build a regular expression to match a single word?

Say I had the following strings: Dublin, Ireland. DublinIreland Ireland, Dublin What regular Expression could I use to find the word Dublin in the above strings, but, it cannot count DublinIreland. As in, DublinIreland doesn't say Dublin, it is a whole word that says DublinIreland.


python - Regular expression for a string like this

I need to match ANY strings that start with: '/Engine and end with: ir_vrn' I have used this: vrn_page = re.compile('\'/Engine[a-zA-Z0-9._+-&amp;/?:=]+ir_vrn\'') Traceback (most recent call last): File "&lt;stdin&gt;", line 1, in &lt;module&gt; File "/usr/lib/python2.6/re.py", line 190, in compile return _compile(pattern, flags) ...


python - Regular expression to match all words ending with and containing three 'e's

I am trying to write a regular expression that matches all words so that the only vowel is e and there are exactly three e's in the word, am writing this in python. I tried writing (?= e){3}[^aiou]* but it didn't work.


regex - python regular expression help

Need to match any string ends up with a letter, and the second last character is '>' It will match: abc&gt;a ddd_4&gt;f It will not match: abc&gt;ab abc&gt;2


regex - Python regular expression to strip script tags

I'm a little scared to ask this for fear of retribution from the SO "You can't parse HTML with regular expressions" cult. Why does re.subn(r'&lt;(script).*?&lt;/\1&gt;', '', data, re.DOTALL) not strip the multiline 'script' but only the two single-line ones at the end, please? Thanks, HC &gt;&gt;&gt; import re &gt;&gt;&gt; data = """\ &lt;nothtml&gt; &lt;head&gt; &lt;title&gt;Re...


python - Why doesn't this regular expression match in this string?

I want to be able to replace a string in a file using regular expressions. But my function isn't finding a match. So I've mocked up a test to replicate what's happening. I have defined the string I want to replace as follows: string = 'buf = O_strdup("ONE=001&amp;TYPE=PUZZLE&amp;PREFIX=EXPRESS&amp;");' I want to replace the "TYPE=PUZZLE&amp;PREFIX=EXPRESS&amp;" part with something else....


python - Regular expression in loops

I have two lists with data that I want to compare dates for. I tried using regular expression within a loop to find the corresponding entry from L1 in L2. The entries in the lists consist of strings 'code, name, date', and I want to match the entry from L1 with the entry that begins with the same code in L2. I wrote the regular expression like this: for line in L2: if re.match((code), line):


python - need help regarding the following regular expression

i am using python with re module for regular expressions i want to remove all the characters from the string except numbers and characters. To achieve this i am using sub function Code Snippet:- &gt;&gt;&gt; text="foo.bar" &gt;&gt;&gt; re.sub("[^A-Z][^a-z]","",text) 'fobar' I wanted to know why above expression removes the "o."? I am not able to understand why it ...


python - Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here. Here comes the HTML source: &lt;a href="http://www.ptop.se" target="_blank"&gt;http://www.ptop.se&lt;/a&gt; I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?


python - What is the regular expression for the "root" of a website in django?

I'm using django and when users go to www.website.com/ I want to point them to the index view. Right now I'm doing this: (r'^$', 'ideas.idea.views.index'), However, it's not working. I'm assuming my regular expression is wrong. Can anyone help me out? I've looked at python regular expressions but they didn't help me.


regex - How do i write a regular expression for the following pattern in python?

How do i look for the following pattern using regular expression in python? for the two cases Am looking for str2 after the "=" sign Case 1: str1=str2 Case 2: str1 = str2 please note there can be a space or none between the either side of the "=" sign Mine is like this, but only works for one of the cases! m=re...


python - Regular expression syntax for "match nothing"?

I have a python template engine that heavily uses regexp. It uses concatenation like: re.compile( regexp1 + "|" + regexp2 + "*|" + regexp3 + "+" ) I can modify the individual substrings (regexp1, regexp2 etc). Is there any small and light expression that matches nothing, which I can use inside a template where I don't want any matches? Unfortunately, sometimes '+' or '*' is appende...


regex - Python regular expression to match # followed by 0-7 followed by ##

I would like to intercept string starting with \*#\* followed by a number between 0 and 7 and ending with: ## so something like \*#\*0## but I could not find a regex for this


regex - How can I create a regular expression in Python?

I'm trying to create regular expressions to filter certain text from a text file. What I want to filter has this format: word_*_word.word So for example, I would like the python code every match. Sample results would be: program1_0.0-1_log.build program2_0.1-3_log.build How can I do this? Thanks a lot for your help


python - How can I build a regular expression which has options part

How can I build a regular expression in python which can match all the following? where it is a "string (a-zA-Z)" follow by a space follow by 1 or multiple 4 integers which separates by a comma: Example: someotherstring 42 1 48 17, somestring 363 1 46 17,363 1 34 17,401 3 8 14, otherstring 42 1 48 17,363 1 34 17, I have tried the following, since I need t...


python - How do I use a regular expression to match a name?

I am a newbie in Python. I want to write a regular expression for some name checking. My input string can contain a-z, A-Z, 0-9, and ' _ ', but it should start with either a-z or A-Z (not 0-9 and ' _ '). I want to write a regular expression for this. I tried, but nothing was matching perfectly. Once the input string follows the regular expression rules, I can proceed further, otherwise discard that string.


regex - python regular expression for domain names

I am trying use the following regular expression to extract domain name from a text, but it just produce nothing, what's wrong with it? I don't know if this is suitable to ask this "fix code" question, maybe I should read more. I just want to save some time. Thanks pat_url = re.compile(r''' (?:https?://)* (?:[\w]+[\-\w]+[.])* (?P&lt;domain&gt;[\w\-]*[\w.](com|net)([.]...


python - OR in regular expression?

I have text file with several thousands lines. I want to parse this file into database and decided to write a regexp. Here's part of file: blablabla checked=12 unchecked=1 blablabla unchecked=13 blablabla checked=14 As a result, I would like to get something like (12,1) (0,13) (14,0) Is it possible?


python - Regular expression to match start of filename and filename extension

What is the regular expression to match strings (in this case, file names) that start with 'Run' and have a filename extension of '.py'? The regular expression should match any of the following: RunFoo.py RunBar.py Run42.py It should not match: myRunFoo.py RunBar.py1 Run42.txt The SQL equivalent of what I am looking for is ... LIKE 'Run%.py' ...


regex - python regular expression to split paragraphs

How would one write a regular expression to use in python to split paragraphs? A paragraph is defined by 2 linebreaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph. I am using python so the solution can use python's regular expression syntax whi...


python - Problem with Boolean Expression with a string value from a lIst

I have the following problem: # line is a line from a file that contains ["baa","beee","0"] line = TcsLine.split(",") NumPFCs = eval(line[2]) if NumPFCs==0: print line I want to print all the lines from the file if the second position of the list has a value == 0. I print the lines but after that the following happens: Traceback (most recent call last): ['baaa'...


python - split twice in the same expression?

Imagine I have the following: inFile = "/adda/adas/sdas/hello.txt" # that instruction give me hello.txt Name = inFile.name.split("/") [-1] # that one give me the name I want - just hello Name1 = Name.split(".") [0] Is there any chance to simplify that doing the same job in just one expression?


python - Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here. Here comes the HTML source: &lt;a href="http://www.ptop.se" target="_blank"&gt;http://www.ptop.se&lt;/a&gt; I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?


regex - How can I translate the following filename to a regular expression in Python?

I am battling regular expressions now as I type. I would like to determine a pattern for the following example file: b410cv11_test.ext. I want to be able to do a search for files that match the pattern of the example file aforementioned. Where do I start (so lost and confused) and what is the best way of arriving at a solution that best matches the file pattern? Thanks in advance.


python - What is the regular expression for the "root" of a website in django?

I'm using django and when users go to www.website.com/ I want to point them to the index view. Right now I'm doing this: (r'^$', 'ideas.idea.views.index'), However, it's not working. I'm assuming my regular expression is wrong. Can anyone help me out? I've looked at python regular expressions but they didn't help me.


python - Regular expression to detect semi-colon terminated C++ for & while loops

In my Python application, I need to write a regular expression that matches a C++ for or while loop that has been terminated with a semi-colon (;). For example, it should match this: for (int i = 0; i &lt; 10; i++); ... but not this: for (int i = 0; i &lt; 10; i++) This looks trivial at first glance, until you realise...


regex - How do i write a regular expression for the following pattern in python?

How do i look for the following pattern using regular expression in python? for the two cases Am looking for str2 after the "=" sign Case 1: str1=str2 Case 2: str1 = str2 please note there can be a space or none between the either side of the "=" sign Mine is like this, but only works for one of the cases! m=re...


regex - Why is the regular expression returning an error in python?

Am trying the following regular expression in python but it returns an error import re ... #read a line from a file to variable line # loking for the pattern 'WORD' in the line ... m=re.search('(?&lt;=[WORD])\w+',str(line)) m.group(0) i get the following error: AttributeError: 'NoneType' object has no attribute 'group'






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top