Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here.

Here comes the HTML source:

<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>

I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?


Asked by: John721 | Posted: 06-12-2021






Answer 1

If you're only looking for one:

import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
    print(match.group(1))

If you have a long string, and want every instance of the pattern in it:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))

Where s is the string that you're looking for matches in.

Quick explanation of the regexp bits:

r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)

"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.

Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."

"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.

It's pretty easy to do:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
    print(tag['href'])

Once you've installed BeautifulSoup, anyway.

Answered by: Kate237 | Posted: 07-01-2022



Answer 2

Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.

Answered by: Joyce456 | Posted: 07-01-2022



Answer 3

this should work, although there might be more elegant ways.

import re
url='<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)

Answered by: Rubie829 | Posted: 07-01-2022



Answer 4

John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:

http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.

Answered by: Miller517 | Posted: 07-01-2022



Answer 5

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.

Answered by: Marcus664 | Posted: 07-01-2022



Answer 6

There's tonnes of them on regexlib

Answered by: Emily667 | Posted: 07-01-2022



Answer 7

this regex can help you, you should get the first group by \1 or whatever method you have in your language.

href="([^"]*)

example:

<a href="http://www.amghezi.com">amgheziName</a>

result:

http://www.amghezi.com

Answered by: Blake178 | Posted: 07-01-2022



Answer 8

Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.

Answered by: Grace943 | Posted: 07-01-2022



Answer 9

This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/

(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)

Oputput:

Match 1. /wiki/Main_Page

Match 2. /wiki/Portal:Contents

Match 3. /wiki/Portal:Featured_content

Match 4. /wiki/Portal:Current_events

Match 5. /wiki/Special:Random

Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en

Answered by: Lyndon872 | Posted: 07-01-2022



Answer 10

You can use this.

<a[^>]+href=["'](.*?)["']

Answered by: Freddie483 | Posted: 07-01-2022



Similar questions

regex - python re module - What regular expression to use to extract pieces of text

I have text which shows course numbers, names, grade and other information for courses taken by students. Specifically, the lines look like these: 0301 453 20071 LINEAR SYSTEMS I A 4 4 16.0 0301 481 20071 ELECTRONICS I WITH LAB A 4 4 16.0 0301 481 20084 ELECTRONICS II WITH LAB RE B 4 4 12.0 0301 713 20091 SOLID STATE PHYSICS NG...


python - How to use Regular Expression to extract information from a HTML webpage?

How to use Regular Expression to extract the answer "Here is the answer" from a HTML webpage like this? &lt;b&gt;Last Question:&lt;/b&gt; &lt;b&gt;Here is the answer&lt;/b&gt;


python - Regular Expression to extract parts of Twitter query

I have the following string from which I want to extract the q and geocode values. ?since_id=261042755432763393&amp;q=salvia&amp;geocode=39.862712%2C-75.33958%2C10mi I've tried the following regular expression. expr = re.compile('\[\=\](.*?)\[\&amp;\]') vals = expr.match(str) However, vals is None. I'm also not sure...


Python - Regular Expression, how to extract this?

I have this kind of string str = "(\\pt 3 \\out I1, I2 \\img img.jpg)" I would like to extract 3 , I2, I2 and img.jpg as a separate values(I1, and I2 as one value) I started like this pattern = "\\pt (.)" re.findall(pattern, str) and I get first value 3, but can't figure out how to extract the other two?


python - Regular expression extract and exclude data from string

I have an html like string I want to extract data out of. s="&lt;ul&gt;&lt;li&gt;this is a bullet lev 1&amp;nbsp;&lt;/li&gt;&lt;li&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;&amp;nbsp;this&lt;/strong&gt; is a bullet lev&amp;nbsp;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&amp;nbsp;&lt;ul&gt;&lt;li&gt;&lt;ul&gt;&lt;li&gt;this is a bullet lev 3&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/u...


python - Regular expression to extract words before a slash

I'd like to extract the two words FIRST and SECOND from the phrase below, i've tried with this regex, to get the word before the slash but it doesn't work : / btw it's on python: import re data = "12341 O:EXAMPLE (FIRST:/xxxxxx) R:SECOND/xxxxx id:1234" data2 = "12341 O:EXAMPLE:FIRST2:/xxxxxx) R:SECOND2/xxxxx id:1234" result = re.findall(r'[/]*',data) result2 = re.findall(r'[/]*',d...


python - Regular expression to extract a message

i have a script that read lines of files.. and some of the lines contain Error messages.. so i have made a loop ( here it's just for one line ) to find those lines and extract the messages: import re data = "15:31:17 TPP E Line 'MESSAGE': There is a technical problem in the server." if (re.findall(".*E Line.*",data)): err = re.match(r'\'MESSAGE\':\s(*$)',data) print err I hav...


python - How can I extract two values from a string like this using a regular expression?

How can I get the value from the following strings using one regular expression? /*##debug_string:value/##*/ or /*##debug_string:1234/##*/ or /*##debug_string:http://stackoverflow.com//##*/ The result should be value 1234 http://stackoverflow.com/


python - Using regular expression to extract string

I need to extract the IP address from the following string. &gt;&gt;&gt; mydns='ec2-54-196-170-182.compute-1.amazonaws.com' The text to the left of the dot needs to be returned. The following works as expected. &gt;&gt;&gt; mydns[:18] 'ec2-54-196-170-182' But it does not work in all cases. For e.g. mydns='ec2-666-777-888-999.compute-1.amazo...


regex - Python Regular Expression Extract Chunk of Data From Binary File

I've a binary file. From that file I need to extract few chunk of data using python regular expression. I need to extract non null characters-set present in-between null characters sets. For example this is the main character set: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00...


python - Regular Expression to extract CSV data, some fields are quoted and contain commas

I have the following types of input data (for Splunk) svr28pr,Linux File System-ALL,success,32.87,2638.259,26/06/14 19:00,26/06/14 21:03,avamar xxxxx1.network.local,Activity completed with exceptions.,26/06/14 19:00 SVr44PR:Staging_SyncDB,incr,success,1271,1271,27/06/14 11:28,27/06/14 11:28,SQL,,,1/01/70 09:59 I need to break this out into fields - the following expression worked well....


xml - Python regular expression extract the text between two values

what a regular expression to extract the text between two values? in: &lt;office:annotation office:name="__Annotation__45582_97049284"&gt; &lt;/office:annotation&gt; case 1 there can be an arbitrary text with any symbols &lt;office:annotation-end office:name="__Annotation__45582_97049284"/&gt; &lt;office:annotation office:name="__Annotation__19324994_2345354"&gt; &lt;/office:annotation&gt; ...


python - regular expression to extract part of email address

I am trying to use a regular expression to extract the part of an email address between the "@" sign and the "." character. This is how I am currently doing it, but can't get the right results. company = re.findall('^From:.+@(.*).',line) Gives me: ['@iupui.edu'] I want to get rid of the .edu


regex - python regular expression, extract bytes from listing output

I'm trying to extract the binary opcodes from listing file generated via /Fa flag in visual studio. The format look like: 00040 8b 45 bc mov eax, DWORD PTR _i$2535[ebp] 00043 3b 45 c8 cmp eax, DWORD PTR _code_section_size$[ebp] 00046 73 19 jae SHORT $LN1@unpacker_m When the first number is address, then we have opcodes and then the instruction mnemonic, in ...


python - Regular expression to extract IPV6 address alone

I am trying to extract an IPV6 address from the below line. I am getting the below output.. I just want till the IPV6 address,not beyond that.That is i don't want to see Scope:Link Code: out ='ifconfig eth6.36\r\neth6.36 Link encap:Ethernet HWaddr A0:36:9F:5F:24:EE \r\n inet addr:36.36.36.10 Bcast:36.36.36.255 Mask:255.255.255.0\r\n inet6 addr: fe80::a236:9fff:fe5f:24ed/64 S...


python - Regular expression to extract string between two words

This question already has answers here:


python - Regular expression to extract data from news page

Hi I'm running python regular expression to extract some data from news pages, however when it is displayed the code produces brackets and apostrophes in the output. For example this is my code: description_title = findall('&lt;item&gt;[\s]*&lt;title[^&gt;]*&gt;(.*?)&lt;\/title&gt;[\s]*&lt;description&gt;', html_source)[:1] news_file.write('&lt;h3 align="Center"&gt;' + str(description_title) + ": " ...


python - what regular expression can extract data I need?

I have a string url = '//item.taobao.com/item.htm?id\u003d528341191030\u0026ns\u003d1\u0026abbucket\u003d0#detail' I like to extract the number 528341191030 between the first two \u. I tried this, m = re.search('\?id\u\d+d(\d+?)\u', url) if m: print m.group(1) But it doesn't work. What is wrong with my solution?


python - Using regular expression to extract a text

This question already has answers here:


python - In Scrapy, how to extract two groups in a regular expression into two different fields?

I'm writing a spider trulia to scrape pages of properties for sale on Trulia.com such as https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123; the current version can be found on https:...


python - What is the regular expression for the "root" of a website in django?

I'm using django and when users go to www.website.com/ I want to point them to the index view. Right now I'm doing this: (r'^$', 'ideas.idea.views.index'), However, it's not working. I'm assuming my regular expression is wrong. Can anyone help me out? I've looked at python regular expressions but they didn't help me.


regex - Python Regular Expression to add links to urls

I'm trying to make a regular expression that will correctly capture URLs, including ones that are wrapped in parenthesis as in (http://example.com) and spoken about on coding horror at https://blog.codinghorror.com/the-problem-with-urls/ I'm currently using the foll...


regex - How do i write a regular expression for the following pattern in python?

How do i look for the following pattern using regular expression in python? for the two cases Am looking for str2 after the "=" sign Case 1: str1=str2 Case 2: str1 = str2 please note there can be a space or none between the either side of the "=" sign Mine is like this, but only works for one of the cases! m=re...


python - Regular expression syntax for "match nothing"?

I have a python template engine that heavily uses regexp. It uses concatenation like: re.compile( regexp1 + "|" + regexp2 + "*|" + regexp3 + "+" ) I can modify the individual substrings (regexp1, regexp2 etc). Is there any small and light expression that matches nothing, which I can use inside a template where I don't want any matches? Unfortunately, sometimes '+' or '*' is appende...


regex - Python regular expression to match # followed by 0-7 followed by ##

I would like to intercept string starting with \*#\* followed by a number between 0 and 7 and ending with: ## so something like \*#\*0## but I could not find a regex for this


regex - How can I create a regular expression in Python?

I'm trying to create regular expressions to filter certain text from a text file. What I want to filter has this format: word_*_word.word So for example, I would like the python code every match. Sample results would be: program1_0.0-1_log.build program2_0.1-3_log.build How can I do this? Thanks a lot for your help


python - How can I build a regular expression which has options part

How can I build a regular expression in python which can match all the following? where it is a "string (a-zA-Z)" follow by a space follow by 1 or multiple 4 integers which separates by a comma: Example: someotherstring 42 1 48 17, somestring 363 1 46 17,363 1 34 17,401 3 8 14, otherstring 42 1 48 17,363 1 34 17, I have tried the following, since I need t...


python - How do I use a regular expression to match a name?

I am a newbie in Python. I want to write a regular expression for some name checking. My input string can contain a-z, A-Z, 0-9, and ' _ ', but it should start with either a-z or A-Z (not 0-9 and ' _ '). I want to write a regular expression for this. I tried, but nothing was matching perfectly. Once the input string follows the regular expression rules, I can proceed further, otherwise discard that string.


regex - python regular expression for domain names

I am trying use the following regular expression to extract domain name from a text, but it just produce nothing, what's wrong with it? I don't know if this is suitable to ask this "fix code" question, maybe I should read more. I just want to save some time. Thanks pat_url = re.compile(r''' (?:https?://)* (?:[\w]+[\-\w]+[.])* (?P&lt;domain&gt;[\w\-]*[\w.](com|net)([.]...


python - OR in regular expression?

I have text file with several thousands lines. I want to parse this file into database and decided to write a regexp. Here's part of file: blablabla checked=12 unchecked=1 blablabla unchecked=13 blablabla checked=14 As a result, I would like to get something like (12,1) (0,13) (14,0) Is it possible?


python - Regular expression to match start of filename and filename extension

What is the regular expression to match strings (in this case, file names) that start with 'Run' and have a filename extension of '.py'? The regular expression should match any of the following: RunFoo.py RunBar.py Run42.py It should not match: myRunFoo.py RunBar.py1 Run42.txt The SQL equivalent of what I am looking for is ... LIKE 'Run%.py' ...


regex - python regular expression to split paragraphs

How would one write a regular expression to use in python to split paragraphs? A paragraph is defined by 2 linebreaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph. I am using python so the solution can use python's regular expression syntax whi...


python - Problem with Boolean Expression with a string value from a lIst

I have the following problem: # line is a line from a file that contains ["baa","beee","0"] line = TcsLine.split(",") NumPFCs = eval(line[2]) if NumPFCs==0: print line I want to print all the lines from the file if the second position of the list has a value == 0. I print the lines but after that the following happens: Traceback (most recent call last): ['baaa'...


python - split twice in the same expression?

Imagine I have the following: inFile = "/adda/adas/sdas/hello.txt" # that instruction give me hello.txt Name = inFile.name.split("/") [-1] # that one give me the name I want - just hello Name1 = Name.split(".") [0] Is there any chance to simplify that doing the same job in just one expression?


regex - How can I translate the following filename to a regular expression in Python?

I am battling regular expressions now as I type. I would like to determine a pattern for the following example file: b410cv11_test.ext. I want to be able to do a search for files that match the pattern of the example file aforementioned. Where do I start (so lost and confused) and what is the best way of arriving at a solution that best matches the file pattern? Thanks in advance.


python - What is the regular expression for the "root" of a website in django?

I'm using django and when users go to www.website.com/ I want to point them to the index view. Right now I'm doing this: (r'^$', 'ideas.idea.views.index'), However, it's not working. I'm assuming my regular expression is wrong. Can anyone help me out? I've looked at python regular expressions but they didn't help me.


regex - Python Regular Expression to add links to urls

I'm trying to make a regular expression that will correctly capture URLs, including ones that are wrapped in parenthesis as in (http://example.com) and spoken about on coding horror at https://blog.codinghorror.com/the-problem-with-urls/ I'm currently using the foll...


python - Regular expression to detect semi-colon terminated C++ for & while loops

In my Python application, I need to write a regular expression that matches a C++ for or while loop that has been terminated with a semi-colon (;). For example, it should match this: for (int i = 0; i &lt; 10; i++); ... but not this: for (int i = 0; i &lt; 10; i++) This looks trivial at first glance, until you realise...


regex - How do i write a regular expression for the following pattern in python?

How do i look for the following pattern using regular expression in python? for the two cases Am looking for str2 after the "=" sign Case 1: str1=str2 Case 2: str1 = str2 please note there can be a space or none between the either side of the "=" sign Mine is like this, but only works for one of the cases! m=re...


regex - Why is the regular expression returning an error in python?

Am trying the following regular expression in python but it returns an error import re ... #read a line from a file to variable line # loking for the pattern 'WORD' in the line ... m=re.search('(?&lt;=[WORD])\w+',str(line)) m.group(0) i get the following error: AttributeError: 'NoneType' object has no attribute 'group'






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top