Python regular expression for HTML parsing (BeautifulSoup)

I want to grab the value of a hidden input field in HTML.

<input type="hidden" name="fooId" value="12-3456789-1111111111" />

I want to write a regular expression in Python that will return the value of fooId, given that I know the line in the HTML follows the format

<input type="hidden" name="fooId" value="**[id is here]**" />

Can someone provide an example in Python to parse the HTML for the value?


Asked by: Charlie520 | Posted: 06-10-2021






Answer 1

For this particular case, BeautifulSoup is harder to write than a regex, but it is much more robust... I'm just contributing with the BeautifulSoup example, given that you already know which regexp to use :-)

from BeautifulSoup import BeautifulSoup

#Or retrieve it from the web, etc. 
html_data = open('/yourwebsite/page.html','r').read()

#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId.attrs[2][1] #The value of the third attribute of the desired tag 
                          #or index it directly via fooId['value']

Answered by: Michael833 | Posted: 07-11-2021



Answer 2

I agree with Vinko BeautifulSoup is the way to go. However I suggest using fooId['value'] to get the attribute rather than relying on value being the third attribute.

from BeautifulSoup import BeautifulSoup
#Or retrieve it from the web, etc.
html_data = open('/yourwebsite/page.html','r').read()
#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId['value'] #The value attribute

Answered by: Brianna770 | Posted: 07-11-2021



Answer 3

import re
reg = re.compile('<input type="hidden" name="([^"]*)" value="<id>" />')
value = reg.search(inputHTML).group(1)
print 'Value is', value

Answered by: Brooke452 | Posted: 07-11-2021



Answer 4

Parsing is one of those areas where you really don't want to roll your own if you can avoid it, as you'll be chasing down the edge-cases and bugs for years go come

I'd recommend using BeautifulSoup. It has a very good reputation and looks from the docs like it's pretty easy to use.

Answered by: Emma314 | Posted: 07-11-2021



Answer 5

Pyparsing is a good interim step between BeautifulSoup and regex. It is more robust than just regexes, since its HTML tag parsing comprehends variations in case, whitespace, attribute presence/absence/order, but simpler to do this kind of basic tag extraction than using BS.

Your example is especially simple, since everything you are looking for is in the attributes of the opening "input" tag. Here is a pyparsing example showing several variations on your input tag that would give regexes fits, and also shows how NOT to match a tag if it is within a comment:

html = """<html><body>
<input type="hidden" name="fooId" value="**[id is here]**" />
<blah>
<input name="fooId" type="hidden" value="**[id is here too]**" />
<input NAME="fooId" type="hidden" value="**[id is HERE too]**" />
<INPUT NAME="fooId" type="hidden" value="**[and id is even here TOO]**" />
<!--
<input type="hidden" name="fooId" value="**[don't report this id]**" />
-->
<foo>
</body></html>"""

from pyparsing import makeHTMLTags, withAttribute, htmlComment

# use makeHTMLTags to create tag expression - makeHTMLTags returns expressions for
# opening and closing tags, we're only interested in the opening tag
inputTag = makeHTMLTags("input")[0]

# only want input tags with special attributes
inputTag.setParseAction(withAttribute(type="hidden", name="fooId"))

# don't report tags that are commented out
inputTag.ignore(htmlComment)

# use searchString to skip through the input 
foundTags = inputTag.searchString(html)

# dump out first result to show all returned tags and attributes
print foundTags[0].dump()
print

# print out the value attribute for all matched tags
for inpTag in foundTags:
    print inpTag.value

Prints:

['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
- empty: True
- name: fooId
- startInput: ['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
  - empty: True
  - name: fooId
  - type: hidden
  - value: **[id is here]**
- type: hidden
- value: **[id is here]**

**[id is here]**
**[id is here too]**
**[id is HERE too]**
**[and id is even here TOO]**

You can see that not only does pyparsing match these unpredictable variations, it returns the data in an object that makes it easy to read out the individual tag attributes and their values.

Answered by: Wilson891 | Posted: 07-11-2021



Answer 6

/<input type="hidden" name="fooId" value="([\d-]+)" \/>/

Answered by: Abigail943 | Posted: 07-11-2021



Answer 7

/<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>/

>>> import re
>>> s = '<input type="hidden" name="fooId" value="12-3456789-1111111111" />'
>>> re.match('<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>', s).groups()
('fooId', '12-3456789-1111111111')

Answered by: Paul686 | Posted: 07-11-2021



Similar questions

python - ImportError: No Module Named bs4 (BeautifulSoup)

I'm working in Python and using Flask. When I run my main Python file on my computer, it works perfectly, but when I activate venv and run the Flask Python file in the terminal, it says that my main Python file has "No Module Named bs4." Any comments or advice is greatly appreciated.


to use Python (Beautifulsoup) to pickup texts on webpage

I want to pickup the texts of news on a webpage, but so far I have not been successful. Here is part of the webpage source code: http://www.legaldaily.com.cn/locality/node_32245.htm. &lt;/HR&gt;&lt;A class="f14 blue001" href="content/2013-11/01/content_4983464.htm?node=32245" target=_blank&gt;&lt;SPAN class="f14 blue00...


python - Getting data out of tags (BeautifulSoup)

Brief explanation: I have a script which loops through elements of a page, then returns the data. But I want it to return data which is not in an element, but in order. import argparse, os, socket, urllib2, re from bs4 import BeautifulSoup pge = urllib2.urlopen("").read() src = BeautifulSoup(pge) body = src.findAll('body') el = body[0].findChildren() for s in el: cname = s.get('class') if cname[0] =...


python - Is it possible to pass a variable to (Beautifulsoup) soup.find()?

Hi I need to pass a variable to the soup.find() function, but it doesn't work :( Does anyone know a solution for this? from bs4 import BeautifulSoup html = '''&lt;div&gt; blabla &lt;p class='findme'&gt; p-tag content&lt;/p&gt; &lt;/div&gt;''' sources = {'source1': '\'p\', class_=\'findme\'', 'source2': '\'span\', class_=\'findme2\'', 'source1': '\'div\', class_=\'findme3\'',} test...


Ubuntu - How to install a Python module (BeautifulSoup) on Python 3.3 instead of Python 2.7?

I have this code (as written in BS4 documentaion): from bs4 import BeautifulSoup When I run the script (using python3) I get the error: ImportError: No module named 'bs4' So installed BeatifulSoup by: sudo pip install BeatifulSoup4 But when I try to run the script again I get the same error. Indeed BS4 is installed i...


python - Repetitive process to follow links in a website (BeautifulSoup)

I'm writing a code in Python to get all the 'a' tags in a URL using Beautiful soup, then I use the link at position 3, then I should follow that link, I will repeat this process about 18 times. I included the code below, which has the process repeated twice. I can't come about a way to repeat the same process 18 times in a loop.Any help would be appreciated. import re import urllib from BeautifulSoup impor...


linux - Error when trying to install python module (beautifulsoup) in CentOS 5.11 Server using pip3.4

I wrote a python3.4 script to fetch some data from an RSS feed that works just as expected. I wanted to create a cronjob to execute the script with my VPS (CentOS 5.11). The server have the regular python and python3.4 installed without problem (i can run python3.4 scripts without problem). The problem is that i need Beautiful Soup installed for python3.4 for my script to work, but when i try to install it, i receive an er...


Extract Columns from html using Python (Beautifulsoup)

I need to extract the info from this page -http://www.investing.com/currencies/usd-brl-historical-data. I need Date, Price, Open, High, Low,Change %. I`m new to Python so I got stuck at this step: import requests from bs4 import BeautifulSoup from datetime import datetime url='http://www.investing.com/currencies/usd-br...


python - Print specific line (Beautifulsoup)

Currently, my code is parsing through the link and printing all of the information from the website. I only want to print a single specific line from the website. How can I go about doing that? Here's my code: from bs4 import BeautifulSoup import urllib.request r = urllib.request.urlopen("Link goes here").read() soup = BeautifulSoup(r, "html.parser") # This is what I want...


How to download srt files from websites like 'Subscene.com' using python (BeautifulSoup)

I am trying to make a subtitle downloader which takes the name of all the files in the folder and searches on the website 'Subscene.com'. I am able to get to scrap the HTML source using beautiful soup but i am unable to get the link for the zip file from the HTML source. Downloading gets triggered by clicking on the 'Download Button'.






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top