How can I retrieve the page title of a webpage using Python?

How can I retrieve the page title of a webpage (title html tag) using Python?


Asked by: Ned980 | Posted: 01-10-2021






Answer 1

Here's a simplified version of @Vinko Vrsalovic's answer:

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string

NOTE:

  • soup.title finds the first title element anywhere in the html document

  • title.string assumes it has only one child node, and that child node is a string

For beautifulsoup 4.x, use different import:

from bs4 import BeautifulSoup

Answered by: Miller793 | Posted: 02-11-2021



Answer 2

I'll always use lxml for such tasks. You could use beautifulsoup as well.

import lxml.html
t = lxml.html.parse(url)
print(t.find(".//title").text)

EDIT based on comment:

from urllib2 import urlopen
from lxml.html import parse

url = "https://www.google.com"
page = urlopen(url)
p = parse(page)
print(p.find(".//title").text)

Answered by: Walter805 | Posted: 02-11-2021



Answer 3

No need to import other libraries. Request has this functionality in-built.

>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb' 

Answered by: Max444 | Posted: 02-11-2021



Answer 4

The mechanize Browser object has a title() method. So the code from this post can be rewritten as:

from mechanize import Browser
br = Browser()
br.open("http://www.google.com/")
print br.title()

Answered by: Lucas338 | Posted: 02-11-2021



Answer 5

This is probably overkill for such a simple task, but if you plan to do more than that, then it's saner to start from these tools (mechanize, BeautifulSoup) because they are much easier to use than the alternatives (urllib to get content and regexen or some other parser to parse html)

Links: BeautifulSoup mechanize

#!/usr/bin/env python
#coding:utf-8

from bs4 import BeautifulSoup
from mechanize import Browser

#This retrieves the webpage content
br = Browser()
res = br.open("https://www.google.com/")
data = res.get_data() 

#This parses the content
soup = BeautifulSoup(data)
title = soup.find('title')

#This outputs the content :)
print title.renderContents()

Answered by: Arnold528 | Posted: 02-11-2021



Answer 6

Using HTMLParser:

from urllib.request import urlopen
from html.parser import HTMLParser


class TitleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.match = False
        self.title = ''

    def handle_starttag(self, tag, attributes):
        self.match = tag == 'title'

    def handle_data(self, data):
        if self.match:
            self.title = data
            self.match = False

url = "http://example.com/"
html_string = str(urlopen(url).read())

parser = TitleParser()
parser.feed(html_string)
print(parser.title)  # prints: Example Domain

Answered by: Brianna937 | Posted: 02-11-2021



Answer 7

Use soup.select_one to target title tag

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)

Answered by: Edgar971 | Posted: 02-11-2021



Answer 8

Using regular expressions

import re
match = re.search('<title>(.*?)</title>', raw_html)
title = match.group(1) if match else 'No title'

Answered by: Alberta240 | Posted: 02-11-2021



Answer 9

soup.title.string actually returns a unicode string. To convert that into normal string, you need to do string=string.encode('ascii','ignore')

Answered by: Michael221 | Posted: 02-11-2021



Answer 10

Here is a fault tolerant HTMLParser implementation.
You can throw pretty much anything at get_title() without it breaking, If anything unexpected happens get_title() will return None.
When Parser() downloads the page it encodes it to ASCII regardless of the charset used in the page ignoring any errors. It would be trivial to change to_ascii() to convert the data into UTF-8 or any other encoding. Just add an encoding argument and rename the function to something like to_encoding().
By default HTMLParser() will break on broken html, it will even break on trivial things like mismatched tags. To prevent this behavior I replaced HTMLParser()'s error method with a function that will ignore the errors.

#-*-coding:utf8;-*-
#qpy:3
#qpy:console

''' 
Extract the title from a web page using
the standard lib.
'''

from html.parser import HTMLParser
from urllib.request import urlopen
import urllib

def error_callback(*_, **__):
    pass

def is_string(data):
    return isinstance(data, str)

def is_bytes(data):
    return isinstance(data, bytes)

def to_ascii(data):
    if is_string(data):
        data = data.encode('ascii', errors='ignore')
    elif is_bytes(data):
        data = data.decode('ascii', errors='ignore')
    else:
        data = str(data).encode('ascii', errors='ignore')
    return data


class Parser(HTMLParser):
    def __init__(self, url):
        self.title = None
        self.rec = False
        HTMLParser.__init__(self)
        try:
            self.feed(to_ascii(urlopen(url).read()))
        except urllib.error.HTTPError:
            return
        except urllib.error.URLError:
            return
        except ValueError:
            return

        self.rec = False
        self.error = error_callback

    def handle_starttag(self, tag, attrs):
        if tag == 'title':
            self.rec = True

    def handle_data(self, data):
        if self.rec:
            self.title = data

    def handle_endtag(self, tag):
        if tag == 'title':
            self.rec = False


def get_title(url):
    return Parser(url).title

print(get_title('http://www.google.com'))

Answered by: Marcus641 | Posted: 02-11-2021



Answer 11

In Python3, we can call method urlopen from urllib.request and BeautifulSoup from bs4 library to fetch the page title.

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.google.com")
soup = BeautifulSoup(html, 'lxml')
print(soup.title.string)

Here we are using the most efficient parser 'lxml'.

Answered by: Roland468 | Posted: 02-11-2021



Answer 12

Using lxml...

Getting it from page meta tagged according to the Facebook opengraph protocol:

import lxml.html.parse
html_doc = lxml.html.parse(some_url)

t = html_doc.xpath('//meta[@property="og:title"]/@content')[0]

or using .xpath with lxml:

t = html_doc.xpath(".//title")[0].text

Answered by: Rubie126 | Posted: 02-11-2021



Similar questions

python - Unable to retrieve code from webpage, because of query string?


python - Can't retrieve link from webpage

I am using bs4 to run through a bunch of websites and grab a specific link off each page but I am having an issue grabbing that link. I have tried getting all the links using. soup = BeautifulSoup(browser.page_source,&quot;lxml&quot;) print(soup.find_all('a')) I have tried many other ways including telling it the exact address of one site. but every time seems to return every...


python - Unable to retrieve code from webpage, because of query string?


python - Can't retrieve link from webpage

I am using bs4 to run through a bunch of websites and grab a specific link off each page but I am having an issue grabbing that link. I have tried getting all the links using. soup = BeautifulSoup(browser.page_source,&quot;lxml&quot;) print(soup.find_all('a')) I have tried many other ways including telling it the exact address of one site. but every time seems to return every...


python - How to retrieve an element from a set without removing it?

Suppose the following: &gt;&gt;&gt; s = set([1, 2, 3]) How do I get a value (any value) out of s without doing s.pop()? I want to leave the item in the set until I am sure I can remove it - something I can only be sure of after an asynchronous call to another host. Quick and dirty: &gt;&gt;&gt; elem = s.pop() &gt;&gt;&gt; s.add(elem)


sql server - Python: Retrieve Image from MSSQL

I'm working on a Python project that retrieves an image from MSSQL. My code is able to retrieve the images successfully but with a fixed size of 63KB. if the image is greater than that size, it just brings the first 63KB from the image! The following is my code: #!/usr/bin/python import _mssql mssql=_mssql.connect('&lt;ServerIP&gt;','&lt;UserID&gt;','&lt;Password&gt;') mssql.select_db('&lt;Database...


python - Best way to retrieve variable values from a text file?

Referring on this question, I have a similar -but not the same- problem.. On my way, I'll have some text file, structured like: var_a: 'home' var_b: 'car' var_c: 15.5 And I need that python read the file and then create a variable named var_a with value 'home', and so on. Example...


python - How to retrieve the selected text from the active window

I am trying to create a simple open source utility for windows using Python that can perform user-defined actions on the selected text of the currently active window. The utility should be activated using a pre-defined keyboard shortcut. Usage is partially outlined in the following example: The user selects some text using the mouse or the keyboard (in any application window)


python - How can I retrieve last x elements in Django

I am trying to retrieve the latest 5 posts (by post time) In the views.py, if I try blog_post_list = blogPosts.objects.all()[:5] It retreives the first 5 elements of the blogPosts objects, how can I reverse this to retreive the latest ones? Cheers


python - Retrieve module object from stack frame

Given a frame object, I need to get the corresponding module object. In other words, implement callers_module so this works: import sys from some_other_module import callers_module assert sys.modules[__name__] is callers_module() (That would be equivalent because I can generate a stack trace in the function for this test case. The imports are there simply to make that example complete an...


How do I retrieve Hotmail contacts with python

How can I retrieve contacts from hotmail with python? Is there any example?


linux - How to retrieve the process start time (or uptime) in python

How to retrieve the process start time (or uptime) in python in Linux? I only know, I can call "ps -p my_process_id -f" and then parse the output. But it is not cool.


python - Retrieve the two highest item from a list containing 100,000 integers

How can retrieve the two highest item from a list containing 100,000 integers without having to sort the entire list first?


c++ - How do I retrieve program output in Python?

I'm not a Perl user, but from this question deduced that it's exceedingly easy to retrieve the standard output of a program executed through a Perl script using something akin to: $version = `java -version`; How would I go about getting the same end result in Python? Does t...






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top