Extracting text from HTML file using Python

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.

Update html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.


Related questions:


Asked by: Rafael283 | Posted: 01-10-2021






Answer 1

The best piece of code I found for extracting text without getting javascript or not wanted things :

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

You just have to install BeautifulSoup before :

pip install beautifulsoup4

Answered by: Kate539 | Posted: 02-11-2021



Answer 2

html2text is a Python program that does a pretty good job at this.

Answered by: Lenny905 | Posted: 02-11-2021



Answer 3

NOTE: NTLK no longer supports clean_html function

Original answer below, and an alternative in the comments sections.


Use NLTK

I wasted my 4-5 hours fixing the issues with html2text. Luckily i could encounter NLTK.
It works magically.

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

Answered by: Grace888 | Posted: 02-11-2021



Answer 4

Found myself facing just the same problem today. I wrote a very simple HTML parser to strip incoming content of all markups, returning the remaining text with only a minimum of formatting.

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

Answered by: Sam559 | Posted: 02-11-2021



Answer 5

I know there are a lot of answers already, but the most elegent and pythonic solution I have found is described, in part, here.

from bs4 import BeautifulSoup

text = ' '.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

Update

Based on Fraser's comment, here is more elegant solution:

from bs4 import BeautifulSoup

clean_text = ' '.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

Answered by: Ada543 | Posted: 02-11-2021



Answer 6

Here is a version of xperroni's answer which is a bit more complete. It skips script and style sections and translates charrefs (e.g., &#39;) and HTML entities (e.g., &amp;).

It also includes a trivial plain-text-to-html inverse converter.

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._buf = []
        self.hide_output = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'br') and not self.hide_output:
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = True

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self._buf.append('\n')

    def handle_endtag(self, tag):
        if tag == 'p':
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = False

    def handle_data(self, text):
        if text and not self.hide_output:
            self._buf.append(re.sub(r'\s+', ' ', text))

    def handle_entityref(self, name):
        if name in name2codepoint and not self.hide_output:
            c = unichr(name2codepoint[name])
            self._buf.append(c)

    def handle_charref(self, name):
        if not self.hide_output:
            n = int(name[1:], 16) if name.startswith('x') else int(name)
            self._buf.append(unichr(n))

    def get_text(self):
        return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
    """
    Given a piece of HTML, return the plain text it contains.
    This handles entities and char refs, but not javascript and stylesheets.
    """
    parser = _HTMLToText()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError:
        pass
    return parser.get_text()

def text_to_html(text):
    """
    Convert the given text to html, wrapping what looks like URLs with <a> tags,
    converting newlines to <br> tags and converting confusing chars into html
    entities.
    """
    def f(mo):
        t = mo.group()
        if len(t) == 1:
            return {'&':'&amp;', "'":'&#39;', '"':'&quot;', '<':'&lt;', '>':'&gt;'}.get(t)
        return '<a href="%s">%s</a>' % (t, t)
    return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

Answered by: Wilson151 | Posted: 02-11-2021



Answer 7

You can use html2text method in the stripogram library also.

from stripogram import html2text
text = html2text(your_html_string)

To install stripogram run sudo easy_install stripogram

Answered by: Lucas662 | Posted: 02-11-2021



Answer 8

I know there's plenty of answers here already but I think newspaper3k also deserves a mention. I recently needed to complete a similar task of extracting the text from articles on the web and this library has done an excellent job of achieving this so far in my tests. It ignores the text found in menu items and side bars as well as any JavaScript that appears on the page as the OP requests.

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

If you already have the HTML files downloaded you can do something like this:

article = Article('')
article.set_html(html)
article.parse()
article.text

It even has a few NLP features for summarizing the topics of articles:

article.nlp()
article.summary

Answered by: Melissa828 | Posted: 02-11-2021



Answer 9

There is Pattern library for data mining.

http://www.clips.ua.ac.be/pages/pattern-web

You can even decide what tags to keep:

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

Answered by: Briony207 | Posted: 02-11-2021



Answer 10

if you need more speed and less accuracy then you could use raw lxml.

import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
    doc = lh.fromstring(html)
    doc = clean_html(doc)
    return doc.text_content()

Answered by: Alissa294 | Posted: 02-11-2021



Answer 11

PyParsing does a great job. The PyParsing wiki was killed so here is another location where there are examples of the use of PyParsing (example link). One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O'Reilly Short Cut manual that is also inexpensive.

Having said that, I use BeautifulSoup a lot and it is not that hard to deal with the entities issues, you can convert them before you run BeautifulSoup.

Goodluck

Answered by: Chester346 | Posted: 02-11-2021



Answer 12

This isn't exactly a Python solution, but it will convert text Javascript would generate into text, which I think is important (E.G. google.com). The browser Links (not Lynx) has a Javascript engine, and will convert source to text with the -dump option.

So you could do something like:

fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
                        stdout=subprocess.PIPE,
                        stderr=open('/dev/null','w'))
text = proc.stdout.read()

Answered by: Ryan406 | Posted: 02-11-2021



Answer 13

Instead of the HTMLParser module, check out htmllib. It has a similar interface, but does more of the work for you. (It is pretty ancient, so it's not much help in terms of getting rid of javascript and css. You could make a derived class, but and add methods with names like start_script and end_style (see the python docs for details), but it's hard to do this reliably for malformed html.) Anyway, here's something simple that prints the plain text to the console

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

Answered by: Leonardo578 | Posted: 02-11-2021



Answer 14

I recommend a Python Package called goose-extractor Goose will try to extract the following information:

Main text of an article Main image of article Any Youtube/Vimeo movies embedded in article Meta Description Meta tags

More :https://pypi.python.org/pypi/goose-extractor/

Answered by: Cherry450 | Posted: 02-11-2021



Answer 15

Anyone has tried bleach.clean(html,tags=[],strip=True) with bleach? it's working for me.

Answered by: Rafael203 | Posted: 02-11-2021



Answer 16

install html2text using

pip install html2text

then,

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

Answered by: Freddie660 | Posted: 02-11-2021



Answer 17

Best worked for me is inscripts .

https://github.com/weblyzard/inscriptis

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

The results are really good

Answered by: Fiona239 | Posted: 02-11-2021



Answer 18

Beautiful soup does convert html entities. It's probably your best bet considering HTML is often buggy and filled with unicode and html encoding issues. This is the code I use to convert html to raw text:

import BeautifulSoup
def getsoup(data, to_unicode=False):
    data = data.replace("&nbsp;", " ")
    # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
    masssage_bad_comments = [
        (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
        (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
    ]
    myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(masssage_bad_comments)
    return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
        convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                    if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

Answered by: Owen525 | Posted: 02-11-2021



Answer 19

Another non-python solution: Libre Office:

soffice --headless --invisible --convert-to txt input1.html

The reason I prefer this one over other alternatives is that every HTML paragraph gets converted into a single text line (no line breaks), which is what I was looking for. Other methods require post-processing. Lynx does produce nice output, but not exactly what I was looking for. Besides, Libre Office can be used to convert from all sorts of formats...

Answered by: Elise625 | Posted: 02-11-2021



Answer 20

I had a similar question and actually used one of the answers with BeautifulSoup. The problem was it was really slow. I ended up using library called selectolax. It's pretty limited but it works for this task. The only issue was that I had manually remove unnecessary white spaces. But it seems to be working much faster that BeautifulSoup solution.

from selectolax.parser import HTMLParser

def get_text_selectolax(html):
    tree = HTMLParser(html)

    if tree.body is None:
        return None

    for tag in tree.css('script'):
        tag.decompose()
    for tag in tree.css('style'):
        tag.decompose()

    text = tree.body.text(separator='')
    text = " ".join(text.split()) # this will remove all the whitespaces
    return text

Answered by: Alina966 | Posted: 02-11-2021



Answer 21

Another option is to run the html through a text based web browser and dump it. For example (using Lynx):

lynx -dump html_to_convert.html > converted_html.txt

This can be done within a python script as follows:

import subprocess

with open('converted_html.txt', 'w') as outputFile:
    subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)

It won't give you exactly just the text from the HTML file, but depending on your use case it may be preferable to the output of html2text.

Answered by: Justin631 | Posted: 02-11-2021



Answer 22

@PeYoTIL's answer using BeautifulSoup and eliminating style and script content didn't work for me. I tried it using decompose instead of extract but it still didn't work. So I created my own which also formats the text using the <p> tags and replaces <a> tags with the href link. Also copes with links inside text. Available at this gist with a test doc embedded.

from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
    "Creates a formatted text email message as a string from a rendered html template (page)"
    soup = BeautifulSoup(html, 'html.parser')
    # Ignore anything in head
    body, text = soup.body, []
    for element in body.descendants:
        # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
        if type(element) == NavigableString:
            # We use the assumption that other tags can't be inside a script or style
            if element.parent.name in ('script', 'style'):
                continue

            # remove any multiple and leading/trailing whitespace
            string = ' '.join(element.string.split())
            if string:
                if element.parent.name == 'a':
                    a_tag = element.parent
                    # replace link text with the link
                    string = a_tag['href']
                    # concatenate with any non-empty immediately previous string
                    if (    type(a_tag.previous_sibling) == NavigableString and
                            a_tag.previous_sibling.string.strip() ):
                        text[-1] = text[-1] + ' ' + string
                        continue
                elif element.previous_sibling and element.previous_sibling.name == 'a':
                    text[-1] = text[-1] + ' ' + string
                    continue
                elif element.parent.name == 'p':
                    # Add extra paragraph formatting newline
                    string = '\n' + string
                text += [string]
    doc = '\n'.join(text)
    return doc

Answered by: Walter823 | Posted: 02-11-2021



Answer 23

I've had good results with Apache Tika. Its purpose is the extraction of metadata and text from content, hence the underlying parser is tuned accordingly out of the box.

Tika can be run as a server, is trivial to run / deploy in a Docker container, and from there can be accessed via Python bindings.

Answered by: Dexter561 | Posted: 02-11-2021



Answer 24

in a simple way

import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)

this code finds all parts of the html_text started with '<' and ending with '>' and replace all found by an empty string

Answered by: Adrian895 | Posted: 02-11-2021



Answer 25

In Python 3.x you can do it in a very easy way by importing 'imaplib' and 'email' packages. Although this is an older post but maybe my answer can help new comers on this post.

status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1]) 
#email.message_from_string(data[0][1])

#If message is multi part we only want the text version of the body, this walks the message and gets the body.

if email_msg.is_multipart():
    for part in email_msg.walk():       
        if part.get_content_type() == "text/plain":
            body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
            body = body.decode()
        elif part.get_content_type() == "text/html":
            continue

Now you can print body variable and it will be in plaintext format :) If it is good enough for you then it would be nice to select it as accepted answer.

Answered by: Dainton293 | Posted: 02-11-2021



Answer 26

Here's the code I use on a regular basis.

from bs4 import BeautifulSoup
import urllib.request


def processText(webpage):

    # EMPTY LIST TO STORE PROCESSED TEXT
    proc_text = []

    try:
        news_open = urllib.request.urlopen(webpage.group())
        news_soup = BeautifulSoup(news_open, "lxml")
        news_para = news_soup.find_all("p", text = True)

        for item in news_para:
            # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
            para_text = (' ').join((item.text).split())

            # COMBINE LINES/PARAGRAPHS INTO A LIST
            proc_text.append(para_text)

    except urllib.error.HTTPError:
        pass

    return proc_text

I hope that helps.

Answered by: Sam708 | Posted: 02-11-2021



Answer 27

you can extract only text from HTML with BeautifulSoup

url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)

Answered by: Sydney124 | Posted: 02-11-2021



Answer 28

While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:

<p>hello&nbsp;world</p>I love you

Should be parsed to:

Hello world
I love you

Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

import re
import html
def html2text(htm):
    ret = html.unescape(htm)
    ret = ret.translate({
        8209: ord('-'),
        8220: ord('"'),
        8221: ord('"'),
        160: ord(' '),
    })
    ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
    ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE)
    ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL)
    ret = re.sub(r"  +", " ", ret)
    return ret

Answered by: Cherry782 | Posted: 02-11-2021



Answer 29

Another example using BeautifulSoup4 in Python 2.7.9+

includes:

import urllib2
from bs4 import BeautifulSoup

Code:

def read_website_to_text(url):
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    for script in soup(["script", "style"]):
        script.extract() 
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return str(text.encode('utf-8'))

Explained:

Read in the url data as html (using BeautifulSoup), remove all script and style elements, and also get just the text using .get_text(). Break into lines and remove leading and trailing space on each, then break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")). Then using text = '\n'.join, drop blank lines, finally return as sanctioned utf-8.

Notes:

Answered by: Steven556 | Posted: 02-11-2021



Answer 30

I am achieving it something like this.

>>> import requests
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> res = requests.get(url)
>>> text = res.text

Answered by: Lyndon190 | Posted: 02-11-2021



Similar questions

Python - Extracting Text From a File

My Code (so far): ins = open( "log", "r" ) array = [] for line in ins: array.append( line ) for line in array: if "xyz" in line: print "xyz found!" else: print "xyz not found!" Log File Example: Norman xyz Cat Cat xyz Norman Dog xyz Dog etc. etc. The Python script I have currently finds xyz and prints that it found it. But I w...


Extracting information from a text file with python

I've a project that includes writing a program that extracts certain data (numeric) from a text file, that has to be generalized to function with different text files the in the same format. The file is an analyse of a molecule, the data to extract is the coordinates of every atom inside the molecule, so it has to be generalised in a way that it extracts as much data as there ate atoms in different files. ...


python - Extracting text from a span with lxml?

Given: import urllib2 from lxml import etree url = "http://www.ebay.com/sch/i.html?rt=nc&amp;LH_Complete=1&amp;_nkw=Under+Armour+Dauntless+Backpack&amp;LH_Sold=1&amp;_sacat=0&amp;LH_BIN=1&amp;_from=R40&amp;_sop=3&amp;LH_ItemCondition=1000" response = urllib2.urlopen(url) htmlparser = etree.HTMLParser() tree = etree.parse(response, htmlparser) where the


Extracting text from JATS XML file using Python

I want to extract text from a JATS-XML file JATS is a standardized XML format for representation of research publications. &lt;article&gt; &lt;front&gt; &lt;journal-meta&gt; &lt;journal-title-gr...


Extracting text from PDF in Python

I have a PDF full of quotes: https://www.pdf-archive.com/2017/03/22/test/ I can extract the text in python using the following code: import PyPDF2 pdfFileObj = open('example.pdf','rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pageObj = pdfReader.getPage(0) print (pageObj.extractText()) ...


Extracting text from link in python

I have a script in python 2.7 that scrapes the table in this page: http://www.the-numbers.com/movie/budgets/all I want to extract each of the columns, the problem is that my code doesn't recognize the columns that have links (2nd and 3rd columns). budgeturl = "http://www.the-numbers.com/movie/budgets/all" s = urlli...


python - Extracting particular text

I am trying to extract all links to videos on a particular WordPress website. Each page has only one video. Inside each page crawled, there is the following code: &lt;p&gt;&lt;script src="https://www.vooplayer.com/v3/watch/video.js"&gt;&lt;/script&gt; &lt;iframe id="" voo-auto-adj="true" name="vooplayerframe" style="max-width:100%" allowtransparency="true" allowfullscreen="true" src="//www.vooplayer...


python - Extracting next line of text file

This question already has answers here:


python - Extracting particular text section between tags from HTML

I would like to extract text in a specific section from HTML file (section "Item 1A"). I want to get text start from "Item 1A", in the content section not from the table of content, and stop at "Item 1B." But there are several same texts of "Item 1A" and "Item 1B". How can I identify which specific text to start and to stop. import requests from bs4 import BeautifulSoup import re url = "https://www.sec.gov...


Extracting text from PDF with Python in repl

I am trying to read data from a PDF in python, and I am trying to use a repl.it file just because it is easier to test out different libraries. I have tried PyPDF2, and PyPDF4, which work but do not give any whitespace. tika gives me a server starting error, pdfminer does not work and pdfminer3 works without whitespace. pdftotext does not download properly. I was wondering if there was more clear documentation on how to my...


python - Extracting text from span

I have a problem regarding a span tag, that has no id or class. The larger approach is to extract the text between "ITEM 1. BUSINESS" TO "ITEM 1A. RISK FACTORS" from the link below. However, I can't figure out a way to find this part, because the span it is in, has no id nor a class I can search for (only the parent div the span is in: div = soup.find("div", {"id": "dynamic-xbrl-form"}). This code does...


python - Extracting information from raw text

Problem Description Here is the text pattern I have: 05.04.0090 1 erhältlichen Tableau Interfaces lassen sich zusätzliche GLT-Kontakte aufschalten. Das System kann die zwei Szenarien-Modi "Urlaub" und Abwesenheit" verwalten. Für beide Modi können bestimmte Parameter programmiert werden. Das WAREMA climatronic Bediengerät kann preisgleich auch in den Farben "schwarz" oder "schwa...


python - Extracting the text from some HTML tags

I am using BeautifulSoup to webscrape job listings on a career page. I am having trouble just printing out the information I need. This is was the HTML looks like &lt;ul class="list-group"&gt; &lt;li class="list-group-item"&gt; &lt;h4 class="list-group-item-heading"&gt; &lt;a href="http://careers.steelseries.com/apply/3LXwyjYOrb/Customer-Experience-Specialist"&gt; ...


Extracting certain text from a cell in python

I have an excel file and i need a way of being able to search a certain row of cells and extract that line from the cell and delete the rest of the text around it. not sure what the best way to go about that is. I feel some sort of script in python but I'm not really sure how to go about that. this is what I have so far and I'm stuck now. any help would be appreciated import xlrd datafile = &quot;test.xlsx&q...


Extracting text from PDF url file with Python

I want to extract text from PDF file thats on one website. The website contains link to PDF doc, but when I click on that link it automaticaly downloads that file. Is it possible to extract text from that file without downloading it import fitz # this is pymupdf lib for text extraction from bs4 import BeautifulSoup import requests from io import StringIO url = &quot;https://www.blv.admin.ch/blv/de/home/leb...


Extracting the first line from pdf text in python

I am splitting the text extracted from pdf by &quot;\n&quot; But having an issue with the position of the string after the split. for some, it is working with [0] and for some, it is [2]. I want to put this in a loop and extract the first line from the page irrespective of the position Here is my code : for fil in new_pdf_files: object = PyPDF2.PdfFileReader(fil) pdfFileObj = open(fil, 'rb') ...


python - OCR not Extracting any Text

I am trying to extract text from an image that looks like OCR should be able to easily extract but it just extracting nothing or garbage in some cases. I have tried the following OpenCV techniques from other stackoverflow resources but nothing seems to help. Image Resisizing GrayScaling Dilation and Erosion adaptiveThreshold If someone could help me how to extrac...


python - Extracting the last text in <p> tag

I wanted to extract the last text within each drop-down of the list belonging to a webpage. The last text should be an address in this list. For example: url = 'https://www.housebeautiful.com/lifestyle/g26859396/movie-homes-you-can-visit/' soup = BeautifulSoup((requests.get(url)).content, 'lxml') for i in soup.select('p'): print(i.text.strip) Prints me all the text within the


text - Extracting information from txt file using python

I have a TXT file that looks like this ETP 474654 0|170122|160222|MXP| 14045.84| | 4711.00| 0| 0| 0| 0| 4711| 0 BA6 91215257 1|310122| |MXP| | 9053.93| | | | | | | TDO 301530 1|010222| |MXP| | 280.91| | | | | | | ETP 475384 0|260122|2502...


python - Extracting data from MS Word

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia. I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed. Which ...


Extracting a URL in Python

In regards to: Find Hyperlinks in Text using Python (twitter related) How can I extract just the url so I can put it into a list/array? Edit Let me clarify, I don't want to parse the URL into pieces. I want to extract the URL from the text of the string to put it into an array. Thank...


Extracting data from a CSV file in Python

I just got my data and it is given to me as a csv file. It looks like this in data studio(where the file was taken). Counts frequency 300 1 302 5 303 7 Excel can't handle the computations so that's why I'm trying to load it in python(it has scipy :D). I want to load the data in an array: Counts = [300, 302, 303] frequen...


Bash or Python for extracting blocks from text files

Closed. This question is opinion-based. It is not c...


regex - Extracting data from a text file to use in a python script?

Basically, I have a file like this: Url/Host: www.example.com Login: user Password: password Data_I_Dont_Need: something_else How can I use RegEx to separate the details to place them into variables? Sorry if this is a terrible question, I can just never grasp RegEx. So another question would be, can you provide the RegEx, but kind of explain what each part of it is for?...


python - Extracting Information from Images

What are some fast and somewhat reliable ways to extract information about images? I've been tinkering with OpenCV and this seems so far to be the best route plus it has Python bindings. So to be more specific I'd like to determine what I can about what's in an image. So for example the haar face detection and full body detection classifiers are great - now I can tell that most likely there are faces and / or peo...


python - How do I say "not" using a regex when extracting a group of text?

I am trying to extract a section of text that looks something like this: Thing 2A blah blah Thing 2A blah blah Thing 3 Where the "3" above could actually be ANY single digit. The code I have that doesn't work is: ((Thing\s2A).+?(Thing\s\d)) Since the 3 could be any single digit, I cannot simply replace the "\d" with "3". I tried the following code, but it doesn't work eit...


python - better way of extracting values from string

str = 'This is first line \n 2 line start from her\ Service Name: test \n some 4 line \n User Name: amit \n some something \n Last Name: amit amit \n Basically What I am interested is getting the service name and user name. Should I user regular expression for doing this. I want to create a dict like dict['service_name'] = 'test' dict['user_name'] = 'amit' dict['last_name'] = 'amit...


python - Extracting the a value from a tuple when the other values are unused

I have a tuple foo which contains something I don't care about and something I do. foo = (something_i_dont_need, something_i_need) Is it more correct to use _, x = foo or x = foo[1] The only things I can think of are different behaviour if foo isn't of length two. I suppose this is fairly case-sp...


python - Extracting nouns from Noun Phase in NLP

Could anyone please tell me how to extract only the nouns from the following output: I have tokenized and parsed the string "Give me the review of movie" based on a given grammar using following procedure:- sent=nltk.word_tokenize(msg) parser=nltk.ChartParser(grammar) trees=parser.nbest_parse(sent) for tree in trees: print tree tokens=find_all_NP(tree) tokens1=nltk.word_tokenize(tokens[0]) print...


python - Extracting unique items from a list of mappings

He're an interesting problem that looks for the most Pythonic solution. Suppose I have a list of mappings {'id': id, 'url': url}. Some ids in the list are duplicate, and I want to create a new list, with all the duplicates removed. I came up with the following function: def unique_mapping(map): d = {} for res in map: d[res['id']] = res['url'] return [{'id': id,...


python - Extracting Embedded Images From Outlook Email

I am using Microsoft's CDO (Collaboration Data Objects) to programmatically read mail from an Outlook mailbox and save embedded image attachments. I'm trying to do this from Python using the Win32 extensions, but samples in any language that uses CDO would be helpful. So far, I am here... The following Python code will read the last email in my mailbox, print the names of the attachments, and print the mes...


python - Extracting data from MS Word

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia. I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed. Which ...


python - Extracting info from large structured text files

I need to read some large files (from 50k to 100k lines), structured in groups separated by empty lines. Each group start at the same pattern "No.999999999 dd/mm/yyyy ZZZ". Here´s some sample data. No.813829461 16/09/1987 270 Tit.SUZANO PAPEL E CELULOSE S.A. (BR/BA) C.N.P.J./C.I.C./N INPI : 16404287000155 Procurador: MARCELLO DO NASCIMENTO No.815326777 28/12/1989 ...


Extracting a URL in Python

In regards to: Find Hyperlinks in Text using Python (twitter related) How can I extract just the url so I can put it into a list/array? Edit Let me clarify, I don't want to parse the URL into pieces. I want to extract the URL from the text of the string to put it into an array. Thank...


Extracting YouTube Video's author using Python and YouTubeAPI

how do I get the author/username from an object using: GetYouTubeVideoEntry(video_id=youtube_video_id_to_output) I'm using Google's gdata.youtube.service Python library Thanks in advance! :)


python - Extracting the To: header from an attachment of an email

I am using python to open an email on the server (POP3). Each email has an attachment which is a forwarded email itself. I need to get the "To:" address out of the attachment. I am using python to try and help me learn the language and I'm not that good yet ! The code I have already is this import poplib, email, mimetypes oPop = poplib.POP3( 'xx.xxx.xx.xx' ) oPop.user( 'a...


Extracting decimals from a number in Python

I am writing a function to extract decimals from a number. Ignore the exception and its syntax, I am working on 2.5.2 (default Leopard version). My function does not yet handle 0's. My issue is, the function produces random errors with certain numbers, and I don't understand the reason. I will post an error readout after the code. Function: def extractDecimals(num): try: if(num &gt...


Python: Extracting data from buffer with ctypes

I am able to successfully call a function with ctypes in Python. I now have a buffer that is filled with Structures of data I want to extract. What is the best strategy for this? Anything else I should post? Function: class list(): def __init__(self): #[...] def getdirentries(self, path): self.load_c() self.fd = os.open(path, os.O_RDONLY) self.statinfo = o...


Extracting text fields from HTML using Python?

what is the best way to extract data from this HTML file and put it into MySQL database with company phone number, company name and email with a primary key as phone number? &lt;/tr&gt;&lt;tr class="tableRowOdd"&gt; &lt;td&gt;"JSC company inc. 00" &amp;lt;email@email.com&amp;gt;&lt;/td&gt; &lt;td&gt;1231231234&lt;/td&gt; &lt;/tr&gt;&lt;tr class="tableRowEven"&gt; ...






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top