Converting PDF to HTML with Python [duplicate]

How can I convert PDF files to HTML with Python?

I was thinking something alone the lines of what Google does (or seems to do) to index PDF files.

My final goal is to setup Apache to show the HTML for the PDF files, so anything leading me in that direction would also be appreciated.


Asked by: Robert585 | Posted: 24-09-2021






Answer 1

The poppler package provides a pdf2html utility that you might be able to use. There is also a Python binding to libpoppler.

Answered by: Walter904 | Posted: 25-10-2021



Similar questions

Converting html to text with Python

I am trying to convert an html block to text using Python. Input: <div class="body"><p><strong></strong></p> <p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p> <p>Consectetuer adipiscing elit. <a href="http://example.com/" tar...


python - Converting HTML to DOC with look and feel

Is it by anyway possible to convert HTML pages to word with some basic styling like tables , some colored headers, a few images ? I work with python . Are there any good libraries to mimic the representation as closely as possible ?


python - converting .py into HTML

I have a python file test.py import os print os.system("dir") def test(): a = 5 + 6 print a test() I want to convert this test.py into HTML file so that I can view the file in HTML browser with same indentation and formats. Is there any way or module which converts my python file into HTML and vice versa?


converting pdf to html page wise using python

I have this for root, dirnames, filenames in os.walk('FilePath'): for filename in fnmatch.filter(filenames, 'page-*.pdf'): # matches.append(os.path.join(root, filename)) subprocess.call('pdf2txt.py > myoutput.html', shell = True) I need to write subprocess for everytime a file is found of particular pattern[Filtered condition] do subprocess of pdf to html of that file.


python - Converting HTML to TXT

I am trying to convert an HTML page to text and store it in a file. I am able to, however there's some random slashes and stars in the file. Here's the code that I am using import html2text from bs4 import BeautifulSoup import requests as r url = r.get("https://dev.bizlem.io:8082/scorpio1/HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018_1.html") # print(html2text.ht...


Converting html file to a word doc with python

I am trying to convert an html file to a word document with python, my code is import win32com.client word = win32com.client.Dispatch('Word.Application') doc = word.Documents.Add('htmlFile.html') doc.SaveAs('example.doc', FileFormat=0) doc.Close() word.Quit() When I run this I get the error $ python HTMLtoWord.py Traceback (most recent call last): File "HTMLtoWord.py...


converting html data to json using python

I tried converting my html file data to json using below code import html_to_json import json def htmltojson(): with open("C:\Extraction\Sample.html", "r") as html_file: html = html_file.read() output_json = html_to_json.convert(html,capture_element_attributes=False,capture_element_values=True) with open('Final.json', 'w') as outfile: json.dump(output_json,...


python - Converting a string of 1's and 0's to a byte array

I have a string with a length that is a multiple of 8 that contains only 0's and 1's. I want to convert the string into a byte array suitable for writing to a file. For instance, if I have the string "0010011010011101", I want to get the byte array [0x26, 0x9d], which, when written to file, will give 0x269d as the binary (raw) contents. How can I do this in Python?


Is there a Python module for converting RTF to plain text?

Closed. This question does not meet Stack Overflow guid...


Converting Python code to PHP

What is the following Python code in PHP? import sys li = range(1,777); def countFigure(li, n): m = str(n); return str(li).count(m); # counting figures for substr in range(1,10): print substr, " ", countFigure(li, substr); Wanted output for 777 1 258 2 258 3 258 4 258 5 258 6 258 7 231 8 147 9 147


Help in Converting Small Python Code to PHP

please i need some help in converting a python code to a php syntax the code is for generating an alphanumeric code using alpha encoding the code : def mkcpl(x): x = ord(x) set="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" for c in set: d = ord(c)^x if chr(d) in set: return 0,c,chr(d) if chr(0xff^d) in set: ...


python - Converting \n to <br> in mako files

I'm using python with pylons I want to display the saved data from a textarea in a mako file with new lines formatted correctly for display Is this the best way of doing it? &gt; ${c.info['about_me'].replace("\n", "&lt;br /&gt;") | n}


utf 8 - Converting from ascii to utf-8 with Python

I have xmpp bot written in python. One of it's plugins is able to execute OS commands and send output to the user. As far as I know output should be unicode-like to send it over xmpp protocol. So I tried to handle it this way: output = os.popen(cmd).read() if not isinstance(output, unicode): output = unicode(output,'utf-8','ignore') bot.send(xmpp.Message(mess.getFrom(),output)) But whe...


python - Converting a list to a string

This question already has answers here:


python - Error Converting PIL B&W images to Numpy Arrays

I am getting weird errors when I try to convert a black and white PIL image to a numpy array. An example of the code I am working with is below. if image.mode != '1': image = image.convert('1') #convert to B&amp;W data = np.array(image) #Have also tried np.asarray(image) n_lines = data.shape[0] #number of raster passes line_range = range(data.shape[1]) for l in range(n_lines): ...


python - Converting videos for iPhone - ffmpeg

Closed. This question is off-topic. It is not curre...


python - Help converting .py to .exe, using py2exe

The script I ran # p2e_simple_con.py # very simple script to make an executable file with py2exe # put this script and your code script into the same folder # run p2e_simple_con.py # it will create a subfolder 'dist' where your exe file is in # has the same name as the script_file with extension exe # (the other subfolder 'build' can be deleted) # note: with console code put a wait line at the end from dis...


python - Converting datetime to POSIX time

How do I convert a datetime or date object into a POSIX timestamp in python? There are methods to create a datetime object out of a timestamp, but I don't seem to find any obvious ways to do the operation the opposite way.


python - Library for converting a traceback to its exception?

Just a curiosity: is there an already-coded way to convert a printed traceback back to the exception that generated it? :) Or to a sys.exc_info-like structure?


python - Converting a string of 1's and 0's to a byte array

I have a string with a length that is a multiple of 8 that contains only 0's and 1's. I want to convert the string into a byte array suitable for writing to a file. For instance, if I have the string "0010011010011101", I want to get the byte array [0x26, 0x9d], which, when written to file, will give 0x269d as the binary (raw) contents. How can I do this in Python?


vb6 - Is there a tool for converting VB to a scripting language, e.g. Python or Ruby?

I've discovered VB2Py, but it's been silent for almost 5 years. Are there any other tools out there which could be used to convert VB6 projects to Python, Ruby, Tcl, whatever?


python - Converting from mod_python to mod_wsgi

My website is written in Python and currently runs under mod_python with Apache. Lately I've had to put in a few ugly hacks that make me think it might be worth converting the site to mod_wsgi. But I've gotten used to using some of mod_python's utility classes, especially FieldStorage and Session (and sometimes Cookie), and from a scan of


django - Converting to safe unicode in python

I'm dealing with unknown data and trying to insert into a MySQL database using Python/Django. I'm getting some errors that I don't quite understand and am looking for some help. Here is the error. Incorrect string value: '\xEF\xBF\xBDs m...' My guess is that the string is not being properly converted to unicode? Here is my code for unicode conversion. s = unicode(content...


Oracle / Python Converting to string -> HEX (for RAW column) -> varchar2

I have a table with a RAW column for holding an encrypted string. I have the PL/SQL code for encrypting from plain text into this field. I wish to create a trigger containg the encryption code. I wish to 'misuse' the RAW field to pass the plain text into the trigger. (I can't modify the schema, for example to add another column for the plain text field) The client inserting the data is Pytho...


python - How can I check Hamming Weight without converting to binary?

How can I get the number of "1"s in the binary representation of a number without actually converting and counting ? e.g. def number_of_ones(n): # do something # I want to MAKE this FASTER (computationally less complex). c = 0 while n: c += n%2 n /= 2 return c &gt;&gt;&gt; number_of_ones(5) 2 &gt;&gt;&gt; number_of_ones(4) 1 ...


python - Django: Converting an entire set of a Model's objects into a single dictionary

If you came here from Google looking for model to dict, skip my question, and just jump down to the first answer. My question will only confuse you. Is there a good way in Django to entire set of a Model's objects into a single dictionary? I mean, like this: class DictModel(models.Model): key = models.CharField(20) value = models.CharField(200) DictModel.objects.all().to_dict()


python - What is the difference between converting to hex on the client end and using rawtohex?

I have a table that's created like this: CREATE TABLE bin_test (id INTEGER PRIMARY KEY, b BLOB) Using Python and cx_Oracle, if I do this: value = "\xff\x00\xff\x00" #The string represented in hex by ff00ff00 self.connection.execute("INSERT INTO bin_test (b) VALUES (rawtohex(?))", (value,)) self.connection.execute("SELECT b FROM bin_test")






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top