

PyPDF2 library extracts the text from a PDF document very nicely. PDFTables puts everything (not just tables) in the PDF document into the output Excel or CSV, to avoid having a lot of junk data in the Excel I created a separate PDF with just the table that I want to extract. The purpose of writing this page with tables into separate pdf file is that I used PDFTables for extracting data. After that, I created a PdfFileWriter object, which will eventually write a new PDF and add the pages to it. getPage() method, with the page number + 1 as the parameter (pages start at 0), on PdfFileReader object. Writer.write(outputStream) #write pages to new PDF With open(NewPDFfilename, "wb") as outputStream: #create new PDF #filename of your PDF/directory where you want your new PDF to be Writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object write (outputStream ) #write pages to new PDF

NewPDFfilename = "hispanic_tables.pdf" with open (NewPDFfilename, "wb" ) as outputStream: #create new PDF addPage (pg3 ) #filename of your PDF/directory where you want your new PDF to be PdfFileWriter ( ) #create PdfFileWriter object #add pages I used PdfFileReader() and PdfFileWriter() classes for reading and writing the table data. Reading a PDF document is pretty simple and straight forward. But it can extract text and return it as a Python string. After spending a little time with it, I realized PyPDF2 does not have a way to extract images, charts, or other media from PDF documents. PyPDF2 can extract data from PDF files and manipulate existing PDFs to produce a new file. When I Googled around for ‘Python read pdf’, PyPDF2 was the first tool I stumbled upon. Method 1: Extract the Pages with Tables using PyPDF2 and PDFTables I liked this solution much better and I am using it for my work. Later I came across PDFMiner and started exploring it for extracting data using its pdf2txt.py script.

It did serve my requirement but is paid service. I will extract the table data for Hispanic or Latino Origin Population by Type: 20 from of the PDF file.įor achieving this, I first tried using PyPDF2 (for extracting) and PDFtables (for converting PDF tables to Excel/CSV). If you look at the content of the PDF, you can see that there is a lot of text data, table data, graphs, maps etc. We will take an example of US census data for the Hispanic Population for 2010. In this post, I will show you a couple of ways to extract text and table data from PDF file using Python and write it into a CSV or Excel file. The PDF file format was not designed to hold structured data, which makes extracting data from PDFs difficult.
#PYTHON PDF TO EXCEL CONVERTER SERIES#
When government organizations publish data online, barring a few notable exceptions, it usually releases it as a series of PDFs. When testing highly data dependent products, I find it very useful to use data published by governments.
