Blog

Merge PDF files then convert specific pages to Excel with Python

Merge PDF files then convert specific pages to Excel with Python

In this tutorial, I’ll be showing you how to do a PDF merge online using Python and then how to extract specific data from PDF to Excel, CSV or XML in the same script. We'll be using the PDF to Excel API.

I’ll be merging 3 PDFs then converting pages 1, 3 and 5 into an Excel workbook. The script I will be using also allows you to convert to CSV and XML.

Before we start

I've used a tool from PDF Labs called PDFtk. You will need to download the PDFtk Server version suitable for the OS you are working on.

If you don't have the PDFTables Python library set up and running on your machine, first go to our tutorial How to convert a PDF to Excel with Python and follow steps 1 and 2.

Additionally, you'll need an API key and the PyPDF2 library installed. To install this library, run the following command in your terminal:

pip install pypdf2

Step 1

In the folder where your PDFs are located, create a new Python file (.py) in your code editor, with a name of your choice, then add the following code:

#!/usr/bin/env python3

#import libraries
import os
import sys
import pdftables_api
from PyPDF2 import PdfFileWriter, PdfFileReader
import subprocess

#define arguments
pdf_input_file = sys.argv[1];
chosen_format = sys.argv[2].lower();
api_key = sys.argv[3];
pages_args = ",".join(sys.argv[4:]).replace(" ","")

#subprocess to merge PDFs
subprocess.call("pdftk *.pdf cat output "+ pdf_input_file)

#error if not enough arguments defined
if len(sys.argv) < 3:
    command = os.path.basename(__file__)
    sys.exit('Usage: {} pdf-file page-number, ...'.format(command))

#define file names
output_file_name = pdf_input_file.split(".pdf")
excel_output_file = output_file_name[0]

#if no page numbers are defined
if len(pages_args) == 0:
	print("All PDFs combined. No pages given, converting all pages to " + chosen_format)
	c = pdftables_api.Client(api_key)
	if chosen_format == "xml":
		c.xml(pdf_input_file, excel_output_file) 
	elif chosen_format == "csv":
		c.csv(pdf_input_file, excel_output_file) 
	elif chosen_format == "xlsx":
		c.xlsx(pdf_input_file, excel_output_file) 
	elif chosen_format == "xlsx_single":
		c.xlsx_single(pdf_input_file, excel_output_file)
	else:
		print("Format given not recognised, converting to xlsx")
		c.xlsx(pdf_input_file, excel_output_file) 
	print("Complete")
	
#if page numbers are defined
else:
	pages_required = [int(p) for p in filter(None, pages_args.split(","))]

	print("All PDFs combined. Now converting pages: {}".format(str(pages_required)[1:-1]) + " to " + chosen_format)

	pages_out_of_range = []
	pdf_file_reader = PdfFileReader(open(pdf_input_file, 'rb'))
	pdf_file_pages = pdf_file_reader.getNumPages()

	for page_number in pages_required:
		if page_number < 1 or page_number > pdf_file_pages:
			pages_out_of_range.append(page_number)

	if len(pages_out_of_range) > 0:
		pages_str = str(pages_out_of_range)[1:-1]
		sys.exit('Error: page numbers out of range: {}'.format(pages_str))

	pdf_writer_selected_pages = PdfFileWriter()

	for page_number in pages_required:
		page = pdf_file_reader.getPage(page_number-1)
		pdf_writer_selected_pages.addPage(page)

	pdf_file_selected_pages = pdf_input_file + '.tmp'

	with open(pdf_file_selected_pages, 'wb') as f:
	   pdf_writer_selected_pages.write(f)

	   
	c = pdftables_api.Client(api_key)

	if chosen_format == "xml":
		c.xml(pdf_file_selected_pages, excel_output_file) 
	elif chosen_format == "csv":
		c.csv(pdf_file_selected_pages, excel_output_file) 
	elif chosen_format == "xlsx":
		c.xlsx(pdf_file_selected_pages, excel_output_file) 
	elif chosen_format == "xlsx_single":
		c.xlsx_single(pdf_file_selected_pages, excel_output_file)
	else:
		print("Format given not recognised, converting to xlsx")
		c.xlsx(pdf_file_selected_pages, excel_output_file) 
	print("Complete")
	os.remove(pdf_file_selected_pages)

If you don’t understand the script above, see the script overview section.

Step 2

If you are converting all PDFs in the folder, you do not need to change the script. However, if you would like to convert only some of the PDFs in the folder, change *.pdf from line 13 (#subprocess to merge PDFs) to be a list of the PDFs, for example:

#subprocess to merge PDFs
subprocess.call("pdftk.exe invoice1.pdf invoice2.pdf invoice3.pdf cat output "+ pdf_input_file)

Step 3

Navigate to your Python file in the terminal and run the following command:

python merge_and_convert.py merged_pdfs.pdf xlsx your_api_key 1,3,5
  • Replace merge_and_convert.py with the name of your Python file.
  • Replace merged_pdfs.pdf with what you'd like to call the PDF file containing all merged PDFs.
  • Replace xlsx with the format you'd like to convert the PDF to. The options are xlsx, xlsx_single, csv or xml.
  • Replace your_api_key with your unique API key found on our API page.
  • Replace 1,3,5 with the page numbers of the merged PDFs file that you would like to convert, ensuring they are comma separated.
Command line

The script will print a message dependent on the arguments you have given. If you do not specify a recognised format, the PDF will be converted to xlsx by default. If no page numbers are given, the entire PDF will be converted to the format you have specified.

Once you see the message 'Complete', the conversion has been successful and the converted file can be found in the same folder as your Python script.

You're all done! You have successfully merged multiple PDFs then converted PDF to Excel using Python.

Script overview

  • #import libraries: Here we are importing all libraries required within the script, including the PDFTables API library and PDFtk toolkit.
  • #define arguments: We are defining the command-line arguments passed to the script.
  • #subprocess to merge PDFs: This allows us to call a command-line argument, to merge the PDFs, within the Python script and obtain the returned results.
  • #error if not enough arguments defined: If you have not defined enough arguments when you run the command in your terminal, an error will be shown.
  • #define file names: We are setting the name of the converted output file.
  • #if no page numbers are defined: If you do not define any page numbers, the entire PDF will be converted to your chosen format. If an unrecognised format is given, the PDF will be converted to xlsx.
  • #if page numbers are defined: Your PDF pages will be converted to your chosen format. If an unrecognised format is given, the PDF will be converted to xlsx.

Do you have more questions?

Check out our other blog posts here or our FAQ page. Also, feel free to contact us.

Love PDFTables? Leave us a review on our Trustpilot page!

Icons made by Smashicons from www.flaticon.com is licensed by CC 3.0 BY
PDFTables.com uses cookies to provide a service and collect information about how you use our site. If you don't want us to collect information about your site behaviour, please go to our privacy page for more information. Read about our use of cookies.