PDF files are widely used to store and share documents, but extracting data from them can be a challenge. In this tutorial, we will explore how to extract tables from a PDF file using Python, and specifically the tabula-py
package.
Installing tabula-py
Before we can use tabula-py
, we need to install it. This can be done using pip
, the Python package manager. Open your terminal or command prompt and type the following command:
pip install tabula-py
This will download and install the tabula-py
package and its dependencies.
Reading a PDF file table
Let's start by reading a table from a PDF file using tabula-py
. In this example, we will use a sample PDF file that contains a table of Olympic medalists.
First, let's import the tabula
module and use the read_pdf()
function to extract the table from the PDF file:
import tabula
# Read the PDF file and extract the first table on the first page
df = tabula.read_pdf('olympic_medalists.pdf', pages=1, multiple_tables=False)[0]
In this example, we read the first page of the PDF file olympic_medalists.pdf
and extract the first table it finds (specified by multiple_tables=False
). The resulting table is returned as a Pandas dataframe, which we can assign to the variable df
.
Let's print the contents of the dataframe to the console to see what we've extracted:
print(df)
The output should look something like this:
Rank Team/NOC Gold Silver Bronze Total
1 United States (USA) 46 37 38 121
2 Japan (JPN) 27 14 17 58
3 Great Britain (GBR) 22 21 22 65
4 China (CHN) 18 20 14 52
5 Russia (ROC) 17 24 19 60
.. ... ... ... ... ...
78 Venezuela (VEN) 0 1 0 1
79 Zambia (ZAM) 0 1 0 1
80 Hong Kong (HKG) 0 0 4 4
81 Nepal (NEP) 0 0 1 1
82 Uganda (UGA) 0 0 2 2
[87 rows x 6 columns]
As we can see, we have successfully extracted the table from the PDF file and converted it to a Pandas dataframe.
Full Sample Code
Here is the full sample code to read a table from a PDF file using tabula-py
:
import tabula
# Read the PDF file and extract the first table on the first page
df = tabula.read_pdf('olympic_medalists.pdf', pages=1, multiple_tables=False)[0]
# Print the dataframe
print(df)
How tabula-py
works
tabula-py
is a Python wrapper for the tabula
Java library, which uses Apache PDFBox to extract data from PDF files. When tabula-py
is called, it starts a subprocess of a Java virtual machine and runs the tabula
Java library on it. The tabula
library then extracts data from the PDF file, converts it to a Pandas dataframe, and returns it to the Python process.
tabula-py
has several options that allow us to customize how the PDF file is parsed. For example, we can specify the pages to extract, the area of the page to extract, and whether to output multiple tables or not. The tabula.read_pdf()
function takes the following arguments:
file_path
: The path to the PDF file to extract data from.pages
: The pages of the PDF file to extract data from. This can be a single page number or a range of pages (e.g., "1-3").area
: The area of the PDF page to extract data from. This is specified as a list of four numbers representing the top, left, bottom, and right coordinates of the area to extract (e.g., [0, 0, 100, 100]
).spreadsheet
: Whether to output the data in a spreadsheet format (default: False
).multiple_tables
: Whether to extract multiple tables from the PDF file (default: True
).java_options
: Additional options to pass to the Java virtual machine.
The output of tabula.read_pdf()
is a list of Pandas dataframes, with each dataframe representing a table extracted from the PDF file.
tabula-py
is a powerful tool for extracting tables from PDF files, but it may not work perfectly for all PDF files. In some cases, the extracted data may require additional cleaning or manipulation to be useful. However, for many PDF files, tabula-py
can be a quick and effective solution for extracting table data.