Lim Yoong Kang

Filling in PDF forms with Python

Recently, I worked on a project that involves filling in PDF forms programmatically.

Spoiler alert: I had a horrible experience.

But if you have any experience dealing with PDFs, you already knew that.

We inherited a Django and an Ionic/Cordova codebase. One of the things the the Cordova app does is send data to the Django backend which in turn creates a PDF.

The situation is that we have a number of PDFs which are basically forms that get filled in based on the data sent from the mobile app. The nature of the project is such that we cannot change how the PDF looks like.

The mobile app isn’t used all year round, but only during certain sales cycles. Each sales cycle we are given a set of PDFs that not only look different from each other – but also look different from previous sales cycles.

That means for this particular part of the project, we can’t standardise the PDF that gets generated.

Essentially, we need to prepare the form for each sales cycle.

Typically, in the form there are a number of text fields, a lot of checkboxes, a number of radio buttons.

One unusual quirk was that it also requires a signature, which is captured from the mobile app.

Given these constraints, I tried a few things.

Idea #1: Use LibreOffice

I tried using LibreOffice.

This was actually how things were done before we inherited the codebase.

The idea was that we can put in named fields in an ODT document, which we can then fill in via something called UNO.

Basically the workflow goes like this:

  • Step 1: Convert all the pages in the form to images, and make sure it fits in A4.
  • Step 2: Create a new document in LibreOffice, and set the page styles to use these images as the background
  • Step 3: Put in named fields in the text fields, checkboxes, and radio buttons.
  • Step 4: Use UNO to put text into the fields. For checkboxes, we can put in a unicode checkbox character.
  • Step 5: Use UNO to convert the ODT document to PDF

Now, in theory that approach was promising – however there were a number of problems.

Firstly, have you tried aligning anything in an ODT document? It’s freaking awful.

The workflow to prepare the forms could easily take days, not to mention super annoying to do for 12 different forms. And I had to do that for each sales cycle.

Also, I wasn’t actually able to get this technique to work on my local machine, which has a newer version of LibreOffice. I was told OpenOffice works better.

Using images as the background of an ODT document also results in pretty poor quality documents. We could use less compression, of course, but we end up with massive files that are hard to work with within LibreOffice.

So after a couple of false starts, I abandoned this approach.

Idea #1.5: Use HTML-to-PDF or RML/ReportLab

So I looked at a number of different ideas. One of them was to use some HTML-to-PDF tools.

Unfortunately, that meant recreating the form in HTML and CSS. We definitely can’t do that.

That also rules out approaches that use RML or ReportLab.

Idea #2: Use FDF

So, we know fillable PDF forms are a thing.

Adobe actually has several standards to fill in PDF documents. One of them is called the Forms Data Format, or FDF.

I wish I could tell you I understood the FDF specification, but the information I see online isn’t very helpful. Just go read the Wikipedia page on it.

The idea is that an FDF file contains data either from a submitted PDF form, or data that is meant to be filled into a PDF (usually the former, from what I can gather).

That means given an FDF, it should be possible to merge it witha PDF form to create a final filled-in PDF form.

So, I talked to my employer who let me purchase Adobe Acrobat Pro DC for about $30 a month, and then used that to add PDF form fields to the PDFs the client gave me.

Great, so all I need to do is to find some Python libraries to create an FDF and merge it with the PDF form I created.

I found a few:

PyPDFtk

The PyPDFtk Python library is a Python wrapper for a server application called PDFtk.

It generates the (X)FDF for you, and merges it by calling PDFtk shell commands. It largely hides the implementation from you.

For most things, this works fine. Textboxes, radio buttons, checkboxes – all fine.

However, it doesn’t handle image fields.

Unfortunately, one of the requirements was to paste a signature to the form, so I can’t actually use this.

PdfJinja

This one does allow you to paste images, which was great!

Here’s the repo: https://github.com/rammie/pdfjinja

I was actually really surprised to see a Python library that does this. I was under the impression that it was a really obscure niche.

I’m not actually 100% sure how it achieves this, but it looks like it uses an API that allows for creating watermarks on PDFs. So watermarks are essentially images, so it does the job.

But… the library itself doesn’t seem to work for checkboxes and radio buttons.

In fact, I pulled the examples from the repo, and ran the example. Filled in text fields and the signature fine, but the checkboxes don’t work.

Perhaps it worked at some point in time, but at some point this obviously stopped working for checkboxes.

But I also knew that FDFs can select checkboxes correctly – PyPDFtk worked fine.

So, I looked at the differences in the implementations between PyPDFtk and PdfJinja, and what I found was that for some reason, if you want a checkbox to be checked – the FDF output needs to contain the specific value of “Yes”.

I created my own fork, and replaced the value with the string “Yes”.

That fixed checkboxes for me, but radio buttons were still broken.

So I pulled out my debugger to see exactly what went wrong. and it turns out the library PDFJinja uses, called PdfMiner, handles some grouped form controls like radio buttons with some subtle difference compared to normal text fields.

I changed that to handle that specific case, and now radio buttons worked for me.

I was hesitant to open a pull request against the original repo because:

  • I didn’t really understand what was happening
  • I didn’t know if my change actually breaks something else

In the end I decided to maintain my own fork.

Other weird shit

Sometimes, I run into some errors related to fonts. Searching the error on StackOverflow gave me this gem from 2014: https://stackoverflow.com/questions/23948647/font-issue-with-pdftk

The original developer of PDFtk (whose name is in the Java stacktrace) commented on the question with a recommendation to use a newer incarnation of PDFtk.

Unfortunately, that doesn’t have a Python SDK, which I could in theory create, but that seems like too much effort.

Someone else found a workaround – the solution is to open the PDF form in Preview (on Mac), type something in the text forms, remove the text, then save it again.

And then it magically works.

Turns out sometimes it happens for checkboxes too. Sometimes, I’ve had to open the PDF form, click on the checkboxes, unselect them, and then save. Magically works.

I really hate dealing with PDFs at this point.

Conclusion

The experience for me was horrible for dealing with a file format as widely used as PDFs.

If you’re looking to do some open source, there’s a lot of low hanging fruit here.

Another interesting path to go down would be to create a new document format with an API that is friendlier to developers – and then make it possible to convert that to PDFs. Seems like a herculean task, though.

If anyone is interested in looking at sample code, I created a Django project that does this, with signatures and all: https://github.com/yoongkang/pdffun

EDIT 29 May 2018: Some people seem to be interested in my fork of PdfJinja, which works for checkboxes and radio buttons. Here is the link to it: https://github.com/yoongkang/pdfjinja