Tuesday, September 29, 2009

Font Encoding and Searchable PDFs

I ran into a weird issue today I thought I'd share in case anyone else runs
into this.

In one of my applications I'm populating PDF forms via CFPDFFORM in
ColdFusion. It works great but the PDFs generated aren't searchable, by
which I mean if you're in Acrobat Reader (or any PDF reader application
from what I tested), you can search the PDF but any data that was
programmatically inserted into the PDF form fields isn't searched. So for
example I can be looking at the name "Smith" in the PDF, but if I do a
search for "Smith" it will yield 0 results.

It turns out that the reason for this is due to the encoding of the font
being used on the form fields. I chose Arial for the font (in Acrobat Pro
on the Mac if I remember correctly) when I was creating the empty form but
didn't realize that the version of Arial I chose used Identity-H encoding.
Identity-H is a double-byte encoding so I find it a bit odd that it's not
searchable, but the solution (at least that I've found so far) is to use a
font with ANSI encoding instead.

Since I've been generating PDFs with this app for 2+ years now (funny no
one noticed until now!), I guess I'll be regenerating a lot of PDFs if I
want them to be searchable. Luckily there's a function in the app for just
that purpose, but my server's going to hate me for having to do all that
work over again.

Hope that saves someone else's head and nearest wall from unnecessary

1 comment:

Matthew Woodward said...

Turns out this did NOT fix the issue when the PDF for is populated by ColdFusion. We'll see what Adobe Support has to say because I'm at a loss. If I type into the PDF form manually the text is searchable, but if it's put there by CF it isn't.