Optimizing PDFs for SEO

Filed in MY BEST POSTS, SEO, Web Site Content by Matt McGee on October 31, 2006 27 Comments

pdf logoA couple years ago, while I was working at OWT, one of our clients was launching a new product — but they weren’t the only company doing so. One of their main competitors was launching the same product, sourced from the same manufacturer. The product didn’t have much search history. It also didn’t have much competition for the relevant keywords.

Still, we screwed up. We took a very traditional approach to SEOing the new product: build out some great content on the web site, go after some links, do some PR, etc. The competition took a non-traditional approach: They slapped together a PDF with a couple pages of text content about the product, uploaded it to their web site, linked to it from their home page, and in no time flat that PDF file had the No. 1 ranking in both Google and Yahoo! for the relevant keywords. What’s worse — we had a heck of a time getting the great content we developed to outrank the PDF file. Ultimately, we followed the “if you can’t beat ’em, join ’em” theory and produced our own PDF which immediately started battling the competitor’s PDF for search engine supremacy, until our great content and links eventually caught up and won the battle.

With so many businesses — especially retailers — having access to PDFs full of product information, here are some thoughts on optimizing PDFs for search engine visibility.

1. All three major engines can crawl and index text-based PDFs. If you need proof, just do a search on each SE with [pdf] in the query. Google: white paper pdf … Yahoo: white paper pdf … MSN: white paper PDF

2. PDF optimization is similar to optimization for a regular content page. Try this: good use of keywords/phrases, appropriate headlines and sub-headlines, solid content that reads well to a human eye, etc. If the PDF will include images, a caption underneath each image would be a good idea, especially if the caption includes a targeted keyword/phrase. (Of course, don’t overdo it. Remember my mom’s advice about SEO.)

Proof: Using the search above, we find this PDF ranked prominently in all three engines. On page 9 of this PDF, there’s a bold content heading (the equivalent of an H2): Awareness and Usage of the XML Button. Let’s not use the exact text, but something close: Here are the SERPs for [xml button awareness]: Yahoo, Google, and MSN. In each case, you find the PDF ranked highly in the SERPs and that exact bold content heading showing prominently in the snippet.

3. The most important thing where PDFs and SEO is concerned is how the PDF is created. Don’t use Photoshop to make your PDF, because when you do that, you’re actually making a big image file, not a true PDF — and the spiders cannot crawl or “read” the text from that image file. The PDF should be created with a text-based program, like MS Word or Adobe Pagemaker, so that the final product is text-based and can be crawled.

4. Your PDF can reside anywhere on your site, but the same rule about spiders not being likely to crawl content that’s too deep applies. The safest thing to do is to put it as close to the root directory as possible.

5. When publishing a PDF on your site, you should very visibly link to the PDF from the home page, or from some page that gets crawled regularly. You have to lead the crawler along so it finds the new content as quickly as possible. Don’t just post the PDF and then cross your fingers that it gets crawled. (See my old post, Training the Crawlers for more.)

6. It’s probably a good idea to use a keyword when naming the files, such as keyword.pdf. I haven’t done any serious investigation on what impact this has, but it would seem to be a good idea to use a keyword when naming the file — to be safe, in case there’s a little boost to be had.

So that’s my quick and dirty overview on PDF optimization for SEO. What do you do with PDFs, if anything?

[tags]seo, pdf, web content[/tags]

Comments (27)

Trackback URL | Comments RSS Feed

  1. gradiva says:

    Hi Matt – thanks for the interesting post! I definitely think that PDF listings in the search engine results can be about as ugly as it gets.

    From what I can tell, Google (the only engine I researched) may grab a page title from any of the following:

    – document meta data (title)
    – the first line of text
    – the file name
    – text from within the document that is formatted in larger font

    I never saw an example of a document that *did* have a metadata title defined in which the metadata title was not used. (In other words, as far as I can tell Google will always take the metadata title first, before the other options). (My research is about 6 months old so of course things may have changed.)

    Your readers might want to know how easy it is to define a document metadata title: just select File > Properties or File > Document Properties.

    Best wishes,

    Gradiva Couzin

  2. Matt McGee says:

    Great reply and information, thanks Gradiva — much appreciated. I haven’t done much study of how and where that metadata title property gets used, so it’s good to see what you’ve discovered. If you uncover more tricks/secrets about PDF SEO, please let me know. 🙂

  3. technomatters says:

    i don’t think making pdf’s for seo, because many website CMS and directory softwares providing the url rewrite concept.

  4. Matt McGee says:

    That’s true, but this isn’t about repurposing a web page in PDF form. It’s about taking PDF material and making it more search engine friendly. Companies generate a lot of PDFs, so if they’re being posted to the web, they should be optimized!

  5. Keira says:

    what can stop a googlebot crawling a pdf? If anyone could tell me I would be very grateful

  6. Matt McGee says:

    Keira – I’d place the PDFs in a new directory and use robots.txt to disallow bots from crawling that directory. If you just have one PDF to block, put the URL in your robots.txt and disallow it.

  7. Chris says:

    I never realized that pdf’s could be optimized, past the name of the file. This has its good and bad points, though. Many of the PDF’s that people make aren’t intended to be indexed, so they won’t realize that their info is out there for everybody to see. How many thousands of dollars are lost from online marketers because of this?

  8. Mike says:

    Any idea of the impact of filesize of the PDF?
    Will crawlers shy away from very large PDFs?

    Many thanks.

  9. Liam says:

    great info guys! here’s a question for you?

    If a hi google ranking site had a link on their homepage to my site (for example) this would be a great boost for my site correct?

    If the same hi google ranking site had a link to my site in their SEOed pdf, would this have the same affect?


  10. andy says:

    If you have an HTML file and has a duplicate file which is a PDF file would Google treat this as duplicate content?

  11. Ali Nasir says:

    A very good thing about this article is it was written in 2006 and its contents are still very usefull in 2010.

  12. Jeff Swanson says:

    I think andy has a good question here – about duplicate content. I’m not certain of the answer, but I would suggest using a unique summary of the PDF on the HTML page and having them download or open it for the full text. That way you avoid duplicate content issues – if they exist here.

    It’s a good question because I want to be able to provide a PDF for users, but also want to make sure they visit my site. If they see a PDF result in the SERP’s and click on it, they aren’t actually visiting the domain, they just are opening the PDF. With an HTML version, you could at least get them on the site and give them the option.

  13. James says:

    the thing is crawlers do now index content from PDF files I think this data is a little out of data but hehe =-) thanks for the information.

  14. Jonathan says:

    Interesting article. I have just started work for a client that has a low ranking but a whole stack of PDF-based literature that is very well written.
    This could be the icing on the cake!
    Thanks Matt

  15. Chad says:

    We have many PDFs but since PDFs can be “landing pages” when properly optimized, we prefer to “embed” the pdfs within the body of our site… that way, we can have title tags, navigation, the asset itself, and anything else we want. Plus, visitors can actually see our website this way instead of just seeing a PDF. Does this technique have any drawbacks other than the embedded PDF not being liked by certain broswers? Or is there another way for us to embed PDFs in our site’s template that all the engines/browsers prefer better?

  16. That one lone dissenter (James) now has me wondering. That said, search engines are info-hungry monsters and I suspect they crawl everything possible. I have been doing a lot of reading lately about RDF, microdata, microformats etc. and Google itself contributes to schema.org. With those techniques you can readily specify that a pdf, or any other linked document is a publication of your site and I’d imagine that would banish “duplicate content” worries.

  17. Nayan Mewada says:

    Good one Adam, However i would like to know if only putting the first line in bold will be picked up Search Engines in today world, I am asking this coz i have already added Title and Description in the PDF but unfortunately for me currently it is not getting reflected in Search Engine precisely in google.

    Looking for some help from the masters above. 🙂

  18. HAHA #1 – I just Googled “SEO PDF” looking for an article like this one, and I ended up with a bunch of PDF results on beginner’s SEO. Thumbs up for optimizing the “how to”!

    Love your advice, and intrigued by your anchor texts. I am ACTUALLY bookmarking! I have to read about your mom’s advice on SEO now!

  19. Phil Ryan says:

    About three years ago I had someone come up to me and ask me can you optimize PDFs? I thought to myself why not? I did some reading and found article very similar to this one. It’s amazing how many PDFs are out there. Especially with universities and companies that publish there earnings. So many of them just fail to fill out the information in the back end of these PDFs and lose out in the SEO value. Thanks for putting this together. Maybe an updated post since things have changed since 2006?

Leave a Reply

Your email address will not be published. Required fields are marked *