Skip to content →

How to download 100 pdf files from a website in one batch

How do we download pdf files from a website without opening and saving each one separately?

Wheat harvest before sunset near Branderslev, Lolland by Lars Plougmann via Flickr

I am doing a little research job and I want to read over 100 pdf files linked into several pages on a website.  Obviously, I could select each link, open the file, save the file to my hard drive, and go on to the next link.  to download all the files would take 2 to 3 hours and then I would probably miss a file or two through fatigue.

I wanted to download all the files more quickly and accurately, so I looked around and put together a 5-step solution.

Step 1:  Find a complete list of the links

Problem: The listings visible to the casual visitor were paginated and I would have to step through 20 pages to download them all.

Solution: I used the website’s sitemap to dig deep into the website and find a “page” where all the links were on one scrollable, rather than paginated, page.

Step 2:  Download all the links on the page

Problem: To download all the links on the page would be terribly tedious.

Solution: Used the “find links” function in Outwit to list all the links which I copied and pasted into an Excel file.

Step 3:  Create a list of pdf files

Problem: The list of links produced by Outwit included links to everything on the page not just the pdf files.  The links to the pdf files were also listed as .htm

Solution: I did a simple alphabetical sort of the links in Excel and deleted everything except the links ending with .htm.

Step 4:  Reformat the links to link directly to the pdf files

Problem: I still had a list of .htm links not links to the pdf files.

Solution: I used the “Inspect Element” feature of Firebug to inspect the link on the website page and found the source and format of the underlying link to the pdf file.  Then, I edited the .htm links into links that described the pdf files.

Finally I saved this list of links to the pdf files in a text file.

Step 5:  GoZilla

Problem: I still had the problem of downloading the pdf files and did not want to download them one-by-one.

Solution: I uploaded the text file to GoZilla and used automatic downloading to complete the job while I did something else.

Result!

140 pdf files downloaded onto my hard drive ready for reading or conversion into text for further searching.

It took me just as long to work this out as to do it manually but next time I will be able to do it quicker!

Published in SOCIAL MEDIA & IT

9 Comments

  1. Bruce Bruce

    That’s a lot of effort, I applaud your determination. I’d think something like HTTrack (free, GPL) would work – I use it all the time to scrape all the PDFs off a website – you can grab indexes, too. http://www.httrack.com/

  2. Anton Makievsky and Sam Fischer Anton Makievsky and Sam Fischer

    Wow this is incredibly dumb process, all you need is a single step. A simple downloader will do.

  3. umar umar

    You could just use wget using bash. If you’re using windows you can download bash for windows.

  4. vik vik

    I went through all ‘solutions’ but none of them worked for this website: http://www.bkk.hu/apps/menetrend/

    It’s the public transportation’s time table. The PDFs can be opened by clicking on the 1st station, which then are opened in a new tab.

    Any solutions are welcome.

  5. Uclar Uclar

    OutWit is a great tool but only the pro version allows to grab documents automatically. They do have another product called OutWit Docs, which is dedicated to document extraction: http://www.outwit.com

  6. SF SF

    If there is a folder with multiple files, DownThemAll works well. If a page has multiple PDF links (the PDFs are not in one folder, but are linked), DownThemAll only picks up the first one.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.