How to download 100 pdf files from a website in one batch

How do we download pdf files from a website without opening and saving each one separately?

I am doing a little research job and I want to read over 100 pdf files linked into several pages on a website. Obviously, I could select each link, open the file, save the file to my hard drive, and go on to the next link. to download all the files would take 2 to 3 hours and then I would probably miss a file or two through fatigue.

I wanted to download all the files more quickly and accurately, so I looked around and put together a 5-step solution.

Step 1: Find a complete list of the links

Problem: The listings visible to the casual visitor were paginated and I would have to step through 20 pages to download them all.

Solution: I used the website’s sitemap to dig deep into the website and find a “page” where all the links were on one scrollable, rather than paginated, page.

Step 2: Download all the links on the page

Problem: To download all the links on the page would be terribly tedious.

Solution: Used the “find links” function in Outwit to list all the links which I copied and pasted into an Excel file.

Step 3: Create a list of pdf files

Problem: The list of links produced by Outwit included links to everything on the page not just the pdf files. The links to the pdf files were also listed as .htm

Solution: I did a simple alphabetical sort of the links in Excel and deleted everything except the links ending with .htm.

Step 4: Reformat the links to link directly to the pdf files

Problem: I still had a list of .htm links not links to the pdf files.

Solution: I used the “Inspect Element” feature of Firebug to inspect the link on the website page and found the source and format of the underlying link to the pdf file. Then, I edited the .htm links into links that described the pdf files.

Finally I saved this list of links to the pdf files in a text file.

Step 5: GoZilla

Problem: I still had the problem of downloading the pdf files and did not want to download them one-by-one.

Solution: I uploaded the text file to GoZilla and used automatic downloading to complete the job while I did something else.

Result!

140 pdf files downloaded onto my hard drive ready for reading or conversion into text for further searching.

It took me just as long to work this out as to do it manually but next time I will be able to do it quicker!

9 Comments

Bruce

That’s a lot of effort, I applaud your determination. I’d think something like HTTrack (free, GPL) would work – I use it all the time to scrape all the PDFs off a website – you can grab indexes, too. http://www.httrack.com/

Loading...

September 26, 2012 Reply
Anton Makievsky and Sam Fischer

Wow this is incredibly dumb process, all you need is a single step. A simple downloader will do.

Loading...

November 7, 2012 Reply
Tom Lyle

Or you could just use
https://addons.mozilla.org/en-US/firefox/addon/downthemall/

Loading...

March 15, 2013 Reply
- joy
  
  That is incredibly awesome, thanks !
  https://addons.mozilla.org/en-US/firefox/addon/downthemall/
  this is the best!
  
  Loading...
  
  November 12, 2013 Reply
  - That Guy
    
    Yeah I second this, fricking awesome. It was so user friendly I had no idea what I was doing and BAM! 188pdf files in a instant
    
    Loading...
    
    January 10, 2014 Reply
umar

You could just use wget using bash. If you’re using windows you can download bash for windows.

Loading...

August 19, 2013 Reply
vik

I went through all ‘solutions’ but none of them worked for this website: http://www.bkk.hu/apps/menetrend/

It’s the public transportation’s time table. The PDFs can be opened by clicking on the 1st station, which then are opened in a new tab.

Any solutions are welcome.

Loading...

October 8, 2013 Reply
Uclar

OutWit is a great tool but only the pro version allows to grab documents automatically. They do have another product called OutWit Docs, which is dedicated to document extraction: http://www.outwit.com

Loading...

November 16, 2013 Reply
SF

If there is a folder with multiple files, DownThemAll works well. If a page has multiple PDF links (the PDFs are not in one folder, but are linked), DownThemAll only picks up the first one.

Loading...

September 18, 2016 Reply