How do we download pdf files from a website without opening and saving each one separately?
I am doing a little research job and I want to read over 100 pdf files linked into several pages on a website. Obviously, I could select each link, open the file, save the file to my hard drive, and go on to the next link. to download all the files would take 2 to 3 hours and then I would probably miss a file or two through fatigue.
I wanted to download all the files more quickly and accurately, so I looked around and put together a 5-step solution.
Step 1: Find a complete list of the links
Problem: The listings visible to the casual visitor were paginated and I would have to step through 20 pages to download them all.
Solution: I used the website’s sitemap to dig deep into the website and find a “page” where all the links were on one scrollable, rather than paginated, page.
Step 2: Download all the links on the page
Problem: To download all the links on the page would be terribly tedious.
Solution: Used the “find links” function in Outwit to list all the links which I copied and pasted into an Excel file.
Step 3: Create a list of pdf files
Problem: The list of links produced by Outwit included links to everything on the page not just the pdf files. The links to the pdf files were also listed as .htm
Solution: I did a simple alphabetical sort of the links in Excel and deleted everything except the links ending with .htm.
Step 4: Reformat the links to link directly to the pdf files
Problem: I still had a list of .htm links not links to the pdf files.
Solution: I used the “Inspect Element” feature of Firebug to inspect the link on the website page and found the source and format of the underlying link to the pdf file. Then, I edited the .htm links into links that described the pdf files.
Finally I saved this list of links to the pdf files in a text file.
Step 5: GoZilla
Problem: I still had the problem of downloading the pdf files and did not want to download them one-by-one.
Solution: I uploaded the text file to GoZilla and used automatic downloading to complete the job while I did something else.
140 pdf files downloaded onto my hard drive ready for reading or conversion into text for further searching.
It took me just as long to work this out as to do it manually but next time I will be able to do it quicker!
That’s a lot of effort, I applaud your determination. I’d think something like HTTrack (free, GPL) would work – I use it all the time to scrape all the PDFs off a website – you can grab indexes, too. http://www.httrack.com/
Wow this is incredibly dumb process, all you need is a single step. A simple downloader will do.
Or you could just use
That is incredibly awesome, thanks !
this is the best!
Yeah I second this, fricking awesome. It was so user friendly I had no idea what I was doing and BAM! 188pdf files in a instant
You could just use wget using bash. If you’re using windows you can download bash for windows.
I went through all ‘solutions’ but none of them worked for this website: http://www.bkk.hu/apps/menetrend/
It’s the public transportation’s time table. The PDFs can be opened by clicking on the 1st station, which then are opened in a new tab.
Any solutions are welcome.
OutWit is a great tool but only the pro version allows to grab documents automatically. They do have another product called OutWit Docs, which is dedicated to document extraction: http://www.outwit.com
If there is a folder with multiple files, DownThemAll works well. If a page has multiple PDF links (the PDFs are not in one folder, but are linked), DownThemAll only picks up the first one.