scrape pdf files – flowingmotion

How do we download pdf files from a website without opening and saving each one separately?

I am doing a little research job and I want to read over 100 pdf files linked into several pages on a website. Obviously, I could select each link, open the file, save the file to my hard drive, and go on to the next link. to download all the files would take 2 to 3 hours and then I would probably miss a file or two through fatigue.

I wanted to download all the files more quickly and accurately, so I looked around and put together a 5-step solution.

Step 1: Find a complete list of the links

Problem: The listings visible to the casual visitor were paginated and I would have to step through 20 pages to download them all.

Solution: I used the website’s sitemap to dig deep into the website and find a “page” where all the links were on one scrollable, rather than paginated, page.

Step 2: Download all the links on the page

Problem: To download all the links on the page would be terribly tedious.

Solution: Used the “find links” function in Outwit to list all the links which I copied and pasted into an Excel file.

Step 3: Create a list of pdf files

Problem: The list of links produced by Outwit included links to everything on the page not just the pdf files. The links to the pdf files were also listed as .htm

Solution: I did a simple alphabetical sort of the links in Excel and deleted everything except the links ending with .htm.

Step 4: Reformat the links to link directly to the pdf files

Problem: I still had a list of .htm links not links to the pdf files.

Solution: I used the “Inspect Element” feature of Firebug to inspect the link on the website page and found the source and format of the underlying link to the pdf file. Then, I edited the .htm links into links that described the pdf files.

Finally I saved this list of links to the pdf files in a text file.

Step 5: GoZilla

Problem: I still had the problem of downloading the pdf files and did not want to download them one-by-one.

Solution: I uploaded the text file to GoZilla and used automatic downloading to complete the job while I did something else.

Result!

140 pdf files downloaded onto my hard drive ready for reading or conversion into text for further searching.

It took me just as long to work this out as to do it manually but next time I will be able to do it quicker!

9 Comments

Tag: scrape pdf files

How to download 100 pdf files from a website in one batch