Downloading all linked PDFs from multiple URLs using Python

My first try with web scraping using Python

Filed under Python, Coding, web-scraping | 0 Comments
Posted on 2020-Nov-14

I’ve been learning Python in my spare time for the past couple of months — initially for data analysis and visualisation. I am still relatively new to the language but have been using it to automate tasks as needed (outside of the data stuffs), primarily by adapting/modifying codes others have shared in Stack Overflow. Recently, I had to download loads of PDF reports related to SDGs, submitted by a select group of countries, from UN’s SDG portal. So, I decided to use Python to automate the task. Below is a simple web-scrapping code I wrote for the purpose, based on this from Stack Overflow.

My key aim was to download all PDFs linked in a member country page and organise them in folders for each country. Also, where there were any errors in the links, I wanted the code to ignore those and continue (but also print an error message). Each component of the code with comments/explanations below.

Loading necessary libraries/packages.

import os
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin

Define the base URL and create a list of country-level URLs, which are the ones we want to scrape for PDFs later. You can get the correct country names for country-specific pages here.

baseurl = 'https://sustainabledevelopment.un.org/memberstates/'
# these are the countries I wanted, these can be added removed as necessary
countries = ['bangladesh', 'china', 'colombia', 'india', 'kenya', 'madagascar', 'malawi', 'mozambique', 'peru', 'tanzania']
# build list of country-level urls
def build_url(country):
    return baseurl + country
urls = []
for country in countries:
    urls.append(build_url(country))

Using a for loop, go through each member country pages and download linked PDFs in respective folders. To ignore any errors in links to PDFs and continue the scraping, I use try when actually downloading the files from their linked URLs.

for url in urls:
    # define folder name from member country portion of the url
    foldername = url.split('/')[-1]
    # create a folder for the country if it doesn't exist
    if not os.path.exists(foldername):os.mkdir(foldername)
    page = requests.get(url).text
    soup = bs(page)
    for link in soup.select("a[href$='.pdf']"):
        filename = os.path.join(foldername, link['href'].split('/')[-1])
        with open(filename, 'wb') as f:
            try:
                f.write(requests.get(urljoin(url, link['href'])).content)
            except:
                print('Could not open url: ', urljoin(url, link['href']))

I was able to download 400+ documents in a few minutes, which manually would have perhaps taken me hours! Full code below:

import os
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin

baseurl = 'https://sustainabledevelopment.un.org/memberstates/'
# these are the countries I wanted, these can be added removed as necessary
countries = ['bangladesh', 'china', 'colombia', 'india', 'kenya', 'madagascar', 'malawi', 'mozambique', 'peru', 'tanzania']
# build list of country-level urls
def build_url(country):
    return baseurl + country
urls = []
for country in countries:
    urls.append(build_url(country))

for url in urls:
    # define folder name from member country portion of the url
    foldername = url.split('/')[-1]
    # create a folder for the country if it doesn't exist
    if not os.path.exists(foldername):os.mkdir(foldername)
    page = requests.get(url).text
    soup = bs(page)
    for link in soup.select("a[href$='.pdf']"):
        filename = os.path.join(foldername, link['href'].split('/')[-1])
        with open(filename, 'wb') as f:
            try:
                f.write(requests.get(urljoin(url, link['href'])).content)
            except:
                print('Could not open url: ', urljoin(url, link['href']))

Downloading all linked PDFs from multiple URLs using Python

Related Posts

Comments