Downloading all linked PDFs from multiple URLs using Python
My first try with web scraping using Python
I’ve been learning Python in my spare time for the past couple of months — initially for data analysis and visualisation. I am still relatively new to the language but have been using it to automate tasks as needed (outside of the data stuffs), primarily by adapting/modifying codes others have shared in Stack Overflow. Recently, I had to download loads of PDF reports related to SDGs, submitted by a select group of countries, from UN’s SDG portal. So, I decided to use Python to automate the task. Below is a simple web-scrapping code I wrote for the purpose, based on this from Stack Overflow.
My key aim was to download all PDFs linked in a member country page and organise them in folders for each country. Also, where there were any errors in the links, I wanted the code to ignore those and continue (but also print an error message). Each component of the code with comments/explanations below.
Loading necessary libraries/packages.
import os
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
Define the base URL and create a list of country-level URLs, which are the ones we want to scrape for PDFs later. You can get the correct country names for country-specific pages here.
baseurl = 'https://sustainabledevelopment.un.org/memberstates/'
# these are the countries I wanted, these can be added removed as necessary
countries = ['bangladesh', 'china', 'colombia', 'india', 'kenya', 'madagascar', 'malawi', 'mozambique', 'peru', 'tanzania']
# build list of country-level urls
def build_url(country):
return baseurl + country
urls = []
for country in countries:
urls.append(build_url(country))
Using a for
loop, go through each member country pages and download linked PDFs in respective folders. To ignore any errors in links to PDFs and continue the scraping, I use try
when actually downloading the files from their linked URLs.
for url in urls:
# define folder name from member country portion of the url
foldername = url.split('/')[-1]
# create a folder for the country if it doesn't exist
if not os.path.exists(foldername):os.mkdir(foldername)
page = requests.get(url).text
soup = bs(page)
for link in soup.select("a[href$='.pdf']"):
filename = os.path.join(foldername, link['href'].split('/')[-1])
with open(filename, 'wb') as f:
try:
f.write(requests.get(urljoin(url, link['href'])).content)
except:
print('Could not open url: ', urljoin(url, link['href']))
I was able to download 400+ documents in a few minutes, which manually would have perhaps taken me hours! Full code below:
import os
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
baseurl = 'https://sustainabledevelopment.un.org/memberstates/'
# these are the countries I wanted, these can be added removed as necessary
countries = ['bangladesh', 'china', 'colombia', 'india', 'kenya', 'madagascar', 'malawi', 'mozambique', 'peru', 'tanzania']
# build list of country-level urls
def build_url(country):
return baseurl + country
urls = []
for country in countries:
urls.append(build_url(country))
for url in urls:
# define folder name from member country portion of the url
foldername = url.split('/')[-1]
# create a folder for the country if it doesn't exist
if not os.path.exists(foldername):os.mkdir(foldername)
page = requests.get(url).text
soup = bs(page)
for link in soup.select("a[href$='.pdf']"):
filename = os.path.join(foldername, link['href'].split('/')[-1])
with open(filename, 'wb') as f:
try:
f.write(requests.get(urljoin(url, link['href'])).content)
except:
print('Could not open url: ', urljoin(url, link['href']))