Mahesh Poudyal

Moving from PowerPoint to Quarto presentations

Sat, 19 Nov 2022 00:00:00 +0000

Let me explain why I wanted to password-protect these slides first: primarily because a lot of them are incomplete and/or not updated and since I’ve setup continuous deployment on netlify, I don’t want them to be public; second, these are current and future teaching materials that I have been building from scratch, so I want to restrict public access to them, at least for now.

I’ve been a big fan of markdown ever since the format came into being and I often take notes and write early versions of my papers in markdown format, especially in Rmarkdown if they also involve data analysis on R as I have described in one of my earlier posts. I had earlier tried making slides on RStudio using xaringan but couldn’t really stick to it. When I came across quarto presentations demo last year, I was amazed by its smoothness and the features it provided. So, last summer I started converting my old lecture slides to quarto just to check if I would be able to deliver my teaching without using PowerPoint. Luckily I was able to test it before teaching term started as I had to give an invited talk (35-40 minutes) just before the start of the term this year, and I used presentation slides completely built on quarto with animations, slide transitions and all. It worked really well, and being able to zoom onto a section of the slide was like magic to my audience! So that was that, I decided to completely move all my lecture slides to quarto.

Quarto presentations are basically static html pages with javascript doing all the fancy stuffs in the background. So, I thought why not setup a website with all my lecture slides which I could just wherever and whenever I needed. I already host my main homepage on github deploying via netlify, so I decided to do the same for my presentations. But I also needed the website to be password-protected so only I had the access. After a bit of search, I came across staticrypt which seemed to be exactly what I needed. Better still, I found a netlify plugin to implement staticrypt making it even easier to setup the whole thing. So, that is what I did, the result of which you can see at lectures.poudyal.me. There are already plenty of great ‘step-by-step’ guides to setup on how to build/host websites using github and netlify, so I’m not going to repeat it here. Once you set that up, you just need three more steps to get quarto to render documents (presentations & any other files, e.g. index files) and encrypt them:

I. Setup a new Environment variable called PROTECTED_PASSWORD within your Site settings in netlify. This will be used to encrypt/decrypt your html pages.

II. Add package.json file at the root of your repository with following entry:

{
    "dependencies": {
        "@quarto/netlify-plugin-quarto": "^0.0.5"
    },
    "devDependencies": {
        "netlify-plugin-password-protection": "^3.0.2"
    }
}

III. Add netlify.toml file at the root of your repository with following entry. See further details at the netlify password protection page about [plugins.inputs].

[[plugins]]
    package = "@quarto/netlify-plugin-quarto"
[[plugins]]
    package = "netlify-plugin-password-protection"

    [plugins.inputs]
        directoryFilter = ["!node_modules"]
        title = "Protected Page"
        instructions = "Enter your passphrase"

After that, push everything to your github repository¹ and deploy on netlify. everything should work just as required. Some screenshots below on how my website with lecture materials looks like now.

Image 1: Encrypted landing page

Image 2: Once I enter the passphrase above, I get here.

Image 3: Title page of one of my quarto presentation document.

Next step for me is to use quarto to build all my websites - currently I use hugo.

Make sure it is a Private repo, otherwise password protection is pointless. ↩

Moving from twitter to mastodon

Sat, 05 Nov 2022 00:00:00 +0000

I’m on mastodon — on the fediscience.org instance to be precise. I’ve been fairly inactive on twitter because I’ve not liked what it has become for a while now, my followers would have noticed that from my general lack of engagement on that platform for the last 2-3 years.

This is a big deal for me given I’ve been on the platform since 2007. I never got into facebook despite joining the platform quite early on, and completely left that platform (and its associated ones, instagram and WhatsApp) some 5 years ago. So, twitter has been my main social media platform for the best part of 15 years! I liked twitter from the very beginning.

In fact the original 140 characters was perfect for me because during most of my first two years on the platform, I used to send sms messages to a specific UK number from my fieldwork sites in Northern Ghana to post my tweets. In those days, each SMS was limited to 160 characters so each tweet would cost the price of a single SMS, which I was happy to pay given the service it was providing me — helping me stay in touch with the outer world via my tweets from the field which often were completely cut-off from the outside world. Occasionally, on one of my fieldwork sites, I even had to climb up a tree to get good enough mobile signal (mostly GPRS, and if lucky edge network — those old enough to use mobile internet before 3G would remember!). I had a Nokia E50 phone back then, which I loved — I still think it is one of the best mobile phones I’ve used. So I tweeted using my reliable Nokia phone for a number of years, and when I got my first Kindle Keyboard with 3G it had an experimental browser feature, which I used to tweet as well.

So all of these fond memories of using Twitter in early years are the ones I would cherish the most. Interestingly, a week or so being on fediverse feels quite similar in many ways, looking for people to follow, seeing others finding you on the platform, and just having a clean timeline with no ads and largely interesting toots. I really hope this will continue and become even better. I have no intention of going back to twitter now, although I’ll keep my account there active for now so that people know where to find me — my name and link on my profile should lead them here 😄

Cartographic desktop backgrounds

Mon, 28 Dec 2020 00:00:00 +0000

Inspired by this tweet and this tool that the tweet referred to, I tried to make my own cartographic desktop background for one of my monitors — for the second monitor I’m simply using the one I got for London using the above tool, which on my desktop looks like this:

Below, I outline the steps in creating similar desktop background as above highlighting all the roads within a particular area but with a base map (e.g. terrain) instead of a plain background. I did one for Kathmandu and surrounding areas in Nepal in R using osmdata, sf and ggmap packages and looks like this:

First step, load the required libraries in R.

# load packages
library(osmdata)
library(tidyverse)
library(sf)
library(ggmap)

Next, get the base map outline and necessary map data from OSM. With the osmdata package, you are using ‘overpass API’ to extract OSM data. To get the map outline, you’ll have to define ‘bounding box’ - basically four coordinates/corners of your outline. If you just want to automatically create a ‘bounding box’ for a certain location, you can use getbb() function - for example, getbb("Kathmandu, Nepal"). However, I wanted to manually create a bounding box, for which I simply went to OSM webpage and used ‘Export’ feature to manually select the area I need to create bounding box coordinates that I used below. Once you have your bounding box, you then use opq() function to build the query. Finally, you use add_osm_feature() function in the query to add the feature you require in your map (in this case “highway”, which includes all the roads). Once you have the final query defined, you can use osmdata_sf() function to send the query defined earlier to the overpass server to return the data as a ‘simple feature (sf)’ format, which you’ll later plot. For the base map, you can use get_map() function from ggmap to pull the base map you want from among the options available.

# set bounding box
bb <- c(85.25, 27.64, 85.46, 27.75) #these coordinates bound Kathmandu and surrounding areas

# build query
q <- opq(bbox = bb) %>% 
  add_osm_feature("highway")

# get data in sf format
roads <- osmdata_sf(q)

# get base map, I'm using 'toner-background'
basemap <- get_map(bb, maptype = "toner-background")

You now have all the necessary things to produce the map. If you are familiar with ggplot functions, then the following steps should look familiar. First step below plots the base map, then adds the roads from the sf data (note you’ll have to specify ‘osm_lines’ from the dataframe). Finally, theme_void() option removes all the axes etc, to get a clean cartographic plot.

ggmap(basemap) + 
    geom_sf(data = roads$osm_lines,
            inherit.aes = FALSE) + 
    theme_void()

Output from above looks like this, which is my main monitor background shown earlier!

Visualising NVivo coding with plotly treemap

Sat, 12 Dec 2020 00:00:00 +0000

TL;DR This post is only interesting/useful if you work with qualitative data and want to customise the “treemap” you get in NVivo, one of the most commonly-used computer-assisted qualitative data analysis software (CAQDAS). Basically, you can make much better treemap plots using plotly package in R using the coding frequency data that you can export from NVivo.

I’ve been coding qualitative data in NVivo for my research for the last few weeks, and one of the things I like doing as soon as I have done decent amount of coding is to visualise them in some way. While latest versions of NVivo do come with quite a few options for visualisation, “treemap”, which you can get through Hierarchy Chart option in NVivo is my favourite. The problem is I can’t do much with what NVivo provides in the way of these charts except to change colours, that too within the limited options available. So, I decided to export coding data that NVivo uses to produce these charts and use plotly package in R to create customisable treemap plots. Once you are in R, you just need the packages tidyverse, plotly and RColorBrewer for the codes below to run successfully.

I. Exporting coding data from NVivo

You basically have two options: if you use Windows version of NVivo then you can export data as .xlsx file (i.e. Microsoft Excel format); if you use Mac version of NVivo then you can export data as .csv to read into R later. Below two screenshots of Mac OS version of NVivo showing the treemap and underlying data that could be exported.

This is the default treemap you get in NVivo.

You can use Export List... menu item to export the data from NVivo.

II. Importing data into R and structuring the df for `plotly` treemap plot

This is the only tricky bit in this workflow as the data from NVivo needs some processing in R to the structure needed for a treemap plot using plotly package. I provide the replicable steps below with codes on data from NVivo’s built in example project.

First, read data into a new dataframe, clean it a bit, remove unnecessary columns, unnecessary strings from the Codes column, and split hierarchical nodes (coding terms) into separate columns.

# load necessary libraries
library(tidyverse)
library(plotly)
library(RColorBrewer)

# read data
# this excludes autocoded nodes (can be selected when exporting data from NVivo)
df <- read.csv("https://raw.githubusercontent.com/mpoudyal/test-data/main/data/nvivo/ex_proj_codes.csv") 
glimpse(df) #check what you've just imported
names(df)[2:3] <- c("cref", "agg_cref") # simple naming for code frequency columns
df <- df[-c(4,5)] # remove unnecessary columns

# remove "Codes\\" string from the `Codes` column
df$Codes <- gsub("Codes\\\\", "", df$Codes, fixed=TRUE)

# prepare data for plotly treemap
# separate nodes (coding terms) into different columns, this is needed as NVivo exports hierarchical coding as single string with `\` separator
df <- df %>%
    separate(.,
             col = Codes,
             into = c("l1node", "l2node","l3node","l4node"),
             sep = "\\\\",
             remove = FALSE,
             extra = "merge")

Create ids, labels and parents columns for treemap plot. This step creates the three columns of codes preserving hierarchy in the structure required for plotly treemap.

df <- df %>%
    mutate(ids = case_when(
        !is.na(l4node) ~ paste0(l3node,"-",l4node),
        (is.na(l4node) & !is.na(l3node)) ~ paste0(l2node,"-",l3node),
        (is.na(l3node) & !is.na(l2node)) ~ paste0(l1node,"-",l2node),
        TRUE ~ l1node
    )) %>%
    mutate(labels = case_when(
        !is.na(l4node) ~ l4node,
        (is.na(l4node) & !is.na(l3node)) ~ l3node,
        (is.na(l3node) & !is.na(l2node)) ~ l2node,
        TRUE ~ l1node
    )) %>%
    mutate(parents = case_when(
        labels == l1node ~ "",
        labels == l2node ~ l1node,
        labels == l3node ~ paste0(l1node,"-",l2node),
        labels == l4node ~ paste0(l2node,"-",l3node)
    ))

The data is now ready to be plotted.

III. Plot the treemaps

First, treemap of all the coding.

# basic treemap
fig <- plot_ly(
    type = "treemap",
    ids = df$ids,
    labels = df$labels,
    parents = df$parents,
    values = df$cref,
    textinfo = "label+value")

# customise the plot with title and annotations
fig <- fig %>% 
    layout(title = list(text = "Treemap of all coding*",
                        xref = "paper", yref = "paper"),
               annotations = list(x = 1, y = -0.05,
                                  text = "*Numbers indicate frequency of occurence for the code",
                                  showarrow = F, xref = "paper", yref = "paper",
                                  font = list(size = 12, color = "charcoal")))
fig

Output from above looks like this:

While in the interactive plotly chart above we can zoom on to the coding groups and subgroups, it is often useful to create a new treemap only for the coding group(s) of interest. Below I create two further treemaps simply by subsetting the original data and using the same basic code as above.

Treemap for the coding group ‘Economy’

## subset data
df1 <- df[grepl("Economy", df[["Codes"]]),]

fig1 <- plot_ly(
    type = "treemap",
    ids = df1$ids,
    labels = df1$labels,
    parents = df1$parents,
    values = df1$cref,
    textinfo = "label+value",
    marker = list(colors = brewer.pal(12,"Set3"))) # using RColorBrewer package for custom colour

fig1 <- fig1 %>% 
    layout(title = list(text = "Treemap of codes for 'Economy'*",
                        xref = "paper", yref = "paper" ),
               annotations = list(x = 1, y = -0.05,
                                  text = "*Numbers indicate frequency of occurence for the code",
                                  showarrow = F, xref = "paper", yref = "paper",
                                  font = list(size = 12, color = "charcoal")))
fig1

Output from the code above looks like this:

Treemap for the coding group ‘Natural Environment’

## subset data
df2 <- df[grepl("Natural", df[["Codes"]]),]

fig2 <- plot_ly(
    type = "treemap",
    ids = df2$ids,
    labels = df2$labels,
    parents = df2$parents,
    values = df2$cref,
    textinfo = "label+value",
    marker = list(colors = brewer.pal(8,"Accent"))) # using RColorBrewer package for custom colour

fig2 <- fig2 %>% 
    layout(title = list(text = "Treemap of codes for 'Natural Environment'*",
                        xref = "paper", yref = "paper" ),
               annotations = list(x = 1, y = -0.05,
                                  text = "*Numbers indicate frequency of occurence for the code",
                                  showarrow = F, xref = "paper", yref = "paper",
                                  font = list(size = 12, color = "charcoal")))
fig2

Output for the above code:

As you can see above, with plotly in R, there is much we can do to customise the treemaps and produce publication-quality figures compared to basic output you get from NVivo. I hope this workflow will come in handy for those of you who, like me, want to produce figures in R but have to rely on NVivo for much of the qualitative data analysis.

Downloading all linked PDFs from multiple URLs using Python

Sat, 14 Nov 2020 00:00:00 +0000

I’ve been learning Python in my spare time for the past couple of months — initially for data analysis and visualisation. I am still relatively new to the language but have been using it to automate tasks as needed (outside of the data stuffs), primarily by adapting/modifying codes others have shared in Stack Overflow. Recently, I had to download loads of PDF reports related to SDGs, submitted by a select group of countries, from UN’s SDG portal. So, I decided to use Python to automate the task. Below is a simple web-scrapping code I wrote for the purpose, based on this from Stack Overflow.

My key aim was to download all PDFs linked in a member country page and organise them in folders for each country. Also, where there were any errors in the links, I wanted the code to ignore those and continue (but also print an error message). Each component of the code with comments/explanations below.

Loading necessary libraries/packages.

import os
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin

Define the base URL and create a list of country-level URLs, which are the ones we want to scrape for PDFs later. You can get the correct country names for country-specific pages here.

baseurl = 'https://sustainabledevelopment.un.org/memberstates/'
# these are the countries I wanted, these can be added removed as necessary
countries = ['bangladesh', 'china', 'colombia', 'india', 'kenya', 'madagascar', 'malawi', 'mozambique', 'peru', 'tanzania']
# build list of country-level urls
def build_url(country):
    return baseurl + country
urls = []
for country in countries:
    urls.append(build_url(country))

Using a for loop, go through each member country pages and download linked PDFs in respective folders. To ignore any errors in links to PDFs and continue the scraping, I use try when actually downloading the files from their linked URLs.

for url in urls:
    # define folder name from member country portion of the url
    foldername = url.split('/')[-1]
    # create a folder for the country if it doesn't exist
    if not os.path.exists(foldername):os.mkdir(foldername)
    page = requests.get(url).text
    soup = bs(page)
    for link in soup.select("a[href$='.pdf']"):
        filename = os.path.join(foldername, link['href'].split('/')[-1])
        with open(filename, 'wb') as f:
            try:
                f.write(requests.get(urljoin(url, link['href'])).content)
            except:
                print('Could not open url: ', urljoin(url, link['href']))

I was able to download 400+ documents in a few minutes, which manually would have perhaps taken me hours! Full code below:

import os
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin

baseurl = 'https://sustainabledevelopment.un.org/memberstates/'
# these are the countries I wanted, these can be added removed as necessary
countries = ['bangladesh', 'china', 'colombia', 'india', 'kenya', 'madagascar', 'malawi', 'mozambique', 'peru', 'tanzania']
# build list of country-level urls
def build_url(country):
    return baseurl + country
urls = []
for country in countries:
    urls.append(build_url(country))

for url in urls:
    # define folder name from member country portion of the url
    foldername = url.split('/')[-1]
    # create a folder for the country if it doesn't exist
    if not os.path.exists(foldername):os.mkdir(foldername)
    page = requests.get(url).text
    soup = bs(page)
    for link in soup.select("a[href$='.pdf']"):
        filename = os.path.join(foldername, link['href'].split('/')[-1])
        with open(filename, 'wb') as f:
            try:
                f.write(requests.get(urljoin(url, link['href'])).content)
            except:
                print('Could not open url: ', urljoin(url, link['href']))

Mapping countries' GDP against Fortune Global 500 top 10

Sat, 11 Jul 2020 00:00:00 +0000

I’ve been experimenting with/learning R and various packages whenever I get the time from work - to keep myself well-versed in some of the skills that I already possess, as well as to learn new tricks and keep myself updated on the new developments. As I’ve been curious about using R for mapping (GIS) for a while, this weekend I thought I should learn something new. So, I set about producing a series of maps highlighting the countries of the world with GDPs lower than the annual revenues of some of the largest companies in the world - top 10 fortune global 500 companies to be exact.

Another motivation came from reading the UN Special Rapporteur Philip Alston’s report on extreme poverty and human rights earlier this week. Two key points stood out for me, quoted below.

“Poverty is a political choice”

“By single-mindedly focusing on the World Bank’s flawed international poverty line, the international community mistakenly gauges progress in eliminating poverty by reference to a standard of miserable subsistence rather than an even minimally adequate standard of living. This in turn facilitates greatly exaggerated claims about the impending eradication of extreme poverty and downplays the parlous state of impoverishment in which billions of people still subsist.”

The report makes a very grim reading, especially given the current and potential future impacts of the ongoing COVID19 pandemic - not to mention the kinds of leadership we have in most of the major economies at this moment. While much of what Prof Alston says in his report isn’t new, particularly the criticism of the World Bank’s measures of global poverty, and of other large global bodies, including the UN agencies’ failures to tackle global poverty over the years, it is still good to see report at the highest level highlighting these issues.

Coming back to the maps I created, these maps not only show how large some of these companies are purely in terms of their economic power, I think they shed light on global inequalities too.

Here is the first one highlighting all the countries in the world that had lower GDP in 2019 than Walmart’s revenue for the year. The American retail giant is number one in the global list of companies with 2019 revenue of US$514 billion.

The fifth-placed State Grid of China (US$387 billion revenue) is next, map below highlighting all the countries with GDPs lower than this company’s revenue in 2019.

Finally, the map highlighting all the countries with annual GDP lower than the 2019 revenue of Toyota Motor (US$273 billion), the company ranked 10th in the global list.

As we can see, Walmart’s revenue in 2019 was larger than GDP of every single country in Africa and most of South America except Brazil. Fifth-placed State Grid of China also had revenue larger than GDP of all African countries except Nigeria, and even 10th-placed Toyota Motor had revenue larger than GDP of all African countries except Nigeria and South Africa.

I used publicly available datasets and global maps from R package rnaturalearth to produce these maps using ggplot2 and sf packages in R version 4.0.2. You can see/download data and steps, including the codes used to generate the maps from my GitHub repository.

Mass conversion of SPSS files to CSV format in R

Wed, 10 Jun 2020 00:00:00 +0000

TLDR: In this rather long post, I provide a few options for mass conversion of SPSS data files to CSV, including steps to test out the functions on a simulated SPSS dataset. Assumes you already know the basics of working with R environment, including installing the packages where necessary.¹

I’ve had to work with a large number of SPSS data files in my job lately, not an ideal scenario as I primarily use R for data processing/analysis. However, if you ever use secondary data, specially in social science disciplines, you are likely to come across survey data recorded in SPSS more often than not. SPSS has certainly been a mainstay of social science research, particularly those involving surveys, for as long as I can remember - I learned to use SPSS for the first time as an undergrad and that was over 2 decades ago (giving away my age here!). And it seems the software is still going strong. Digressions aside, I needed a way to easily convert all the SPSS files I had to open data formats like the CSV for better archiving and sharing.

As usual, I started by searching stackoverflow for mass conversion of SPSS to CSV in R, and found answers like this and a bit better version here. Both were useful in giving me ideas on what I wanted to do, but neither worked for me as they are. So, I decided to write my own function(s) to mass convert SPSS files to CSV in R. Below I outline three functions and highlight their pros and cons.

Function 1: Using `convert()` function from the `rio` package

RIO_SPSS2CSV <- function(filepath) {
  setwd(filepath) #this is the root dir where SPSS data files/folders are located; .csv files will be stored in the same dir
  library(rio)
  files <- list.files(path = filepath, pattern = '.sav', recursive = TRUE) #recursive option to check all folders inside the root dir
  for (f in files) {
    convert(f, paste0(strsplit(f, split = '.', fixed = TRUE)[[1]][1],'.csv'))
  }
}

This is the easiest and most straightforward of the three options I outline here. Basically the function above recursively looks for SPSS files within the specified filepath, and the uses the convert() function in rio package to convert them to CSV files in the same location. convert() basically wraps import() and export() functions thereby making the conversion simpler, however, not faster as we see below. It is also worth noting that this method writes values for the categorical variables rather than value labels (e.g., for variable sex in original SPSS data with 1=Female and 2=Male, CSV would have 1 or 2 under sex and not Female or Male), which means you’d need an extra variable definition file for categorical variable to fully understand converted CSV files.

Function 2: Using `characterize()` function together with `import()` and `export()` from the `rio` package

RIO_SPSS2CSV_VL <- function(filepath) {
  setwd(filepath)
  library(rio)
  files <- list.files(path = filepath, pattern = '.sav', recursive = TRUE)
  for (f in files) {
    export(characterize(import(f)), paste0(strsplit(f,split = '.', fixed = TRUE)[[1]][1],'.csv'))
  }
}

This is just a slight (but very useful) tweak in the previous option. It is still using rio package for the conversion, but instead of using convert() function, it now uses generic import() and export() functions with characterize() option to convert variables with defined value labels (i.e., categorical variables) to character or factor (e.g., for variable sex in original SPSS data with 1=Female and 2=Male, CSV would now have Female or Male under sex and not 1 or 2). This is particularly useful as you would not need a separate document defining value labels for categorical variables.

Function 3: Using `foreign` package with `write.csv` function

FOR_SPSS2CSV <- function(filepath) {
  setwd(filepath)
  files <- list.files(path = filepath, pattern = '.sav', recursive = TRUE)
  for (f in files) {
    write.csv(
      x = foreign::read.spss(file = f, to.data.frame = TRUE, use.value.labels = TRUE, use.missings = TRUE, reencode = FALSE),
      file = sprintf("%s.csv", tools::file_path_sans_ext(f)),
      row.names = FALSE, na = ""
      )
  }
}

This final option uses the foreign package, one of the default packages that comes with every R installation, so without the need to install any extra package for this task. Few good points about using foreign package — first, you can easily switch to copying values or value labels using the use.value.labels option (see function above); second, this gives you the option to define missing values in converted file using user defined missing values in SPSS by setting use.missings to TRUE; and finally, foreign provides warnings for unexpected values in original SPSS files, for example, when certain value in a categorical variable is undefined - allowing users to take actions against unexpected cases.

Testing the functions with simulated SPSS data

This section is really only possible because I found this excellent post on simulating SPSS data by Martin Chan. Example below is more or less literal copy from his post linked above - I’ve only tweaked the variable and data type to make them more relatable to the type of data with which I normally work. Start by loading necessary packages - tidyverse, surveytoolbox and haven, and creating a directory to save simulated SPSS data file.

library(tidyverse)
library(surveytoolbox) #if you don't have this, you'll need to install it from the source with devtools::install_github("martinctc/surveytoolbox") 
library(haven)
#create a directory to save simulated SPSS data. this will also be the base directory/filepath to test conversion functions above
dir.create("sav")

I want to simulate a more-or-less typical rural household survey data where majority of the household heads are male. So, I’m going to create a dataset with 200 observations with high male respondents in the sample - the dataset will have the variables sex (sex of HH head), education (highest education attainment of the HH head), and place_attach (place attachment). In addition, I’ll make education variable dependent on sex (with higher educational attainment skewed towards male HH heads); and place attachment variable dependent on highest educational attainment (those with higher education more likely to have lower place attachment).

Lets begin by creating id and sex variables.

set.seed(97) #this is to ensure reproducibility of this example but not necessary if simply testing SPSS to CSV conversion functions with your own data.

#id variable
v_id <- seq(1, 200) %>% set_varl("Household Identifier")

#sex variable
v_sex <- sample(x = 1:2,
                size = 200, replace = TRUE,
                prob = c(.25 , .75)) %>%  #skewed probability to reflect more male HH heads
  set_vall(value_labels = c("Female" = 1,
                            "Male" = 2)) %>% 
  set_varl("HH Head's Sex")

Then create education variable that depends on sex variable above.

#Highest education attainment variable - sex-dependent sampling
v_edu <-
  v_sex %>%
  map_dbl(function(x){
    if(x == 1){
      sample(0:6,
             size = 1,
             prob = c(25, 15, 20, 20, 15, 5, 5)) #Sum to 100
    } else {
      sample(0:6,
             size = 1,
             prob = c(10, 10, 20, 15, 25, 10, 10)) #Sum to 100
    }
  }) %>%
  set_vall(value_labels = c("Illiterate" = 0,
                            "Literate - no formal education" = 1,
                            "Primary school" = 2,
                            "Lower secondary school" = 3,
                            "Secondary school" = 4,
                            "College/Technical college" = 5,
                            "University degree" = 6)) %>%
  set_varl("Highest education level")

Finally create variable for place attachment which depends on education variable above.

#Place attachment variable - education-dependent sampling
v_place <- 
  v_edu %>% 
  map_dbl(function(x){
    if(x>=4){
      sample(1:5,
             size = 1,
             prob = c(25, 25, 20, 20, 10)) #Sum to 100
    } else {
      sample(1:5,
             size = 1,
             prob = c(5, 10, 20, 30, 35)) #Sum to 100
    }
  }) %>% 
  set_vall(value_labels = c("Not attached at all" = 1,
                            "Not very attached" = 2,
                            "Neutral" = 3,
                            "Attached" = 4,
                            "Very attached" = 5)) %>% 
  set_varl("Place attachment")

You can now combine individual vectors and save the dataset².

combined_df <-
  tibble(id = v_id,
         sex = v_sex,
         education = v_edu,
         place_attach = v_place)

Save the combined data to the new directory created at the beginning. And also create a couple of more SPSS files by subsetting the main simulated data so we have more than one file to check the conversion functions.

#save simulated data in SPSS format
combined_df %>% haven::write_sav("sav/Simulated_Dataset.sav")
#create more SPSS files from the same dataframe to test file conversion functions
combined_df %>% filter(sex==1) %>% write_sav("sav/female_only.sav")
combined_df %>% filter(sex==2) %>% write_sav("sav/male_only.sav")

Assuming the functions above are already loaded in your R environment, you simply load each function with the sav directory that you created at the beginning of simulated data creation in place of filepath as follows:

#using Function 1
RIO_SPSS2CSV("Drive://path/to/sav") #make sure you provide full file path as in the example, NOT relative path

#using Function 2
RIO_SPSS2CSV_VL("Drive://path/to/sav") #make sure you provide full file path as in the example, NOT relative path

#using Function 3
FOR_SPSS2CSV("Drive://path/to/sav") #make sure you provide full file path as in the example, NOT relative path

On every run of the above function, you’ll see SPSS files within your sav folder converted to CSV file with corresponding name, as shown in screenshot of my sav directory below:

Processing time and choice of option

I used tictoc package to get processing times for each of the functions. For the simulated data above, my processing times were 0.12s, 0.08s and 0.05s for Functions 1, 2 and 3 respectively. These functions were tested using R Version 4.0.1 in RStudio environment. I used Intel i7-6700 (3.4Ghz) with 32GB RAM and a SSD drive running Windows 10 for these tests. I also tested the functions on actual SPSS dataset. I had data from a very large household survey spread over multiple folders and files, each file with 160 to over 1000 observations (rows) and with five to over 50 variables (columns) in each file. Altogether 202 SPSS files were processed in 4 folders with directory structure as follows:

basedir
+--subfolder1
    +--subsubfolder1.1 (49 SPSS files, 3.47MB)
    +--subsubfolder1.2 (50 SPSS files, 4.36MB)
+--subfolder2
    +--subsubfolder2.1 (51 SPSS files, 5.72MB)
    +--subsubfolder2.2 (52 SPSS files, 5.46MB)

In terms of processing time (averaged over three runs for each function), Function 1 took the longest, followed by Function 3 — Function 2 being the fastest (see table below for summary).

Function	Description	Processing time (seconds)
Function 1	Function using `convert()` function from `rio` package.	71.48
Function 2	Function using `export()` function in `rio` package with `characterize()` option to write value labels for categorical variables.	52.10
Function 3	Function using `foreign` package to read SPSS files and `write.csv()` function to write CSV files.	56.86

So, just looking at the processing time, obvious choice is to use Function 2. However, if you do like the options that foreign package provides to generate different outputs to account for different types of variables in SPSS, then you might still consider using Function 3 above, as the latter could be important especially for data from the social surveys.

To sum up, I think if you simply want to read SPSS files to work with them in R environment, using package like haven or rio which wraps haven among other packages within its functions provides you with better options to read and use metadata-rich formats like the SPSS. On the other hand, if you simply want to mass convert SPSS files to CSV files, you can pick Function 2 or Function 3, depending on the kind of data you have in SPSS and the options you require in the conversion.

Post updated on 16 June 2020 to include the section on simulated SPSS data. ↩
In order to keep this post at a manageable length, I’ve left out some of the checks and verification you can do on the simulated data in this post, which you can see in Martin’s post here. ↩

My Writing Workflow

Sat, 22 Jul 2017 00:00:00 +0000

Following on from my previous post about my data workflow, I outline my basic writing workflow here. As mentioned in the previous post, I use Scapple as a tool to organise my thoughts, brainstorm, and plan my work, including basic outline for my write-up (background on the figure above).¹

I start my initial drafts, particularly methods and results sections within RStudio, as it is where I do my data analysis and visualisation and it is simply easier to write about methods and results while they are being worked on. However, for most of my original writing, I use Scrivener, with Bookends as my reference manager. I can easily export Markdown written within RStudio in a format like RTF for Scrivener. Scrivener is one of the very best applications available for academic writing, possibly for any kind of writing, as it allows you to organise your writings in small segments, set targets (word count) and track the progress easily, as well as collate and organise research materials, such as relevant papers, snippets or any other kinds of materials. While I use my Mac for most of my writing, Scrivener is also available for Windows, I often work on Scrivener in Windows, especially in office where I have to use Windows PC. I normally use Dropbox to store my Scrivener projects so I can pick up from where I left off on any of my PCs, including on my iPad with Scrivener app. On the Mac Bookends works very well with Scrivener as a reference manager; however, if you are working on a Windows PC, you can easily use Endnote or Mendeley for reference management, and to insert citation into your write-up (as citation codes that can later be automatically scanned by reference manager like Endnote to create formatted bibliography).

Once I have a full draft of the paper/report, I export them from Scrivener to a specialised word processing application. If I am working alone on the project - and do not need others to edit the text, I often work on Mellel in my Mac. Mellel is one of the best and most stable word processing application on MacOS, especially when you are writing a long text document, such as a thesis; and it works perfectly with Bookends for reference management. For example, I finalised my PhD dissertation on Mellel with Bookends for reference management. When I needed my thesis chapters to be commented on, I sent them as RTF to my supervisors so they could comment on it using MS Word. But if you do not need others to directly edit the document, Mellel can export document as PDF directly from the main menu.

When I am working on a co-authored paper, I move from Scrivener to MS Word when I have to have other authors working on the paper as well, as everybody I work with uses Word and are comfortable working on it (I don’t know any of my co-authors who use Mellel for example). I have on occasions used Google Docs when I’ve wanted inputs from more than one co-authors at the same time, and also to make the versioning easier, however, being online-only makes Google Docs hard to use, especially when your collaborators are travelling or are in places with poor internet connection. Hence MS Word is usually the go-to application when working on a co-authored paper. In terms of reference manager, there are a number of options that work well with Word. Bookends works with Word as well on a Mac, but as it is not available on Windows OS, I either use Endnote or Mendeley for organising references and citation when working on co-authored papers.

Once the papers are finalised on Word (or Mellel), I convert them to PDF for submission to journals or for wider circulation if they are research reports or working papers.

A note on the use of proprietary/for-cost applications, and availability of free/open-source alternatives. While I do like to use free and open-source applications as much as is possible, they also have to have the necessary features that you are after. Sometimes, you just want an application that works out-of-the-box without having to do much tweaking. For these reasons, I do have quite a few for-cost applications in my workflow; however, some of these applications do have potential free/open-source alternatives. While there are some free/open-source mind-mapping tools, I haven’t found one that is as easy to use and flexible as Scapple and that works seamlessly in both Mac and Windows. RStudio comes in free, open-source edition. For the main writing environment, again I don’t know of any free/open-source alternative to Scrivener with similar set of features. Scapple and Scrivener both have slightly cheaper Education licence. For reference management, Mendeley is free but requires an online account (free), and there are other similar free alternatives like Endnote basic or zotero. For final writing, free and open-source LibreOffice is more or less a complete replacement for MS Office suite, and its Writer can be used instead of MS Word. ↩

My Data Workflow

Fri, 21 Jul 2017 00:00:00 +0000

First of all, I use Scapple as a tool to organise my thoughts, brainstorm, and plan my work (background on the figure above).

Most of the data I work with comes from structured surveys. The original raw data is usually entered and cleaned in Excel - primarily because virtually everybody knows how to work with Excel. Once the data is cleaned and ready to be analysed, I export them to .csv format. If the data is also going to be deposited in public data archives .csv is one of the most commonly accepted formats. I then import the data into R for analysis and visualisation. I use RStudio as the main work environment, for data organising and manipulation (using packages like dplyr and reshape2), for analysis and visualisation (packages like gmnl and ggplot2 - also see my other post about my favourite visualisation packages), and also for initial drafts of my reports/papers (using rmarkdown and knitr).

My favourite visualisation packages in R

Tue, 04 Jul 2017 00:00:00 +0000

Over the past two years I’ve used R within RStudio environment as my only data analysis/visualisation application for my research. For the most part I’m a self-taught R/RStudio user, and I’m quite pleased with how far I’ve come in terms of being able to do pretty much everything I need in terms of data analysis and visualisation, and a significant part of writing up using RMarkdown in RStudio. In terms of data visualisation in R, I guess ggplot2 is what everybody turns to first, and I’m no exception. I love ggplot and the flexibility it allows in terms of creating figures. However, there are some other packages which let you create some interesting plots either for exploratory analysis or from the regression outputs. I briefly discuss two such packages that have become my favourites over the last couple of years.

beeswarm

I’ve become a huge fan of beeswarm plots ever since I discovered this package while looking for ways to plot individual overlapping points on a two dimensional plot. Not only this package allows us to plot individual data points that would otherwise overlap, it also allows to save the beeswarm plot data as datatable, which can then be plotted using ggplot with additional dimensions as necessary. This is exactly what I did for the figure below that formed part of this journal paper published in 2016. In addition to the location of data points in beeswarm hex arrangement, we changed the colour as well as size of the data points based on additional information in two other variables. The resulting plot is a simple representation of the location of respondents’ dwellings from the park boundary, but also providing much richer information without making it too complicated or confusing to look at.

sjPlot

I first came across sjPlot package while trying to find a way to create nicely formatted tables for regression outputs in R. However, over time I’ve used this package more to visualise results from different types of statistical analyses in R that I carry out for my socio-economic research, which I guess is not surprising given the package description, which I quote below:

“Results of various statistical analyses (that are commonly used in social sciences) can be visualized using this package, including simple and cross tabulated frequencies, histograms, box plots, (generalized) linear models, mixed effects models, PCA and correlation matrices, cluster analyses, scatter plots, Likert scales, effects plots of interaction terms in regression models, constructing index or score variables and much more.”

Among several other types of plots, I used this package to create the odds-ratio plot shown below, which featured in our journal paper published in 2016.

I often use sjmisc package together with sjPlot, especially to create nice variable labels to use in the plots or tables. Another reason why sjPlot is among my favourite packages is its active development and a very useful set of blog posts with examples, and prompt response to comments on these posts whenever I’ve had any queries regarding the package.