Infering Missing Data

Today we’re going to do some basic regression to fill in some missing data. The following is data from the World Bank showing the percentage of households who have access to some kind of telephone; land line or cell.

telephone_data <- read.csv("percentage_household_telephone.csv")

#getting rid of redundant columns
#mostly containing information of how/what the data is measuring
#which does not change from source to source
telephone_data[, c(1,2,5,6,7,8,9,10,11,12,13,14,15,16,17,18)] <- NULL
telephone_data[, 1:13] %>% head(n = 20) %>% kable()
REF_AREA REF_AREA_LABEL X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010
ABW Aruba NA NA NA NA NA NA NA NA NA NA 55.7827
ALB Albania NA NA NA NA NA NA NA NA NA NA NA
AND Andorra 36.1888 NA NA NA NA NA NA NA NA NA NA
ARE United Arab Emirates NA NA NA NA NA NA NA NA NA NA NA
ARG Argentina NA 17.6571 NA NA NA NA NA NA NA NA NA
ARM Armenia NA NA NA NA NA NA NA NA 53.0371 52.3044 64.6410
AZE Azerbaijan NA NA NA NA NA NA NA NA 44.3644 44.3109 51.3909
BDI Burundi NA NA NA NA NA NA NA NA NA NA NA
BEL Belgium NA NA NA NA NA NA NA NA NA NA NA
BFA Burkina Faso NA NA NA NA NA NA NA 2.29767 NA NA NA
BGD Bangladesh NA NA NA NA NA NA NA NA NA NA NA
BHR Bahrain NA NA NA NA NA NA NA NA NA NA NA
BIH Bosnia and Herzegovina NA NA NA NA NA NA NA 49.97480 NA NA NA
BLR Belarus NA NA NA NA NA NA NA NA NA NA NA
BLZ Belize NA NA NA NA NA NA NA NA NA NA NA
BMU Bermuda NA NA NA NA NA NA NA NA NA NA NA
BRA Brazil NA NA NA NA NA NA NA NA NA 34.6078 34.6773
BRN Brunei Darussalam NA NA NA NA NA NA NA NA NA NA NA
CAN Canada NA NA NA NA NA NA NA 64.59160 65.7311 67.1909 NA
CHL Chile NA NA 28.3403 25.5887 NA NA 28.9019 NA NA 40.4815 NA

Keep in mind I’m only showing a portion of the data here. As you can see, it is mostly missing. But we can infer some of the missing data by doing some regression analysis.

Let’s look at the instances where the data does show consecutive points from one year to the next by creating a function to put it into a data frame and graph it.

cons_relate <- function(step_out = 1, data = telephone_data, return_known = FALSE) {
  x <- c()
  y <- c()
  for(i in 1:nrow(data)) {
    for(j in 3:ncol(data)) {
      #cat(telephone_data[i,j], " ", telephone_data[i,j], "\n")
      if(j+step_out > ncol(data)) break
      if(!(data[i,j  ] %>% is.na()) && 
         !(data[i,j+step_out] %>% is.na()) ) {
        x <- c(x, data[i,j])
        y <- c(y, data[i,j+step_out])
      }
      
    }
  }
  known <- tibble(
    x = x,
    y = y
  )
  
  plot(known, asp = 1)
  
  if(return_known == TRUE) {
    return(known)
  }
}

known <- cons_relate(return_known = TRUE)

What you’re looking at is a graph showing all given previous years compared with all given next years. (As determined by the step_out parameter.) You’ll notice an almost one to one correspondence. Meaning the data from previous year is a pretty good predictor of the data for next year.

The same thing holds true if we look at data with more than a year of seperation.

cons_relate(step_out = 2)

cons_relate(step_out = 3)

cons_relate(step_out = 4)

cons_relate(step_out = 5)

Predictably, the extent to which the data predicts it is diminished, but the same general pattern holds. A pattern that we can model.

model <- lm(y ~ x, data = known)
model %>% summary()
## 
## Call:
## lm(formula = y ~ x, data = known)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.617  -1.522   0.334   1.671  25.724 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.61667    0.54984  -1.122    0.263    
## x            0.98403    0.01148  85.754   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.249 on 301 degrees of freedom
## Multiple R-squared:  0.9607, Adjusted R-squared:  0.9605 
## F-statistic:  7354 on 1 and 301 DF,  p-value: < 2.2e-16

We can use this to fill in the missing data and get a “reasonable” estimation for where the data is at in the most recent data.

count <- 0
telephone_filled <- telephone_data

for(i in 1:nrow(telephone_filled)) {
  for(j in 4:ncol(telephone_filled)) {
    #print(telephone_filled[i,j])
    if(telephone_filled[i,j] %>% is.na() &&
       telephone_filled[i,j-1] %>% is.numeric()) {
      #TODO: figure out why you're getting negative values
      telephone_filled[i,j] <- predict(model, tibble(x = telephone_filled[i,j-1])) %>% max(0)
    }
    #count <- count + 1
    #if(count > 20) break()
  }
}

telephone_filled[, 17:length(telephone_filled)] %>% head(n = 10) %>% kable()
X2014 X2015 X2016 X2017 X2018 X2019 X2020 X2021 X2022 X2023 X2024
49.894894 48.481248 47.09018 45.721337 44.374356 43.048890 41.744596 40.461135 39.198175 37.9553890 36.7324538
NA 21.592300 20.04810 19.111199 13.955100 13.115523 12.289357 11.476387 10.676402 9.8891962 9.1145641
21.093105 20.139512 19.20115 18.277778 17.369155 16.475045 15.595217 14.729442 13.877496 13.0391589 12.2142123
59.000000 57.440918 99.50000 97.294009 83.000000 84.660000 98.680000 98.680000 99.000000 99.7000000 100.0000000
7.030547 6.301576 5.58425 4.878381 4.183787 3.500288 2.827706 2.165868 1.514601 0.8737368 0.2431093
61.929330 60.323458 60.80730 59.219350 57.656764 30.454100 23.740000 18.587000 20.485700 18.1620000 17.2552259
45.777044 44.429173 43.10283 61.600000 61.110000 60.748500 60.944700 53.321400 53.026200 51.8302000 50.3856415
0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0000000 0.0000000
NA NA NA NA 61.111600 59.518789 57.951420 56.409087 54.891390 53.3979344 51.9283343
0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0000000 0.0000000

Thoughts and concerns

To start with, there are several problems with what we did here. The most glaring for me is the extent to which a model that predicts a slight drop could be overtaken by noise. Therefore, extrapolating outwards could be less accurate the further out you go. and the slight drop in telephone access could very easily be a result of noise. Therefore extrapolating outwards could prove to be problematic.

let’s take a look at the 5 year prediction again.

yr_5_known <- cons_relate(step_out = 5, return_known = TRUE)

yr_5_model <- lm(y ~ x, data = yr_5_known)

yr_5_model %>% summary()
## 
## Call:
## lm(formula = y ~ x, data = yr_5_known)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.652  -6.551  -0.218   6.076  54.844 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.86398    1.68511  -0.513    0.609    
## x            0.87369    0.03417  25.572   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.67 on 205 degrees of freedom
## Multiple R-squared:  0.7613, Adjusted R-squared:  0.7602 
## F-statistic: 653.9 on 1 and 205 DF,  p-value: < 2.2e-16

The important number is the slope of the regression line, now set up to be 0.87309. Earlier, our number for the for the one year regression was set to 0.98403.

Let’s take that slope and extrapolate outwards by multiplying it by 5 for get our predicted 5 year change.

\[ 1-0.98403=0.01597 \\ 0.01597 \times 5 = 0.07985 \\ 1 - 0.87369 = 0.12631 \\ 0.12631 > 0.07985 \] The 5 year difference is larger even after we incorporate the standard error.

\[ 0.12631 - 0.0141 = 0.11221 \\ 0.07985 + 0.01148 = 0.09133 \]

This is to suggest that the drop in telephone lines predicted by the one year model is actually an under estimate, and the access to telephone lines is actually decreasing over time. And I’ll be honest, I did not expect to see that in the data. keep in mind, this data agnostic as to whether the access comes from home lines or cell phones. Which, presumably, would make this over estimate, because you only need one cell phone in a household in order for it be counted. and if you take a household of 3 people or whatever which loses a land line to gain a cell phone, then the number of people with access to a phone line goes down, and the data still counts that household as having a phone line. It goes against the conventional wisdom that access to technology is growing. If we assume that the predictions made by this analysis are true (because, keep in mind, there still could be some shenanigans with how the data is collected) then I can think of a couple of explanations as to why this is happening:

  1. This could be a result o higher internet access and the advent of chat apps to replace phone usage. I, myself have found that I’m using my cell phone less and less as a phone, and more and more as a conduit to access the internet, and if I want to text someone, I can use Whatsapp or Snapchat or Instagram or any number of applications designed for chatting with friends. and it’s reasonable to think that people would opt out of having access to a phone line if they get all those things from the internet. You could try to confirm or deny this by looking at things such as internet access, internet speeds, the number of people with a stand alone data plan on their phone, as well as cell phone ownership. Perhaps in a future post I’ll explore those avenues of data analysis, but for now we’ll leave the question open.
  2. In the event that this is not explained by replacements to phone lines, then people are simply losing their access to technology and are becoming less connected. The implications, in my view, are an increase in worldwide corruption and inequality. If you take a look at the Corruption Perception Index, you’ll see a clear correlation between countries that are corrupt, and the well-being of the average citizen. If the median person is losing their phone access without anything to replace it, then the median person is, in all likelihood, becoming poorer. Which doesn’t lend well to our general understanding of the patterns of society over the last 20 years or so.

Other than that, the analysis technique seems to be incomplete. There are ways we could expand on the regression done here by incorporating all the years into one model. Perhaps by incorporating factor analysis on which year it is, future years as predictors, as well as other information we have about the countires in the data set, and putting it all into one model. In that case you do run into concerns about over fitting, and (as today) over extrapolating from granular trends. Even for the analysis we did today, there are various assumptions one must confirm. which I skipped in a vein attempt at brevity. And all of that seems like the topic of a future blog post if I get to it.