Infering Missing Data

Today we’re going to do some basic regression to fill in some missing data. The following is data from the World Bank showing the percentage of households who have access to some kind of telephone; land line or cell.

telephone_data <- read.csv("percentage_household_telephone.csv")

#getting rid of redundant columns
#mostly containing information of how/what the data is measuring
#which does not change from source to source
telephone_data[, c(1,2,5,6,7,8,9,10,11,12,13,14,15,16,17,18)] <- NULL
telephone_data[, 1:13] %>% head(n = 20) %>% kable()

REF_AREA	REF_AREA_LABEL	X2000	X2001	X2002	X2003	X2004	X2005	X2006	X2007	X2008	X2009	X2010
ABW	Aruba	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	55.7827
ALB	Albania	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
AND	Andorra	36.1888	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
ARE	United Arab Emirates	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
ARG	Argentina	NA	17.6571	NA	NA	NA	NA	NA	NA	NA	NA	NA
ARM	Armenia	NA	NA	NA	NA	NA	NA	NA	NA	53.0371	52.3044	64.6410
AZE	Azerbaijan	NA	NA	NA	NA	NA	NA	NA	NA	44.3644	44.3109	51.3909
BDI	Burundi	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
BEL	Belgium	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
BFA	Burkina Faso	NA	NA	NA	NA	NA	NA	NA	2.29767	NA	NA	NA
BGD	Bangladesh	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
BHR	Bahrain	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
BIH	Bosnia and Herzegovina	NA	NA	NA	NA	NA	NA	NA	49.97480	NA	NA	NA
BLR	Belarus	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
BLZ	Belize	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
BMU	Bermuda	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
BRA	Brazil	NA	NA	NA	NA	NA	NA	NA	NA	NA	34.6078	34.6773
BRN	Brunei Darussalam	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
CAN	Canada	NA	NA	NA	NA	NA	NA	NA	64.59160	65.7311	67.1909	NA
CHL	Chile	NA	NA	28.3403	25.5887	NA	NA	28.9019	NA	NA	40.4815	NA

Keep in mind I’m only showing a portion of the data here. As you can see, it is mostly missing. But we can infer some of the missing data by doing some regression analysis.

Let’s look at the instances where the data does show consecutive points from one year to the next by creating a function to put it into a data frame and graph it.

cons_relate <- function(step_out = 1, data = telephone_data, return_known = FALSE) {
  x <- c()
  y <- c()
  for(i in 1:nrow(data)) {
    for(j in 3:ncol(data)) {
      #cat(telephone_data[i,j], " ", telephone_data[i,j], "\n")
      if(j+step_out > ncol(data)) break
      if(!(data[i,j  ] %>% is.na()) && 
         !(data[i,j+step_out] %>% is.na()) ) {
        x <- c(x, data[i,j])
        y <- c(y, data[i,j+step_out])
      }
      
    }
  }
  known <- tibble(
    x = x,
    y = y
  )
  
  plot(known, asp = 1)
  
  if(return_known == TRUE) {
    return(known)
  }
}

known <- cons_relate(return_known = TRUE)

What you’re looking at is a graph showing all given previous years compared with all given next years. (As determined by the step_out parameter.) You’ll notice an almost one to one correspondence. Meaning the data from previous year is a pretty good predictor of the data for next year.

The same thing holds true if we look at data with more than a year of seperation.

cons_relate(step_out = 2)

cons_relate(step_out = 3)

cons_relate(step_out = 4)

cons_relate(step_out = 5)

Predictably, the extent to which the data predicts it is diminished, but the same general pattern holds. A pattern that we can model.

model <- lm(y ~ x, data = known)
model %>% summary()

## 
## Call:
## lm(formula = y ~ x, data = known)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.617  -1.522   0.334   1.671  25.724 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.61667    0.54984  -1.122    0.263    
## x            0.98403    0.01148  85.754   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.249 on 301 degrees of freedom
## Multiple R-squared:  0.9607, Adjusted R-squared:  0.9605 
## F-statistic:  7354 on 1 and 301 DF,  p-value: < 2.2e-16

We can use this to fill in the missing data and get a “reasonable” estimation for where the data is at in the most recent data.

count <- 0
telephone_filled <- telephone_data

for(i in 1:nrow(telephone_filled)) {
  for(j in 4:ncol(telephone_filled)) {
    #print(telephone_filled[i,j])
    if(telephone_filled[i,j] %>% is.na() &&
       telephone_filled[i,j-1] %>% is.numeric()) {
      #TODO: figure out why you're getting negative values
      telephone_filled[i,j] <- predict(model, tibble(x = telephone_filled[i,j-1])) %>% max(0)
    }
    #count <- count + 1
    #if(count > 20) break()
  }
}

telephone_filled[, 17:length(telephone_filled)] %>% head(n = 10) %>% kable()

X2014	X2015	X2016	X2017	X2018	X2019	X2020	X2021	X2022	X2023	X2024
49.894894	48.481248	47.09018	45.721337	44.374356	43.048890	41.744596	40.461135	39.198175	37.9553890	36.7324538
NA	21.592300	20.04810	19.111199	13.955100	13.115523	12.289357	11.476387	10.676402	9.8891962	9.1145641
21.093105	20.139512	19.20115	18.277778	17.369155	16.475045	15.595217	14.729442	13.877496	13.0391589	12.2142123
59.000000	57.440918	99.50000	97.294009	83.000000	84.660000	98.680000	98.680000	99.000000	99.7000000	100.0000000
7.030547	6.301576	5.58425	4.878381	4.183787	3.500288	2.827706	2.165868	1.514601	0.8737368	0.2431093
61.929330	60.323458	60.80730	59.219350	57.656764	30.454100	23.740000	18.587000	20.485700	18.1620000	17.2552259
45.777044	44.429173	43.10283	61.600000	61.110000	60.748500	60.944700	53.321400	53.026200	51.8302000	50.3856415
0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0000000	0.0000000
NA	NA	NA	NA	61.111600	59.518789	57.951420	56.409087	54.891390	53.3979344	51.9283343
0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0000000	0.0000000

Thoughts and concerns

To start with, there are several problems with what we did here. The most glaring for me is the extent to which a model that predicts a slight drop could be overtaken by noise. Therefore, extrapolating outwards could be less accurate the further out you go. and the slight drop in telephone access could very easily be a result of noise. Therefore extrapolating outwards could prove to be problematic.

let’s take a look at the 5 year prediction again.

yr_5_known <- cons_relate(step_out = 5, return_known = TRUE)

yr_5_model <- lm(y ~ x, data = yr_5_known)

yr_5_model %>% summary()

## 
## Call:
## lm(formula = y ~ x, data = yr_5_known)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.652  -6.551  -0.218   6.076  54.844 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.86398    1.68511  -0.513    0.609    
## x            0.87369    0.03417  25.572   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.67 on 205 degrees of freedom
## Multiple R-squared:  0.7613, Adjusted R-squared:  0.7602 
## F-statistic: 653.9 on 1 and 205 DF,  p-value: < 2.2e-16

The important number is the slope of the regression line, now set up to be 0.87309. Earlier, our number for the for the one year regression was set to 0.98403.

Let’s take that slope and extrapolate outwards by multiplying it by 5 for get our predicted 5 year change.

\[ 1-0.98403=0.01597 \\ 0.01597 \times 5 = 0.07985 \\ 1 - 0.87369 = 0.12631 \\ 0.12631 > 0.07985 \] The 5 year difference is larger even after we incorporate the standard error.

\[ 0.12631 - 0.0141 = 0.11221 \\ 0.07985 + 0.01148 = 0.09133 \]

This is to suggest that the drop in telephone lines predicted by the one year model is actually an under estimate, and the access to telephone lines is actually decreasing over time. And I’ll be honest, I did not expect to see that in the data. keep in mind, this data agnostic as to whether the access comes from home lines or cell phones. Which, presumably, would make this over estimate, because you only need one cell phone in a household in order for it be counted. and if you take a household of 3 people or whatever which loses a land line to gain a cell phone, then the number of people with access to a phone line goes down, and the data still counts that household as having a phone line. It goes against the conventional wisdom that access to technology is growing. If we assume that the predictions made by this analysis are true (because, keep in mind, there still could be some shenanigans with how the data is collected) then I can think of a couple of explanations as to why this is happening:

This could be a result o higher internet access and the advent of chat apps to replace phone usage. I, myself have found that I’m using my cell phone less and less as a phone, and more and more as a conduit to access the internet, and if I want to text someone, I can use Whatsapp or Snapchat or Instagram or any number of applications designed for chatting with friends. and it’s reasonable to think that people would opt out of having access to a phone line if they get all those things from the internet. You could try to confirm or deny this by looking at things such as internet access, internet speeds, the number of people with a stand alone data plan on their phone, as well as cell phone ownership. Perhaps in a future post I’ll explore those avenues of data analysis, but for now we’ll leave the question open.
In the event that this is not explained by replacements to phone lines, then people are simply losing their access to technology and are becoming less connected. The implications, in my view, are an increase in worldwide corruption and inequality. If you take a look at the Corruption Perception Index, you’ll see a clear correlation between countries that are corrupt, and the well-being of the average citizen. If the median person is losing their phone access without anything to replace it, then the median person is, in all likelihood, becoming poorer. Which doesn’t lend well to our general understanding of the patterns of society over the last 20 years or so.

Other than that, the analysis technique seems to be incomplete. There are ways we could expand on the regression done here by incorporating all the years into one model. Perhaps by incorporating factor analysis on which year it is, future years as predictors, as well as other information we have about the countires in the data set, and putting it all into one model. In that case you do run into concerns about over fitting, and (as today) over extrapolating from granular trends. Even for the analysis we did today, there are various assumptions one must confirm. which I skipped in a vein attempt at brevity. And all of that seems like the topic of a future blog post if I get to it.