Today we’re going to do some basic regression to fill in some missing data. The following is data from the World Bank showing the percentage of households who have access to some kind of telephone; land line or cell.
telephone_data <- read.csv("percentage_household_telephone.csv")
#getting rid of redundant columns
#mostly containing information of how/what the data is measuring
#which does not change from source to source
telephone_data[, c(1,2,5,6,7,8,9,10,11,12,13,14,15,16,17,18)] <- NULL
telephone_data[, 1:13] %>% head(n = 20) %>% kable()
| REF_AREA | REF_AREA_LABEL | X2000 | X2001 | X2002 | X2003 | X2004 | X2005 | X2006 | X2007 | X2008 | X2009 | X2010 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ABW | Aruba | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 55.7827 |
| ALB | Albania | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| AND | Andorra | 36.1888 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| ARE | United Arab Emirates | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| ARG | Argentina | NA | 17.6571 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| ARM | Armenia | NA | NA | NA | NA | NA | NA | NA | NA | 53.0371 | 52.3044 | 64.6410 |
| AZE | Azerbaijan | NA | NA | NA | NA | NA | NA | NA | NA | 44.3644 | 44.3109 | 51.3909 |
| BDI | Burundi | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| BEL | Belgium | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| BFA | Burkina Faso | NA | NA | NA | NA | NA | NA | NA | 2.29767 | NA | NA | NA |
| BGD | Bangladesh | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| BHR | Bahrain | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| BIH | Bosnia and Herzegovina | NA | NA | NA | NA | NA | NA | NA | 49.97480 | NA | NA | NA |
| BLR | Belarus | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| BLZ | Belize | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| BMU | Bermuda | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| BRA | Brazil | NA | NA | NA | NA | NA | NA | NA | NA | NA | 34.6078 | 34.6773 |
| BRN | Brunei Darussalam | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| CAN | Canada | NA | NA | NA | NA | NA | NA | NA | 64.59160 | 65.7311 | 67.1909 | NA |
| CHL | Chile | NA | NA | 28.3403 | 25.5887 | NA | NA | 28.9019 | NA | NA | 40.4815 | NA |
Keep in mind I’m only showing a portion of the data here. As you can see, it is mostly missing. But we can infer some of the missing data by doing some regression analysis.
Let’s look at the instances where the data does show consecutive points from one year to the next by creating a function to put it into a data frame and graph it.
cons_relate <- function(step_out = 1, data = telephone_data, return_known = FALSE) {
x <- c()
y <- c()
for(i in 1:nrow(data)) {
for(j in 3:ncol(data)) {
#cat(telephone_data[i,j], " ", telephone_data[i,j], "\n")
if(j+step_out > ncol(data)) break
if(!(data[i,j ] %>% is.na()) &&
!(data[i,j+step_out] %>% is.na()) ) {
x <- c(x, data[i,j])
y <- c(y, data[i,j+step_out])
}
}
}
known <- tibble(
x = x,
y = y
)
plot(known, asp = 1)
if(return_known == TRUE) {
return(known)
}
}
known <- cons_relate(return_known = TRUE)
What you’re looking at is a graph showing all given previous years compared with all given next years. (As determined by the step_out parameter.) You’ll notice an almost one to one correspondence. Meaning the data from previous year is a pretty good predictor of the data for next year.
The same thing holds true if we look at data with more than a year of seperation.
cons_relate(step_out = 2)
cons_relate(step_out = 3)
cons_relate(step_out = 4)
cons_relate(step_out = 5)
Predictably, the extent to which the data predicts it is diminished, but the same general pattern holds. A pattern that we can model.
model <- lm(y ~ x, data = known)
model %>% summary()
##
## Call:
## lm(formula = y ~ x, data = known)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.617 -1.522 0.334 1.671 25.724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.61667 0.54984 -1.122 0.263
## x 0.98403 0.01148 85.754 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.249 on 301 degrees of freedom
## Multiple R-squared: 0.9607, Adjusted R-squared: 0.9605
## F-statistic: 7354 on 1 and 301 DF, p-value: < 2.2e-16
We can use this to fill in the missing data and get a “reasonable” estimation for where the data is at in the most recent data.
count <- 0
telephone_filled <- telephone_data
for(i in 1:nrow(telephone_filled)) {
for(j in 4:ncol(telephone_filled)) {
#print(telephone_filled[i,j])
if(telephone_filled[i,j] %>% is.na() &&
telephone_filled[i,j-1] %>% is.numeric()) {
#TODO: figure out why you're getting negative values
telephone_filled[i,j] <- predict(model, tibble(x = telephone_filled[i,j-1])) %>% max(0)
}
#count <- count + 1
#if(count > 20) break()
}
}
telephone_filled[, 17:length(telephone_filled)] %>% head(n = 10) %>% kable()
| X2014 | X2015 | X2016 | X2017 | X2018 | X2019 | X2020 | X2021 | X2022 | X2023 | X2024 |
|---|---|---|---|---|---|---|---|---|---|---|
| 49.894894 | 48.481248 | 47.09018 | 45.721337 | 44.374356 | 43.048890 | 41.744596 | 40.461135 | 39.198175 | 37.9553890 | 36.7324538 |
| NA | 21.592300 | 20.04810 | 19.111199 | 13.955100 | 13.115523 | 12.289357 | 11.476387 | 10.676402 | 9.8891962 | 9.1145641 |
| 21.093105 | 20.139512 | 19.20115 | 18.277778 | 17.369155 | 16.475045 | 15.595217 | 14.729442 | 13.877496 | 13.0391589 | 12.2142123 |
| 59.000000 | 57.440918 | 99.50000 | 97.294009 | 83.000000 | 84.660000 | 98.680000 | 98.680000 | 99.000000 | 99.7000000 | 100.0000000 |
| 7.030547 | 6.301576 | 5.58425 | 4.878381 | 4.183787 | 3.500288 | 2.827706 | 2.165868 | 1.514601 | 0.8737368 | 0.2431093 |
| 61.929330 | 60.323458 | 60.80730 | 59.219350 | 57.656764 | 30.454100 | 23.740000 | 18.587000 | 20.485700 | 18.1620000 | 17.2552259 |
| 45.777044 | 44.429173 | 43.10283 | 61.600000 | 61.110000 | 60.748500 | 60.944700 | 53.321400 | 53.026200 | 51.8302000 | 50.3856415 |
| 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0.0000000 |
| NA | NA | NA | NA | 61.111600 | 59.518789 | 57.951420 | 56.409087 | 54.891390 | 53.3979344 | 51.9283343 |
| 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0.0000000 |
To start with, there are several problems with what we did here. The most glaring for me is the extent to which a model that predicts a slight drop could be overtaken by noise. Therefore, extrapolating outwards could be less accurate the further out you go. and the slight drop in telephone access could very easily be a result of noise. Therefore extrapolating outwards could prove to be problematic.
let’s take a look at the 5 year prediction again.
yr_5_known <- cons_relate(step_out = 5, return_known = TRUE)
yr_5_model <- lm(y ~ x, data = yr_5_known)
yr_5_model %>% summary()
##
## Call:
## lm(formula = y ~ x, data = yr_5_known)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.652 -6.551 -0.218 6.076 54.844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.86398 1.68511 -0.513 0.609
## x 0.87369 0.03417 25.572 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.67 on 205 degrees of freedom
## Multiple R-squared: 0.7613, Adjusted R-squared: 0.7602
## F-statistic: 653.9 on 1 and 205 DF, p-value: < 2.2e-16
The important number is the slope of the regression line, now set up to be 0.87309. Earlier, our number for the for the one year regression was set to 0.98403.
Let’s take that slope and extrapolate outwards by multiplying it by 5 for get our predicted 5 year change.
\[ 1-0.98403=0.01597 \\ 0.01597 \times 5 = 0.07985 \\ 1 - 0.87369 = 0.12631 \\ 0.12631 > 0.07985 \] The 5 year difference is larger even after we incorporate the standard error.
\[ 0.12631 - 0.0141 = 0.11221 \\ 0.07985 + 0.01148 = 0.09133 \]
This is to suggest that the drop in telephone lines predicted by the one year model is actually an under estimate, and the access to telephone lines is actually decreasing over time. And I’ll be honest, I did not expect to see that in the data. keep in mind, this data agnostic as to whether the access comes from home lines or cell phones. Which, presumably, would make this over estimate, because you only need one cell phone in a household in order for it be counted. and if you take a household of 3 people or whatever which loses a land line to gain a cell phone, then the number of people with access to a phone line goes down, and the data still counts that household as having a phone line. It goes against the conventional wisdom that access to technology is growing. If we assume that the predictions made by this analysis are true (because, keep in mind, there still could be some shenanigans with how the data is collected) then I can think of a couple of explanations as to why this is happening:
Other than that, the analysis technique seems to be incomplete. There are ways we could expand on the regression done here by incorporating all the years into one model. Perhaps by incorporating factor analysis on which year it is, future years as predictors, as well as other information we have about the countires in the data set, and putting it all into one model. In that case you do run into concerns about over fitting, and (as today) over extrapolating from granular trends. Even for the analysis we did today, there are various assumptions one must confirm. which I skipped in a vein attempt at brevity. And all of that seems like the topic of a future blog post if I get to it.