Machine Learning and Econometrics: Model Selection and Assessment Statistical Learning Style

This is the second in a series of posts where I document my own process in figuring out how machine learning relates to the classic econometrics one learns in a graduate program in economics. Note that I am a humble practitioner of applied econometrics, so the ideas I am working through are not new, or may first take on them might not even be fully correct. But, I think many of us applied economists are starting to learn and dabble in this area and I thought it might be fruitful to community if I learn publicly in these blog posts. These posts are certainly going to solidify my own understanding and serve as helpful notes for my own purposes.

I had at least three people reach out after my first machine learning post – two in academic positions saying that they are starting to use these methods in their own research and one former classmate in the corporate world who reached out to say these methods are already deeply embedded in their business. Further I saw @DrKeIthHCoble, Ashok Mishra, @shanferrell, and @SpacePlowboy just dropped an AEPP article on the topic, Big Data in Agriculture: A Challenge for the Future. I’m even more convinced now is a really great time to invest in learning this machine learning stuff. At the end I do an Ag example by fitting models of the relationship between corn prices received by farmers and stocks to use.

Flexibility versus Interpretability

I started going through An Introduction to Statistical Learning, and the first thing that struck me is that linear regression gets two whole chapters devoted to it (Chapters 3 and 6). I was a bit surprised because my naive view was that linear regression was something primitive and completely separate from the concepts of machine learning, whereas in this book, linear regression is presented as a baseline or starting place for the concepts of machine learning. This left me with a question. Why is econometrics as usually presented as something separate methodologically from machine learning?

In chapter 2, I think I got the answer to my question. The authors note that there is always a tension between flexibility and interpret ability of a model. Linear regression is quite inflexible, but it is readily interpret-able. Non-linear models can be quite complex in how an independent variable is functionally related to the dependent variable. In economics, we care about interpretability. We go through great pains to find instruments, design experiments, get random samples, in order to argue causality of one variable on another. Inference and interpretability matter a great deal to us, so inflexible models are often attractive for this purpose. Additionally, less flexible models often perform fairly well compared to more flexible models in terms of predictive accuracy, at least in out of sample testing, due to the problem of over-fitting in flexible models. So you get a lot of mileage out of an inflexible model like linear regression, and more flexible models might not perform much better on medium data anyway.

Our analysis is often naturally constrained in terms of the number of independent variables one might include. First, economic theory often suggests what things might be important independent variables, so there is less need to throw everything into a regression and see if it predicts. Or, you might be constrained by the scarcity of relevant data. If you conduct a survey, you can only ask so many questions and then you only have that many potential independent variables. To use a concept from my first machine learning post, we often have ‘medium data’ not big data.

I think there are good reasons why econometrics has gravitated to linear regression models as our bread and butter.

Econometrics:

• We care very much about inference, less emphasis on prediction $$\Rightarrow$$ Linear regression often works well
• We often have medium data $$\Rightarrow$$ Linear regression often works well.

So does that mean we should scoff at this statistical learning stuff? No I don’t think so. Reading into chapter 5 (Re-sampling Methods) and chapter 6 (Linear Model Selection and Regularization), I think there are really nice opportunities to get a lot of benefit even in a fairly inflexible model and even with medium data. Chapter 5 covers re-sampling methods including Cross-Validation and Bootstrap methods. Bootstrap is ubiquitous in econometrics, but cross-validation could be successfully utilized more, I think. Among those of us working with large ‘medium data’, like the Census or Ag Census, the methods are chapter 6 are fairly commonly employed, I think.

Statistical Learning:

• More focus on prediction $$\Rightarrow$$ Larger baseline set of models to choose from. Just so long as it predicts, who cares about causality!
• Means much more effort exerted at the model selection stage.

Chapters 2 and 5 discuss the bias-variance trade-off in model selection and the k-Fold cross validation method of model selection, respectively. I briefly summarize these concepts next.

A key to mastering model assessment and model selection is to understand the bias-variance trade-off. If you have a model, $$y = f(x) + \epsilon$$, so that $$y$$ is a function of $$f$$ plus some noise. Then, if we want to find out how $$x$$ affects $$y$$, or use known values of $$x$$ to predict $$y$$, we need to find a function $$\hat{f}(x)$$ that is as close as possible to $$f$$. The most common way to assess if our $$\hat{f}$$ is good or not is with mean squared error (MSE), or root mean squared error where we take the square root of the MSE. MSE is defined like this.

$MSE = \frac{1}{n}\sum_{i=1}^n (y_i – \hat{f}(x_i))^2$

It just takes all the model’s predictions, $$\hat{f}(x_i)$$, calculates the square of how far they are from the actual value, $$y_i$$, and averages them. So a low MSE is good – your predictions were not far away from the actual values.

So what contributes to a model’s errors? Head over to Wikipedia for a proof, but for any particular observation the expectation of the terms contributing to the MSE can be broken down as follows,

$E(y_i – \hat{f}x_i)^2 = Var(\hat{f}(x_i)) + Bias(\hat{f}(x_i))^2 + Var(\epsilon_i),$

where, $$Bias(\hat{f}(x_i)) = E(\hat{f}(x_i) – f(x_i))$$. Clearly to get a low MSE you need a model that will give you low variance and low bias. But, typically these things move in opposite directions. You get a high variance in $$\hat{f}$$ with a more flexible model because when a model is flexible (think for example of a high degree polynomial versus a linear model where the polynomial has many more parameters to estimate) different training data sets will make the fitted function ‘wiggle’ around much more than the linear model. Conversely, models that are less flexible have higher bias, because it can not accommodate the true shape of the data generating process as easily as a more flexible model.

In ISL, figure 2.11 illustrates this very well.

This figure is taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

Here you see a scatter-plot of a data set with variables X and Y as circles in the left panel. To model the data they fit a linear model (in gold), and increasingly higher degree polynomials in black, blue, and green. Of course, the higher the degree of the polynomial the more ‘wiggly’ the $$\hat{f}$$. Then in the right panel, they plot the MSE of these different candidate $$\hat{f}$$’s. The training set MSE is in grey and the test set (out of sample in the parlance of price forecasting) MSE is in red.

This illustrates an important point. Yes, variance and bias of a model move in opposite directions, but not necessarily at the same rate (as flexibility increases I mean). Here, when you move from a linear, to a 3rd degree, to a 5th degree polynomial, you get dramatic reductions in the MSE of both the training set and the test set.

This is because the reduction in bias as you move away from the linear model overwhelms the increase in variance as you move toward a higher degree polynomial. Though, after about a 5th degree polynomial, the gains disappear and you do not get much more out of adding higher polynomial terms to the $$\hat{f}$$. So, if this were a model selection exercise in a real research problem, we would probably select a 5th degree polynomial.

The approach of using MSE for model selection seems to be pretty ubiquitous in machine learning contexts. In economics, it is commonly used to select among competing forecasting models, but I don’t think it is used that widely for model selection in models where inference is the goal.

Cross Validation

Cross validation is a technique for model selection that I do not believe is widely used by economists, but I think it has a lot of promise. It does not seem to me to require a particularly large data set, so it could be an effective tool even on our ‘medium’ data sets.

The idea behind it is pretty simple. At the model selection stage, you will have a number of candidate models. Conceptually, the example above is pretty simple, your candidate models are polynomials of higher and higher degree. But, you could have a number of candidate models that look very different from one another. The MSE approach to model selection is to just see which one predicts the best on a ‘hold out’ data set. That is basically what we did in the example shown in figure 2.11.

Cross Validation just takes this one step further. In k-Fold cross validation you do the MSE exercise with k different randomly selected hold out samples for every model you are considering. An illustration for a 5-Fold cross validation sample split is shown in figure 5.5 from the ISL book.

This figure is taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

The advantage of this is that it allows the variance and bias of the fitted models to be revealed a little more completely. By using different subsets of your data as the test set you get a sense of how much individual observations are moving around your MSE estimates. When conducting a k-Fold cross validation, you typically just average the MSE’s given by each of the k model estimates.

An Ag Example

To illustrate the potential of k-Fold cross validation in agricultural economics I will use it in a simple example on a pretty small data set. Every month the USDA puts out the World Agricultural Supply and Demand Estimates (WASDE). This report is followed very closely, especially during the growing season and harvest of commodities important to U.S. agriculture. It publishes what are called ‘balance sheets’ because they state in a table expectations for supply and demand from various sources for each commodity.

For store-able commodities the WASDE also have an estimate of ending stocks, the amount of the grain that is not consumed in the same marketing year it was grown in, and thus carried over to be consumed during the next marketing year. Ending stocks are an important indicator or scarcity or surplus, and thus correlate with prices quite nicely. Rather than simply use ending stocks, a ratio called $$Stocks$$ $$to$$ $$Use$$ has come to be important in price analysis because it allows ending stocks to be expressed in a way that is not influenced by the (potentially growing) size of the market.

$Stocks \text{ } to \text{ } Use_t = \frac{Ending \text{ } Stocks_t}{Total \text{ } Use_t}$

Stocks to Use is simply the ratio of ending stocks to the total amount of the commodity ‘used’.

The Data

Data from historical WASDE reports can be retrieved from the USDA’s Feed Grains Database. If you want to follow along, download the .csv file containing ending stocks for corn in the the U.S. and the rest of the world (ROW), as well as average prices received by farmers in the U.S. The following code imports the .csv file (make sure to modify the file path in read.csv() to wherever you put the stocks.csv file if you are following along). Then the code creates the stocks to use variable in the line that calls the mutate() function.

# If you haven't installed any of the packages listed below, do so with the "install.packages('tibble')" command.
library(tibble)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(gridExtra)
library(boot)
stocks  <- as_tibble(stocks)
stocks  <- mutate(stocks, USStockUse = USEndingStocks/USTotalUse, WorldStockUse = ROWEndingStocks/WorldTotalUse)

stocks %>% kable(caption = "U.S. and World Ending Stocks, Total Use, Prices Recieved by Farmers, and Stocks to USE, 1975-2017", "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
U.S. and World Ending Stocks, Total Use, Prices Recieved by Farmers, and Stocks to USE, 1975-2017
Year USEndingStocks ROWEndingStocks USTotalUse WorldTotalUse PriceRecievedFarmers USStockUse WorldStockUse
1975 633.200 36411 5767.054 228199 2.54 0.1097961 0.1595581
1976 1135.600 39491 5789.200 235034 2.15 0.1961584 0.1680225
1977 1435.900 40833 6207.139 246973 2.02 0.2313304 0.1653339
1978 1709.500 47957 6995.479 254029 2.25 0.2443721 0.1887855
1979 2034.300 59481 7604.060 273641 2.52 0.2675281 0.2173687
1980 1392.100 67180 7282.444 293625 3.11 0.1911584 0.2287952
1981 2536.600 62725 6974.706 290686 2.50 0.3636856 0.2157827
1982 3523.100 60273 7249.090 279377 2.55 0.4860058 0.2157407
1983 1006.300 63421 6692.759 287033 3.21 0.1503565 0.2209537
1984 1648.200 76287 7031.963 297524 2.63 0.2343869 0.2564062
1985 4039.522 75069 6494.029 285517 2.23 0.6220363 0.2629230
1986 4881.693 80862 7385.350 298388 1.50 0.6609968 0.2709962
1987 4259.086 89488 7757.318 304440 1.94 0.5490410 0.2939430
1988 1930.428 96217 7260.121 319881 2.54 0.2658947 0.3007900
1989 1344.457 98715 8119.826 327783 2.36 0.1655771 0.3011596
1990 1521.245 102761 7760.655 319954 2.28 0.1960202 0.3211743
1991 1100.311 113106 7915.336 332232 2.37 0.1390100 0.3404428
1992 2112.981 109068 8471.119 339172 2.07 0.2494335 0.3215714
1993 850.143 107849 7621.383 349304 2.50 0.1115471 0.3087540
1994 1557.840 113777 9352.380 353437 2.26 0.1665715 0.3219159
1995 425.942 122467 8548.436 376204 3.24 0.0498269 0.3255335
1996 883.161 143886 8788.599 382278 2.71 0.1004894 0.3763910
1997 1307.803 133982 8791.000 388191 2.43 0.1487661 0.3451445
1998 1786.977 145972 9298.317 395867 1.94 0.1921828 0.3687400
1999 1717.549 150779 9514.784 412536 1.82 0.1805137 0.3654930
2000 1899.108 126880 9740.316 412710 1.85 0.1949740 0.3074314
2001 1596.426 110808 9815.402 424591 1.97 0.1626450 0.2609759
2002 1086.673 99276 9490.986 427821 2.32 0.1144953 0.2320503
2003 958.091 80334 10229.950 438453 2.42 0.0936555 0.1832215
2004 2113.972 77364 10660.530 465941 2.06 0.1982990 0.1660382
2005 1967.161 73472 11267.804 475772 2.00 0.1745825 0.1544269
2006 1303.647 75639 11206.620 499570 3.04 0.1163283 0.1514082
2007 1624.150 86418 12737.393 515187 4.20 0.1275104 0.1677410
2008 1673.311 100739 12007.572 526638 4.06 0.1393547 0.1912870
2009 1707.787 97474 13041.023 546250 3.55 0.1309550 0.1784421
2010 1127.645 93924 13033.141 570041 5.18 0.0865214 0.1647671
2011 989.027 103071 12481.947 607875 6.22 0.0792366 0.1695595
2012 821.185 112089 11082.899 606563 6.89 0.0740948 0.1847937
2013 1231.904 142987 13454.039 661861 4.46 0.0915639 0.2160378
2014 1731.164 165766 13747.918 686099 3.70 0.1259219 0.2416065
2015 1737.058 170837 13663.635 669447 3.61 0.1271300 0.2551912
2016 2293.303 171509 14648.801 747315 3.36 0.1565523 0.2295003
2017 2352.370 143333 14595.000 749749 3.30 0.1611764 0.1911746

Source: USDA Feedgrains Database

This is clearly not big data if I can print the whole data set right in the blog post! We have 43 years of data. Note that the 2017 observation is an expectation. We are not sure what carryout for the 2017 marketing year will be. The U.S. ending stocks and total use are in millions of bushels. The World ending stocks and total use are in 1,000 MT.

Now lets just plot the U.S. and World stocks to use versus price to get a sense of what the relationship looks like.

us      <- ggplot(stocks, aes(x = USStockUse, y = PriceRecievedFarmers)) + geom_point() + theme_bw() + xlim(0, .7)

ROW      <- ggplot(stocks, aes(x = WorldStockUse, y = PriceRecievedFarmers)) + geom_point() + theme_bw()  + xlim(0, .7)

grid.arrange(us, ROW, nrow = 1, top = "U.S. and World Stocks to Use vs Prices Recieved by U.S. Farmers, 1975-2017")

In the U.S. panel, a clear non-linear relationship if evident. The points from 2010-2012 had low stocks to use and exceptionally high prices. The points with the highest stocks to use and lowest prices are from the 1980’s when the U.S. government was in the business of holding stocks to support prices. Despite the non-linearity, there is clearly a relationship between stocks to use and prices for corn, as we would expect.

In a farmdoc Daily article Darrel Good and Scott Irwin propose,

$Price = \alpha + \beta \frac{1}{Stocks \text{ } to \text{ } Use} + \epsilon,$

as a model for ‘predicting’ the average price received by farmers variable. This certainly seems like a sensible choice, given that the shape of the relationship resembles $$y = \frac{1}{x}$$, but in this example, we will see if we can find another functional form that performs better in terms of cross validation MSE. We will also consider if the log of the stocks to use variable might fit, since this could also produce close to the correct curved shape (with a negative coefficient of course). I should note that in the farmdoc daily article, Good and Irwin use a much shorter sample that possibly reflects more closely current supply and demand dynamics better. But, in our case, we need the sample size to do the more data intensive cross validation.

We will use a 5-Fold cross validation to assess the models. Since World Stocks to Use might give additional predictive power, we will work it into some of our specifications. We will consider each of the following model types, and we will try them with the polynomial terms with degree from 1 to 5:

Model Functional Form
Model 1 $$Price = \beta_0 + \sum_{i=1}^n \beta_i \Big(\frac{1}{US \text{ } Stocks \text{ } to \text{ } Use} \Big)^i + \epsilon$$
Model 2 $$Price = \beta_0 + \sum_{i=1}^n \beta_i \Big(\frac{1}{US \text{ } Stocks \text{ } to \text{ } Use} \Big)^i + \sum_{i=1}^n \gamma_i \Big(\frac{1}{World \text{ } Stocks \text{ } to \text{ } Use} \Big)^i + \epsilon$$
Model 3 $$Price = \beta_0 + \sum_{i=1}^n \beta_i \Big(log(US \text{ } Stocks \text{ } to \text{ } Use \Big)^i + \epsilon$$
Model 4 $$Price = \beta_0 + \sum_{i=1}^n \beta_i \Big(log(US \text{ } Stocks \text{ } to \text{ } Use \Big)^i + \sum_{i=1}^n \gamma_i \Big(log(World \text{ } Stocks \text{ } to \text{ } Use \Big)^i + \epsilon$$
Model 5 $$Price = \beta_0 + \sum_{i=1}^n \beta_i \Big(\frac{1}{US \text{ } Stocks \text{ } to \text{ } Use} \Big)^i + \sum_{i=1}^n \gamma_i \Big(log(World \text{ } Stocks \text{ } to \text{ } Use \Big)^i + \epsilon$$

In the code below, each for loop is fitting one of the models 1-5 and looping over the polynomial degree. So in total, we have fit 25 different models. In addition to that the line with the cv.glm() function is performing the k-Fold cross validation, i.e. generating 5 test and hold out samples, calculating the MSE of each and then averaging it. The cv.XXXX vectors are storing the cross validation MSE to plot and display below.

set.seed(17)
cv.us=rep(0,5)
cv.World=rep(0,5)
cv.Lus=rep(0,5)
cv.LWorld=rep(0,5)
cv.World2=rep(0,5)
for (i in 1:5){
glm.fit=glm(PriceRecievedFarmers~poly(1/USStockUse, i) ,data=stocks)
cv.us[i]=cv.glm(stocks,glm.fit,K=5)$delta[1] } for (i in 1:5){ glm.fit.World=glm(PriceRecievedFarmers~poly(1/USStockUse, i) + poly(1/WorldStockUse, i) ,data=stocks) cv.World[i]=cv.glm(stocks,glm.fit.World,K=5)$delta[1]
}

for (i in 1:5){
glm.fit=glm(PriceRecievedFarmers~poly(log(USStockUse), i) ,data=stocks)
cv.Lus[i]=cv.glm(stocks,glm.fit,K=5)$delta[1] } for (i in 1:5){ glm.fit.World=glm(PriceRecievedFarmers~poly(log(USStockUse), i) + poly(log(WorldStockUse), i),data=stocks) cv.LWorld[i]=cv.glm(stocks,glm.fit.World,K=5)$delta[1]
}

for (i in 1:5){
glm.fit.World=glm(PriceRecievedFarmers~poly(1/USStockUse, i) + poly(log(WorldStockUse), i),data=stocks)
cv.World2[i]=cv.glm(stocks,glm.fit.World,K=5)$delta[1] } res_tibble <- cbind(cv.us, cv.World, cv.Lus, cv.LWorld, cv.World2) %>% as_tibble() %>% add_column(PolynomialDegree = seq(1:5)) %>% round(2) colnames(res_tibble) <- c('Model 1', 'Model 2', 'Model 3', 'Model 4', 'Model 5', 'PolynomialDegree') p1 <- ggplot(res_tibble, aes(x = PolynomialDegree, y = cv.us)) + geom_line() + ylim(0, 75) + labs(x = 'Model 1 Poly Degree', y = 'Cross Validation MSE') + theme_bw() p2 <- ggplot(res_tibble, aes(x = PolynomialDegree, y = cv.World)) + geom_line() + ylim(0, 75) + labs(x = 'Model 2 Poly Degree', y = '') + theme_bw() p3 <- ggplot(res_tibble, aes(x = PolynomialDegree, y = cv.Lus)) + geom_line() + ylim(0, 75) + labs(x = 'Model 3 Poly Degree', y = '') + theme_bw() p4 <- ggplot(res_tibble, aes(x = PolynomialDegree, y = cv.LWorld)) + geom_line() + ylim(0, 75) + labs(x = 'Model 4 Poly Degree', y = '') + theme_bw() p5 <- ggplot(res_tibble, aes(x = PolynomialDegree, y = cv.World2)) + geom_line() + ylim(0, 75) + labs(x = 'Model 5 Poly Degree', y = '') + theme_bw() grid.arrange(p1, p2, p3, p4, p5, nrow = 1, top = "Cross Validation MSE for Competing Models") The figure shows the cross validation MSE we get from each of the five models fit with polynomial degree 1:5. Remember, to get the cross validation MSE the cv.glm() function with k = 5 is creating 5 different non-overlapping test and hold out samples. The MSE plotted in the figure is average of the MSE from each of the 5 different hold out samples. In the table below we present the same information, the cross validation MSE by model type and for each polynomial degree. Since there were several models with very similar MSE we can see exactly which models are getting the minimum MSE. Recall that Model 1 with a 1 degree polynomial is the model proposed by Good and Irwin in the farmdoc Daily article. This model does quite well, it comes in third, and the models that do better only do marginally better. The first place model is Model 5 with polynomial degree 1; the second place model is Model 4 with polynomial degree 2. Now to get a sense of whether or not these models are providing economically different predictions, we show predictions from the top three models in the table below. Rank Specification 2017 Price 1 $$Price = \beta_0 + \beta_1 \frac{1}{US \text{ } Stocks \text{ } to \text{ } Use} + \gamma_1 log(World \text{ } Stocks \text{ } to \text{ } Use) + \epsilon$$ 3.06 2 $$Price = \beta_0 + \sum_{i=1}^2 \beta_i \Big(log(US \text{ } Stocks \text{ } to \text{ } Use \Big)^i + \sum_{i=1}^2 \gamma_i \Big(log(World \text{ } Stocks \text{ } to \text{ } Use \Big)^i + \epsilon$$ 3.15 3 $$Price = \beta_0 + \beta_1 \frac{1}{US \text{ } Stocks \text{ } to \text{ } Use} + \epsilon$$ 2.77 Currently, the March 2018 futures contract is trading at around$3.65, which is significantly higher than each of these predictions, but if you look back at the scatter-plot of the actual data, \$3.65 is well within the typical variability for a stocks to use value of 0.16 which is our current expectation for the 2017/18 marketing year.

That’s It!

That’s it for this time! I hope you enjoyed this explanation and toy example of cross validation. I certainly learned a lot writing it! I think you would only consider using cross validation for model selection if your data set was at least modestly bigger that the one in this post. However, this example goes to show that the technique can be useful even in medium data!

In the next post I will talk about Tree-Based Methods, Random Forests, Boosting, and Bagging from chapter 8 of ISL. I really currently know zero about these methods, but I have heard a lot of buzz about them. It is my sense that these are ‘real’ machine learning topics, whereas the cross validation and model selection stuff is more like necessary background.

Stay Tuned!

Reposts

• Ana
• Rstats
• tomas nilsson

Mentions

• Abraham Mathew
• Mubashir Qasim
• mindymallory

1. Edward

Great information and well articulated. Greatly appreciate this wonderful lesson. Waiting for the next post.

• mindymallory

Thank you!

2. There are some tidy modeling packages that can get rid of some of those loops. Would you be interested in a modified version of the code that you’ve shown?

3. Certainly!

Webmentions

• Machine Learning and Econometrics: Model Selection and Assessment Statistical Learning Style blog.mindymallory.com/2018/02/machin…

• Mubashir Qasim February 28, 2018

Mubashir Qasim mentioned this article on mqasim.me.

• mindymallory mentioned this article on blog.mindymallory.com.