November 28, 2014
If you ever read an academic study or about analytics, you must’ve heard the word “regression”. Regression is a mathematical tool used for establishing a
relationship between two or more variables.
Simply put, regression analysis helps to determine if there is or there is not a correlation between two or more variables. At it’s most basic form, it
involves plotting datapoints on X and Y axes and looking for a trendline, and thus, a potential relationship. you can plot your data on a graph and do it,
like a school kid or you can use a spreadsheet application, like Excel to do it more accurately and with greater certainity.
In this blogpost, we will understand how to do basic regression analysis using a case as example. Anyone who’s ever had a meeting with a bank loan officer
knows that as interest rates rise, mortgages become more expensive and thus the maximum mortgage that one can afford declines. In response to this, median
housing prices frequently decline. (For the sake of simplicity, we’ll leave aside the “unconventional” mortgages that contributed to the housing boom and
bust.) But what exactly is that relationship between mortgage rates and median house prices? If mortgage interest rates increase by 1 percentage point, how
much would the median selling price of properties fall?
To solve this, first, we need some data and then a spreadsheet software. This tutorial deals with Microsoft Excel, but free tools such as Open Office Calc
can be used.
Once you have Excel installed in your PC, verify that Excel is set up to calculate regressions. To do so, it must have the Analysis ToolPak enabled:
1. Go to the File menu.
2. Click on Options.
3. Click on Add-Ins.
4. Select Analysis ToolPak.
5. Click the Go button.
6. In the Add-Ins dialog box, ensure that “Analysis ToolPak” is checked.
7. Click OK.
The data we are going to use is a list of the average interest rates on 30 year mortgages and the median home selling price from 1988 to 2003. We’d want to
put the home prices on vertical axis and the interest rate on horizontal axis (Why, you ask, sign up for _______).
Now create a diagram of the data points — this is known as a “scatter plot” — with a regression line that shows the relationship between mortgage interest
rates and median home prices:
1. Download the spreadsheet and open it with Excel. Ensure that you’re on the first tab, “Interest rates and home prices.”
2. With your cursor, select the range from cells B6 to C21 (note that we don’t want to plot the averages, which are on line 22).
3. Click on Insert and then Scatter. Excel will generate a scatter chart and place it on your worksheet.
4. On the scatter chart, right-click on any point in the chart.
5. Choose “Add Trend Line.”
6. Select “Linear,” “Display Equation on Chart” and “R2 Value on Chart.”
When you press OK, you should see the following chart, or something similar:
What is being shown in the graph is the regression line, which is a “trend line”, meaning it gives a general trend which is flowing through the data. One
important thing that you need to remember is that “correlation doesn’t mean causation”. To understand causation different kind of
experiments should be designed.
From the graph, we can see that as the interest rates increases, median home prices decrease. The reason we plotted interest rate on the X axis is because
it is the independent variable in our data set and median home price is the dependent variable. Hence it is on the Y axis.
While the regression formula looks complicated — “y = -23409x + 393349” — it’s actually not too hard to read and use. The formula’s end result is a Y
value, which will be a median home price; this is influenced by the X value, which is the interest rate: For any 1 percentage-point increase in interest
rates, the median value of homes sold decreases by $23,409, and vice-versa. To give an obvious example, if interest rates are zero, the X in the formula is
zero; this would completely eliminate the $23,409 multiplier (also known as a coefficient), and the median home price would be $393,349 (this is known as
the Y-intercept). If the average interest rate were 10%, then the median sales price would be $159,259 — $393,349-($23,409*10). You can plug in other
interest rates, and see what the median sales prices would be.
The “R2” value tells you how compelling the dataset is — and you can see this visually by how tightly the data points are or aren’t distributed
around the trend line. Excel calculates R2 for you based on the vertical distance of the individual data points from the trend line, and it’s
always between 0 and 1: The higher the value, the better the “fit” of the data and the more accurate the formula. In our example, the R2 is
0.3846 — decent, but not great. Consequently, we might want to get additional data to see if we can establish a stronger relationship (or not) between
mortgage interest rates and home prices.
The R2 value is a measure of how much the data is close to our trend line. You can see the impact of the size of your dataset on the R 2 value and thus the regression formula itself. R2 value always lies between 0 and 1. In our case, R2 value is 0.3846,
which is decent but not great. Perhaps, if we used additional data, perhaps we can get higher R2 values.
If you want to generate more exhaustive diagnostics, you can easily do this in Excel:
1. Click on Data, “Data Analysis,” and then select “Regression.”
2. In the Input Y Range, select C5:C20.
3. In the Input X Range, select B5:B20.
4. Under Output Options, choose “New Worksheet Ply,” then click OK.