打动pandas线性回归_使⽤Python,pandas和statsmodels通过线性回归预测房价pandas 线性回归
This post was originally published
这篇⽂章最初发表
rel="stylesheet" type="text/css" href="/wp-content/themes/colormag-child/css/tim-dobbins-style.css">
rel="stylesheet" type="text/css" href="/wp-content/themes/colormag-child/css/tim-dobbins-style.css">
In this post, we’ll walk through building linear regression models to predict housing prices resulting from economic activity. Topics covered will include:
在本⽂中,我们将逐步构建线性回归模型,以预测经济活动导致的房价。 涵盖的主题将包括:
Future posts will cover related topics such as exploratory analysis, regression diagnostics, and advanced regression modeling, but I wanted to jump right in so readers could get their hands dirty with data.
未来的⽂章将涵盖相关主题,例如探索性分析,回归诊断和⾼级回归建模,但是我想跳进去,以便读者可以轻松掌握数据。
肠梗阻的症状及治疗
什么是回归? (What is Regression?)
Linear regression is a model that predicts a relationship of direct proportionality between the dependent variable (plotted on the vertical or Y axis) and the predictor variables (plotted on the X axis) that produces a straight line, like so:
线性回归是⼀个模型,该模型可预测因变量(绘制在垂直或Y轴上)与预测变量(绘制在X轴上)之间的直接⽐例关系,该变量会产⽣⼀条直线,如下所⽰:
Linear regression will be discusd in greater detail as we move through the modeling process.
在建模过程中,将更详细地讨论线性回归。
变量选择 (Variable Selection)
For our dependent variable we’ll u housing_price_index (HPI), which measures price changes of residential housing.
对于我们的因变量,我们将使⽤housing_price_index (HPI)来衡量住宅价格的变化。
For our predictor variables, we u our intuition to lect drivers of macro- (or “big picture”) economic activity, such as unemployment, interest rates, and gross domestic product (total productivity). For an explanation of our variables, including assumptions about how they impact housing prices, and all the sources of data ud in this post, e .
对于我们的预测变量,我们使⽤直觉来选择宏观(或“全局”)经济活动的驱动⼒,例如失业率,利率和国内⽣产总值(总⽣产率)。 有关我们变量的解释,包括关于变量如何影响房价的假设以及本⽂中使⽤的所有数据来源,请参见 。
⽤读数据 (Reading in the Data with )
Once we’ve downloaded the data, read it in using pandas’ read_csv method.
下载完数据后,请使⽤pandas的read_csv⽅法读取数据。
import pandas as pd
# read in from csv ad_csv
# be sure to u the file path where you saved the data
housing_price_index = pd.read_csv('/Urs/tdobbins/Downloads/hpi/monthly-hpi.csv')
unemployment = pd.read_csv('/Urs/tdobbins/Downloads/hpi/unemployment.csv')
federal_funds_rate = pd.read_csv('/Urs/tdobbins/Downloads/hpi/fed_funds.csv')
shiller = pd.read_csv('/Urs/tdobbins/Downloads/hpi/shiller.csv')
gross_domestic_product = pd.read_csv('/Urs/tdobbins/Downloads/hpi/gdp.csv')
import pandas as pd
# read in from csv ad_csv
# be sure to u the file path where you saved the data
housing_price_index = pd.read_csv('/Urs/tdobbins/Downloads/hpi/monthly-hpi.csv')
unemployment = pd.read_csv('/Urs/tdobbins/Downloads/hpi/unemployment.csv')
federal_funds_rate = pd.read_csv('/Urs/tdobbins/Downloads/hpi/fed_funds.csv')
shiller = pd.read_csv('/Urs/tdobbins/Downloads/hpi/shiller.csv')
gross_domestic_product = pd.read_csv('/Urs/tdobbins/Downloads/hpi/gdp.csv')
Once we have the data, invoke pandas’ merge method to join the data together in a single dataframe for analysis. Some data is reported monthly, others are reported quarterly. No worries. We merge the dataframes on a certain column so each row is in its logical place for measurement purpos. In this example, the best column to merge on is the date column. See below.
有了数据后,调⽤pandas的merge⽅法将数据merge到单个数据框中进⾏分析。 ⼀些数据每⽉报告⼀次,其他数据每季度报告⼀次。 别担⼼。 我们将数据帧合并到某⼀列上,以便每⼀⾏都位于其逻辑位置以进⾏测量。 在此⽰例中,要合并的最佳列是⽇期列。 见下⽂。
Let’s get a quick look at our variables with pandas’ head method. The headers in bold text reprent the date and the variables we’ll test for our model. Each row reprents a different time period.
让我们⽤pandas的head⽅法快速查看我们的变量。 粗体⽂本标题表⽰⽇期和我们将为模型测试的变量。 每⾏代表⼀个不同的时间段。
Out[23]:
出[23]:
date⽇期sp500sp500consumer_price_index 消费者
价格指
数
long_interest_rate long_interest_rate housing_price_index housing_price_index total_unemployed
002011-
01-01
2011-
01-01
1282.621282.62220.22220.22 3.39 3.39181.35181.3516.2
11
个
2011-
04-01
2011-
04-01
1331.511331.51224.91224.91 3.46 3.46180.80180.8016.1
222011-
07-01
2011-
07-01
1325.191325.19225.92225.92 3.00 3.00184.25184.2515.9
凋敝是什么意思
332011-
10-01
2011-
10-01
1207.221207.22226.42226.42 2.15 2.15181.51181.5115.8
442012-
01-01
2012-
01-01
1300.581300.58226.66226.66 1.97 1.97179.13179.1315.2
Usually, the next step after gathering data would be exploratory analysis. Exploratory analysis is the part of the process where we analyze the variables (with plots and descriptive statistics) and figure out the best predictors of our dependent variable. For the sake of brevity, we’ll skip the exploratory analysis. Keep in the back of your mind, though, that it’s of utmost importance and that skipping it in the real world would preclude ever getting to the predictive ction.
通常,收集数据后的下⼀步将是探索性分析。 探索性分析是该过程的⼀部分,在该过程中,我们分析变量(使⽤图表和描述性统计数据)并找出我们因变量的最佳预测变量。为了简洁起见,我们将跳过探索性分析。 不过,请记住,它⾄关重要,在现实世界中跳过它会阻⽌您进⼊预测领域。
We’ll u ordinary least squares (OLS), a basic yet powerful way to asss our model.
我们将使⽤普通最⼩⼆乘法(OLS),这是⼀种基本⽽强⼤的评估模型的⽅法。
普通最⼩⼆乘假设 (Ordinary Least Squares Assumptions)
OLS measures the accuracy of a linear regression model.
OLS衡量线性回归模型的准确性。
OLS is built on assumptions which, if held, indicate the model may be the correct lens through which to interpret our data. If the assumptions don’t hold, our model’s conclusions lo their validity. Take extra effort to choo the right model to avoid .
OLS建⽴在假设上,如果假设成⽴,则表明模型可能是解释我们数据的正确镜头。 如果这些假设不成⽴,那么我们模型的结论将失去其有效性。 付出额外的努⼒来选择正确的模型,以避免 。
Here are the OLS assumptions:
以下是OLS的假设:
1. Linearity: A linear relationship exists between the dependent and predictor variables. If no linear relationship exists, linear regression isn’t the correct
model to explain our data.
2. No multicollinearity: Predictor variables are not collinear, i.e., they aren’t highly correlated. If the predictors are highly correlated, try removing one or
more of them. Since additional predictors are supplying redundant information, removing them shouldn’t drastically reduce the Adj. R-squared (e below).
3. Zero conditional mean: The average of the distances (or residuals) between the obrvations and the trend line is zero. Some will be positive, others
negative, but they won’t be biad toward a t of values.
4. Homoskedasticity: The certainty (or uncertainty) of our dependent variable is equal across all values of a predictor variable; that is, there is no pattern in
the residuals. In statistical jargon, the variance is constant.
5. No autocorrelation (rial correlation): Autocorrelation is when a variable is correlated with itlf across obrvations. For example, a stock price might
be rially correlated if one day’s stock price impacts the next day’s stock price.
1. 线性 :因变量和预测变量之间存在线性关系。 如果不存在线性关系,则线性回归不是解释我们数据的正确模型。
2. 没有多重共线性 :预测变量不是共线性的,即它们之间没有⾼度相关。 如果预测变量⾼度相关,请尝试删除其中⼀个或多个。 由于其他预测变量正在提供冗余信息,因此
删除这些预测变量不应显着降低Adj。 R平⽅ (请参见下⽂)。
3. 零条件均值 :观测值和趋势线之间的平均距离(或残差)为零。 有些会是积极的,有些则是消极的,但它们不会偏向⼀系列价值观。
4. 同⽅性 :我们的因变量的确定性(或不确定性)在预测变量的所有值之间相等; 也就是说,残差中没有图案。 ⽤统计术语来说,⽅差是恒定的。
5. ⽆⾃相关(串⾏相关) :⾃相关是指变量在各个观测值之间与⾃⾝相关。 例如,如果⼀天的股票价格影响第⼆天的股票价格,则股票价格可能会顺序相关。
Let’s begin modeling.
让我们开始建模。
简单线性回归 (Simple Linear Regression)
Simple linear regression us a single predictor variable to explain a dependent variable. A simple linear regression equation is as follows:
简单线性回归使⽤单个预测变量来解释因变量。 ⼀个简单的线性回归⽅程如下:
Where:
哪⾥:快乐的跳吧
y = dependent variable
y =因变量
ß = regression coefficient
ß= 回归系数
α = intercept (expected mean value of housing prices when our independent variable is zero)
α=截距(当我们的⾃变量为零时,房价的预期均值)
x = predictor (or independent) variable ud to predict Y
x =⽤于预测Y的预测变量(或⾃变量)
ε = the error term, which accounts for the randomness that our model can’t explain.
ε=误差项,占我们模型⽆法解释的随机性。
Using statsmodels’ function, we construct our model tting housing_price_index as a function of total_unemployed. We assume that an increa in the total number of unemployed people will have downward pressure on housing prices. Maybe we’re wrong, but we have to start somewhere!
使⽤statsmodels' 的功能,我们构建我们的模型设定housing_price_index作为⼀个功能total_unemployed 。 我们假设失业⼈数的增加将对房价产⽣下⾏压⼒。 也许我们错了,但我们必须从某个地⽅开始!
The code below shows how to t up a simple linear regression model with total_unemployment as our predictor variable.
下⾯的代码显⽰了如何使⽤total_unemployment作为我们的预测变量来建⽴简单的线性回归模型。
from IPython.display import HTML, display
import statsmodels.api as sm
from statsmodels.formula.api import ols
# fit our model with .fit() and show results
# we u statsmodels' formula API to invoke the syntax below,
# where we write out the formula using ~
housing_model = ols("housing_price_index ~ total_unemployed", data=df).fit()
# summarize our model
housing_model_summary = housing_model.summary()
# convert our table to HTML and add colors to headers for explanatory purpos
HTML(
housing_model_summary\
.as_html()\
.replace('
Adj. R-squared: ', ' Adj. R-squared: ')\
.replace('coef', 'coef')\
.replace('std err', 'std err')\
.replace('P>|t|', 'P>|t|')\
.replace('[95.0% Conf. Int.]', '[95.0% Conf. Int.]')
)
Out[24]:
出[24]:
OLS Regression Results
Dep. Variable:部门 变量:housing_price_index housing_price_index R-squared:R平⽅:0.9520.952
Model:模型:OLS最⼩⼆乘Adj. R-squared:调整 R平⽅:0.9490.949
Method:⽅法:Least Squares最⼩⼆乘F-statistic:F统计:413.2413.2台湾怪谭
Date:⽇期:Fri, 17 Feb 20172017年2⽉17⽇,星期五Prob (F-statistic):概率(F统计): 2.71e-15 2.71e-15
Time:时间:17:57:0517:57:05Log-Likelihood:对数似然:-65.450-65.450 No. Obrvations:号观察:2323AIC:AIC:134.9134.9
Df Residuals:Df残渣:2121BIC:BIC:137.2137.2
Df Model:DF型号:11个
Covariance Type:协⽅差类型:nonrobust不稳健
coef ef std err标准错误tŤP>|t|P> | t |[95.0% Conf. Int.][95.0%Conf。 整数]
Intercept截距313.3128313.3128 5.408 5.40857.93857.9380.0000.000302.067 324.559302.067 324.559
total_unemployed共有失业-8.3324-8.33240.4100.410-20.327-20.3270.0000.000-9.185 -7.480-9.185 -7.480
Omnibus:综合:0.4920.492Durbin-Watson:杜宾·沃森: 1.126 1.126
Prob(Omnibus):概率(Omnibus):0.7820.782Jarque-Bera (JB):Jarque-Bera(JB):0.5520.552
Skew:偏斜:0.2940.294Prob(JB):概率(JB):0.7590.759
Kurtosis:峰度: 2.521 2.521Cond. No.条件。 没有。78.978.9
Referring to the OLS regression results above, we’ll offer a high-level explanation of a few metrics to understand the strength of our model: Adj. R-squared, coefficients, standard errors, and p-values.
参考上⾯的OLS回归结果,我们将提供⼀些指标的⾼级解释,以了解我们模型的强度:调整。 R平⽅,系数,标准误差和p值。
To explain:
解释:
Adj. R-squared indicates that 95% of housing prices can be explained by our predictor variable, total_unemployed.
调整 R平⽅表明,我们的预测变量total_unemployed可以解释房屋价格的95%。
The regression coefficient (coef) reprents the change in the dependent variable resulting from a one unit change in the predictor variable, all other variables being held constant. In our model, a one unit increa in total_unemployed reduces housing_price_index by 8.33. In line with our assumptions, an increa in unemployment appears to reduce housing prices.
回归系数(coef)表⽰因预测变量⼀个单位变化⽽导致的因变量变化,所有其他变量保持不变。 在我们的模型中,增加⼀个单位total_unemployed减少housing_price_index 8.33。 根据我们的假设,失业率的上升似乎会降低房价。
The standard error measures the accuracy of total_unemployed‘s coefficient by estimating the variation of the coefficient if the same test were run on a different sample of our population. Our stan
dard error, 0.41, is low and therefore appears accurate.
如果同⼀检验是在我们⼈⼝的不同样本上进⾏的,则标准误差通过估计系数的变化来衡量总total_unemployed系数的准确性。 我们的标准误差为0.41,很低,因此看起来很准确。
The p-value means the probability of an 8.33 decrea in housing_price_index due to a one unit increa in total_unemployed is 0%, assuming there is no relationship between the two variables. A low p-value indicates that the results are statistically significant, that is in general the p-value is less than 0.05.
p值表⽰假设两个变量之间没有关系,则由于total_unemployed增加1个单位⽽导致housing_price_index下降8.33的可能性为0%。 低p值表⽰结果具有统计意义,即通常p值⼩于0.05。
The confidence interval is a range within which our coefficient is likely to fall. We can be 95% confident that total_unemployed‘s coefficient will be within our confidence interval, [-9.185, -7.480].
置信区间是我们的系数可能下降的范围。 我们可以有95%的信⼼, total_unemployed的系数将在我们的信⼼区间[-9.185,-7.480]之内。
Let’s u statsmodels’ plot_regress_exog function to help us understand our model.
羊肉片汤
让我们使⽤statsmodels的plot_regress_exog函数来帮助我们了解我们的模型。
回归图 (Regression Plots)
Plea e the four graphs below.
请参阅下⾯的四个图表。
1. The “Y and Fitted vs. X” graph plots the dependent variable against our predicted values with a confidence interval. The inver relationship in our
graph indicates that housing_price_index and total_unemployed are negatively correlated, i.e., when one variable increas the other decreas.
2. The “Residuals versus total_unemployed” graph shows our model’s errors versus the specified predictor variable. Each dot is an obrved value; the
line reprents the mean of tho obrved values. Since there’s no pattern in the distance between the dots and the mean value, the OLS assumption of homoskedasticity holds.
3. The “Partial regression plot” shows the relationship between housing_price_index and total_unemployed, taking in to account the impact of adding other
independent variables on our existing total_unemployed coefficient. We’ll e later how this same graph changes when we add more variables.
4. The Component and Component Plus Residual (CCPR) plot is an extension of the partial regression plot, but shows where our trend line would lie after
adding the impact of adding our other independent variables on our existing total_unemployed coefficient. More on this plot .
1. “ Y and Fitted vs. X”(Y和拟合与X的关系图)图以⼀个置信区间将因变量相对于我们的预测值进⾏绘制。 我们图表中的反⽐关系表
明housing_price_index和total_unemployed呈负相关,即,当⼀个变量增加⽽另⼀变量减少时。
2. “残差与total_unemployed ”图显⽰了我们模型的误差与指定的预测变量的关系。 每个点都是⼀个观察值; 该线代表那些观察值的平均值。 由于点和平均值之间的距离
没有规律,因此OLS假设为同⽅差。
3. “偏回归图”显⽰了housing_price_index与total_unemployed之间的关系,并考虑了添加其他⾃变量对我们现有的total_unemployed系数的影响。 稍后我们将看到当添加
更多变量时,同⼀图形如何变化。
4. Component and Component Plus Residual(CCPR)图是部分回归图的扩展,但显⽰了在添加其他⾃变量对我们现有的total_unemployed系数的影响后,趋势线将位
于total_unemployed 。 更多关于这个情节 。
Simple Linear Regression Plot
The next plot graphs our trend line (green), the obrvations (dots), and our confidence interval (red).
下图绘制了趋势线(绿⾊),观察值(点)和置信区间(红⾊)。
# this produces our trend line
from ssion.predstd import wls_prediction_std
import numpy as np
# predictor variable
x = df[['total_unemployed']]
# dependent variable
y = df[['housing_price_index']]
# retrieve our confidence interval values
# _ is a dummy variable since we don't actually u it for plotting but need it as a placeholder
# since wls_prediction_std(housing_model) returns 3 values
_, confidence_interval_lower, confidence_interval_upper = wls_prediction_std(housing_model)
fig, ax = plt.subplots(figsize=(10,7))
# plot the dots
# 'o' specifies the shape (circle), we can also u 'd' (diamonds), 's' (squares)
ax.plot(x, y, 'o', label="data")
# plot the trend line
# g-- and r-- specify the color to u
ax.plot(x, housing_model.fittedvalues, 'g--.', label="OLS")
# plot upper and lower ci values
ax.plot(x, confidence_interval_upper, 'r--')
ax.plot(x, confidence_interval_lower, 'r--')
南京博物院观后感
# plot legend
ax.legend(loc='best');
# this produces our trend line
from ssion.predstd import wls_prediction_std
import numpy as np
# predictor variable
x = df[['total_unemployed']]
# dependent variable
y = df[['housing_price_index']]
# retrieve our confidence interval values
# _ is a dummy variable since we don't actually u it for plotting but need it as a placeholder
# since wls_prediction_std(housing_model) returns 3 values
_, confidence_interval_lower, confidence_interval_upper = wls_prediction_std(housing_model)
fig, ax = plt.subplots(figsize=(10,7))
# plot the dots
# 'o' specifies the shape (circle), we can also u 'd' (diamonds), 's' (squares)
ax.plot(x, y, 'o', label="data")
# plot the trend line
# g-- and r-- specify the color to u
ax.plot(x, housing_model.fittedvalues, 'g--.', label="OLS")东方资产管理公司
# plot upper and lower ci values
ax.plot(x, confidence_interval_upper, 'r--')
ax.plot(x, confidence_interval_lower, 'r--')
# plot legend
ax.legend(loc='best');
趋势图
So far, our model looks decent. Let’s add some more variables and e how total_unemployed reacts.
到⽬前为⽌,我们的模型看起来不错。 让我们添加更多变量,看看total_unemployed如何React的。
多元线性回归 (Multiple Linear Regression)
Mathematically, multiple linear regression is:
从数学上讲,多元线性回归为:
We know that unemployment cannot entirely explain housing prices. To get a clearer picture of what influences housing prices, we add and test different variables and analyze the regression results to e which combinations of predictor variables satisfy OLS assumptions, while remaining intuitively appealing from an economic perspective.
我们知道失业不能完全解释房价。 为了更清楚地了解影响房价的因素,我们添加并测试了不同的变量,并对回归结果进⾏了分析,以查看哪些预测变量组合满⾜OLS假设,同时从经济⾓度仍然具有直观吸引⼒。
We arrive at a model that contains the following variables: fed_funds, consumer_price_index, long_interest_rate, and gross_domestic_product, in addition to our original predictor, total_unemployed.
我们到达包含以下变量的模型: fed_funds , consumer_price_index , long_interest_rate和gross_domestic_product ,除了我们原来的预测, total_unemployed 。
Adding the new variables decread the impact of total_unemployed on housing_price_index. total_unemployed‘s impact is now more unpredictable (standard error incread from 0.41 to 2.399), and, since the p-value is higher (from 0 to 0.943), less likely to influence housing prices.