项⽬评价指标误差回归_了解回归误差指标
项⽬评价指标 误差回归
Human brains are built to recognize patterns in the world around us. For example, we obrve that if we practice our programming everyday, our related skills grow. But how do we precily describe this r
elationship to other people? How can we describe how strong this relationship is? Luckily, we can describe relationships between phenomena, such as practice and skill, in terms of formal mathematical estimations called regressions.
⼈类的⼤脑旨在识别我们周围世界的模式。 例如,我们观察到,如果每天练习编程,我们的相关技能就会提⾼。 但是,我们如何精确地描述与他⼈的这种关系? 我们如何描述这种关系有多牢固? 幸运的是,我们可以⽤称为回归的形式化数学估计来描述诸如实践和技能之类的现象之间的关系。
Regressions are one of the most commonly ud tools in a data scientist’s kit. When you or R, you gain the ability to create regressions in single lines of code without having to deal with the underlying mathematical theory. But this ea can cau us to forget to evaluate our regressions to ensure that they are a sufficient enough reprentation of our data.
纸尿裤排行榜回归是数据科学家⼯具包中最常⽤的⼯具之⼀。 当您或R时,您将能够在单⾏代码中创建回归,⽽⽆需处理基础数学理论。 但是这种轻松可能会导致我们忘记评估回归,以确保它们⾜以代表我们的数据。
We can plug our data back into our regression equation to e if the predicted output matches corresponding obrved value en in the data. The quality of a regression model is how well its pre
dictions match up against actual values, but how do we actually evaluate quality?
c的同音词我们可以将数据重新插⼊回归⽅程中,以查看预测输出是否与数据中看到的相应观察值匹配。 回归模型的质量是其预测与实际值的匹配程度,但是我们如何实际评估质量?
Luckily, smart statisticians have developed error metrics to judge the quality of a model and enable us to compare regresssions against other regressions with different parameters. The metrics are short and uful summaries of the quality of our data. This article will dive into four common regression metrics and discuss their u cas.
幸运的是,聪明的统计学家已经开发出误差度量标准来判断模型的质量,并使我们能够将回归与其他具有不同参数的回归进⾏⽐较。 这些指标是我们数据质量的简短有⽤的总结。 本⽂将深⼊探讨四个常见的回归指标并讨论它们的⽤例。
There are many types of regression, but this article will focus exclusively on metrics related to the linear regression. The linear regression is the most commonly ud model in rearch and business and is the simplest to understand, so it makes n to start developing your intuition on how they are assd. The intuition behind many of the metrics we’ll cover here extend to other types of models and their respective metrics.
回归类型很多,但是本⽂将只关注与线性回归相关的指标。 线性回归是研究和商业中最常⽤的模型,最容易理解,因此有必要开始就如何评估它们建⽴直觉。 我们将在此处介绍的许多指标背后的直觉延伸到其他类型的模型及其各⾃的指标。
If you’d like a quick refresher on the linear regression, you can consult or .
南朝北朝如果您想快速回顾线性回归,可以查阅或 。
线性回归⼊门 (A primer on linear regression)北京都
In the context of regression, models refer to mathematical equations ud to describe the relationship between two variables. In general, the models deal with prediction and estimation of values of interest in our data called outputs. Models will look at other aspects of the data called inputs that we believe to affect the outputs, and u them to generate estimated outputs. The inputs and outputs have many names that you may have heard before. Inputs are can also be called independent variables or predictors, while outputs are also known as respons or dependent variables. Simply speaking, models are just functions where the outputs are some function of the inputs.
在回归⽅⾯,模型是指⽤于描述两个变量之间关系的数学⽅程式。 通常,这些模型⽤于处理我们的数据中称为输出的感兴趣值的预测和估计。 模型将研究称为输⼊的数据的其他⽅⾯,我们认为这些⽅⾯会影响输出,并使⽤它们来⽣成估计的输出。 这些输⼊和输出具有您之前可能听说过的许多名称。 输⼊也可以称为⾃变量或预测变量,⽽输出也称为响应或因变量。 简⽽⾔之,模型只是输出是输⼊的某些功能的函数。
The linear part of linear regression refers to the fact that a linear regression model is described mathematically in the form:
线性回归的线性部分是指以数学形式描述线性回归模型的事实:
If that looks too mathematical, take solace in that linear thinking is particularly intuitive. If you’ve ever heard of “practice makes perfect,” then you know that more practice means better skills; there is some linear relationship between practice and perfection.
如果这看起来太数学了,那么请放⼼,线性思维特别直观。 如果您曾经听说过“实践使完美”,那么您就会知道,更多的练习意味着更好的技能。 实践与完美之间存在线性关系。
The regression part of linear regression does not refer to some return to a lesr state. Regression here simply refers to the act of estimating the relationship between our inputs and outputs. In particular, regression deals with the modelling of continuous values (think: numbers) as oppod to discrete states (think: categories).
线性回归的回归部分并不表⽰某种程度的回归。 这⾥的回归简单地是指估计我们的投⼊和产出之间关系的⾏为。 尤其是,回归处理的是连续值 (思考:数字)与离散状态(思考:类别)的建模。
Taken together, a linear regression creates a model that assumes a linear relationship between the inputs and outputs. The higher the inputs are, the higher (or lower, if the relationship was negative) the outputs are.
总之,线性回归创建⼀个模型,该模型假设输⼊和输出之间存在线性关系。 输⼊越⾼,输出越⾼(如果关系为负,则较低)。
What adjusts how strong the relationship is and what the direction of this relationship is between the inputs and outputs are our coefficients. The first coefficient without an input is called the intercept, and it adjusts what the model predicts when all your inputs are 0.
调整该关系的牢固程度以及输⼊和输出之间此关系的⽅向的是我们的系数 。 没有输⼊的第⼀个系数称为拦截(intercept) ,它会在所有输⼊均为0时调整模型预测的值。
We will not delve into how the coefficients are calculated, but know that there exists a method to calculate the optimal coefficients, given which inputs we want to u to predict the output. Given the coefficients, if we plug in values for the inputs, the linear regression will give us an estimate for what the output should be.
我们不会深⼊研究这些系数的计算⽅式,但是知道存在⼀种计算最佳系数的⽅法,因为我们要使⽤哪些输⼊来预测输出。 给定系数,如果我们插⼊输⼊值,则线性回归将为我们提供输出的估计值 。
As we’ll e, the outputs won’t always be perfect. Unless our data is a perfectly straight line, our m我在活着
odel will not precily hit all of our data points. One of the reasons for this is the ϵ (named “epsilon”) term. This term reprents error that comes from sources out of our control, causing the data to deviate slightly from their true position.
我们将看到,这些输出并不总是完美的。 除⾮我们的数据是⼀条完美的直线,否则我们的模型将不会精确地影响我们的所有数据点。 原因之⼀是ϵ(称为“ε”)术语。 该术语表⽰来⾃我们⽆法控制的源的错误,导致数据略有偏离其实际位置。
Our error metrics will be able to judge the differences between prediction and actual values, but we cannot know how much the error has contributed to the discrepancy. While we cannot ever completely eliminate epsilon, it is uful to retain a term for it in a linear model.
我们的误差指标将能够判断预测值与实际值之间的差异,但我们不知道误差对差异的影响程度。 尽管我们⽆法完全消除ε,但将其保留在线性模型中很有⽤。
将模型预测与现实进⾏⽐较 (Comparing model predictions against reality)
Since our model will produce an output given any input or t of inputs, we can then check the estimated outputs against the actual values that we tried to predict. We call the difference between t
he actual value and the model’s estimate a residual. We can calculate the residual for every point in our data t, and each of the residuals will be of u in asssment. The residuals will play a significant role in judging the ufulness of a model.
由于我们的模型将在给定任何输⼊或⼀组输⼊的情况下产⽣输出,因此我们可以将这些估计的输出与我们尝试预测的实际值进⾏⽐较。 我们称实际值与模型估计值之差为残差 。 我们可以计算数据集中每个点的残差,并且这些残差中的每⼀个都将在评估中使⽤。 这些残差将在判断模型的实⽤性⽅⾯发挥重要作⽤。
If our collection of residuals are small, it implies that the model that produced them does a good job at predicting our output of interest. Converly, if the residuals are generally large, it implies that model is a poor estimator.
如果残差的收集量很⼩,则意味着产⽣残差的模型在预测我们感兴趣的输出⽅⾯做得很好。 相反,如果这些残差通常很⼤,则表明该模型是⼀个差的估计量。
We technically can inspect all of the residuals to judge the model’s accuracy, but unsurprisingly, this does not scale if we have thousands or millions of data points. Thus, statisticians have developed summary measurements that take our collection of residuals and conden them into a single value
that reprents the predictive ability of our model.
从技术上讲,我们可以检查所有残差以判断模型的准确性,但是不⾜为奇的是,如果我们有成千上万个数据点,那么这不会缩放。 因此,统计⼈员开发了汇总度量,这些度量将我们的残差集合收集并将其压缩为⼀个代表我们模型的预测能⼒的值。儿童手工包包>猪八戒图片大全
There are many of the summary statistics, each with their own advantages and pitfalls. For each, we’ll discuss what each statistic reprents, their intution and typical u ca. We’ll cover:
这些摘要统计数据很多,每个统计数据都有其⾃⾝的优势和陷阱。 对于每个统计数据,我们将讨论每个统计数据代表什么,其直觉和典型⽤例。 我们将介绍:
Mean Absolute Error
Mean Square Error
Mean Absolute Percentage Error
Mean Percentage Error
平均绝对误差
均⽅误差
平均绝对百分⽐误差
平均百分⽐误差
Note: Even though you e the word error here, it does not refer to the epsilon term from above! The error described in the metrics refer to the residuals!
注意:即使您在此处看到错误⼀词,也不会从上⽅引⽤epsilon术语! 这些指标中描述的错误是指残差 !
扎根于真实数据 (Staying rooted in real data)
In discussing the error metrics, it is easy to get bogged down by the various acronyms and equations ud to describe them. To keep ourlves grounded, we’ll u a model that I’ve created using the .
在讨论这些错误度量时,很容易被⽤于描述它们的各种⾸字母缩略词和⽅程式所困扰。 为了使⾃⼰保持扎根,我们将使⽤我使⽤的创建的模型。
The specifics of the model I’ve created are shown below.
我创建的模型的详细信息如下所⽰。
My regression model takes in two inputs (critic score and ur score), so it is a multiple variable linear regression. The model took in my data and found that 0.039 and -0.099 were the best coefficients for the inputs. For my model, I cho my intercept to be zero since I’d like to imagine there’d be zero sales for scores of zero. Thus, the intercept term is crosd out. Finally, the error term is crosd out becau we do not know its true value in practice. I have shown it becau it dep
icts a more detailed description of what information is encoded in the linear regression equation.
我的回归模型接受两个输⼊(评论评分和⽤户评分),因此它是⼀个多变量线性回归。 该模型吸收了我的数据,发现0.039和-0.099是输⼊的最佳系数。 对于我的模型,我选择截距为零,因为我想想象得分为零时销售量为零。 因此,截取项被删除。 最后,错误项被删除了,因为我们在实践中不知道其真正价值。 之所以显⽰它,是因为它描述了线性回归⽅程中编码的信息的更详细描述。
模型背后的原理 (Rationale behind the model)
Let’s say that I’m a game developer who just created a new game, and I want to know how much money I will make. I don’t want to wait, so I developed a model that predicts total global sales (my output) bad on an expert critic’s judgment of the game and general player judgment (my inputs). If both critics and players love the game, then I should make more money… right? When I actually get the critic and ur reviews for my game, I can predict how much glorious money I’ll make.
假设我是⼀个游戏开发⼈员,刚创建了⼀个新游戏,我想知道我将赚多少钱。 我不想等待,因此我开发了⼀个模型,该模型可根据专家评论家对游戏的判断和⼀般玩家的判断(我的投⼊)来预测全球总销售额(我的产出)。 如果评论家和玩家都喜欢这款游戏,那我应该赚更多钱……对吗? 当我真正得到游戏的评论家和⽤户评论时,我可以预测⾃⼰能赚到多少光彩。
著名书法家的故事Currently, I don’t know if my model is accurate or not, so I need to calculate my error metrics to check if I should perhaps include more inputs or if my model is even any good!
⽬前,我不知道我的模型是否正确,因此我需要计算我的错误指标,以检查是否应该添加更多输⼊或者我的模型是否还不错!
平均绝对误差 (Mean absolute error)
The mean absolute error (MAE) is the simplest regression error metric to understand. We’ll calculate the residual for every data point, taking only the absolute value of each so that negative and positive residuals do not cancel out. We then take the average of all the residuals. Effectively, MAE describes the typical magnitude of the residuals. If you’re unfamiliar with the mean, you can refer back to . The formal equation is shown below:
平均绝对误差 (MAE)是最容易理解的回归误差指标。 我们将为每个数据点计算残差,只取每个残差的绝对值,以使负残差和正残差不会被抵消。 然后,我们取所有这些残差的平均值。 有效地,MAE描述了残差的典型⼤⼩。 如果您不熟悉平均值,可以参考 。 形式⽅程如下所⽰:
The picture below is a graphical description of the MAE. The green line reprents our model’s predictions, and the blue points reprent our data.
下图是MAE的图形描述。 绿线代表模型的预测,蓝点代表我们的数据。