Pearson product-moment correlation coefficient In statistics, the Pearson product-moment correlation coefficient (sometimes referred to as the PMCC , and typically denoted by r ) is a measure of the correlation (linear dependence) between two variables X and Y , giving a value between +1 and −1 inclusive. It is widely ud in the sciences as a measure of the strength of linear dependence between two variables. It was developed by Karl Pearson from a similar but slightly different idea introduced by Francis Galton in the 1880s.[1] [2] The correlation coefficient is sometimes called "Pearson's r."Several ts of (x , y ) points, with the correlation coefficient of x and y for each t. Note
that the correlation reflects the noisiness and direction of a linear relationship (top row),
but not the slope of that relationship (middle), nor many aspects of nonlinear relationships
(bottom). N.B.: the figure in the center has a slope of 0 but in that ca the correlation
逗比图片coefficient is undefined becau the variance of Y
is zero.
Definition
小提琴名曲
Pearson's correlation coefficient
between two variables is defined as the
covariance of the two variables divided
by the product of their standard
deviations:
The above formula defines the population correlation coefficient, commonly reprented by the Greek letter ρ (rho).Substituting estimates of the covariances and variances bad on a sample gives the sample correlation coefficient ,commonly denoted r
:
An equivalent expression gives the correlation coefficient as the mean of the products of the standard scores. Bad on a sample of paired data (X i , Y i
), the sample Pearson correlation coefficient is
道德的英语
where , and are the standard score, sample mean, and sample standard deviation respectively.
Mathematical properties
The absolute value of both the sample and population Pearson correlation coefficients are less than or equal to 1.Correlations equal to 1 or -1 correspond to data points lying exactly on a line (in the ca of the sample correlation),or to a bivariate distribution entirely supported on a line (in the ca of the population correlation). The Pearson correlation coefficient is symmetric: corr (X ,Y ) = corr (Y ,X ).
A key mathematical property of the Pearson correlation coefficient is that it is invariant to parate changes in location and scale in the two variables. That is, we may transform X to a + bX and transform Y to c + dY , where a , b ,c , and d are constants, without changing the correlation coefficient (this fact holds for both the population and sample Pearson correlation coefficients). Note that more general linear transformations do change the correlation:e a later ction for an application of this.
The Pearson correlation can be expresd in terms of uncentered moments. Since μX = E(X ), σX 2 = E[(X − E(X ))2]= E(X 2) − E 2(X ) and likewi for Y , and since
the correlation can also be written as秦灭楚
Alternative formulae for the sample Pearson correlation coefficient are also available:
The above formula conveniently suggests a single-pass algorithm for calculating sample correlations, but, depending on the numbers involved, it can sometimes be numerically unstable.
Interpretation
梦到被虫子咬The correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increas as X increas. A value of −1implies that all data points lie on a line for which Y decreas as X increas. A value of 0 implies that there is no linear correlation between the variables.
More generally, note that (X i − X )(Y i − Y ) is positive if and only if X i and Y i lie on the same side of their respective means. Thus the correlation coefficient is positive if X i and Y i tend to be simultaneously greater than, or simultaneously less than, their respective means. The correlation coefficient is negative if X i and Y i tend to lie on opposite sides of their respective means.
Geometric interpretation
Regression lines for y=g x (x) [red] and x=g y
(y) [blue ]
For uncentered data, the correlation coefficient
corresponds with the the cosine of the angle
between both possible regression lines y=g x (x) and
x=g y
(y).
For centered data (i.e., data which have been
shifted by the sample mean so as to have an
average of zero), the correlation coefficient can
also be viewed as the cosine of the angle
between the two vectors of samples drawn from
the two random variables (e below).
Some practitioners prefer an uncentered
(non-Pearson-compliant) correlation coefficient.
See the example below for a comparison.
As an example, suppo five countries are found to
have gross national products of 1, 2, 3, 5, and 8
billion dollars, respectively. Suppo the same five countries (in the same order) are found to have 11%, 12%,13%, 15%, and 18% poverty. Then let x and y be ordered 5-element vectors containing the above data: x = (1, 2, 3,5, 8) and y = (0.11, 0.12, 0.13, 0.15, 0.18).By the usual procedure for finding the angle between two vectors (e dot product), the uncentered correlation
coefficient is:
Note that the above data were deliberately chon to be perfectly correlated: y = 0.10 + 0.01 x . The Pearson correlation coefficient must therefore be exactly one. Centering the data (shifting x by E(x ) = 3.8 and y by E(y ) =0.138) yields x = (−2.8, −1.8, −0.8, 1.2, 4.2) and y = (−0.028, −0.018, −
0.008, 0.012, 0.042), from which
as expected.
Interpretation of the size of a correlation Correlation
Negative Positive None
−0.09 to 0.00.0 to 0.09Small
−0.3 to −0.10.1 to 0.3Medium
−0.5 to −0.30.3 to 0.5Large −1.0 to −0.50.5 to 1.0
Several authors [3] have offered guidelines for the interpretation of a correlation coefficient. Cohen (1988),[3] has obrved, however, that all such criteria are in some ways arbitrary and should not be obrved too strictly. The interpretation of a correlation coefficient depends on the context and purpos. A correlation of 0.9 may be very low if one is verifying a physical law using high-quality instruments, but may be regarded as very high in the social sciences where there may be a greater contribution from complicating factors.
Inference
A graph showing the minimum value of Pearson's correlation coefficient that is
significantly different from zero at the 0.05 level, for a given sample size.Statistical inference bad on Pearson's
correlation coefficient often focus on one
of the following two aims. One aim is to test
the null hypothesis that the true correlation
coefficient is ρ, bad on the value of the
sample correlation coefficient r . The other
aim is to construct a confidence interval
around r that has a given probability of
containing ρ.
Randomization approaches
Permutation tests provide a direct approach
to performing hypothesis tests and
constructing confidence intervals. A
一年级口算训练permutation test for Pearson's correlation
coefficient involves the following two steps:(i) using the original paired data (x i , y i ),
randomly redefine the pairs to create a new
data t (x i , y i ′), where the i ′ are a permutation of the t {1,...,n }. The permutation i ′ is lected randomly, with equal probabilities placed on all n ! possible permutations. This is equivalent to drawing the i ′ randomly "without replacement" from the t {1,..., n }. A cloly-related and equally-justified (bootstrapping) approach is to parately draw the i and the i ′ "with replacement" from {1,..., n }; (ii) Construct a correlation coefficient r from the randomized data. To perform the permutation test, repeat (i) and (ii) a large number of times. The p-value for the permutation test is one minus the proportion of the r values generated in step (ii) that are larger than the Pearson correlation coefficient that was calculated from the original data. Here "larger" can mean either that the value is l
arger in magnitude, or larger in signed value, depending on whether a two-sided or one-sided test is desired.
The bootstrap can be ud to construct confidence intervals for Pearson's correlation coefficient. In the "non-parametric" bootstrap, n pairs (x i , y i ) are resampled "with replacement" from the obrved t of n pairs, and the correlation coefficient r is calculated bad on the resampled data. This process is repeated a large number of times,and the empirical distribution of the resampled r values are ud to approximate the sampling distribution of the statistic. A 95% confidence interval for ρ can be defined as the interval spanning from the 2.5th to the 97.5th percentile of the resampled r values.
Approaches bad on mathematical approximations
For approximately Gaussian data, the sampling distribution of Pearson's correlation coefficient approximately follows Student's t-distribution with degrees of freedom N − 2. Specifically, if the underlying variables have a
读书卡怎么做bivariate normal distribution, the variable
has a Student's t-distribution in the null ca (zero correlation).[4] This also holds approximately even if the obrved values are non-normal, provided sample sizes are not very small.[5] For constructing confidence intervals and performing power analys, the inver of this transformation is also needed:
Alternatively, large sample approaches can be ud.
Early work on the distribution of the sample correlation coefficient was carried out by R. A. Fisher[6][7] and A. K. Gayen.[8] Another early paper[9] provides graphs and tables for general values of ρ, for small sample sizes, and discuss computational approaches.
Fisher Transformation
In practice, confidence intervals and hypothesis tests relating to ρ are usually carried out using the Fisher transformation:
If F(r) is the Fisher transformation of r, and n is the sample size, then F(r) approximately follows a normal distribution with
and standard error
Thus, a z-score is
心理健康小报内容
under the null hypothesis of that , given the assumption that the sample pairs are independent and identically distributed and follow a bivariate normal distribution. Thus an approximate p-value can be obtained from a normal probability table. For example, if z = 2.2 is obrved and a two-sided p-value is desired to test the null hypothesis that , the p-value is 2·Φ(−2.2) = 0.028, where Φ is the standard normal cumulative distribution function.
Confidence Intervals
To obtain a confidence interval for ρ, we first compute a confidence interval for F( ):
The inver Fisher transformation bring the interval back to the correlation scale.
For example, suppo we obrve r = 0.3 with a sample size of n=50, and we wish to obtain a 95% confidence interval for ρ. The transformed value is artanh(r) = 0.30952, so the confidence interval on the transformed scale is 0.30952 ± 1.96/√47, or (0.023624, 0.595415). Converting back to the correlation scale yields (0.024, 0.534).