How to compute confidence interval for Pearson’s r? A brief guide

Shan Dou
4 min readMay 30, 2018

--

  1. Brush-up on Pearson’s r

We use Pearson’s r (a.k.a., correlation coefficient) to quantify the strength and direction of linear correlation between an independent variable x and a dependable variable y:

where cov(x, y) is the covariance of x and y, which is a measure of how much x and y vary together; Sx and Sy are the sample standard deviation of x and y (i.e., with Bessel’s correction (n-1) applied when computing standard deviation). Note that r is a proportion and not a percentage.

The magnitude of r tells us how closely the data fall along a straight line. If the data fall perfectly along a straight line in the positive direction, we have r = 1, and if the data fall perfectly along a straight line in the negative direction, we get r = -1. If x and y are not at all correlated, r = 0.

2. Hypothesis test for Pearson’s r

Before computing confidence intervals, we should first frame the underlying hypothesis test. When we trying to find linear correlation between two variables from sampled data, because sampling errors are inevitable, we always want to check if the observed correlation is robust against sampling error.

In this light, we can start articulating the null and alternative hypotheses. By denoting the sample correlation coefficient as r and the population correlation coefficient as , we can state the hypotheses as follows:

  • Null hypothesis H₀: = 0 (populations of x and y are not correlated)
  • Alternative hypothesis Hₐ: ⍴ <0 (one-tail test in the negative direction) or ⍴ >0 (one-tail test in the positive direction) or ≠0 (two-tail test in both direction)

Next, let’s carry out t-test for Pearson’s r:

where r is Pearson’s r computed from sampled data, and n is the number of sample. The degree of freedom for t-test is n-2.

Let’s quickly go through an example: Given number of sample n = 25, we obtain a t-statistic value of 2.71. If we are to conduct a non-directional (i.e., two-tail) test with significant level α=0.05, what decision should we make about the hypothesis for population correlation coefficient?

Here is a code snippet in Python:

The above code will return a t_critical value of 2.069. Given that our t-statistic is larger than the critical t value, we can conclude that there is enough evidence to reject the null hypothesis. In other words, there is a significant linear relationship between x and y.

3. How to compute confidence interval for population’s Pearson’s ?

Many linear regression software tools can also provide a 95% confidence interval for the Pearson’s r. This also is an effective way of informing us about whether there is indeed a significant linear relationship between x and y — if the CIs include 0, we will not have enough evidence to reject the null hypothesis.

How do we compute the confidence interval for Pearson’s r? It is slightly more complicated than the cases for standardized normal distribution and student’s t distribution. The root of complication is that r does not follow the bell-shaped normal distribution. Instead, it has a negatively skewed shape. To work around this complication, the confidence interval calculations for requires the following three steps:

  1. Convert r to z’ using Fisher’s z’ transform:

2. Compute confidence intervals using the resulting z’ value:

where z’-critical can easily be obtained from the z-table for a given significant level, and SE is the standard error:

3. Convert the confident intervals in terms of z’ back into r values:

Let’s get to work!

Let’s use an example to drive the point home (a similar example can be found in http://onlinestatbook.com/):

Task: Compute 95% confidence interval (two-tail) for population’s correlation coefficient , given that N = 34, r = -0.654.

Solution:

[Step 1] compute critical value for z’

We obtain z’_critical = 1.96

[Step 2] Compute confidence interval in terms of z’

We obtain 95% confidence interval in terms of z’ value: (-1.13, -0.43)

[Step 3] Convert z’ back to r, we obtain (-0.81, -0.40) as the confidence interval for population’s correlation coefficient. Because this interval is far from 0, we can conclude that there is a significant negative correlation between the dependent and independent variables.

Happy stats!

Although most stats libraries can compute this confidence interval for us, knowing what is under the hood is always helpful for solidifying our understanding. Happy stats!

--

--