As the coefficient of determination is expressed as a percent, its value is between 0% and 100%.

coefficient of determination, in statistics, R2 (or r2), a measure that assesses the ability of a model to predict or explain an outcome in the linear regression setting. More specifically, R2 indicates the proportion of the variance in the dependent variable (Y) that is predicted or explained by linear regression and the predictor variable (X, also known as the independent variable).

In general, a high R2 value indicates that the model is a good fit for the data, although interpretations of fit depend on the context of analysis. An R2 of 0.35, for example, indicates that 35 percent of the variation in the outcome has been explained just by predicting the outcome using the covariates included in the model. That percentage might be a very high portion of variation to predict in a field such as the social sciences; in other fields, such as the physical sciences, one would expect R2 to be much closer to 100 percent. The theoretical minimum R2 is 0. However, since linear regression is based on the best possible fit, R2 will always be greater than zero, even when the predictor and outcome variables bear no relationship to one another.

R2 increases when a new predictor variable is added to the model, even if the new predictor is not associated with the outcome. To account for that effect, the adjusted R2 (typically denoted with a bar over the R in R2) incorporates the same information as the usual R2 but then also penalizes for the number of predictor variables included in the model. As a result, R2 increases as new predictors are added to a multiple linear regression model, but the adjusted R2 increases only if the increase in R2 is greater than one would expect from chance alone. In such a model, the adjusted R2 is the most realistic estimate of the proportion of the variation that is predicted by the covariates included in the model.

When only one predictor is included in the model, the coefficient of determination is mathematically related to the Pearson’s correlation coefficient, r. Squaring the correlation coefficient results in the value of the coefficient of determination. The coefficient of determination can also be found with the following formula: R2 = MSS/TSS = (TSS − RSS)/TSS, where MSS is the model sum of squares (also known as ESS, or explained sum of squares), which is the sum of the squares of the prediction from the linear regression minus the mean for that variable; TSS is the total sum of squares associated with the outcome variable, which is the sum of the squares of the measurements minus their mean; and RSS is the residual sum of squares, which is the sum of the squares of the measurements minus the prediction from the linear regression.

The coefficient of determination shows only association. As with linear regression, it is impossible to use R2 to determine whether one variable causes the other. In addition, the coefficient of determination shows only the magnitude of the association, not whether that association is statistically significant.

Felicity Boyd Enders

Let's start our investigation of the coefficient of determination, r2, by looking at two different examples — one example in which the relationship between the response y and the predictor x is very weak and a second example in which the relationship between the response y and the predictor x is fairly strong. If our measure is going to work well, it should be able to distinguish between these two very different situations.

Here's a plot illustrating a very weak relationship between y and x. There are two lines on the plot, a horizontal line placed at the average response, \(\bar{y}\), and a shallow-sloped estimated regression line, \(\hat{y}\). Note that the slope of the estimated regression line is not very steep, suggesting that as the predictor x increases, there is not much of a change in the average response y. Also, note that the data points do not "hug" the estimated regression line:

\(SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2=119.1\)

\(SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5\)

\(SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=1827.6\)

The calculations on the right of the plot show contrasting "sums of squares" values:

  • SSR is the "regression sum of squares" and quantifies how far the estimated sloped regression line, \(\hat{y}_i\), is from the horizontal "no relationship line," the sample mean or \(\bar{y}\).
  • SSE is the "error sum of squares" and quantifies how much the data points, \(y_i\), vary around the estimated regression line, \(\hat{y}_i\).
  • SSTO is the "total sum of squares" and quantifies how much the data points, \(y_i\), vary around their mean, \(\bar{y}\).

Note that SSTO = SSR + SSE. The sums of squares appear to tell the story pretty well. They tell us that most of the variation in the response y (SSTO = 1827.6) is just due to random variation (SSE = 1708.5), not due to the regression of y on x (SSR = 119.1). You might notice that SSR divided by SSTO is 119.1/1827.6 or 0.065. Do you see where this quantity appears on the above fitted line plot?

Contrast the above example with the following one in which the plot illustrates a fairly convincing relationship between y and x. The slope of the estimated regression line is much steeper, suggesting that as the predictor x increases, there is a fairly substantial change (decrease) in the response y. And, here, the data points do "hug" the estimated regression line:

\(SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2=6679.3\)

\(SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5\)

\(SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=8487.8\)

The sums of squares for this dataset tell a very different story, namely that most of the variation in the response y (SSTO = 8487.8) is due to the regression of y on x (SSR = 6679.3) not just due to random error (SSE = 1708.5). And, SSR divided by SSTO is 6679.3/8487.8 or 0.799, which again appears on the fitted line plot.

The previous two examples have suggested how we should define the measure formally. In short, the "coefficient of determination" or "r-squared value," denoted r2, is the regression sum of squares divided by the total sum of squares. Alternatively, as demonstrated in this screencast below, since SSTO = SSR + SSE, the quantity r2 also equals one minus the ratio of the error sum of squares to the total sum of squares: 

\[r^2=\frac{SSR}{SSTO}=1-\frac{SSE}{SSTO}\]

Here are some basic characteristics of the measure:

  • Since r2 is a proportion, it is always a number between 0 and 1.
  • If r2 = 1, all of the data points fall perfectly on the regression line.The predictor x accounts for all of the variation in y!
  • If r2 = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y!

We've learned the interpretation for the two easy cases — when r2 = 0 or r2 = 1 — but, how do we interpret r2 when it is some number between 0 and 1, like 0.23 or 0.57, say? Here are two similar, yet slightly different, ways in which the coefficient of determination r2 can be interpreted. We say either:

"r2 ×100 percent of the variation in y is reduced by taking into account predictor x"

or:

"r2 ×100 percent of the variation in y is "explained by" the variation in predictor x."

Many statisticians prefer the first interpretation. I tend to favor the second. The risk with using the second interpretation — and hence why "explained by" appears in quotes — is that it can be misunderstood as suggesting that the predictor x causes the change in the response y. Association is not causation. That is, just because a dataset is characterized by having a large r-squared value, it does not imply that x causes the changes in y. As long as you keep the correct meaning in mind, it is fine to use the second interpretation. A variation on the second interpretation is to say, "r2 ×100 percent of the variation in y is accounted for by the variation in predictor x."

Students often ask: "what's considered a large r-squared value?" It depends on the research area. Social scientists who are often trying to learn something about the huge variation in human behavior will tend to find it very hard to get r-squared values much above, say 25% or 30%. Engineers, on the other hand, who tend to study more exact systems would likely find an r-squared value of just 30% unacceptable. The moral of the story is to read the literature to learn what typical r-squared values are for your research area!

Let's revisit the skin cancer mortality example (skincancer.txt). Any statistical software that performs simple linear regression analysis will report the r-squared value for you, which in this case is 67.98% or 68% to the nearest whole number.

We can say that 68% of the variation in the skin cancer mortality rate is reduced by taking into account latitude. Or, we can say — with knowledge of what it really means — that 68% of the variation in skin cancer mortality is "explained by" latitude.

What does it mean when coefficient of determination is 0?

Understanding the Coefficient of Determination A value of 1.0 indicates a perfect fit, and is thus a highly reliable model for future forecasts, while a value of 0.0 would indicate that the calculation fails to accurately model the data at all.

When given as a percent the coefficient of determination is always between 0 and 100%?

When given as a percent the coefficient of determination is always between 0 and 100% The coefficient of determination states that about 66.88% of the variation in goals scored per gram is explained by minutes on the ice.

Is coefficient of determination always between 0 and 1?

It is always between 0 and 1. It can never be negative – since it is a squared value.

Can the coefficient of determination can exceed 100%?

The coefficient of determination is a statistical measurement that examines how differences in one variable can be explained by the difference in a second variable, when predicting the outcome of a given event. Values can range from 0.00 to 1.00, or 0 to 100%.

Toplist

Neuester Beitrag

Stichworte