Back to Post-publication discussions

Are We Comparing Apples or Squared Apples? The Proportion of Explained Variance Exaggerates Differences Between Effects

Submission status
Accepted

Submission Editor
Emil O. W. Kirkegaard

Author
Marco Del Giudice

Title
Are we comparing apples or squared apples? The proportion of explained variance exaggerates differences between effects

Abstract

This brief note addresses a known problem whose implications are still not widely appreciated: using the proportion of explained variance as an index of effect size does not just distort the real-world magnitude of individual effects, but also exaggerates the differences between effects, which may lead to strikingly incorrect judgements of relative importance. Luckily, a meaningful and interpretable “effect ratio” can be easily calculated as the square root of the ratio between proportions of explained variance. In a variety of practical examples, effect ratios tell a different story than variance components, and suggest a different perspective on certain canonical results (e.g., regarding the role of the shared environment in the development of psychological traits). This simple but highly consequential point should be understood more widely, to help researchers avoid fallacious interpretations of empirical findings.

Keywords
explained variance, variance components, effect size, correlation

Pdf

Paper

Reviewers ( 0 / 0 / 2 )
Reviewer 1: Accept
Reviewer 2: Accept

Wed 05 May 2021 21:43

Bot

Author has updated the submission to version #2

Bot

Author has updated the submission to version #3

Reviewer

Thanks for the opportunity to review this interesting methodological, brief-report communication. I think it has significant value both as a methodological refresher for seaoned investigators as well as a novel pedagogic utility for junior scientists, particular in fields where relative magnitudes of predictors or indicator variables often break along biological (genetic) and environmental (social) conceptual and empirical lines. Below I make a few suggestions that I think will sharpen the manuscript.

- Most broadly, the piece gives the impression that the author was motivated by attention to a particular literature--namely the BG literature--when crafting it. Heritabilities feature prominently as pedagogic examples (albeit not exclusively--i.e.,the Big Five/social-personality literature) and it occurred to me that perhaps drawing out more sharply the author's motivation for the piece would contextualize and provide increased sharpness to the cogent points made in the piece.  

- 2nd paragraph, 3rd sentence reads improperly. I believe "when" should be replaced with "while" to clarify the intended meaning of the sentence.

-A more general issue is whether this confusion is truly as widespread as the author implies. I don't recollect comparisons in social sciences--or at least the work I have done and am most familar with--using R^2 to evalate relative comparisons of predictors or indicator components; Rather, comparisons (again that I am most familar with) have always used standardized beta to rival predictors. This may be a feature of the work under my purview, but I also would note that standardized betas are produced as default in SPSS MLR output (along with the model ANOVA which will provide variance components and R^2 etc.). Related: in what way does this piece move beyond/advance information (or serve as an anciliary to) that can be gleaned in Cohen's (1988-2nd edition) treatise? A sentence or two addressing these points would be useful, I think.

-Some clarification on the author's meaninging of standardized effect ratio as it relates to scenarios *across different outcomes* would be useful. This is becuase in the case of disparate outcomes, effects in MLR (and SEM) will often be standardized as a function of different Y outcome variances, so unless the two 'standardized' variables being compared share the same Y (i.e., within model), the meaning may be more or less informative

-On p. 3, the reciprocal of the effect ratio (as a unit of interpretation) should be introduced in the 1st full paragraph (first didactic example) rather than the 2nd full paragraph as this way of interpretating the relationship may be more intuitive to a given reader. Its mention can be retained in the 3rd paragraph (second didactic example) for consistency.

- Related to the point above: It may be useful to draw a juxtaposition with the IRT literature to further clarify. In IRT, reliability can be computed as 1 - the squared reciprocal of the square root of the information value for scores in a specified latent trait range; the reciprocal of the of the square root of the information value provides an estimate of the standard error of latent trait ability measurement in the specified latent trait range, which, when squared, provides an index of error variance in the same latent trait range.

I say this as to my surprise, in some of the pedagogic examples in the piece, 1 - the squared reciprocal does approximately yield the discussed effect ratio percentage difference noted (e.g., the Polderman et al. example and the criminality and substance abuse example), whereas in others it does not (e.g., the following 50/5 heritable, shared environment example). The distinction would be pedagogically useful, I think, to show how similar ways of parsing data can be used to calculate different metrics that share some underlying formal operations (also since CTT reliability is discussed in the piece). Of course, one may also take the view it complicates the focal issue, but a brief reaction to the suggestion would be informative nonetheless.

One suggestion more generally is to include a mock-up of a bivariate data table demonstrating that the sqrt of variance ratio is equivalent to the beta ratio for a given pair of values. 

Reviewer

-- Minor revisions -- 

I found this paper interesting, and I like that it has several easily understandable examples.

That said, it is basically just an opinion, and is something many (or at least some) people are already aware of.

I would argue that it is an opinion that one metric is "distorted" and the other is not. Explained variance is a metric which shows that - explained variance. May not always be what is most enlightening, but there is nothing distorted about it in itself.

So I would suggest changing the emphasis of the paper to instead being: Here is a way of viewing things that can aid understanding. A different perspective (as you also write). But focus on this, that it is a suggested perspective, not that using explained variance is distorted.

Best,

Anon Anonsen

Bot

Author has updated the submission to version #4

Author
Replying to Forum Bot

Author has updated the submission to version #4

Just changed some keywords, it's not a revised manuscript yet

Bot

Author has updated the submission to version #5

Author
Replying to Forum Bot

Author has updated the submission to version #5

Same (changed the abstract)

Bot

Author has updated the submission to version #6

Author

Replying to Reviewer 2

(the tracked revision is in version 6 of the manuscript; sorry for the minor updates, which I just learned have triggered useless emails to you)

Many thanks for the positive assessment!

- I know that many people are aware of these issues, and tried to explicitly disclaim any originality in my contribution. I think some of my examples are more convicing than ones that have been used previously, and hopefully the idea of an effect ration will stimulate more awareness.

- I substantially edited the manuscript and changed the wording of several passages to make it as clear as possible that variances offer a "distorted" picture of the effects, if those effects are best understood in the original units of the variables (e.g., factors predicting "intelligence" rather than "variance of intelligence"). In this case, I think there is a strong argument that comparisons are in fact distorted or exaggerated, even if there is nothing technically incorrect in the variance.

- I think some of the quotes directly suggest an inflated interpretation of the differences, e.g. the quote about polygenic scores that predict 10% of the variance being "a long way" from explaining 50% of the variance.

Best,

MarcoDG

-- Minor revisions -- 

I found this paper interesting, and I like that it has several easily understandable examples.

That said, it is basically just an opinion, and is something many (or at least some) people are already aware of.

I would argue that it is an opinion that one metric is "distorted" and the other is not. Explained variance is a metric which shows that - explained variance. May not always be what is most enlightening, but there is nothing distorted about it in itself.

So I would suggest changing the emphasis of the paper to instead being: Here is a way of viewing things that can aid understanding. A different perspective (as you also write). But focus on this, that it is a suggested perspective, not that using explained variance is distorted.

Best,

Anon Anonsen

 

Author

Replying to Reviewer 1

(the tracked revision is in version 6 of the manuscript; sorry for the minor updates, which I just learned have triggered useless emails to you)

many thanks for the helpful and thorough review!

- I signled out behavior genetics in the opening paragraph, but didn't want to focus the paper on this discipline specifically. So I added a final paragraph in which I briefly discuss methods with a broad application, such as "relative importance analysys" (very popular in applied psychology) and PCA/EFA.

- "2nd paragraph, 3rd sentence reads improperly. I believe "when" should be replaced with "while" to clarify the intended meaning of the sentence." I think the sentence is OK--I meant that more often than not, the squared units are substantively meaningless (e.g., squared IQ points)

- I tried to drive home the broader implications of the paper in the final paragraph

- "Some clarification on the author's meaninging of standardized effect ratio as it relates to scenarios *across different outcomes* would be useful." I did not specifically include a discussion iof this issue to avoid miring the paper in excessive detail. I tried to keep the paper as general as possible, and noted that the effect ratio only makes sense in cases where the ratio of variance also makes sense.

- "On p. 3, the reciprocal of the effect ratio (as a unit of interpretation) should be introduced in the 1st full paragraph (first didactic example) rather than the 2nd full paragraph as this way of interpretating the relationship may be more intuitive to a given reader. Its mention can be retained in the 3rd paragraph (second didactic example) for consistency." Done. Thanks for the suggestion.

- "It may be useful to draw a juxtaposition with the IRT literature to further clarify. In IRT, reliability can be computed as 1 - the squared reciprocal of the square root of the information value for scores in a specified latent trait range; the reciprocal of the of the square root of the information value provides an estimate of the standard error of latent trait ability measurement in the specified latent trait range, which, when squared, provides an index of error variance in the same latent trait range." I've considered this, but I wonder if this would be too specific for the average reader. Since reliability is not a primary concern of IRT analyses (which focus more on estimating the SEM), I'm not sure how to go about this without adding a lengthy discussion of reliability in the two approaches. If you still think this should be included, please let me know and I'll be happy to add a footnote or a new paragraph.

All the best,

Marco DG

 

Thanks for the opportunity to review this interesting methodological, brief-report communication. I think it has significant value both as a methodological refresher for seaoned investigators as well as a novel pedagogic utility for junior scientists, particular in fields where relative magnitudes of predictors or indicator variables often break along biological (genetic) and environmental (social) conceptual and empirical lines. Below I make a few suggestions that I think will sharpen the manuscript.

- Most broadly, the piece gives the impression that the author was motivated by attention to a particular literature--namely the BG literature--when crafting it. Heritabilities feature prominently as pedagogic examples (albeit not exclusively--i.e.,the Big Five/social-personality literature) and it occurred to me that perhaps drawing out more sharply the author's motivation for the piece would contextualize and provide increased sharpness to the cogent points made in the piece.  

- 2nd paragraph, 3rd sentence reads improperly. I believe "when" should be replaced with "while" to clarify the intended meaning of the sentence.

-A more general issue is whether this confusion is truly as widespread as the author implies. I don't recollect comparisons in social sciences--or at least the work I have done and am most familar with--using R^2 to evalate relative comparisons of predictors or indicator components; Rather, comparisons (again that I am most familar with) have always used standardized beta to rival predictors. This may be a feature of the work under my purview, but I also would note that standardized betas are produced as default in SPSS MLR output (along with the model ANOVA which will provide variance components and R^2 etc.). Related: in what way does this piece move beyond/advance information (or serve as an anciliary to) that can be gleaned in Cohen's (1988-2nd edition) treatise? A sentence or two addressing these points would be useful, I think.

-Some clarification on the author's meaninging of standardized effect ratio as it relates to scenarios *across different outcomes* would be useful. This is becuase in the case of disparate outcomes, effects in MLR (and SEM) will often be standardized as a function of different Y outcome variances, so unless the two 'standardized' variables being compared share the same Y (i.e., within model), the meaning may be more or less informative

-On p. 3, the reciprocal of the effect ratio (as a unit of interpretation) should be introduced in the 1st full paragraph (first didactic example) rather than the 2nd full paragraph as this way of interpretating the relationship may be more intuitive to a given reader. Its mention can be retained in the 3rd paragraph (second didactic example) for consistency.

- Related to the point above: It may be useful to draw a juxtaposition with the IRT literature to further clarify. In IRT, reliability can be computed as 1 - the squared reciprocal of the square root of the information value for scores in a specified latent trait range; the reciprocal of the of the square root of the information value provides an estimate of the standard error of latent trait ability measurement in the specified latent trait range, which, when squared, provides an index of error variance in the same latent trait range.

I say this as to my surprise, in some of the pedagogic examples in the piece, 1 - the squared reciprocal does approximately yield the discussed effect ratio percentage difference noted (e.g., the Polderman et al. example and the criminality and substance abuse example), whereas in others it does not (e.g., the following 50/5 heritable, shared environment example). The distinction would be pedagogically useful, I think, to show how similar ways of parsing data can be used to calculate different metrics that share some underlying formal operations (also since CTT reliability is discussed in the piece). Of course, one may also take the view it complicates the focal issue, but a brief reaction to the suggestion would be informative nonetheless.

One suggestion more generally is to include a mock-up of a bivariate data table demonstrating that the sqrt of variance ratio is equivalent to the beta ratio for a given pair of values. 

 

The paper's strength is its examples, and I like how real explained variance estimates (e.g., heritabilities) are being used to illustrate the author's point. But there is one glaring lacuna: never in the paper does quote or cite anyone who makes the error that the author claims is so common. I strongly recommend that the author find an example or two of someone saying something like, "Look at this missing heritability problem! Polygenic scores only explain 5% of the IQ phenotype, but twin and family studies show that the heritability is 50%. We're missing 90% of the heritability! Polygenic scores must be trivial." Without an example or two of people making this mistake, the author doesn't have a strong argument that this error is common or that it is leading people to make wrong conclusion. The example on p. 5 (from McCrae, 2015) sort of gets to this point, but it comes so late in the paper... An example early on would do a lot of good.

An example of ranking factors in an EFA would also help illustrate the point on pp. 5-6. Eigenvalues are a percentage of variance (standardized across observed variables) that a factor explains in exploratory factor analysis. Differences in eigenvalues may be another example of how variance components can appear inflated; this can have serious consequences when judging whether a factor is worth retaining or not (e.g., with the Guttman rule of retaining factors with eigenvalues greater than 1).

Minor suggestions to make the paper read more smoothly:

  • Page 2: Change ". . . such as behavior genetics" to ". . . such as behavioral genetics."
  • Footnote 4: Specify that the correction factor is the portion in the second equation that is inside the square root.
  • Footnote 5: I think the author should clarify that 80% heritability for IQ is in wealthy countries; there are few heritability estimates for this phenotype in developing or poor nations, and none seem to be as high as what is seen in Europe and North America.
Author

Many thanks for the feedback! This was supposed to be a very short note, and certain interpretations of the data seem to be endemic and rarely made explicit, so I didn't want to turn it into a collection of mined quotes. I think the Plomin & van Stumm quote nicely illustrates the perception of 10% being "a long way" from 50%, which is true in terms of explaining the variance but not so much in terms of explaining the phenotype. In the next revision, I will add these three examples, one very general and two specific:

- In his "three laws" paper, Turkheimer (2000) wrote: "Although according to the second law shared environment accounts for a small proportion of the variability in behavioral outcomes, according to the third law, nonshared environment usually accounts for a substantial portion. So perhaps the appropriate conclusion is not so much that the family environment does not matter for development, but rather that the part of the family environment that is shared by siblings does not matter." (emphasis mine)

The following two are from Knopik et al.'s (2017) textbook:

- "Memory and verbal fluency show lower heritability, about 30 percent; the other abilities yield heritabilities of 40 to 50 percent. [...] however, adoption designs show little influence of shared environment. For example, the correlations for adoptive siblings are only about 0.10, suggesting that only 10 percent of the variance of verbal and spatial abilities is due to shared environmental factors." (emphasis mine)

- "Large twin studies found similar results [heritabilities around 60%] in the early school years for both reading disability and reading ability. However, in all of these studies, shared environmental influence is modest, typically accounting for less than 20 percent of the variance." (emphasis mine)

I should also note that "small" shared environmental effects are often dropped from the best-fitting models and set to zero, because they cannot be detected reliably unless sample size is quite large (e.g., Burt, 2014).

 

Replying to Fri 14 May 2021 05:27

The paper's strength is its examples, and I like how real explained variance estimates (e.g., heritabilities) are being used to illustrate the author's point. But there is one glaring lacuna: never in the paper does quote or cite anyone who makes the error that the author claims is so common. I strongly recommend that the author find an example or two of someone saying something like, "Look at this missing heritability problem! Polygenic scores only explain 5% of the IQ phenotype, but twin and family studies show that the heritability is 50%. We're missing 90% of the heritability! Polygenic scores must be trivial." Without an example or two of people making this mistake, the author doesn't have a strong argument that this error is common or that it is leading people to make wrong conclusion. The example on p. 5 (from McCrae, 2015) sort of gets to this point, but it comes so late in the paper... An example early on would do a lot of good.

An example of ranking factors in an EFA would also help illustrate the point on pp. 5-6. Eigenvalues are a percentage of variance (standardized across observed variables) that a factor explains in exploratory factor analysis. Differences in eigenvalues may be another example of how variance components can appear inflated; this can have serious consequences when judging whether a factor is worth retaining or not (e.g., with the Guttman rule of retaining factors with eigenvalues greater than 1).

Minor suggestions to make the paper read more smoothly:

  • Page 2: Change ". . . such as behavior genetics" to ". . . such as behavioral genetics."
  • Footnote 4: Specify that the correction factor is the portion in the second equation that is inside the square root.
  • Footnote 5: I think the author should clarify that 80% heritability for IQ is in wealthy countries; there are few heritability estimates for this phenotype in developing or poor nations, and none seem to be as high as what is seen in Europe and North America.

 

Reviewer

 I think these edits have been very responsive to my original review and appreciate the author's attentitiveness.

I still would suggest modification to the 3rd sentence (2nd paragraph) that I mentioned earlier. While I appreciate the point (and edit), the sentence needs a leading "Even" here. You are suggesting something redeeming in nature about the units (i.e., when they are not *entirely* meaningless as they would be if standardized), but in the 2nd part of the sentence you are suggesting they are still quite limited in their interpretive ramifications. Thus, in my opinion, it reads a bit awarkedly when only "when" is used as you are drawing a distinction here.

In the prior version I reviewed, I did not see the current footnote #4, which I think is a welcome addition. Incidentally, I think this may also explain my curious observation relevant to the (seemingly random) IRT reliability calculation (using on the reciprocal) coherency with some of the provided examples in the paper.

Replying to Marco Del Giudice

Replying to Reviewer 1

(the tracked revision is in version 6 of the manuscript; sorry for the minor updates, which I just learned have triggered useless emails to you)

many thanks for the helpful and thorough review!

- I signled out behavior genetics in the opening paragraph, but didn't want to focus the paper on this discipline specifically. So I added a final paragraph in which I briefly discuss methods with a broad application, such as "relative importance analysys" (very popular in applied psychology) and PCA/EFA.

- "2nd paragraph, 3rd sentence reads improperly. I believe "when" should be replaced with "while" to clarify the intended meaning of the sentence." I think the sentence is OK--I meant that more often than not, the squared units are substantively meaningless (e.g., squared IQ points)

- I tried to drive home the broader implications of the paper in the final paragraph

- "Some clarification on the author's meaninging of standardized effect ratio as it relates to scenarios *across different outcomes* would be useful." I did not specifically include a discussion iof this issue to avoid miring the paper in excessive detail. I tried to keep the paper as general as possible, and noted that the effect ratio only makes sense in cases where the ratio of variance also makes sense.

- "On p. 3, the reciprocal of the effect ratio (as a unit of interpretation) should be introduced in the 1st full paragraph (first didactic example) rather than the 2nd full paragraph as this way of interpretating the relationship may be more intuitive to a given reader. Its mention can be retained in the 3rd paragraph (second didactic example) for consistency." Done. Thanks for the suggestion.

- "It may be useful to draw a juxtaposition with the IRT literature to further clarify. In IRT, reliability can be computed as 1 - the squared reciprocal of the square root of the information value for scores in a specified latent trait range; the reciprocal of the of the square root of the information value provides an estimate of the standard error of latent trait ability measurement in the specified latent trait range, which, when squared, provides an index of error variance in the same latent trait range." I've considered this, but I wonder if this would be too specific for the average reader. Since reliability is not a primary concern of IRT analyses (which focus more on estimating the SEM), I'm not sure how to go about this without adding a lengthy discussion of reliability in the two approaches. If you still think this should be included, please let me know and I'll be happy to add a footnote or a new paragraph.

All the best,

Marco DG

 

Thanks for the opportunity to review this interesting methodological, brief-report communication. I think it has significant value both as a methodological refresher for seaoned investigators as well as a novel pedagogic utility for junior scientists, particular in fields where relative magnitudes of predictors or indicator variables often break along biological (genetic) and environmental (social) conceptual and empirical lines. Below I make a few suggestions that I think will sharpen the manuscript.

- Most broadly, the piece gives the impression that the author was motivated by attention to a particular literature--namely the BG literature--when crafting it. Heritabilities feature prominently as pedagogic examples (albeit not exclusively--i.e.,the Big Five/social-personality literature) and it occurred to me that perhaps drawing out more sharply the author's motivation for the piece would contextualize and provide increased sharpness to the cogent points made in the piece.  

- 2nd paragraph, 3rd sentence reads improperly. I believe "when" should be replaced with "while" to clarify the intended meaning of the sentence.

-A more general issue is whether this confusion is truly as widespread as the author implies. I don't recollect comparisons in social sciences--or at least the work I have done and am most familar with--using R^2 to evalate relative comparisons of predictors or indicator components; Rather, comparisons (again that I am most familar with) have always used standardized beta to rival predictors. This may be a feature of the work under my purview, but I also would note that standardized betas are produced as default in SPSS MLR output (along with the model ANOVA which will provide variance components and R^2 etc.). Related: in what way does this piece move beyond/advance information (or serve as an anciliary to) that can be gleaned in Cohen's (1988-2nd edition) treatise? A sentence or two addressing these points would be useful, I think.

-Some clarification on the author's meaninging of standardized effect ratio as it relates to scenarios *across different outcomes* would be useful. This is becuase in the case of disparate outcomes, effects in MLR (and SEM) will often be standardized as a function of different Y outcome variances, so unless the two 'standardized' variables being compared share the same Y (i.e., within model), the meaning may be more or less informative

-On p. 3, the reciprocal of the effect ratio (as a unit of interpretation) should be introduced in the 1st full paragraph (first didactic example) rather than the 2nd full paragraph as this way of interpretating the relationship may be more intuitive to a given reader. Its mention can be retained in the 3rd paragraph (second didactic example) for consistency.

- Related to the point above: It may be useful to draw a juxtaposition with the IRT literature to further clarify. In IRT, reliability can be computed as 1 - the squared reciprocal of the square root of the information value for scores in a specified latent trait range; the reciprocal of the of the square root of the information value provides an estimate of the standard error of latent trait ability measurement in the specified latent trait range, which, when squared, provides an index of error variance in the same latent trait range.

I say this as to my surprise, in some of the pedagogic examples in the piece, 1 - the squared reciprocal does approximately yield the discussed effect ratio percentage difference noted (e.g., the Polderman et al. example and the criminality and substance abuse example), whereas in others it does not (e.g., the following 50/5 heritable, shared environment example). The distinction would be pedagogically useful, I think, to show how similar ways of parsing data can be used to calculate different metrics that share some underlying formal operations (also since CTT reliability is discussed in the piece). Of course, one may also take the view it complicates the focal issue, but a brief reaction to the suggestion would be informative nonetheless.

One suggestion more generally is to include a mock-up of a bivariate data table demonstrating that the sqrt of variance ratio is equivalent to the beta ratio for a given pair of values. 

 

 

Author

Replying to Reviewer 1

Thank you. I will change the sentence as suggested in the next revision (as soon as I get Reviewer 2's reply)

 I think these edits have been very responsive to my original review and appreciate the author's attentitiveness.

I still would suggest modification to the 3rd sentence (2nd paragraph) that I mentioned earlier. While I appreciate the point (and edit), the sentence needs a leading "Even" here. You are suggesting something redeeming in nature about the units (i.e., when they are not *entirely* meaningless as they would be if standardized), but in the 2nd part of the sentence you are suggesting they are still quite limited in their interpretive ramifications. Thus, in my opinion, it reads a bit awarkedly when only "when" is used as you are drawing a distinction here.

Reviewer

Thanks. Could you also note (here) what version the current footnote #4 came into the manuscript?

Author
Replying to Reviewer 1

Thanks. Could you also note (here) what version the current footnote #4 came into the manuscript?

Was there from Version 1 (then footnote 3). In fact, versions 1 to 5 are identical; I didn't know how this asystem worked, and made minor edits to the keywords, abstract etc. without knowing that they would trigger notifications to the reviewers.

Reviewer
Replying to Marco Del Giudice
Replying to Reviewer 1

Thanks. Could you also note (here) what version the current footnote #4 came into the manuscript?

Was there from Version 1 (then footnote 3). In fact, versions 1 to 5 are identical; I didn't know how this asystem worked, and made minor edits to the keywords, abstract etc. without knowing that they would trigger notifications to the reviewers.

Thanks!

Bot

Author has updated the submission to version #7

Author

Replying to Forum Bot

I uploaded a new version (#7), with the changes suggested by Reviewer 1 and the additional quotes in response to "Gold Vanth"

Author has updated the submission to version #7