Back to Post-publication discussions

1
Looking for evidence of the Dunning-Kruger effect: an analysis of 2400 online test takers

Submission status
Reviewing

Submission Editor
Submission editor not assigned yet.

Authors
Emil O. W. Kirkegaard
Arjen Gerritsen

Title
Looking for evidence of the Dunning-Kruger effect: an analysis of 2400 online test takers

Abstract

The Dunning-Kruger effect is a well-known psychological finding. Unfortunately, there are two aspects of the finding, one trivial, indeed a simple statistically necessary empirical pattern, and the other an unsupported theory that purports to explain this pattern. Recently, (Gignac & Zajenkowski, 2020) suggested two ways to operationalize and test the theory. We carried out a replication of their study using archival data from a larger dataset. We used two measures of self-estimated ability: estimated sumscore (correct responses), and estimated own-centile. We find no evidence of nonlinearity for either. We find evidence of heteroscedasticity for self-centile estimates, but not raw score estimates. Overall, the evidence was mostly inconsistent with Dunning-Kruger theory.

Keywords
intelligence, regression towards the mean, self-perception, Dunning-Kruger effect, self-estimated intelligence

Supplemental materials link
https://osf.io/fhqap/

Pdf

Paper

Reviewers ( 0 / 0 / 2 )
Reviewer 1: Accept
Reviewer 2: Accept

Mon 15 Feb 2021 04:00

Reviewer

I like the paper. It provides an important replication regarding the Kruger-Dunning effect. However, I have two major concerns. These concerns have to do more with Gignac & Zajenkowski (2020) original method than with the implementation of this method in the current paper. Gignac & Zajenkowski (2020) offered that the existence of Kruger-Dunning effect should lead to (1) heteroscedasticity in which the variance of the residuals is higher when ability is low than when it is high and (2) convexity of the function relating estimated ability to actual ability. However, the problem with these effects is that they may be the result of a ceiling effect likely to stem from the “better than average” effect that clearly exist in ability evaluations.

These issues could be resolved by a simulation that introduce a ceiling effect into the ability estimates. I should also note here that of the two effects, the second is a bit more complex. The reason is that a Kruger-Dunning effect coupled with a better than average effect should lead not only to convexity in low ability (as Gignac & Zajenkowski argue), but also to concavity in high ability. These two effects are not easy to model (and perhaps may lead to linearity if the simultaneous existence of the two effects is not acknowledged). In this regard, I am not sure if the right model to use is LOESS. Perhaps a more appropriate model is TOBIT.

In sum, I am neither sure that the heteroscedasticity in high levels observed by he authors necessarily implies the existence of a Kruger-Dunning effect, nor that the linearity they observe necessarily implies a lack of a Kruger-Dunning effect. They both may be due to a ceiling effect resulting from a better than average effect.

If I am right, these issues may be resolved in a simulation, although such a simulation may be more appropriate to be published in Intelligence as a response to Gignac & Zajenkowski (2020).

Minor concerns

The paper is written very carelessly. If you want your work to have an influence you need to do a much better work in the writing. The casual way by which the paper is written leaves a bad impression. Here are some examples.

All Figures are labeled Figure X. It seems that you forgot to go over the paper and insert figure numbers.

In the abstract you write “(Gignac & Zajenkowski, 2020) suggested two ways to operationalize and test the theory. We carried out a replication of their study using archival data from a larger dataset” – but what are these two ways. They should be mentioned here and not in the next sentence

A large meta-analysis found a mean observed r = .33 (Freund & Kasten, 2012)” – redundant we already saw it in Fig 1. Confusing – the reader thinks for a moment that this sentence represent a new information that was not communicated before, whereas it does not

We scored the cognitive data” – what are “cognitive data” – you didn’t defined before the model of interest” – I don’t understand this. Do you mean the parameter of interest is fit?

Figure X. Distributions of cognitive ability” – Are we talking about cognitive ability or about scores in science test. If the authors think they are the same, they should clearly state that. Otherwise the reader is confused

You need to specify the figure caption from Dunning-Kruger original paper, otherwise the meaning of the figure is not clear. Reviewers should not go to the original paper to understand the figure?

You say that your purpose is to replicate Gignac & Zajenkowski’s findings, but you never specify what are their findings!! 

Author | Admin
Replying to Reviewer 1

I like the paper. It provides an important replication regarding the Kruger-Dunning effect. However, I have two major concerns. These concerns have to do more with Gignac & Zajenkowski (2020) original method than with the implementation of this method in the current paper. Gignac & Zajenkowski (2020) offered that the existence of Kruger-Dunning effect should lead to (1) heteroscedasticity in which the variance of the residuals is higher when ability is low than when it is high and (2) convexity of the function relating estimated ability to actual ability. However, the problem with these effects is that they may be the result of a ceiling effect likely to stem from the “better than average” effect that clearly exist in ability evaluations.

These issues could be resolved by a simulation that introduce a ceiling effect into the ability estimates. I should also note here that of the two effects, the second is a bit more complex. The reason is that a Kruger-Dunning effect coupled with a better than average effect should lead not only to convexity in low ability (as Gignac & Zajenkowski argue), but also to concavity in high ability. These two effects are not easy to model (and perhaps may lead to linearity if the simultaneous existence of the two effects is not acknowledged). In this regard, I am not sure if the right model to use is LOESS. Perhaps a more appropriate model is TOBIT.

In sum, I am neither sure that the heteroscedasticity in high levels observed by he authors necessarily implies the existence of a Kruger-Dunning effect, nor that the linearity they observe necessarily implies a lack of a Kruger-Dunning effect. They both may be due to a ceiling effect resulting from a better than average effect.

If I am right, these issues may be resolved in a simulation, although such a simulation may be more appropriate to be published in Intelligence as a response to Gignac & Zajenkowski (2020).

Minor concerns

The paper is written very carelessly. If you want your work to have an influence you need to do a much better work in the writing. The casual way by which the paper is written leaves a bad impression. Here are some examples.

All Figures are labeled Figure X. It seems that you forgot to go over the paper and insert figure numbers.

In the abstract you write “(Gignac & Zajenkowski, 2020) suggested two ways to operationalize and test the theory. We carried out a replication of their study using archival data from a larger dataset” – but what are these two ways. They should be mentioned here and not in the next sentence

A large meta-analysis found a mean observed r = .33 (Freund & Kasten, 2012)” – redundant we already saw it in Fig 1. Confusing – the reader thinks for a moment that this sentence represent a new information that was not communicated before, whereas it does not

We scored the cognitive data” – what are “cognitive data” – you didn’t defined before the model of interest” – I don’t understand this. Do you mean the parameter of interest is fit?

Figure X. Distributions of cognitive ability” – Are we talking about cognitive ability or about scores in science test. If the authors think they are the same, they should clearly state that. Otherwise the reader is confused

You need to specify the figure caption from Dunning-Kruger original paper, otherwise the meaning of the figure is not clear. Reviewers should not go to the original paper to understand the figure?

You say that your purpose is to replicate Gignac & Zajenkowski’s findings, but you never specify what are their findings!! 

Thanks for the helpful review.

Ceiling issues

Our data do not suffer much from ceiling problems, so I am not sure what to make of this comment. Distribution of objective scores from the submission:

The maximum score is 25, and no person attained this. Distribution of self-rated ability:

Only the centile guesses show some ceiling issues. But even here, not much to write about:

> #in %
> (quiz25_noOutlier$score == 25) %>% describe()
   vars    n mean sd median trimmed mad min max range skew kurtosis se
X1    1 2392    0  0      0       0   0   0   0     0  NaN      NaN  0
> (quiz25_noOutlier$score_guess == 25) %>% describe()
   vars    n    mean     sd median trimmed mad min max range skew kurtosis       se
X1    1 2388 0.00168 0.0409      0       0   0   0   1     1 24.4      592 0.000837
> (quiz25_noOutlier$g == max(quiz25_noOutlier$g, na.rm = T)) %>% describe()
   vars    n     mean     sd median trimmed mad min max range skew kurtosis       se
X1    1 2392 0.000418 0.0204      0       0   0   0   1     1 48.8     2385 0.000418
> (quiz25_noOutlier$centile_guess == 100) %>% describe()
   vars    n    mean     sd median trimmed mad min max range skew kurtosis      se
X1    1 2386 0.00712 0.0841      0       0   0   0   1     1 11.7      135 0.00172

Thus, for the worst case variable, 0.71% rated themselves as being in the 100th centile, and 0.17% guessed their own score was 25/25. It is true that the centile variable is skewed towards the upper ceiling:

> quiz25_noOutlier %>%
+   select(score, g, score_guess, centile_guess) %>%
+   describe()
              vars    n    mean    sd   median  trimmed   mad   min    max  range    skew kurtosis     se
score            1 2392 15.3198  3.36 15.00000 15.34065  2.97  5.00  24.00  19.00 -0.0626   -0.333 0.0688
g                2 2392  0.0101  0.97  0.00671  0.00547  1.03 -2.48   2.48   4.96  0.0456   -0.503 0.0198
score_guess      3 2388 14.0507  4.25 14.00000 14.15481  4.45  0.00  25.00  25.00 -0.2087   -0.313 0.0870
centile_guess    4 2386 68.2481 20.20 70.00000 70.08901 19.27  0.00 100.00 100.00 -0.7645    0.172 0.4135

But it is not much. Skew is -0.76.

Gignac and Zajenkowski's paper

R1 is right. We forgot to explain what this study did. We have now added this to the introduction:

(Gignac & Zajenkowski, 2020) applied these two methods to a dataset of 929 subjects who had taken the Raven’s advanced progressive matrices test (a standard nonverbal test) as well as rated themselves on a 1-25 scale. First, they found no evidence of heteroscedasticity using the Glejser test. This test involves saving the residuals from the linear model (self-estimated ability ~ objectively measured ability, where ~ denotes “regressed on”), converting to absolute values, and correlating with the predictor (i.e., objectively measured ability). The correlation was -.05 with 95% confidence interval of -.11 to .02. Second, they looked for a nonlinear association using a model comparison with a quadratic model. The model comparison found no incremental validity of the nonlinear model (incremental R2 < 1%). They plotted the data using a smoothing function (local regression, LOESS), which also showed no notable deviation from linearity. The purpose of this paper was to replicate the findings of (Gignac & Zajenkowski, 2020) in a new and larger sample using more robust methods for testing for heteroscedasticity and nonlinearity.

Writing

All Figures are labeled Figure X. It seems that you forgot to go over the paper and insert figure numbers.

These change as the manucript changes, and will be added in the typesetting step at the end.

In the abstract you write “(Gignac & Zajenkowski, 2020) suggested two ways to operationalize and test the theory. We carried out a replication of their study using archival data from a larger dataset” – but what are these two ways. They should be mentioned here and not in the next sentence

Their methods and results are not given in the introduction.

A large meta-analysis found a mean observed r = .33 (Freund & Kasten, 2012)” – redundant we already saw it in Fig 1. Confusing – the reader thinks for a moment that this sentence represent a new information that was not communicated before, whereas it does not

Figure 1 (Typical Dunning-Kruger pattern. Reproduced from ​(Kruger & Dunning, 1999)) is not related to Fruend and Kasten's meta-analysis, so I am not sure what to make of this comment. The point of including the effect sizes is to give the reader an idea of how strong the correlation typically is between these constructs. As it is, our paper finds substantially stronger correlations than this meta-analysis.

“We scored the cognitive data” – what are “cognitive data” – you didn’t defined before “the model of interest” – I don’t understand this. Do you mean the parameter of interest is fit?

Cognitive data are the data from the objective test. This is a fairly common term (18.5k hits on google scholar), so I don't know this is objected to. We changed the phrase to "cognitive ability data", maybe this will be enough clarification.

The model of interest refers to whatever model the research is looking for heteroscedasticity in. I changed the text to be a more clear:

Turning to the question of heteroscedasticity, we employed the same method as in (Kirkegaard, 2021). The approach is as follows: first, the model of interest is fit. This is the statistical model that wants to evaluate for heteroscedasticity. Second, the residuals are saved, standardized, and then converted to positive (absolute) values. Third, linear and nonlinear models are then fit to the residuals using the predictor of interest to look for evidence of heteroscedasticity.

Figure X. Distributions of cognitive ability” – Are we talking about cognitive ability or about scores in science test. If the authors think they are the same, they should clearly state that. Otherwise the reader is confused

We changed the caption to:

Figure X. Distributions of scientific knowledge by scoring method. Left panel shows sum scores, and the right panel, item response theory standard scores (density curve overlaid).

You need to specify the figure caption from Dunning-Kruger original paper, otherwise the meaning of the figure is not clear. Reviewers should not go to the original paper to understand the figure?

I don't know what you mean. The caption already says this:

Figure X. Typical Dunning-Kruger pattern. Reproduced from (Kruger & Dunning, 1999).

Bot

Authors have updated the submission to version #2

Reviewer

Emil addressed most of mine concerns. The only remaining comment I have has to do with the way he explains the Dunning-Kruger effect as arising from “two simple facts. First, self-estimated ability is positively, but imperfectly, correlated with actual ability. A large meta-analysis found a mean observed r = .33 (Freund & Kasten, 2012). Second, there is a general tendency to overestimate own performance in general”. What’s missing here is the importance of regression to the mean. In fact, this is the most important mistake that Dunning-Kruger commit.
 
Congratulations for a nice work.

Reviewer

After reading the paper twice, I found that it was clearly written with interesting and appropriate analysis.

This typo may have already been fixed:
After the heading "Results" ... "elderly persons who were pasted again..." obviously needs to be changed to "tested."

The only thing I found that I would recommend changing is the final part (DISCUSSION).  Just before the section, DATA, the text is

"The purpose of this paper was to replicate the findings of (Gignac & Zajenkowski, 2020) in a new and larger sample using more robust methods for testing for heteroscedasticity and nonlinearity."

After the section DISCUSSION, the text reads: "First, our replication study was about 2.5 times larger than the study by (Gignac & Zajenkowski, 2020, n=929)."  The remainder of the paper does not mention the Gignac & Zajenkowski paper again.  It discusses the findings of the current study and compares some results to other papers, but does not close by going back to the Gignac & Zajenkowski and making some kind of final comparison, such as the degree to which the Gignac & Zajenkowski paper was replicated, or something a bit more detailed.  My impression is that the two methods suggested by Gignac & Zajenkowski were tested appropriately, but with different results and uncertainty as to the cause of the differences.

One way to end the paper, and accommodate the comment above, would be to add a short "CONCLUSIONS" section.  It could be a single paragraph, that succinctly sums the findings in relation to Gignac & Zajenkowski (and other papers, if an appropriate comparison would be informative) and states the conclusions that are already part of the DISCUSSION.  These could be single sentences... here are the things that are important.

There is a lot to like about the paper.  The correlation matrix, for example, is precisely what one would expect and is a good confirmation that nothing strange has entered the study.

These are just some (obvious) general thoughts that might be useful if any additional tweaking is done:

If we go back to the D-K plot, given at the beginning of this paper, the crossing point strikes me as something that could vary between different data sets and different test formats.  In fact Gignac & Zajenkowski show the same general outcome as D-K, except that the lines do not cross, causing a slightly different conclusion; the actual scores are always below the perceived ability line.  Anything that could move the intercepts or the slope of one lines, could result in them crossing.

The final paragraph leaves a sense of uncertainty as to whether the analysis is better, worse, or similar to the several studies mentioned.  It has high N and a test that produced a nice looking score distribution, but it isn't clear that the studies are directly comparable.  One issue is how the science information test (complete with a religion question!) compares to the RAPM test used by Gignac & Zajenkowski.  If we are really comparing g measures only, the differences may be unimportant.  [If so, this could be stated.]  The RAPM has a long history of research usage and is well understood, while the science test strikes me as a weaker test.  

I wonder about the self-selected nature of the online test.  We don't know how the test takers were distributed, but it seems likely that there is a narrow age range (probably young), a large male fraction, and as the authors mentioned, the test takers may be above average in intelligence.  The question is how would the procedures followed in this paper change if the study cohorts were known to be representative of a more general population?  I found little information about the participants in the Gignac & Zajenkowski study, other than "general community participants."  I think the differences in data sets is an obvious candidate as the source of differences in results.

Author | Admin
Replying to Reviewer 1

Emil addressed most of mine concerns. The only remaining comment I have has to do with the way he explains the Dunning-Kruger effect as arising from “two simple facts. First, self-estimated ability is positively, but imperfectly, correlated with actual ability. A large meta-analysis found a mean observed r = .33 (Freund & Kasten, 2012). Second, there is a general tendency to overestimate own performance in general”. What’s missing here is the importance of regression to the mean. In fact, this is the most important mistake that Dunning-Kruger commit.
 
Congratulations for a nice work.

I have expanded the discussion of this. I prefer not to use this term for the effect we see here, but others like it, so I now explicitly refer to this, and it is also a keyword of the article.

Author | Admin
Replying to Reviewer 2

After reading the paper twice, I found that it was clearly written with interesting and appropriate analysis.

This typo may have already been fixed:
After the heading "Results" ... "elderly persons who were pasted again..." obviously needs to be changed to "tested."

Fixed.

The only thing I found that I would recommend changing is the final part (DISCUSSION).  Just before the section, DATA, the text is

"The purpose of this paper was to replicate the findings of (Gignac & Zajenkowski, 2020) in a new and larger sample using more robust methods for testing for heteroscedasticity and nonlinearity."

After the section DISCUSSION, the text reads: "First, our replication study was about 2.5 times larger than the study by (Gignac & Zajenkowski, 2020, n=929)."  The remainder of the paper does not mention the Gignac & Zajenkowski paper again.  It discusses the findings of the current study and compares some results to other papers, but does not close by going back to the Gignac & Zajenkowski and making some kind of final comparison, such as the degree to which the Gignac & Zajenkowski paper was replicated, or something a bit more detailed.  My impression is that the two methods suggested by Gignac & Zajenkowski were tested appropriately, but with different results and uncertainty as to the cause of the differences.

One way to end the paper, and accommodate the comment above, would be to add a short "CONCLUSIONS" section.  It could be a single paragraph, that succinctly sums the findings in relation to Gignac & Zajenkowski (and other papers, if an appropriate comparison would be informative) and states the conclusions that are already part of the DISCUSSION.  These could be single sentences... here are the things that are important.

The discussion now adds: However, both studies found the same result, namely that the patterns the Dunning-Kruger effect should generate were not found. Thus, we successfully replicated (Gignac & Zajenkowski, 2020).

There is a lot to like about the paper.  The correlation matrix, for example, is precisely what one would expect and is a good confirmation that nothing strange has entered the study.

These are just some (obvious) general thoughts that might be useful if any additional tweaking is done:

If we go back to the D-K plot, given at the beginning of this paper, the crossing point strikes me as something that could vary between different data sets and different test formats.  In fact Gignac & Zajenkowski show the same general outcome as D-K, except that the lines do not cross, causing a slightly different conclusion; the actual scores are always below the perceived ability line.  Anything that could move the intercepts or the slope of one lines, could result in them crossing.

Yes, in their Figure 1, the lines do not cross in the visible portion here because the better-than-average effect is very large. In their case, looks like about 25 IQ, so 1.67 d. The meta-analysis by Zell shows a typical effect size is about 0.75 d. https://www.researchgate.net/publication/337692412_The_Better-Than-Average_Effect_in_Comparative_Self-_Evaluation_A_Comprehensive_Review_and_Meta-Analysis

In our study, the mean centile guess is 68, so a quite small above-average-effect effect size (d = 0.47). However, our sample was somewhat elite being sampled from the Twitter followers of the author, thus selected for interest in science and probably above average intelligence. This number is thus difficult to interpret.

The final paragraph leaves a sense of uncertainty as to whether the analysis is better, worse, or similar to the several studies mentioned.  It has high N and a test that produced a nice looking score distribution, but it isn't clear that the studies are directly comparable.  One issue is how the science information test (complete with a religion question!) compares to the RAPM test used by Gignac & Zajenkowski.  If we are really comparing g measures only, the differences may be unimportant.  [If so, this could be stated.]  The RAPM has a long history of research usage and is well understood, while the science test strikes me as a weaker test.  

I wonder about the self-selected nature of the online test.  We don't know how the test takers were distributed, but it seems likely that there is a narrow age range (probably young), a large male fraction, and as the authors mentioned, the test takers may be above average in intelligence.  The question is how would the procedures followed in this paper change if the study cohorts were known to be representative of a more general population?  I found little information about the participants in the Gignac & Zajenkowski study, other than "general community participants."  I think the differences in data sets is an obvious candidate as the source of differences in results.

Added a final sentence: Taken together then, it is unclear why we find larger correlations between measured and self-rated ability scores compared to other studies. The final paragraph discusses the stronger observed correlation between measured and self-estimated scores.

It is a good point. The descriptive statistics are missing. These have been added to a new table which also has a text dicussing the sample.

---

The new revision hopefully improves many things. I added more tables and some more figures. Some of the other text was rewritten to be more clear and on point.

Bot

Authors have updated the submission to version #4

Bot

The submission was accepted for publication.

Bot

Authors have updated the submission to version #6

Bot

Authors have updated the submission to version #7