Back to Post-publication discussions

1
Black-White differences in an English vocabulary test using an online Prolific sample

Submission status
Published

Submission Editor
Noah Carl

Authors
Emil O. W. Kirkegaard
Meng Hu

Title
Black-White differences in an English vocabulary test using an online Prolific sample

Abstract

We sought to examine the Black-White difference in performance on a new English vocabulary test based on 225 items. Using data from the norm sample (N = 499, Prolific) we found a gap of d = 0.74. Adjusting for test reliability, this was d = 0.75 (reliability = .977). We examined measurement invariance using Differential Item Functioning (DIF). Biased items are flagged based on p-value < .05. We found 1 biased item after Bonferroni correction for multiple comparisons. An application of Jensen’s Method of Correlated Vectors (MCV) to the item data showed a positive relationship between the Black-White difference and the items’ factor loading, with a predicted gap of d = 0.76 at loading = 1. Findings were in line with prior research of minimal bias in vocabulary tests, and a g-related difference.

 

Keywords
cognitive ability, method of correlated vectors, vocabulary, differential item functioning, Test Bias, Spearman’s Hypothesis, Black-White gap

Supplemental materials link
https://osf.io/t2j4s/

Pdf

Paper

Typeset Pdf

Typeset Paper

Reviewers ( 0 / 0 / 2 )
Anon Anonsen: Accept
Gregory Connor: Accept

Fri 14 Mar 2025 18:22

Author | Admin

We have finally received two reviews.

I initially asked different people, and only Gregory Connor answered positively to the request, and the review was sent the April 12th 2025 (uploaded in the supplemental material link: https://osf.io/u2zaf). Emil Kirkegaard asked for another reviewer, but the reviewer likely wanted to be anonymous, and the review was sent to us May 22th 2025 (uploaded in the supplemental material link: https://osf.io/bj726).

I will fix what is necessary as soon as possible, and consult with my co-author, Emil, then upload it here.

Reviewer

I am not an expert in the topics of this paper. I have skimmed through it, and it looks fine, as far as I am able to evaluate. I have some minor mostly language notes below:

 

page 1

I would put it in the introduction that it is a representative US sample that is used.

 

page 3

“Each item was scored as correct or incorrect if the subject chose the right option(s)”

-> depending on whether the subject chose the right option(s).

 

page 3

“This allows for the calculation of item loadings that are on the same scale as regular test-level loadings and which are not affected by item pass rates, unlike the item-test correlation which lacks the invariance property due to being computed as a within-group correlation based on sum scores, therefore circumventing issues mentioned by Wicherts (2017).”

->

This approach calculates item loadings on the same scale as test-level loadings, unaffected by item difficulty. Unlike item-test correlations, these loadings have the invariance property since they're not computed as within-group correlations based on sum scores. This addresses the issues raised by Wicherts (2017).

 

page 4
-> 

“Jensen's MCV was applied to the item data. First, we derived the item standardized factor loadings. We computed the Cohen's d gap, estimated using the inverse normal distribution function (qnorm() in R) based on the pass rates by race. For instance, on the first item, Whites had a pass rate of 82.2% and Blacks of 68.3%. This corresponded to z score means of 0.922 and 0.475, yielding a gap of d = 0.447. Figure 2 shows the results.””

 

page 5

I don’t understand why some of the labels in the plot say “1of5_3” instead of a word, when other points are words.

 

page 6

“In this study, the Black sample was relatively small (N=63), so the true correlation will be larger as sampling error biases the correlation towards 0 (Dutton & Kirkegaard, 2021). The estimated gap at a factor loading of 1.00 should be unbiased as random error in the outcome variable should not affect the slope, and the random error in the predictor variable is relatively small. Indeed, the estimated gap (intercept + slopeg-loading ) at a perfect factor loading was 0.76 d, which is practically the same as the observed gap, and slightly closer to the reliability adjusted value (d = 0.74 and d = 0.75, respectively).”

I found this paragraph confusing. Here is an AI suggested version FWIW:

"In this study, the Black sample was relatively small (N=63), which increases uncertainty in the correlation estimate due to sampling variability (Dutton & Kirkegaard, 2021). The estimated gap at a factor loading of 1.00 should provide a reliable estimate, as measurement error in the outcome variable does not bias the regression slope, and measurement error in the predictor variable is relatively small. The estimated gap (intercept + slope_g-loading) at a perfect factor loading was 0.76d, which closely matches both the observed gap (d = 0.74) and the reliability-adjusted value (d = 0.75)."

 

 

page 6

“Because the main analysis uses all items, including the 8 easy items that could have served as attention check to detect potentially poor responses”

-> The main analysis uses all items, including the 8 easy items that could have served as attention check to detect potentially poor responses. We conducted a supplemental analysis that involved ...

Comment: It seems to me that this may remove Blacks who are not merely inattentive, but answer wrong because they are low IQ. So I would question even including it in the paper.

 

 

page 7

“But the magnitude of the Black-White gap found in the present study (d = 0.75) is smaller than what was reported in another, recent online test (d = 0.99) on the same Prolific platform comprising a vocabulary subscale and a paper folding subscale (Kirkegaard, 2022).”

I don’t think the ‘but’ is warranted here.

 

 

page 8

“Bates & Gignac (2022) conducted several analyses using the Prolific platform and found a modest effect size of (≈2.5 IQ) favoring the incentive group, a finding consistent with Gignac (2018) and Merritt et al. (2019).”

-> favoring the group who received extra payment for correct answers.

 

page 8

“Future research should explicitly measure motivation and its impact on test performance, particularly in online settings, to better understand its role in group differences.”

the separation of paragraphs should be different. This makes me think that this whole paragraph (also) is about the motivation thing.

 

Comment:

I would expect that the way that the subjects are found has range IQ limitations. You both can’t be too low IQ to figure out how to use Prolific, and you can’t be too high IQ if your time is not worth more than 8 dollars per hour. Thus I would expect that estimated group differences would be smaller than those in the full population. This could be mentioned in the discussion.


-----

Overall I accept it for publication

Author | Admin

Reviewer 1: Gregory Connor's (GC) review is available in the OSF file uploaded months ago.

GC argued that the abstract should probably remove the part mentioning that the number of biased items decreased from 27 to 1 after Bonferroni correction, if no clarification is provided as to why such huge effect happened. So I twisted the sentence by simply reporting that there is only 1 biased item after Bonferroni correction.

GC argued that "If all the items are unbiased, the bionomial probability of finding 27 or more items biased using a correctly specified test with 95% confidence is 0.0000266. So you can reject unbiasedness of the full set of 225 of items with 99.99% confidence." and then later "You are using a p-value of 0.02%, which means that you are relying very heavily on EXACT normality of the test statistic distribution. Do you believe it is extremely close to an exact Gaussian distribution? You do not mention what test statistic you are putting to this extreme test of tail-distribution reliable normality."

I addressed this in the result section and modified the original paragraph as follows: 

In this analysis, we fit a partially invariant model based on items with p < .05 for the test, and scores based on this and the initial, baseline model are compared. The difference in the gap size is then computed. Since there are 225 items, the p values can be corrected for multiple testing using Bonferroni. Without Bonferroni testing, 27 items were found to be biased (27/225 = 12%, i.e., above chance expectation of 5%), but after multiple testing corrections only 1 item was biased. In light of this latter result, one might reason that the conclusion of no bias is unwarranted because the finding of 27 significant items by chance alone is astronomically unlikely if all items were truly unbiased. To address concerns that Bonferroni correction is overly conservative and relies on extreme-tail assumptions, we also applied the Benjamini-Hochberg (BH) procedure, which controls the False Discovery Rate (FDR). The BH correction yielded the same result, identifying only a single biased item. Crucially, regardless of the method, the impact on test-level bias was minimal (in Cohen’s d, 0.02 against Blacks without correction, or 0.01 against Whites with Bonferroni correction). This means the statistical significance does not translate into practical significance.

And with this added footnote

Regarding the reviewer’s concern on the extreme Bonferroni-corrected alpha (α = 0.05 / 225 = 0.00022), the test statistic used was the Likelihood Ratio Test (LRT) statistic, which is asymptotically χ²-distributed, not Gaussian. We agree that all asymptotic approximations, including the χ², must be treated with caution in the extreme tails.

We also corrected several typos pointed out by GC and two confusing sentences.

Reviewer 2: Unfortunately, the anonymous review requested by Emil Kirkegaard was lost. I made a PDF of the review and uploaded it at OSF, but the OSF file disappeared for some reason. Emil did not keep a copy of this review, neither did I.

Reviewer 3 (in this thread): 

In the revised paper, I addressed all of your points and suggestions. With the exception of the abstract because if I did not mention in the abstract that it is a representative US sample, it is because it isn't. Very low ability and very high ability participants are certainly lacking. I added this detail at the beginning of the discussion section.

Regarding your questions and observations:

"I don’t understand why some of the labels in the plot say “1of5_3” instead of a word, when other points are words."

We use the same label for all items in the follow up tests. 1of5_3 means the 3rd item among the select 1 out of 5 item format. We could have used the target word instead, but this would still cause confusion because there is no target word for select 2 out of 5 and select 3 out of 5. We use the same label to indicate these items all belong to the same set, which was administered in the follow up test.

"Comment: It seems to me that this may remove Blacks who are not merely inattentive, but answer wrong because they are low IQ. So I would question even including it in the paper."

These words are too easy, even for non-natives, they should be able to answer these words. I doubt this is related to true IQ. In any case, this was just a supplementary analysis, and the paper reports the result for all Blacks, including those who did not pass the attention check.

"the separation of paragraphs should be different. This makes me think that this whole paragraph (also) is about the motivation thing."

I agree. This last paragraph reads like this 

Future research should explicitly measure motivation and its impact on test performance, particularly in online settings, to better understand its role in group differences. Such test-taking behavior has already been examined using response time data to account for rapid-guessing behavior (Michaelides et al., 2024; Lee & Jia, 2014). A more convincing test of Spearman’s hypothesis is to apply MCV in other subtests which are less subject to cultural influences than vocabulary.

It starts with "future research" and then moves on to discussion SH. I splitted this long paragraph right where it discusses SH, and moved this portion so that it follows immediately this paragraph:

Research on test bias is useful for examining the cultural hypothesis but also for vindicating the Spearman’s Hypothesis. Not only for the purpose of valid cross-group comparison but also for estimating the true impact of g on group differences after accounting for item bias. There was indeed an account that the effect of g increases when biased items are removed (te Nijenhuis et al., 2016b). 

So the flow now is much more natural, and fluid.

Bot

Authors have updated the submission to version #2

Bot

The submission was accepted for publication.

Author | Admin

Good news, I was able to find (by accident, in fact) the second review that we lost. It is now available at the OSF project of this paper. The main concern of the review is the following:

While the study appears objective and clear, (in the current climate) it may receive greater criticism from the academic community due to the lack of cultural, motivational and socio-structural insights/clarifications. Also, number of cited authors and their theories were heavily criticised in the literature – not addressing this may attract higher scrutiny from the reader.

I think the theory and methodology is robust, as explained in the paper. Other research (cited in the paper) confirmed the Spearman's hypothesis, in the case of Black-White differences. As for lack of clarification for the motivation of this paper, I admit we were not being very clear here, except our goal is to test for item bias and Spearman's hypothesis. Anyone who knows about SH understands that item bias analysis is required for corrected inferences about the nature of the group differences implied by SH, and the value brought by SH is that g matters if one wishes to examine the nature of group differences. So the MCV applied at the item level makes sense here.