Back to Submissions

1
Black-White differences in an English vocabulary test using an online Prolific sample

Submission status
Reviewing

Submission Editor
Noah Carl

Authors
Emil O. W. Kirkegaard
Meng Hu

Title
Black-White differences in an English vocabulary test using an online Prolific sample

Abstract

We sought to examine the Black-White difference in performance on a new English vocabulary test based on 225 items. Using data from the norm sample (N=499, Prolific) we found a gap of d = 0.74. Adjusting for test reliability, this was d = 0.75 (reliability = .977). We examined measurement invariance using Differential Item Functioning (DIF). Biased items are flagged based on p-value < .05. We found 27 biased items which decreased to 1 biased item after Bonferroni correction. An application of Jensen’s Method of Correlated Vectors (MCV) to the item data showed a positive relationship between the Black-White difference and the items’ factor loading, with a predicted gap of d = 0.76 at loading=1. Findings were in line with prior research of minimal bias in vocabulary tests, and a g-related difference.

Keywords
cognitive ability, method of correlated vectors, vocabulary, differential item functioning, Test Bias, Spearman’s Hypothesis, Black-White gap, Prolific

Supplemental materials link
https://osf.io/t2j4s/

Pdf

Paper

Reviewers ( 0 / 0 / 1 )
Reviewer 1: Accept

Fri 14 Mar 2025 18:22

Author | Admin

We have finally received two reviews.

I initially asked different people, and only Gregory Connor answered positively to the request, and the review was sent the April 12th 2025 (uploaded in the supplemental material link: https://osf.io/u2zaf). Emil Kirkegaard asked for another reviewer, but the reviewer likely wanted to be anonymous, and the review was sent to us May 22th 2025 (uploaded in the supplemental material link: https://osf.io/bj726).

I will fix what is necessary as soon as possible, and consult with my co-author, Emil, then upload it here.

Reviewer

I am not an expert in the topics of this paper. I have skimmed through it, and it looks fine, as far as I am able to evaluate. I have some minor mostly language notes below:

 

page 1

I would put it in the introduction that it is a representative US sample that is used.

 

page 3

“Each item was scored as correct or incorrect if the subject chose the right option(s)”

-> depending on whether the subject chose the right option(s).

 

page 3

“This allows for the calculation of item loadings that are on the same scale as regular test-level loadings and which are not affected by item pass rates, unlike the item-test correlation which lacks the invariance property due to being computed as a within-group correlation based on sum scores, therefore circumventing issues mentioned by Wicherts (2017).”

->

This approach calculates item loadings on the same scale as test-level loadings, unaffected by item difficulty. Unlike item-test correlations, these loadings have the invariance property since they're not computed as within-group correlations based on sum scores. This addresses the issues raised by Wicherts (2017).

 

page 4
-> 

“Jensen's MCV was applied to the item data. First, we derived the item standardized factor loadings. We computed the Cohen's d gap, estimated using the inverse normal distribution function (qnorm() in R) based on the pass rates by race. For instance, on the first item, Whites had a pass rate of 82.2% and Blacks of 68.3%. This corresponded to z score means of 0.922 and 0.475, yielding a gap of d = 0.447. Figure 2 shows the results.””

 

page 5

I don’t understand why some of the labels in the plot say “1of5_3” instead of a word, when other points are words.

 

page 6

“In this study, the Black sample was relatively small (N=63), so the true correlation will be larger as sampling error biases the correlation towards 0 (Dutton & Kirkegaard, 2021). The estimated gap at a factor loading of 1.00 should be unbiased as random error in the outcome variable should not affect the slope, and the random error in the predictor variable is relatively small. Indeed, the estimated gap (intercept + slopeg-loading ) at a perfect factor loading was 0.76 d, which is practically the same as the observed gap, and slightly closer to the reliability adjusted value (d = 0.74 and d = 0.75, respectively).”

I found this paragraph confusing. Here is an AI suggested version FWIW:

"In this study, the Black sample was relatively small (N=63), which increases uncertainty in the correlation estimate due to sampling variability (Dutton & Kirkegaard, 2021). The estimated gap at a factor loading of 1.00 should provide a reliable estimate, as measurement error in the outcome variable does not bias the regression slope, and measurement error in the predictor variable is relatively small. The estimated gap (intercept + slope_g-loading) at a perfect factor loading was 0.76d, which closely matches both the observed gap (d = 0.74) and the reliability-adjusted value (d = 0.75)."

 

 

page 6

“Because the main analysis uses all items, including the 8 easy items that could have served as attention check to detect potentially poor responses”

-> The main analysis uses all items, including the 8 easy items that could have served as attention check to detect potentially poor responses. We conducted a supplemental analysis that involved ...

Comment: It seems to me that this may remove Blacks who are not merely inattentive, but answer wrong because they are low IQ. So I would question even including it in the paper.

 

 

page 7

“But the magnitude of the Black-White gap found in the present study (d = 0.75) is smaller than what was reported in another, recent online test (d = 0.99) on the same Prolific platform comprising a vocabulary subscale and a paper folding subscale (Kirkegaard, 2022).”

I don’t think the ‘but’ is warranted here.

 

 

page 8

“Bates & Gignac (2022) conducted several analyses using the Prolific platform and found a modest effect size of (≈2.5 IQ) favoring the incentive group, a finding consistent with Gignac (2018) and Merritt et al. (2019).”

-> favoring the group who received extra payment for correct answers.

 

page 8

“Future research should explicitly measure motivation and its impact on test performance, particularly in online settings, to better understand its role in group differences.”

the separation of paragraphs should be different. This makes me think that this whole paragraph (also) is about the motivation thing.

 

Comment:

I would expect that the way that the subjects are found has range IQ limitations. You both can’t be too low IQ to figure out how to use Prolific, and you can’t be too high IQ if your time is not worth more than 8 dollars per hour. Thus I would expect that estimated group differences would be smaller than those in the full population. This could be mentioned in the discussion.


-----

Overall I accept it for publication