I am not an expert in the topics of this paper. I have skimmed through it, and it looks fine, as far as I am able to evaluate. I have some minor mostly language notes below:
page 1
I would put it in the introduction that it is a representative US sample that is used.
page 3
“Each item was scored as correct or incorrect if the subject chose the right option(s)”
-> depending on whether the subject chose the right option(s).
page 3
“This allows for the calculation of item loadings that are on the same scale as regular test-level loadings and which are not affected by item pass rates, unlike the item-test correlation which lacks the invariance property due to being computed as a within-group correlation based on sum scores, therefore circumventing issues mentioned by Wicherts (2017).”
->
This approach calculates item loadings on the same scale as test-level loadings, unaffected by item difficulty. Unlike item-test correlations, these loadings have the invariance property since they're not computed as within-group correlations based on sum scores. This addresses the issues raised by Wicherts (2017).
page 4
->
“Jensen's MCV was applied to the item data. First, we derived the item standardized factor loadings. We computed the Cohen's d gap, estimated using the inverse normal distribution function (qnorm() in R) based on the pass rates by race. For instance, on the first item, Whites had a pass rate of 82.2% and Blacks of 68.3%. This corresponded to z score means of 0.922 and 0.475, yielding a gap of d = 0.447. Figure 2 shows the results.””
page 5
I don’t understand why some of the labels in the plot say “1of5_3” instead of a word, when other points are words.
page 6
“In this study, the Black sample was relatively small (N=63), so the true correlation will be larger as sampling error biases the correlation towards 0 (Dutton & Kirkegaard, 2021). The estimated gap at a factor loading of 1.00 should be unbiased as random error in the outcome variable should not affect the slope, and the random error in the predictor variable is relatively small. Indeed, the estimated gap (intercept + slopeg-loading ) at a perfect factor loading was 0.76 d, which is practically the same as the observed gap, and slightly closer to the reliability adjusted value (d = 0.74 and d = 0.75, respectively).”
I found this paragraph confusing. Here is an AI suggested version FWIW:
"In this study, the Black sample was relatively small (N=63), which increases uncertainty in the correlation estimate due to sampling variability (Dutton & Kirkegaard, 2021). The estimated gap at a factor loading of 1.00 should provide a reliable estimate, as measurement error in the outcome variable does not bias the regression slope, and measurement error in the predictor variable is relatively small. The estimated gap (intercept + slope_g-loading) at a perfect factor loading was 0.76d, which closely matches both the observed gap (d = 0.74) and the reliability-adjusted value (d = 0.75)."
page 6
“Because the main analysis uses all items, including the 8 easy items that could have served as attention check to detect potentially poor responses”
-> The main analysis uses all items, including the 8 easy items that could have served as attention check to detect potentially poor responses. We conducted a supplemental analysis that involved ...
Comment: It seems to me that this may remove Blacks who are not merely inattentive, but answer wrong because they are low IQ. So I would question even including it in the paper.
page 7
“But the magnitude of the Black-White gap found in the present study (d = 0.75) is smaller than what was reported in another, recent online test (d = 0.99) on the same Prolific platform comprising a vocabulary subscale and a paper folding subscale (Kirkegaard, 2022).”
I don’t think the ‘but’ is warranted here.
page 8
“Bates & Gignac (2022) conducted several analyses using the Prolific platform and found a modest effect size of (≈2.5 IQ) favoring the incentive group, a finding consistent with Gignac (2018) and Merritt et al. (2019).”
-> favoring the group who received extra payment for correct answers.
page 8
“Future research should explicitly measure motivation and its impact on test performance, particularly in online settings, to better understand its role in group differences.”
the separation of paragraphs should be different. This makes me think that this whole paragraph (also) is about the motivation thing.
Comment:
I would expect that the way that the subjects are found has range IQ limitations. You both can’t be too low IQ to figure out how to use Prolific, and you can’t be too high IQ if your time is not worth more than 8 dollars per hour. Thus I would expect that estimated group differences would be smaller than those in the full population. This could be mentioned in the discussion.
-----
Overall I accept it for publication