Reviewer 1: Gregory Connor's (GC) review is available in the OSF file uploaded months ago.
GC argued that the abstract should probably remove the part mentioning that the number of biased items decreased from 27 to 1 after Bonferroni correction, if no clarification is provided as to why such huge effect happened. So I twisted the sentence by simply reporting that there is only 1 biased item after Bonferroni correction.
GC argued that "If all the items are unbiased, the bionomial probability of finding 27 or more items biased using a correctly specified test with 95% confidence is 0.0000266. So you can reject unbiasedness of the full set of 225 of items with 99.99% confidence." and then later "You are using a p-value of 0.02%, which means that you are relying very heavily on EXACT normality of the test statistic distribution. Do you believe it is extremely close to an exact Gaussian distribution? You do not mention what test statistic you are putting to this extreme test of tail-distribution reliable normality."
I addressed this in the result section and modified the original paragraph as follows:
In this analysis, we fit a partially invariant model based on items with p < .05 for the test, and scores based on this and the initial, baseline model are compared. The difference in the gap size is then computed. Since there are 225 items, the p values can be corrected for multiple testing using Bonferroni. Without Bonferroni testing, 27 items were found to be biased (27/225 = 12%, i.e., above chance expectation of 5%), but after multiple testing corrections only 1 item was biased. In light of this latter result, one might reason that the conclusion of no bias is unwarranted because the finding of 27 significant items by chance alone is astronomically unlikely if all items were truly unbiased. To address concerns that Bonferroni correction is overly conservative and relies on extreme-tail assumptions, we also applied the Benjamini-Hochberg (BH) procedure, which controls the False Discovery Rate (FDR). The BH correction yielded the same result, identifying only a single biased item. Crucially, regardless of the method, the impact on test-level bias was minimal (in Cohen’s d, 0.02 against Blacks without correction, or 0.01 against Whites with Bonferroni correction). This means the statistical significance does not translate into practical significance.
And with this added footnote
Regarding the reviewer’s concern on the extreme Bonferroni-corrected alpha (α = 0.05 / 225 = 0.00022), the test statistic used was the Likelihood Ratio Test (LRT) statistic, which is asymptotically χ²-distributed, not Gaussian. We agree that all asymptotic approximations, including the χ², must be treated with caution in the extreme tails.
We also corrected several typos pointed out by GC and two confusing sentences.
Reviewer 2: Unfortunately, the anonymous review requested by Emil Kirkegaard was lost. I made a PDF of the review and uploaded it at OSF, but the OSF file disappeared for some reason. Emil did not keep a copy of this review, neither did I.
Reviewer 3 (in this thread):
In the revised paper, I addressed all of your points and suggestions. With the exception of the abstract because if I did not mention in the abstract that it is a representative US sample, it is because it isn't. Very low ability and very high ability participants are certainly lacking. I added this detail at the beginning of the discussion section.
Regarding your questions and observations:
"I don’t understand why some of the labels in the plot say “1of5_3” instead of a word, when other points are words."
We use the same label for all items in the follow up tests. 1of5_3 means the 3rd item among the select 1 out of 5 item format. We could have used the target word instead, but this would still cause confusion because there is no target word for select 2 out of 5 and select 3 out of 5. We use the same label to indicate these items all belong to the same set, which was administered in the follow up test.
"Comment: It seems to me that this may remove Blacks who are not merely inattentive, but answer wrong because they are low IQ. So I would question even including it in the paper."
These words are too easy, even for non-natives, they should be able to answer these words. I doubt this is related to true IQ. In any case, this was just a supplementary analysis, and the paper reports the result for all Blacks, including those who did not pass the attention check.
"the separation of paragraphs should be different. This makes me think that this whole paragraph (also) is about the motivation thing."
I agree. This last paragraph reads like this
Future research should explicitly measure motivation and its impact on test performance, particularly in online settings, to better understand its role in group differences. Such test-taking behavior has already been examined using response time data to account for rapid-guessing behavior (Michaelides et al., 2024; Lee & Jia, 2014). A more convincing test of Spearman’s hypothesis is to apply MCV in other subtests which are less subject to cultural influences than vocabulary.
It starts with "future research" and then moves on to discussion SH. I splitted this long paragraph right where it discusses SH, and moved this portion so that it follows immediately this paragraph:
Research on test bias is useful for examining the cultural hypothesis but also for vindicating the Spearman’s Hypothesis. Not only for the purpose of valid cross-group comparison but also for estimating the true impact of g on group differences after accounting for item bias. There was indeed an account that the effect of g increases when biased items are removed (te Nijenhuis et al., 2016b).
So the flow now is much more natural, and fluid.