Back to Post-publication discussions

On The Validity of The GSS Vocabulary Test Across Groups

Submission status

Submission Editor
Emil O. W. Kirkegaard

Meng Hu

On The Validity of The GSS Vocabulary Test Across Groups


The psychometric properties of the Wordsum vocabulary test across race and gender groups has not been studied yet. Taking advantage of a large sample of American adults from the General Social Survey (GSS), the Differential item/test functioning (DIF/DTF) were evaluated across gender and ethnicity groups by using Item Response Theory (IRT). Two items displayed DIF with respect to race whereas four items displayed DIF with respect to gender. However, because the DIFs run in both directions, there is no consistent bias against either group at the test level. Despite being culturally loaded, the Wordsum shows no evidence of culture bias.

intelligence, group differences, vocabulary, differential item functioning, Wordsum, Test Bias

Supplemental materials link



Typeset Pdf

Typeset Paper

Reviewers ( 0 / 0 / 2 )
John Herrnstein: Accept
Simon Wright: Accept

Sun 26 Feb 2023 08:07


Firstly, this is a very useful paper. I'm glad to review it. 

There is considerable need for editing. Lines like "Instead, one should instead think of cultural distance" are awkward and common throughout the paper. 

I was surprised to not see any mention of ipsitivity. If it's not relevant, I would like to know why. To my knowledge, it seems like it may be an adequate explanation for many equalized bias findings. 

If the Wordsum reliability difference is significant, it likely means that a total lack of bias cannot be established, as SFI/IRT equivalent is untenable. So, the equivalence of results to, at best, strong/scalar invariance should be stated. 

In section 2.1 you say that "Our study sample includes 2,826 blacks and 15,186 whites, of which 8,051 are males and 7,135 are males." Make sure to fix this. 

When discussing eigenvalues, please include the number of indicators they're based on nearby in the text so we know the total variance they explain. 

All of the items in Table 1 seem to be DIF. Could you use different numbers of anchors and reassess the results as a robustness check? The results are noticeably different for Table 3, so this seems like something to be concerned about. 





Thanks for your review. 

Concerning some of the weird sentences, I fixed them. I only detected one more weird sentence with repetitive words (first sentence in first paragraph of Discussion ) but I was unable to find more. 

I added a text regarding differences in reliability and ipsitivity in the second paragraph of Discussion.

With respect to ipsitivity, I admit it's often not discussed in the majority of DIF studies but I believe it's unlikely to represent a big issue in general. The reason is because if DIF evaluates the relative size of DIFs, one needs to ensure that the anchor is valid (i.e., DIF-free) which in principle is impossible if one needs to run a DIF model to detect DIF-free items. This is why some statisticians recommend content expertise. For well-known, widely used IQ tests, items are heavily scrutinized for fairness. The problem with Wordsum is that I never knew of any studies which validated the fairness of these items across racial or even gender groups. I searched even more yesterday and today but I could not find any such content review. 

With respect to reliabilities, I added only a sentence. But i'm not too sure what to say. You're right that if a bias is introduced, it would seem to overestimate bias, not underestimate it, but strict factorial invariance (i.e., equivalence in error variance) is not a requirement for measurement equivalence. I remember few papers (I think it was Dolan and one from Millsap or Meredith) mentioning that strong equivalence is sufficient. I could add a warning about this if you feel this is needed though. 

I added a few words describing the factor model for the eigenvalues they are based on.

With respect to Table 1, as I explained in the text, I completely disregard significance test because obviously a large sample size will almost always flag everything as biased. All it says is that the sample has sufficient power to detect even extremely small (and therefore unimportant) bias. And I know of no statisticians who would recommend us to flag items based on significance. My method relies on an exploratory approach, which I now described briefly so as to make it clear. The analysis should be robust because some of the largest fit index values display small biases, so the remaining items (assumed to be DIF-free) are unlikely to be reveal a sizeable DIF.

I have added few robustness checks (sections 3.1 and 3.2). For the black-white analysis, 1 more item tested as DIF (one model adds word A, another model adds word D because those had the next biggest fit index values). Both A and D showed small bias and other items retain similar values. In the gender analysis, I simply removed word G since that one has the smallest (in fact very small) bias. The remaining items tested for DIFs show similar values. I briefly discussed why I do not extend the robustness tests further by adding even more items to be tested as DIFs. And this is because DIF methods are not good if the DIFs are not a minor phenomenon; DeMars & Lau (2011) I believe explain this quite well. There are simulations (by Finch) which show that the higher % of DIFs the test contains, the lower the power and the higher the false positives. 

DeMars, C. E., & Lau, A. (2011). Differential item functioning detection with latent classes: how accurately can we detect who is responding differentially?. Educational and Psychological Measurement, 71(4), 597-616.

Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278-295.

EDIT: I just realized in Section 3.1 and 3.2, I added robustness checks but instead of sDRF and dDRF I wrote sDIF and dDIF. The next update will fix these.


Author has updated the submission to version #2


Author has updated the submission to version #3


Small update. I displayed the % of variance accounted for by the eigenvalues from parallel analysis, and at the same time fixed the typos for sDRF in the result section.


Small update. I displayed the % of variance accounted for by the eigenvalues from parallel analysis, and at the same time fixed the typos for sDRF in the result section.


I mostly like this paper and the findings are very important. However, at the moment the captions on the figures are not very informative. To anyone not already fairly familiar with IRT they are confusing. This is particularly apparent with figure 2, which is in my view the most important result in the paper. A clear explanation of what the axis on this graph actually mean, preferably within a longer caption, but alternatively in the main body of the text, would be very useful. I would also provide a quick summary of the results found by the 3PL model in the conclusion, rather than making the reader follow supplementary files to find out anything about these results. Unless the 3PL model standard errors were too large as to provide anything useful. 


Author has updated the submission to version #4


Thank you for your review. About your first suggestion, I was unable to change the caption for the figures because they were produced by mirt package functions, and they apparently don't have such options. Instead, I have added a short description for the X and Y axis of each figure. See the updated version. Regarding the second suggestion, The standard errors were indeed too large to be acceptable. This was true for both the black-white and gender analyses. In fact, so large that I doubt any journal would accept such finding. I changed the sentence at section 2.2 from "large standard errors" to "standard errors that were too large to be acceptable" in hope the readers understand better the CIs/standard errors were just not acceptable.


Author has updated the submission to version #5


The submission was accepted for publication.


Author has updated the submission to version #7


Author has updated the submission to version #8


Author has updated the submission to version #9