Thanks for your review.
Concerning some of the weird sentences, I fixed them. I only detected one more weird sentence with repetitive words (first sentence in first paragraph of Discussion ) but I was unable to find more.
I added a text regarding differences in reliability and ipsitivity in the second paragraph of Discussion.
With respect to ipsitivity, I admit it's often not discussed in the majority of DIF studies but I believe it's unlikely to represent a big issue in general. The reason is because if DIF evaluates the relative size of DIFs, one needs to ensure that the anchor is valid (i.e., DIF-free) which in principle is impossible if one needs to run a DIF model to detect DIF-free items. This is why some statisticians recommend content expertise. For well-known, widely used IQ tests, items are heavily scrutinized for fairness. The problem with Wordsum is that I never knew of any studies which validated the fairness of these items across racial or even gender groups. I searched even more yesterday and today but I could not find any such content review.
With respect to reliabilities, I added only a sentence. But i'm not too sure what to say. You're right that if a bias is introduced, it would seem to overestimate bias, not underestimate it, but strict factorial invariance (i.e., equivalence in error variance) is not a requirement for measurement equivalence. I remember few papers (I think it was Dolan and one from Millsap or Meredith) mentioning that strong equivalence is sufficient. I could add a warning about this if you feel this is needed though.
I added a few words describing the factor model for the eigenvalues they are based on.
With respect to Table 1, as I explained in the text, I completely disregard significance test because obviously a large sample size will almost always flag everything as biased. All it says is that the sample has sufficient power to detect even extremely small (and therefore unimportant) bias. And I know of no statisticians who would recommend us to flag items based on significance. My method relies on an exploratory approach, which I now described briefly so as to make it clear. The analysis should be robust because some of the largest fit index values display small biases, so the remaining items (assumed to be DIF-free) are unlikely to be reveal a sizeable DIF.
I have added few robustness checks (sections 3.1 and 3.2). For the black-white analysis, 1 more item tested as DIF (one model adds word A, another model adds word D because those had the next biggest fit index values). Both A and D showed small bias and other items retain similar values. In the gender analysis, I simply removed word G since that one has the smallest (in fact very small) bias. The remaining items tested for DIFs show similar values. I briefly discussed why I do not extend the robustness tests further by adding even more items to be tested as DIFs. And this is because DIF methods are not good if the DIFs are not a minor phenomenon; DeMars & Lau (2011) I believe explain this quite well. There are simulations (by Finch) which show that the higher % of DIFs the test contains, the lower the power and the higher the false positives.
DeMars, C. E., & Lau, A. (2011). Differential item functioning detection with latent classes: how accurately can we detect who is responding differentially?. Educational and Psychological Measurement, 71(4), 597-616.
Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278-295.
EDIT: I just realized in Section 3.1 and 3.2, I added robustness checks but instead of sDRF and dDRF I wrote sDIF and dDIF. The next update will fix these.