Thank you for updating the paper. I see there are few improvements and some clarifications. There are still several points left unanswered.
First complaint before. Although I didn't mention it earlier, I really wish the figures and tables are numbered, because if I were to comment on several Figures or Tables, but each are labeled X, it's very tedious.
Now about the content itself.
In the method section, you still did not explain what you mean by bias-adjusted. I believe you estimated the means by removing the DIFs. I think it's much clearer if you say that you "removed" these offending items. I mentioned in my earlier comment you need to specify clearly in the text which criteria you use for removing DIF. What do you consider as DIF? As I suggested before, an effect size (which one then? Chalmer's DRF or Meade's IDS?) is quite convenient for deciding which items to keep.
Similarly you still did not define "LOO method" (leaves one out?). In fact, even leaves-one-out should be explained properly. Non R users aren't familiar with this.
You should also, at least briefly, describe the 2PL, 3PL and 4PL, their difference, and why you would rather use 4PL in favor of a simpler model. Giving a reference is a plus.
calculating the raw percentage of people who the individual scored higher than a threshold
This sentence is oddly written.
The predicted average score for every cohort was calculated using the restricted cubic splines.
It's wise to specify clearly that the restricted cubic splines is used to adjust for age.
Figure X. Relationship between General Knowledge by age, modeled with a restricted cubic spline (ages of above 100 excluded in the analysis).
I'm pretty sure this wasn't there in the earlier version, but "ages of above 100" is surprising. Maybe it needs to be explained clearer?
The effect of age and the time taken to do the test on the result of the test were calculated to observe whether there was a notable age or effort effect.
Typically, after a sentence like this, we are expecting results being presented/discussed. But after this sentence, we jump into an entirely different section: "Factor Structure".
Now we move to the methodological issues.
You said that the CFA "model fit was mediocre, but not terrible". I indeed cited Hu & Bentler, but as one example. In fact, there is no one-size-fits-all cutoff values. Cutoffs are always provided given the condition of the simulation. And the issue with the great majority of the simulation studies is they use very simple models. Sivo et al. (2006) "The Search for "Optimal" Cutoff Properties: Fit Index Criteria in Structural Equation Modeling" provided an excellent illustration of the problem, and show how the cutoffs change depending on the models. Given the complexity of your model, the fit seems ok.
Since you used a factor analysis by specifying 7 factors, I suggest reporting the 6-factor EFA, since the 7th factor is meaningless. I do not expect the loadings to be exactly similar between the 6 and 7- factor EFA, and I wouldn't specify a 6-factor CFA model based on loadings suggested by a 7 factor EFA.
Speaking of which, I still do not see the explanation of model specification of your higher order factor (HOF). As I mentioned earlier, you need to justify the cross loadings and specify which cutoffs you are using. Typically, many people pick 0.30 but considering a great majority of simulation studies I found, a cutoff of 0.20 or 0.25 is more appropriate. Here are a few studies:
Xiao, Y., Liu, H., & Hau, K. T. (2019). A comparison of CFA, ESEM, and BSEM in test structure analysis. Structural Equation Modeling: A Multidisciplinary Journal, 26(5), 665-677.
Ximénez, C., Revuelta, J., & Castañeda, R. (2022). What are the consequences of ignoring cross-loadings in bifactor models? A simulation study assessing parameter recovery and sensitivity of goodness-of-fit indices. Frontiers in Psychology, 13, 923877.
Zhang, B., Luo, J., Sun, T., Cao, M., & Drasgow, F. (2023). Small but nontrivial: A comparison of six strategies to handle cross-loadings in bifactor predictive models. Multivariate Behavioral Research, 58(1), 115-132.
If I consider a cutoff of .25 (or close to .25) then you are missing tons of cross loadings (at least if based on my observation of your 7-factor EFA). Perhaps the 6-factor EFA has less cross loadings. Speaking of "Table X. Oblimin rotated factor analysis of the 32 questions." what are the columns "loadings" and "cumulative"?
In your HOF model, you now said that you allowed correlated (latent) residuals because they have non-trivial correlation. This was not what I meant by justification: what I meant is theoretical justification. A correlation of .2 or .3 or whatever is not a justification. For instance, Cole et al. (2007) mentioned method effects as a justification. This of course does not apply here so you may not want to cite Cole et al. (I cited them as an example of justifiable correlated residuals), but a more general explanation as to why we could specify correlated residuals is the presence of another, unmeasured (small) factor. So the question is: What is the common source between computational knowledge and technical knowledge and the common source between aesthetic knowledge and literary knowledge? If you can't identify the sources, I suggest removing these correlated residuals. It's very easy to overfit your model this way, and therefore "improving" the model in terms of fit, but making it less defensible scientifically, and also less replicable (McCallum et al. 1992).
Table X. Correlation matrix of the 6 knowledge subfactors of general knowledge.
This gave me tons of trouble due to the numbers I saw in the diagonals, because I did not realize first that this was actually a matrix of the residual correlations, and not simple correlations. The title needs to be fixed. This shows also why Table/Figure "X" is very hard to read/follow. They must be numbered, even for drafts. However, as I noted earlier, if you opt for not using correlated residuals, then you don't need this table anymore.
Hallquist (2017) provides here some instances of common mistakes he came across when reviewing SEM papers. There are few points mentioned there which are relevant to your current problem: https://psu-psychology.github.io/psy-597-SEM/12_best_practices/best_practices.html
I suggest removing the Table "Fit as a function of modeling choices." because it is uninformative. Since model 1 is more saturated than model 2, of course you should expect better fit in model 1. But parsimony is a desirable feature CFA modeling. And since you did not mention this table at all in the main text, why not removing it?
Since I wasn't sure whether you used a bifactor model or not in the earlier version, and now you stated clearly you didn't use it, I can now specify with more clarity the issue. If you can fit a HOF model and regress sex variable and get sex estimates on factor means, I do not see the added value of the second analysis which uses a simple factor CFA based on each factor as illustrated in your table "Table X. Latent differences in knowledge by sex and facet of knowledge.".
Why is this a problem? The factors are obviously correlated and there are cross loadings, which makes using separate simple structure CFAs a quite dubious approach. Either you should stick merely with the earlier table reporting gaps estimated by HOF model (which for some reasons are removed in this version), or you should also provide estimates from a bifactor model.
The added value here is that a bifactor structure provides more accurate estimates of the specific factors, because general factors are completely separated in a bifactor while the specific factors are represented as residuals in a higher order model. The literature on this matter abounds.
Bornovalova, M. A., Choate, A. M., Fatimah, H., Petersen, K. J., & Wiernik, B. M. (2020). Appropriate use of bifactor analysis in psychopathology research: Appreciating benefits and limitations. Biological psychiatry, 88(1), 18-27.
Beaujean, A. A., Parkin, J., & Parker, S. (2014). Comparing Cattell–Horn–Carroll factor models: Differences between bifactor and higher order factor models in predicting language achievement. Psychological Assessment, 26(3), 789.
Gignac, G. E. (2008). Higher-order models versus direct hierarchical models: g as superordinate or breadth factor?. Psychology Science, 50(1), 21.
In particular, Beaujean (2014) and Gignac (2008) explain that a higher order factor (HOF) model posits that the specific factors explain all the covariance among the observed test scores while the bifactor model posits that the specific factors account for the test scores’ residual covariance that remains after extraction of the covariance that is due to g.
This means if you were to use HOF, the specific factors are not independent of g. Depending on how you wish to interpret these factors, either the bifactor or HOF is preferred. But if you are curious about knowing whether the gaps of the specific factors are similar across bifactor and HOF, then I advise using bifactor as well. Given your focus on specific factors as well, a bifactor model may be even more appropriate. If you don't use the bifactor, maybe it's wise to explain in the discussion why you don't use it.
In any case what you should be attempting is something similar to Reynolds et al. (2008). The method is very straightforward: adding a path sex variable -> all factors. I suggest removing regression path if any of these show sex difference close to zero or confidence intervals including zero. This achieves parsimony and provides likely more accurate parameter estimates (since that is apparently one of your concerns). Although Reynolds tested for measurement bias prior, you don't have to, but in this case add a discussion about this potential issue. In other words, I suggest removing these one factor CFA models. Something close to your latax2 with sex dummy var, but without the correlated residuals. (EDIT: I'm not so sure how national differences in latent factors are computed, maybe a short description will be fine)
Reynolds, M. R., Keith, T. Z., Ridley, K. P., & Patel, P. G. (2008). Sex differences in latent general and broad cognitive abilities for children and youth: Evidence from higher-order MG-MACS and MIMIC models. Intelligence, 36(3), 236-260.
Given that it’s implausible that the difference is that large (or even in the right direction),
You should add references to back it up because "implausible" is a rather bold, and strong, claim. Given your discussion, it does not necessarily surprises me. The test is highly cultural specific after all. The difference is likely reflecting content/test-specific differences rather than g differences (well, g here is highly contamined by specific knowledge so it's probably not even a good g to begin with).
Table AX. General Knowledge by country (no bias adjustment)
This is merely a suggestion, but it might be useful to display the confidence intervals of the general knowledge score since the sample is small for some countries (also, specify this is the general factor of knowledge - I don't think it's obvious at first shot).
The latent difference was unbelievably large (d = -.7), probably due to the poor fit of the model (CFI < .7).
I am not aware of any research paper or textbooks which claimed that the effect size has anything to do with model fit. Furthermore, I have seen many instances of incredibly large latent mean differences going along with a very good fit. What a poor fit tells you is merely the following: the estimates should not be trusted because the model is largely misspecified (i.e., wrong). A more appropriate model may (or may not) yield different point estimates.
Regarding whether the gap is "too" large or not. Remember this is more an aptitude test than an IQ test. So the small sex gap in IQ test is not generalizeable to aptitude sex gaps. The g depends on the content of the tests. In IQ test this isn't a problem since the test includes a vast, representative, array of abilities. Here on the other hand, the latent factors reflect highly specific knowledge (technical, literary, aesthetic), which is perhaps why DIFs were detected. Whether this g gap is implausibly large requires looking at past studies of sex differences on very similar test batteries. One might argue achievement-g and IQ-g are highly correlated but this doesn't imply the means are identical.
A limitation worth mentioning in the discussion is that the use of SEM does not allow testing of measurement bias at the subtest level. Another approach would be using MGCFA which not only estimates these gaps but also test for measurement invariance at the subtest level. Indeed testing for measurement bias at the item level is one step, but that is simply not enough. The test could still be biased at the subtest level if loadings or intercepts are different. Another reason for expecting possible bias at the subtest level is simply because the item level test is not optimal. The result shows that the number of biased items is very large. And internal DIF methods are reliable only if the ratio of biased/unbiased items is very small. If DIFs aren't the minority, then the method may not be very accurate (DeMars & Lau 2011). I am not necessarily asking for MGCFA analysis, as it is a very complicated technique, but related issues should be exposed. Data are accessible so that any researchers interested in the matter can apply MGCFA in the future.
DeMars, C. E., & Lau, A. (2011). Differential item functioning detection with latent classes: how accurately can we detect who is responding differentially?. Educational and Psychological Measurement, 71(4), 597-616.
In the discussion, you said the reliability is high but you refer to the symbol ω, which is weird because in general this symbol denotes Omega when refering to reliability. In your main text though you said you used spearman-Brown reliability. Is this what you used in your display of "Table X. Comparison of the seven methods used to calculate general knowledge." ? In any case, S-B reliability is usually denoted ρ.
The last sentence of the discussion section does not seem to be an appropriate way of ending the section. It feels abrupt, unfinished. What I suggest is a brief discussion on the implication of these findings and/or suggestion for future research.
In the reference section, you have Brown et al. 2023 but in the main text, you have Brown et al. 2023 and also Brown et al. 2022.
Still one more question about methodologies. Why are you displaying first the results from CFA model (p.11) and then the results of the DIF analyses? Does that mean you used CFA modeling before testing for DIF? Because DIF should be conducted first.
Now, I would like to comment on your answers:
Cultural knowledge was initially called medical knowledge, because a lot of medicine related (e.g. cancer) topics loaded on to it. However, knowledge of things like serial killers also loaded on to it, so a more broad name was given.
This is fine, but the last thing we expect when reading "cultural knowledge" is having a strong medical knowledge flavour. It's clear from your description that "cultural knowledge" is probably misleading. How many medical items load into this factor? If the percentage is high, I suggest you stick with medical knowledge. EDIT2: For general clarity as well I recommend you describe a little bit each latent factor, by telling us which kind of items are loading onto each these factors. This helps knowing how well the factors are properly defined. Like, if a computational factor has a non-trivial percentage of verbal items, then it also measures English knowledge as well, and not just math ability.