Back to Accepted Submissions

Psychometric Analysis of the Multifactor General Knowledge Test

Submission status
Reviewing

Submission Editor
Noah Carl

Authors
Emil O. W. Kirkegaard
Sebastian Jensen

Title
Psychometric Analysis of the Multifactor General Knowledge Test

Abstract

 

 

The Multifactor general knowledge test for the openpsychometrics website was evaluated on multiple dimensions, including its reliability, ability to generate differences in areas where it is known that groups differ, how it should be scored, whether older individuals scored higher, and its dimensionality. The best method to generate the scores was to treat every checkbox as an item and add up the correct and incorrect scores. This generated a highly reliable (ω = 0.93) test, with a low median completion time (577 seconds), and a high ceiling (IQ = 149). One set of items (internet abbreviations) were found to have very low g-loadings, so we recommend removing them. The test also had age, national, and gender differences which replicate previous literature.

 

The test was clearly biased against non-Anglos, especially in the sections of aesthetic knowledge, cultural knowledge, literary knowledge, and technical knowledge. DIF testing suggested that the test was not biased in favor of Anglo countries, calling into question its usefulness in identifying highly biased tests. Between sexes, DIF found that many items were biased against both genders, but the magnitude of the bias did not vary by either sex. We highly recommend using this test to examine the general knowledge of native English speakers, and the use of a cultural and linguistic translation for non-English speakers.

 

 

 

Keywords
intelligence, sex differences, statistics, knowledge, methods

Supplemental materials link
https://osf.io/erx6q/

Pdf

Paper

Reviewers ( 0 / 2 / 2 )
Reviewer 1: Considering / Revise
Reviewer 2: Accept
Reviewer 3: Considering / Revise
Reviewer 4: Accept

Tue 20 Jun 2023 02:43

Reviewer

The paper evaluates the psychometric properties of the MGKT scores by using first IRT graded response model (due to the questionnaires having multiple possible correct responses) and then applying CFA to evaluate gender differences in each cognitive dimension reflected by the test. The study found that the test was biased against non English speaker and biased with respect to gender. 

The first problem which immediately comes to mind is the lack of proper introduction and discussion section. The small section "Online tests" can be part of the introduction section, but you must introduce clearly the problem and how you intend to disclose its intricacies but also why that research is important.

Secondly, the sections should be better separated, by numbering them (e.g., 1. intro, 2. methodology, 3. results, 4. discussion). The current format of the article fits well in a website or blog but in a case of a reviewed publication I suggest following the format of the typical academic paper. Do not forget the list of references at the end as well.

Having said that, let's move on to the contents. There are couple quibbles with respect to the methods and their descriptions.

Page 4, where you discuss device bias and refer to other devices, could you be more explicit? Because the Table at page 5 has a column named Desktop advantage. Without explaining what that means, I doubt many readers will follow. I think I do, but the paper still lacks clarity.

whether certain items were answered correctly by one gender independent of general ability

I understand what you mean, but it is a bit clearer to write, for instance: "whether certain items exhibit a gender difference in the probability of correct response when controlling for general ability".

Differential item functioning testing was used to ...

Add (DIF) because you use it quite a few times after.

The item probability functions by gender where ability is calculated using the LOO method ...

If this is the leave-one-out method of IRT-DIF from the mirt package, this should be made clear. And if you are using mirt package, you should probably mention it. Also in the same paragraph you said there are 180 items among distractors and 180 among correct answers. Did you mean 160 or 180? Because earlier in the text you said there are 160+160 items.

The bias-adjusted sex difference in general knowledge (d = 0.4683) was hardly different from the difference before adjustment after adjustment (d = 0.4689), after adjusting for reliability as well.

You did not explain what you mean by bias-adjusted difference. Is this the gap calculated based on items after removing DIFs? If so, how DIFs are selected? Using which criteria? Is it based on significance test (I do not recommend) or based on effect sizes (in this case, which one)? These questions also apply to the entire section "sex bias" where you report the number of items biased against either male or female. 

This leads me to believe that, while the amount of sex bias in the test is fairly large, this is not leading to a significant bias in favor of either gender.

This is because the DIFs cancel out. Make this clear in the text because it's hard to understand for the average reader. Instead of "significant", which typically refer to significance testing methods, use substantial or practical significance as they refer to effect sizes. 

Internet abbreviations also has a very low g-loading as a question

It should be made clear which cutoff you are using. If there are also good recommendations on cutoffs, cite those references.

A confirmatory factor analysis was conducted based on these results, which was somewhat successful, yielding a CFI of .93 and a RMSEA of 0.055.

Please report SRMR. This measure is no less important than the other two. Also, refer to research on recommended model fit indices. Hu & Bentler (1999) being one of these.

Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural equation modeling: a multidisciplinary journal, 6(1), 1-55.

Medical knowledge showed a small difference in favor of men ...

Probably a mistake. I don't see medical knowledge.

Table X. Gender differences in knowledge by facet of knowledge. Reference group is men.

If the table provides effect size values from the CFA model, it should be made clear in what metric they have been expressed. In the main text you mention the gender difference in general knowledge in Cohen's d but in the syntax you provide, I don't see that. Also, if I read correctly this CFA model is represented by latax2 in your code, but you didn't for this model include either the "group" argument in cfa() or either the latent factors being regressed on sex variable. It's like analyzing a single group, and there is no way to obtain group differences from this output. So I'm confused. Where do these effect sizes come from? This is why you should once again explain in detail your methods in a dedicated, separate section. 

Furthermore, in this same table, add a note that negative values denote disadvantage for females and positive values advantage for females. However, if negative values really show the female disadvantage, I see a problem with the following reports:

Items with a pro-female bias typically were associated with cultural knowledge ...

Women tended to score higher in facets related to cultural knowledge and aesthetic knowledge ...

Men tend to score higher in fields related to science and geography, while women tend to know more about fashion and cultural works. ...

Because the table shows females score higher on aesthetic and literary knowledge, but not cultural knowledge. 

Next point is probably the most obscure. If the gender gaps are calculated based on the the CFA model shown in "Figure X. Confirmatory factor analysis of the Multifactor General Knowledge test." then there are more complications due to the peculiarity of this model. 

In this figure, I notice 2 cross loadings. Why? Typically, the decision to include (or not) cross loadings is based on an "exploratory" factor analysis. If this was the case, this step should be described in the main text. 

The figure also shows the model allows for several latent residual correlations (IK with TK and AK with LK). The decision of correlating residuals is critical in CFA modeling, and should always be justified, because it affects model fit and model parameters (including means). There are several reasons why researchers would want to correlate residuals. One is specification searches (e.g., using modification indices) which tend to capitalize on chance (MacCallum et al., 1992) and is well known to be a bad approach. Another is theory-driven, and is concerned about redundancy between variables (more generally called linear dependency). In the latter case, if there are a priori reason to believe IK with TK and AK with LK suffer this redundancy issue, or because they likely form a minor (latent) factor, it makes sense to correlate their residuals. However if that is the approach, it should be explained in the main text why you believe their residuals are correlated and, if possible, provide empirical evidence from prior research. Although I do not have access to Cole et al. (2007) I often read they give multiple reasons as to why certain cases justify correlated residuals.

MacCallum, R. C., Roznowski, M., & Necowitz, L. B. (1992). Model modifications in covariance structure analysis: the problem of capitalization on chance. Psychological bulletin, 111(3), 490.
Cole, D. A., Ciesla, J. A., & Steiger, J. H. (2007). The insidious effects of failing to include design-driven correlated residuals in latent-variable covariance structure analysis. Psychological Methods, 12, 381-398.

Also, with respect to "Figure X. Confirmatory factor analysis..." it appears very clear that GK is the 2nd order general factor, but this isn't explained anywhere in the text. If someone doesn't enough pay attention to Figure X, the reader will have the false impression that GK is another first-order latent factor like the other ones.

Now, as I move to the next table: "Table X. Latent differences in knowledge by sex and facet of knowledge. Reference group is men." ... I think the bad fit for AK, LK, CK, and GK tests deserves to be mentioned. CFI 0.90 is borderline, but RMSEA close or higher than 0.10 is quite concerning. The model fit of GK is so bad that the gap estimate should be called into question. Also, if the fit indices are from the bifactor models, this should be stated clearly either in the text, or the table's title (or in a note under the table). 

However, looking at the syntax you provided in the supplementary file, I understand that latax3-latax8 models are the ones displayed in the table I just mentioned. But these are simple structures, not bifactor structures. I now understand why some models have such poor fit. For instance:

latax8 <- "
  #latents:
  GK =~ Q1 + Q2 + Q3 + Q4 + Q6 + Q7 + Q8 + Q9 + Q10 + Q11 + Q12 + Q13 + Q14 + Q15 + Q16 + Q17 + Q18 + Q19 + Q20 + Q21 + Q22 + Q23 + Q24 + Q25 + Q26 + Q27 + Q28 + Q29 + Q30 + Q31 + Q32
  
  GK ~ sex
"

This one is not a bifactor, but a one factor model. Given that these questionnaires tap on different abilities, it's no wonder why fitting a single factor to explain all observed variables lead to poor fit. If you wanted to build a bifactor, the correct syntax looks more like this:

  GK =~ Q1 + Q2 + Q3 + Q4 + Q6 + Q7 + Q8 + Q9 + Q10 + Q11 + Q12 + Q13 + Q14 + Q15 + Q16 + Q17 + Q18 + Q19 + Q20 + Q21 + Q22 + Q23 + Q24 + Q25 + Q26 + Q27 + Q28 + Q29 + Q30 + Q31 + Q32
  COK =~ Q13 + Q14 + Q15 + Q16 + Q22 + Q30 + Q21 + Q26
  IK =~ Q9 + Q10 + Q11 + Q12 + Q23
  TK (?)*
  CK =~ Q3 + Q5 + Q6 + Q7 + Q8 + Q24 + Q31 + Q20
  AK =~ Q4 + Q17 + Q19 + Q27 + Q32
  LK =~ Q1 + Q2 + Q25
  GK ~ sex

I marked TK with (?)* because the code is missing from your supplementals. Regardless, in cfa() you must remember to use the argument orthogonal=T for bifactor modeling, if that was the model you attempted. You are using sam() and not cfa() and I never worked with sam(), so I don't know how to implement it with sam(). In any case, the advantage of this model is that you get the effect of the specific factors IK, TK, etc... net of g, whereas in your original models latax3-latax8 the manifest variables are loaded onto one single factor, without accounting for the variance that could have been accounted for by g. Now perhaps you only intended to investigate each dimension, without accounting for g. But this is not clear to me (in this case, explain more clearly your intention in the main text). Finally, you should warn the readers who are not familiar with CFA or SEM that the model fit indices are not directly comparable because they are not fit to the same data variables. But because they are all displayed in the same table as separate columns, it may give the false impression the fit values are directly comparable.

Taken as a whole, you should explain why want to display in the first table the effect size from a 2nd order g CFA-model (if that is what you did) and then in the next table the effect sizes from separate one factor models. Typically, if I do this, I would also need to explain the differences between the two methods and how it potentially affects the estimates, and the advantage (if there is indeed one) of one method over the other.

national IQs taken from Becker’s latest version of the NIQ dataset (V1.3.3)

I suggest providing a link (supplementary file?) or reference for the data.

Table X. Differences by specific ability by region. 

You did not explain which model you use to calculate the nations' differences in each cognitive factors.

calculate the raw percentage of people who the individual scored higher than, use a linear regression model which predicts the converted IQ score based on the summed score, or

"the individual scored higher than" : I believe something is missing here.

The first method works well when you have a very large sample size and there are departures from normality within the test.

I would not recommend using "you" in an academic paper.

Reviewer

On top of what I just said, I realized that in your models latax and latax2 (the latter being displayed in the CFA figure) displayed in your code, you used the estimator DWLS, which is typically the default estimator for categorical data in lavaan, as seen here where the syntax for cfa() looks clearly different. If these variables Q1-Q32 really aren't treated like continuous variables (in this case, I don't see why) it should be made clear in the text. 

Out of curiosity, what is the argument sam.method='global' that you used for latax3-latax8 models? I never used sam() function and in fact I wasn't able to find anything about it.

Reviewer

I am approaching this as a curious reader without extensive training in the methods involved. Thus, I will focus narrowly on those aspects of this paper about which I can comment. (Also, Reviewer 2 addressed many of the methodological issues.) 

The major problem is that this feels more like a set of notes than a finished article and would likely be perplexing to most people. It needs more order, more description, and more polishing. Obviously, the prose does not need to be scintillating. But it does need to be clear. And it needs legitimate intro and discussion sections. The results need to be contextualized so a reader understands the purpose of the paper.
 

Throughout, the word “data” is used as a singular noun, as in “Data was taken,” but it should be plural, as in “Data were taken.”

Eloquent prose is not necessary, as I noted above, but clarity is. This sentence, for example, is confusing:

 

“This test consists of 32 questions where individuals are asked to identity whether 10 items correspond to a criteria asked in a question, with the constraint that there are only 5 correct answers.”

Perhaps:

“The test consists of 32 general knowledge questions in which a participant is asked which of 10 items satisfies a particular criterion (e.g., “Which of these players won MVP” [replace my example with a real example]). Five of the 10 items are correct for each question.”



Other sections are also confusing, but it is difficult to assess how confusing because it does not read like a coherent article. The author should spend 10-20 hours to structure this properly, striving to explain to readers the goals of the article, the shortcomings, et cetera. This probably doesn’t require more than 5-8 paragraphs. Most articles are too long; but a few brief paragraphs would be helpful.

After such revisions, the paper would be easier to assess for standard academic qualities.

Bot

Author has updated the submission to version #2

Author
Emil Kirkegaard was added as an author - the page doesn't list me as one but hopefully this will be fixed in due time. Structure (e.g. intro, discussion) was added at the request of multiple reviewers. Most of this comment will address reviewer 2 - if something was not mentioned, it should be assumed that it was fixed according to that reviewer's specifications.

It should be made clear which cutoff you are using. If there are also good recommendations on cutoffs, cite those references.

It had a g-loading of 0.14. Not atrocious, but still worth removing.

If the table provides effect size values from the CFA model, it should be made clear in what metric they have been expressed. In the main text you mention the gender difference in general knowledge in Cohen's d but in the syntax you provide, I don't see that. Also, if I read correctly this CFA model is represented by latax2 in your code, but you didn't for this model include either the "group" argument in cfa() or either the latent factors being regressed on sex variable. It's like analyzing a single group, and there is no way to obtain group differences from this output. So I'm confused. Where do these effect sizes come from? This is why you should once again explain in detail your methods in a dedicated, separate section. 

Both observed and latent (in SEM) differences are now reported in the paper. Structure after measurement models (sam()) were abandoned, as they seem to return the same results is SEMs, according to other papers I am in the process of writing.

Although I do not have access to Cole et al. (2007) I often read they give multiple reasons as to why certain cases justify correlated residuals.

https://sci-hub.ru/10.1037/1082-989X.12.4.381

The figure also shows the model allows for several latent residual correlations (IK with TK and AK with LK). The decision of correlating residuals is critical in CFA modeling, and should always be justified, because it affects model fit and model parameters (including means).

Incidentally, the wrong variables (technical knowledge and international knowledge) were assigned correlated residuals. Now, technical knowledge and computational knowledge have been assigned correlated residuals. In both cases, the correlated residuals were justified based on a moderate correlation (0.2 and 0.21) independent of the general factor of knowledge. 

Probably a mistake. I don't see medical knowledge.

Cultural knowledge was initially called medical knowledge, because a lot of medicine related (e.g. cancer) topics loaded on to it. However, knowledge of things like serial killers also loaded on to it, so a more broad name was given.

Now, as I move to the next table: "Table X. Latent differences in knowledge by sex and facet of knowledge. Reference group is men." ... I think the bad fit for AK, LK, CK, and GK tests deserves to be mentioned. CFI 0.90 is borderline, but RMSEA close or higher than 0.10 is quite concerning. The model fit of GK is so bad that the gap estimate should be called into question. Also, if the fit indices are from the bifactor models, this should be stated clearly either in the text, or the table's title (or in a note under the table). 

However, looking at the syntax you provided in the supplementary file, I understand that latax3-latax8 models are the ones displayed in the table I just mentioned. But these are simple structures, not bifactor structures. I now understand why some models have such poor fit. For instance:

latax8 <- "
  #latents:
  GK =~ Q1 + Q2 + Q3 + Q4 + Q6 + Q7 + Q8 + Q9 + Q10 + Q11 + Q12 + Q13 + Q14 + Q15 + Q16 + Q17 + Q18 + Q19 + Q20 + Q21 + Q22 + Q23 + Q24 + Q25 + Q26 + Q27 + Q28 + Q29 + Q30 + Q31 + Q32
  
  GK ~ sex
"

This one is not a bifactor, but a one factor model. Given that these questionnaires tap on different abilities, it's no wonder why fitting a single factor to explain all observed variables lead to poor fit. If you wanted to build a bifactor, the correct syntax looks more like this:

  GK =~ Q1 + Q2 + Q3 + Q4 + Q6 + Q7 + Q8 + Q9 + Q10 + Q11 + Q12 + Q13 + Q14 + Q15 + Q16 + Q17 + Q18 + Q19 + Q20 + Q21 + Q22 + Q23 + Q24 + Q25 + Q26 + Q27 + Q28 + Q29 + Q30 + Q31 + Q32
  COK =~ Q13 + Q14 + Q15 + Q16 + Q22 + Q30 + Q21 + Q26
  IK =~ Q9 + Q10 + Q11 + Q12 + Q23
  TK (?)*
  CK =~ Q3 + Q5 + Q6 + Q7 + Q8 + Q24 + Q31 + Q20
  AK =~ Q4 + Q17 + Q19 + Q27 + Q32
  LK =~ Q1 + Q2 + Q25
  GK ~ sex

I marked TK with (?)* because the code is missing from your supplementals. Regardless, in cfa() you must remember to use the argument orthogonal=T for bifactor modeling, if that was the model you attempted. You are using sam() and not cfa() and I never worked with sam(), so I don't know how to implement it with sam(). In any case, the advantage of this model is that you get the effect of the specific factors IK, TK, etc... net of g, whereas in your original models latax3-latax8 the manifest variables are loaded onto one single factor, without accounting for the variance that could have been accounted for by g. Now perhaps you only intended to investigate each dimension, without accounting for g. But this is not clear to me (in this case, explain more clearly your intention in the main text). Finally, you should warn the readers who are not familiar with CFA or SEM that the model fit indices are not directly comparable because they are not fit to the same data variables. But because they are all displayed in the same table as separate columns, it may give the false impression the fit values are directly comparable.

Taken as a whole, you should explain why want to display in the first table the effect size from a 2nd order g CFA-model (if that is what you did) and then in the next table the effect sizes from separate one factor models. Typically, if I do this, I would also need to explain the differences between the two methods and how it potentially affects the estimates, and the advantage (if there is indeed one) of one method over the other.

Differences were not initially calculated with a bifactor model - this is a mistake in the text. Instead, I have decided to stick with SEM models which model the sex difference in a simple way (creating the latent variable from the observed ones, then regressing the latent variable onto the sex variable). Given the poor fit of the general model, I also attempted to use the method of correlated vectors to test the existence of a latent difference in general knowledge, but the test was inconclusive.

On top of what I just said, I realized that in your models latax and latax2 (the latter being displayed in the CFA figure) displayed in your code, you used the estimator DWLS, which is typically the default estimator for categorical data in lavaan, as seen here where the syntax for cfa() looks clearly different. If these variables Q1-Q32 really aren't treated like continuous variables (in this case, I don't see why) it should be made clear in the text. 

Will be added in the next edition. Unfortunately I cannot edit now.

 

 

Reviewer

Thank you for updating the paper. I see there are few improvements and some clarifications. There are still several points left unanswered.

First complaint before. Although I didn't mention it earlier, I really wish the figures and tables are numbered, because if I were to comment on several Figures or Tables, but each are labeled X, it's very tedious.

Now about the content itself.

In the method section, you still did not explain what you mean by bias-adjusted. I believe you estimated the means by removing the DIFs. I think it's much clearer if you say that you "removed" these offending items. I mentioned in my earlier comment you need to specify clearly in the text which criteria you use for removing DIF. What do you consider as DIF? As I suggested before, an effect size (which one then? Chalmer's DRF or Meade's IDS?) is quite convenient for deciding which items to keep.

Similarly you still did not define "LOO method" (leaves one out?). In fact, even leaves-one-out should be explained properly. Non R users aren't familiar with this.

You should also, at least briefly, describe the 2PL, 3PL and 4PL, their difference, and why you would rather use 4PL in favor of a simpler model. Giving a reference is a plus.

calculating the raw percentage of people who the individual scored higher than a threshold

This sentence is oddly written.

The predicted average score for every cohort was calculated using the restricted cubic splines.

It's wise to specify clearly that the restricted cubic splines is used to adjust for age.

Figure X. Relationship between General Knowledge by age, modeled with a restricted cubic spline (ages of above 100 excluded in the analysis).

I'm pretty sure this wasn't there in the earlier version, but "ages of above 100" is surprising. Maybe it needs to be explained clearer?

The effect of age and the time taken to do the test on the result of the test were calculated to observe whether there was a notable age or effort effect.

Typically, after a sentence like this, we are expecting results being presented/discussed. But after this sentence, we jump into an entirely different section: "Factor Structure".

Now we move to the methodological issues.

You said that the CFA "model fit was mediocre, but not terrible". I indeed cited Hu & Bentler, but as one example. In fact, there is no one-size-fits-all cutoff values. Cutoffs are always provided given the condition of the simulation. And the issue with the great majority of the simulation studies is they use very simple models. Sivo et al. (2006) "The Search for "Optimal" Cutoff Properties: Fit Index Criteria in Structural Equation Modeling" provided an excellent illustration of the problem, and show how the cutoffs change depending on the models. Given the complexity of your model, the fit seems ok. 

Since you used a factor analysis by specifying 7 factors, I suggest reporting the 6-factor EFA, since the 7th factor is meaningless. I do not expect the loadings to be exactly similar between the 6 and 7- factor EFA, and I wouldn't specify a 6-factor CFA model based on loadings suggested by a 7 factor EFA. 

Speaking of which, I still do not see the explanation of model specification of your higher order factor (HOF). As I mentioned earlier, you need to justify the cross loadings and specify which cutoffs you are using. Typically, many people pick 0.30 but considering a great majority of simulation studies I found, a cutoff of 0.20 or 0.25 is more appropriate. Here are a few studies:

Xiao, Y., Liu, H., & Hau, K. T. (2019). A comparison of CFA, ESEM, and BSEM in test structure analysis. Structural Equation Modeling: A Multidisciplinary Journal, 26(5), 665-677. 
Ximénez, C., Revuelta, J., & Castañeda, R. (2022). What are the consequences of ignoring cross-loadings in bifactor models? A simulation study assessing parameter recovery and sensitivity of goodness-of-fit indices. Frontiers in Psychology, 13, 923877. 
Zhang, B., Luo, J., Sun, T., Cao, M., & Drasgow, F. (2023). Small but nontrivial: A comparison of six strategies to handle cross-loadings in bifactor predictive models. Multivariate Behavioral Research, 58(1), 115-132. 

If I consider a cutoff of .25 (or close to .25) then you are missing tons of cross loadings (at least if based on my observation of your 7-factor EFA). Perhaps the 6-factor EFA has less cross loadings. Speaking of "Table X. Oblimin rotated factor analysis of the 32 questions." what are the columns "loadings" and "cumulative"?

In your HOF model, you now said that you allowed correlated (latent) residuals because they have non-trivial correlation. This was not what I meant by justification: what I meant is theoretical justification. A correlation of .2 or .3 or whatever is not a justification. For instance, Cole et al. (2007) mentioned method effects as a justification. This of course does not apply here so you may not want to cite Cole et al. (I cited them as an example of justifiable correlated residuals), but a more general explanation as to why we could specify correlated residuals is the presence of another, unmeasured (small) factor. So the question is: What is the common source between computational knowledge and technical knowledge and the common source between aesthetic knowledge and literary knowledge? If you can't identify the sources, I suggest removing these correlated residuals. It's very easy to overfit your model this way, and therefore "improving" the model in terms of fit, but making it less defensible scientifically, and also less replicable (McCallum et al. 1992).

Table X. Correlation matrix of the 6 knowledge subfactors of general knowledge.

This gave me tons of trouble due to the numbers I saw in the diagonals, because I did not realize first that this was actually a matrix of the residual correlations, and not simple correlations. The title needs to be fixed. This shows also why Table/Figure "X" is very hard to read/follow. They must be numbered, even for drafts. However, as I noted earlier, if you opt for not using correlated residuals, then you don't need this table anymore.

Hallquist (2017) provides here some instances of common mistakes he came across when reviewing SEM papers. There are few points mentioned there which are relevant to your current problem: https://psu-psychology.github.io/psy-597-SEM/12_best_practices/best_practices.html

I suggest removing the Table "Fit as a function of modeling choices." because it is uninformative. Since model 1 is more saturated than model 2, of course you should expect better fit in model 1. But parsimony is a desirable feature CFA modeling. And since you did not mention this table at all in the main text, why not removing it?

Since I wasn't sure whether you used a bifactor model or not in the earlier version, and now you stated clearly you didn't use it, I can now specify with more clarity the issue. If you can fit a HOF model and regress sex variable and get sex estimates on factor means, I do not see the added value of the second analysis which uses a simple factor CFA based on each factor as illustrated in your table "Table X. Latent differences in knowledge by sex and facet of knowledge.".

Why is this a problem? The factors are obviously correlated and there are cross loadings, which makes using separate simple structure CFAs a quite dubious approach. Either you should stick merely with the earlier table reporting gaps estimated by HOF model (which for some reasons are removed in this version), or you should also provide estimates from a bifactor model.

The added value here is that a bifactor structure provides more accurate estimates of the specific factors, because general factors are completely separated in a bifactor while the specific factors are represented as residuals in a higher order model. The literature on this matter abounds.

Bornovalova, M. A., Choate, A. M., Fatimah, H., Petersen, K. J., & Wiernik, B. M. (2020). Appropriate use of bifactor analysis in psychopathology research: Appreciating benefits and limitations. Biological psychiatry, 88(1), 18-27.
Beaujean, A. A., Parkin, J., & Parker, S. (2014). Comparing Cattell–Horn–Carroll factor models: Differences between bifactor and higher order factor models in predicting language achievement. Psychological Assessment, 26(3), 789.
Gignac, G. E. (2008). Higher-order models versus direct hierarchical models: g as superordinate or breadth factor?. Psychology Science, 50(1), 21.

In particular, Beaujean (2014) and Gignac (2008) explain that a higher order factor (HOF) model posits that the specific factors explain all the covariance among the observed test scores while the bifactor model posits that the specific factors account for the test scores’ residual covariance that remains after extraction of the covariance that is due to g.

This means if you were to use HOF, the specific factors are not independent of g. Depending on how you wish to interpret these factors, either the bifactor or HOF is preferred. But if you are curious about knowing whether the gaps of the specific factors are similar across bifactor and HOF, then I advise using bifactor as well. Given your focus on specific factors as well, a bifactor model may be even more appropriate. If you don't use the bifactor, maybe it's wise to explain in the discussion why you don't use it.

In any case what you should be attempting is something similar to Reynolds et al. (2008). The method is very straightforward: adding a path sex variable -> all factors. I suggest removing regression path if any of these show sex difference close to zero or confidence intervals including zero. This achieves parsimony and provides likely more accurate parameter estimates (since that is apparently one of your concerns). Although Reynolds tested for measurement bias prior, you don't have to, but in this case add a discussion about this potential issue. In other words, I suggest removing these one factor CFA models. Something close to your latax2 with sex dummy var, but without the correlated residuals. (EDIT: I'm not so sure how national differences in latent factors are computed, maybe a short description will be fine)

Reynolds, M. R., Keith, T. Z., Ridley, K. P., & Patel, P. G. (2008). Sex differences in latent general and broad cognitive abilities for children and youth: Evidence from higher-order MG-MACS and MIMIC models. Intelligence, 36(3), 236-260.

Given that it’s implausible that the difference is that large (or even in the right direction),

You should add references to back it up because "implausible" is a rather bold, and strong, claim. Given your discussion, it does not necessarily surprises me. The test is highly cultural specific after all. The difference is likely reflecting content/test-specific differences rather than g differences (well, g here is highly contamined by specific knowledge so it's probably not even a good g to begin with).

Table AX. General Knowledge by country (no bias adjustment)

This is merely a suggestion, but it might be useful to display the confidence intervals of the general knowledge score since the sample is small for some countries (also, specify this is the general factor of knowledge - I don't think it's obvious at first shot).

The latent difference was unbelievably large (d = -.7), probably due to the poor fit of the model (CFI < .7).

I am not aware of any research paper or textbooks which claimed that the effect size has anything to do with model fit. Furthermore, I have seen many instances of incredibly large latent mean differences going along with a very good fit. What a poor fit tells you is merely the following: the estimates should not be trusted because the model is largely misspecified (i.e., wrong). A more appropriate model may (or may not) yield different point estimates.

Regarding whether the gap is "too" large or not. Remember this is more an aptitude test than an IQ test. So the small sex gap in IQ test is not generalizeable to aptitude sex gaps. The g depends on the content of the tests. In IQ test this isn't a problem since the test includes a vast, representative, array of abilities. Here on the other hand, the latent factors reflect highly specific knowledge (technical, literary, aesthetic), which is perhaps why DIFs were detected. Whether this g gap is implausibly large requires looking at past studies of sex differences on very similar test batteries. One might argue achievement-g and IQ-g are highly correlated but this doesn't imply the means are identical.

A limitation worth mentioning in the discussion is that the use of SEM does not allow testing of measurement bias at the subtest level. Another approach would be using MGCFA which not only estimates these gaps but also test for measurement invariance at the subtest level. Indeed testing for measurement bias at the item level is one step, but that is simply not enough. The test could still be biased at the subtest level if loadings or intercepts are different. Another reason for expecting possible bias at the subtest level is simply because the item level test is not optimal. The result shows that the number of biased items is very large. And internal DIF methods are reliable only if the ratio of biased/unbiased items is very small. If DIFs aren't the minority, then the method may not be very accurate (DeMars & Lau 2011). I am not necessarily asking for MGCFA analysis, as it is a very complicated technique, but related issues should be exposed. Data are accessible so that any researchers interested in the matter can apply MGCFA in the future.

DeMars, C. E., & Lau, A. (2011). Differential item functioning detection with latent classes: how accurately can we detect who is responding differentially?. Educational and Psychological Measurement, 71(4), 597-616.

In the discussion, you said the reliability is high but you refer to the symbol ω, which is weird because in general this symbol denotes Omega when refering to reliability. In your main text though you said you used spearman-Brown reliability. Is this what you used in your display of "Table X. Comparison of the seven methods used to calculate general knowledge." ? In any case, S-B reliability is usually denoted ρ.

The last sentence of the discussion section does not seem to be an appropriate way of ending the section. It feels abrupt, unfinished. What I suggest is a brief discussion on the implication of these findings and/or suggestion for future research.

In the reference section, you have Brown et al. 2023 but in the main text, you have Brown et al. 2023 and also Brown et al. 2022.

Still one more question about methodologies. Why are you displaying first the results from CFA model (p.11) and then the results of the DIF analyses? Does that mean you used CFA modeling before testing for DIF? Because DIF should be conducted first. 

Now, I would like to comment on your answers:

Cultural knowledge was initially called medical knowledge, because a lot of medicine related (e.g. cancer) topics loaded on to it. However, knowledge of things like serial killers also loaded on to it, so a more broad name was given.

This is fine, but the last thing we expect when reading "cultural knowledge" is having a strong medical knowledge flavour. It's clear from your description that "cultural knowledge" is probably misleading. How many medical items load into this factor? If the percentage is high, I suggest you stick with medical knowledge. EDIT2: For general clarity as well I recommend you describe a little bit each latent factor, by telling us which kind of items are loading onto each these factors. This helps knowing how well the factors are properly defined. Like, if a computational factor has a non-trivial percentage of verbal items, then it also measures English knowledge as well, and not just math ability. 

Author
As I stated in the body, the purpose of the factor analysis/parallel analysis is to compute the sex and national differences in subfactors of general knowledge. It is doubtful that minor fluctuations in modeling are going to have effects on the rank order or magnitude of these differences. For the sake of simplicity, the cross-loadings have been removed.
 
You should add references to back it up because "implausible" is a rather bold, and strong, claim. Given your discussion, it does not necessarily surprises me. The test is highly cultural specific after all. The difference is likely reflecting content/test-specific differences rather than g differences (well, g here is highly contamined by specific knowledge so it's probably not even a good g to begin with).
I wouldn't consider implausible a strong or bold claim - just that it's intuitively unlikely. Regardless, I added a section in the discussion section which goes over why this is not likely to be a true difference.
 
This is fine, but the last thing we expect when reading "cultural knowledge" is having a strong medical knowledge flavour. It's clear from your description that "cultural knowledge" is probably misleading. How many medical items load into this factor? If the percentage is high, I suggest you stick with medical knowledge. EDIT2: For general clarity as well I recommend you describe a little bit each latent factor, by telling us which kind of items are loading onto each these factors. This helps knowing how well the factors are properly defined. Like, if a computational factor has a non-trivial percentage of verbal items, then it also measures English knowledge as well, and not just math ability. 
Three out of the eight questions were unambigiously related to health (STDs, cancers, painkillers). Some of them were drug related (cigarettes, weed). Others include holidays, famous criminals, and card games. 
 
Still one more question about methodologies. Why are you displaying first the results from CFA model (p.11) and then the results of the DIF analyses? Does that mean you used CFA modeling before testing for DIF? Because DIF should be conducted first. 
Is there a purpose to conducting DIF before CFA, if removing biased items was not the reason for using the method?
 
 EDIT2: For general clarity as well I recommend you describe a little bit each latent factor, by telling us which kind of items are loading onto each these factors. This helps knowing how well the factors are properly defined. Like, if a computational factor has a non-trivial percentage of verbal items, then it also measures English knowledge as well, and not just math ability. 
The main body mentions that the name of the question each factor loads on is available in the Appendix. Besides an odd occasion where a question which asks for 'synonyms of fancy' loads onto technical knowledge, the association between questions and factors is fairly intuitive, and the factor names are self-explanatory.
 

First complaint before. Although I didn't mention it earlier, I really wish the figures and tables are numbered, because if I were to comment on several Figures or Tables, but each are labeled X, it's very tedious.

Now about the content itself.

In the method section, you still did not explain what you mean by bias-adjusted. I believe you estimated the means by removing the DIFs. I think it's much clearer if you say that you "removed" these offending items. I mentioned in my earlier comment you need to specify clearly in the text which criteria you use for removing DIF. What do you consider as DIF? As I suggested before, an effect size (which one then? Chalmer's DRF or Meade's IDS?) is quite convenient for deciding which items to keep.

Similarly you still did not define "LOO method" (leaves one out?). In fact, even leaves-one-out should be explained properly. Non R users aren't familiar with this.

You should also, at least briefly, describe the 2PL, 3PL and 4PL, their difference, and why you would rather use 4PL in favor of a simpler model. Giving a reference is a plus.

Fixed.

Since I wasn't sure whether you used a bifactor model or not in the earlier version, and now you stated clearly you didn't use it, I can now specify with more clarity the issue. If you can fit a HOF model and regress sex variable and get sex estimates on factor means, I do not see the added value of the second analysis which uses a simple factor CFA based on each factor as illustrated in your table "Table X. Latent differences in knowledge by sex and facet of knowledge.".

Why is this a problem? The factors are obviously correlated and there are cross loadings, which makes using separate simple structure CFAs a quite dubious approach. Either you should stick merely with the earlier table reporting gaps estimated by HOF model (which for some reasons are removed in this version), or you should also provide estimates from a bifactor model.

The added value here is that a bifactor structure provides more accurate estimates of the specific factors, because general factors are completely separated in a bifactor while the specific factors are represented as residuals in a higher order model. The literature on this matter abounds.

I agree, though at this point, I think the simple structure CFAs are worth keeping in the paper for the sake of noting the difference in results between the observed general knowledge means and the latent one. 

In the discussion, you said the reliability is high but you refer to the symbol ω, which is weird because in general this symbol denotes Omega when refering to reliability. In your main text though you said you used spearman-Brown reliability. Is this what you used in your display of "Table X. Comparison of the seven methods used to calculate general knowledge." ? In any case, S-B reliability is usually denoted ρ.

The last sentence of the discussion section does not seem to be an appropriate way of ending the section. It feels abrupt, unfinished. What I suggest is a brief discussion on the implication of these findings and/or suggestion for future research.

In the reference section, you have Brown et al. 2023 but in the main text, you have Brown et al. 2023 and also Brown et al. 2022.

Fixed.

I am not aware of any research paper or textbooks which claimed that the effect size has anything to do with model fit. Furthermore, I have seen many instances of incredibly large latent mean differences going along with a very good fit. What a poor fit tells you is merely the following: the estimates should not be trusted because the model is largely misspecified (i.e., wrong). A more appropriate model may (or may not) yield different point estimates.

Regarding whether the gap is "too" large or not. Remember this is more an aptitude test than an IQ test. So the small sex gap in IQ test is not generalizeable to aptitude sex gaps. The g depends on the content of the tests. In IQ test this isn't a problem since the test includes a vast, representative, array of abilities. Here on the other hand, the latent factors reflect highly specific knowledge (technical, literary, aesthetic), which is perhaps why DIFs were detected. Whether this g gap is implausibly large requires looking at past studies of sex differences on very similar test batteries. One might argue achievement-g and IQ-g are highly correlated but this doesn't imply the means are identical.

I never implied that the effect size was related to the model fit, merely that models that are poor fits for the data are potential causes of unrealistic effect sizes. The phrase 'too large' was not implicitly comparing it to an assumed magnitude of a sex difference in intelligence, it was in comparison to published studies suggest the difference is of about ~.3-.5.

 

 

Bot

Authors have updated the submission to version #3

Reviewer

I approve the article with the caveat that the latent models used to estimate sex differences are not optimal. Now, I would appreciate if the sections are numbered. In some other papers I've seen submitted in the journal here, this wasn't done properly. Section 1: Intro, section 2: method (eventually subsections), section 3: results, section 4: discussion etc.

Finally, I have some remarks. Note that these points are not mandatory, you can ignore them. However, I still recommend you take my point on cross loadings seriously because there are enough studies out there indicating their importance. Mentioning these references in the paper at least shows you are aware of the issue mentioned by other researchers.

In response to your comments:

It is doubtful that minor fluctuations in modeling are going to have effects on the rank order or magnitude of these differences.

While model specification may not produce a big difference in parameter estimates, it allows for valid estimation. Whatever the magnitude/difference of the true group difference, a misspecification is still a misspecification.

For the sake of simplicity, the cross-loadings have been removed.

I would have preferred cross loadings being specified according to EFA procedures. If you don't feel like including cross loadings, I hope you would mention at least these references: Hsu et al. (2014) and Ximenez (2022). These authors explain the consequence of ignoring cross loadings in parameter estimates.

Hsu, H. Y., Skidmore, S. T., Li, Y., & Thompson, B. (2014). Forced zero cross-loading misspecifications in measurement component of structural equation models. Methodology. doi: 10.1027/1614-2241/a000084
Ximénez, C., Revuelta, J., & Castañeda, R. (2022). What are the consequences of ignoring cross-loadings in bifactor models? A simulation study assessing parameter recovery and sensitivity of goodness-of-fit indices. Frontiers in Psychology, 13, 923877. doi: 10.3389/fpsyg.2022.923877

Is there a purpose to conducting DIF before CFA, if removing biased items was not the reason for using the method?

Knowing which items show DIF before CFA allows you to remove the offending items when conducting CFA. It ensures the loadings and intercepts (or group difference) aren't affected by item bias.

Three out of the eight questions were unambigiously related to health (STDs, cancers, painkillers). Some of them were drug related (cigarettes, weed). Others include holidays, famous criminals, and card games. 

With 3 out of 8, I think it will be worth mentioning in the main text that there is a modest health "flavor" to this general knowledge.

I think the simple structure CFAs are worth keeping in the paper for the sake of noting the difference in results between the observed general knowledge means and the latent one. 

While it is a fine procedure to compare results from different methods (in your case, observed and latent variable approaches) whenever these methods have different strengths and weakeness and/or each have different applications, in the case of a simple latent structure, we know it to be very, very wrong for complex intelligence structures. Comparing results between a good and a wrong method is not informative at all. This is important I believe because you wrote somewhere that "the latent models failed to be useful in determining whether there was a sex difference in the general factor of knowledge". This conclusion could have been different if appropriate models were used.

I never implied that the effect size was related to the model fit, merely that models that are poor fits for the data are potential causes of unrealistic effect sizes.

Perhaps related wasn't the best word. But the sentence "except for general knowledge, where there is a much larger gender difference (d = -0.7), probably due to the poor model fit (CFI = .61)" indicates that you suspect poor fit to be the cause of the large gender gap. My point was that a poor fit indicates that the difference of -0.7 can't be trusted due to poor model fit, rather than the magnitude being due to poor model fit. In fact, it is not impossible to find a large gender difference with a very good model fit (refer to this submission for instance).

The phrase 'too large' was not implicitly comparing it to an assumed magnitude of a sex difference in intelligence, it was in comparison to published studies suggest the difference is of about ~.3-.5.

This is fine. It is the previous statement that was weird.

In response to your updated paper:

Otherwise, the omega reliability was used to estimate reliability.

I suggest adding "(denoted ω)" to improve clarity.

Bias-adjusted differences were computed by estimating the test bias using partial invariant fits

I suggest either adding a reference to this method or describing in a few words the method.

where ability is calculated without taking that item into consideration

It should have been: "where DIF is calculated".

Within the answers, 2 methods supported a negative relationship between g-loadings and female advantages, while only the method with four logistic parameters found a positive relationship between g-loadings and female advantages.

Be specific about these 2 methods supporting a negative relationship. Are these 2- and 3-parameter models? If yes, it is preferrable to be explicit.

and general knowledge correlates with intelligence at about .8 . 

Spacing issue.

Beyond this, analysis suggested that desktop users scored

It should be "Beyond this, the analysis".

The title of Figure 6 says "DWLS estimation was used.". This wasn't mentioned before in the method section. I suggest mentioning this briefly and stating exactly what DWLS stands for (diagonally weighted least squares?). Too many times acronyms were used before in the paper without explaining what they are.

The title of table 8 lacks a dot. Same issue with table A2, figure A1, figure A2. Title in Figure 3 lacks a dot.

Author
Replying to Reviewer 2

I approve the article with the caveat that the latent models used to estimate sex differences are not optimal. Now, I would appreciate if the sections are numbered. In some other papers I've seen submitted in the journal here, this wasn't done properly. Section 1: Intro, section 2: method (eventually subsections), section 3: results, section 4: discussion etc.

Numbering was added.

Finally, I have some remarks. Note that these points are not mandatory, you can ignore them. However, I still recommend you take my point on cross loadings seriously because there are enough studies out there indicating their importance. Mentioning these references in the paper at least shows you are aware of the issue mentioned by other researchers.

In response to your comments:

It is doubtful that minor fluctuations in modeling are going to have effects on the rank order or magnitude of these differences.

While model specification may not produce a big difference in parameter estimates, it allows for valid estimation. Whatever the magnitude/difference of the true group difference, a misspecification is still a misspecification.

For the sake of simplicity, the cross-loadings have been removed.

I would have preferred cross loadings being specified according to EFA procedures. If you don't feel like including cross loadings, I hope you would mention at least these references: Hsu et al. (2014) and Ximenez (2022). These authors explain the consequence of ignoring cross loadings in parameter estimates.

Hsu, H. Y., Skidmore, S. T., Li, Y., & Thompson, B. (2014). Forced zero cross-loading misspecifications in measurement component of structural equation models. Methodology. doi: 10.1027/1614-2241/a000084
Ximénez, C., Revuelta, J., & Castañeda, R. (2022). What are the consequences of ignoring cross-loadings in bifactor models? A simulation study assessing parameter recovery and sensitivity of goodness-of-fit indices. Frontiers in Psychology, 13, 923877. doi: 10.3389/fpsyg.2022.923877

Citations have been added.

Is there a purpose to conducting DIF before CFA, if removing biased items was not the reason for using the method?

Knowing which items show DIF before CFA allows you to remove the offending items when conducting CFA. It ensures the loadings and intercepts (or group difference) aren't affected by item bias.

Very well. I did not intend to remove offending items before conducting CFA.

Three out of the eight questions were unambigiously related to health (STDs, cancers, painkillers). Some of them were drug related (cigarettes, weed). Others include holidays, famous criminals, and card games. 

With 3 out of 8, I think it will be worth mentioning in the main text that there is a modest health "flavor" to this general knowledge.

 

I think the simple structure CFAs are worth keeping in the paper for the sake of noting the difference in results between the observed general knowledge means and the latent one. 

While it is a fine procedure to compare results from different methods (in your case, observed and latent variable approaches) whenever these methods have different strengths and weakeness and/or each have different applications, in the case of a simple latent structure, we know it to be very, very wrong for complex intelligence structures. Comparing results between a good and a wrong method is not informative at all. This is important I believe because you wrote somewhere that "the latent models failed to be useful in determining whether there was a sex difference in the general factor of knowledge". This conclusion could have been different if appropriate models were used.

 

I never implied that the effect size was related to the model fit, merely that models that are poor fits for the data are potential causes of unrealistic effect sizes.

Perhaps related wasn't the best word. But the sentence "except for general knowledge, where there is a much larger gender difference (d = -0.7), probably due to the poor model fit (CFI = .61)" indicates that you suspect poor fit to be the cause of the large gender gap. My point was that a poor fit indicates that the difference of -0.7 can't be trusted due to poor model fit, rather than the magnitude being due to poor model fit. In fact, it is not impossible to find a large gender difference with a very good model fit (refer to this submission for instance).

 

The phrase 'too large' was not implicitly comparing it to an assumed magnitude of a sex difference in intelligence, it was in comparison to published studies suggest the difference is of about ~.3-.5.

This is fine. It is the previous statement that was weird.

In response to your updated paper:

Otherwise, the omega reliability was used to estimate reliability.

I suggest adding "(denoted ω)" to improve clarity.

Added.

Bias-adjusted differences were computed by estimating the test bias using partial invariant fits

I suggest either adding a reference to this method or describing in a few words the method.

The method has been described in more detail in the upcoming second edition.

where ability is calculated without taking that item into consideration

It should have been: "where DIF is calculated".

Within the answers, 2 methods supported a negative relationship between g-loadings and female advantages, while only the method with four logistic parameters found a positive relationship between g-loadings and female advantages.

Be specific about these 2 methods supporting a negative relationship. Are these 2- and 3-parameter models? If yes, it is preferrable to be explicit.

Yes.

and general knowledge correlates with intelligence at about .8 . 

Spacing issue.

Fixed.

Beyond this, analysis suggested that desktop users scored

It should be "Beyond this, the analysis".

The title of Figure 6 says "DWLS estimation was used.". This wasn't mentioned before in the method section. I suggest mentioning this briefly and stating exactly what DWLS stands for (diagonally weighted least squares?). Too many times acronyms were used before in the paper without explaining what they are.

Added.

The title of table 8 lacks a dot. Same issue with table A2, figure A1, figure A2. Title in Figure 3 lacks a dot.

Fixed.

 

Bot

Authors have updated the submission to version #4

Reviewer

I read the new version, it looks fine. One last nitpick: 

This analysis also suggests that using DIF is not an optimal method for assessing bias in highly biased tests, and that different methods, such as MGCFA should be used to assess it.

I don't remember this sentence before, but I just wanted to say this. DIF methods are indeed weak to high % of biased items, e.g., 30% (often used as cutoff for strong bias in simulation studies), but I don't think MGCFA handles the problem better if the % of biased subtests is high (although there is a lack of research in MGCFA studies about percentage cutoff to be considered as highly biased: some researchers apply MGCFA and found 35% biased subtests but still claim the test to be culture fair). If a test is highly biased, no method seems to handle it properly. The test simply needs to be revised.

Author
Replying to Reviewer 2

I read the new version, it looks fine. One last nitpick: 

This analysis also suggests that using DIF is not an optimal method for assessing bias in highly biased tests, and that different methods, such as MGCFA should be used to assess it.

I don't remember this sentence before, but I just wanted to say this. DIF methods are indeed weak to high % of biased items, e.g., 30% (often used as cutoff for strong bias in simulation studies), but I don't think MGCFA handles the problem better if the % of biased subtests is high (although there is a lack of research in MGCFA studies about percentage cutoff to be considered as highly biased: some researchers apply MGCFA and found 35% biased subtests but still claim the test to be culture fair). If a test is highly biased, no method seems to handle it properly. The test simply needs to be revised.

I have no idea why I wrote that. I have my doubts about MGCFA too, though I am not confident enough in my assessment of it to strongly disavow it. It is better to assess bias using the traditional Jensen slope/intercept method (see https://arthurjensen.net/wp-content/uploads/2014/06/1980-jensen.pdf) where the predictor that is being tested (e.g. IQ) is regressed against others (e.g. income, GPA, graduation rate) in two different groups, and it is determined whether the predictor overpredicts or underpredicts performance on the dependent variables, or if the predictor is a better indicator of performance in one group in comparison to others.

Author
Replying to Reviewer 2

I read the new version, it looks fine. One last nitpick: 

This analysis also suggests that using DIF is not an optimal method for assessing bias in highly biased tests, and that different methods, such as MGCFA should be used to assess it.

I don't remember this sentence before, but I just wanted to say this. DIF methods are indeed weak to high % of biased items, e.g., 30% (often used as cutoff for strong bias in simulation studies), but I don't think MGCFA handles the problem better if the % of biased subtests is high (although there is a lack of research in MGCFA studies about percentage cutoff to be considered as highly biased: some researchers apply MGCFA and found 35% biased subtests but still claim the test to be culture fair). If a test is highly biased, no method seems to handle it properly. The test simply needs to be revised.

I have no idea why I wrote that. I have my doubts about MGCFA too, though I am not confident enough in my assessment of it to strongly disavow it. It is better to assess bias using the traditional Jensen slope/intercept method (see https://arthurjensen.net/wp-content/uploads/2014/06/1980-jensen.pdf) where the predictor that is being tested (e.g. IQ) is regressed against others (e.g. income, GPA, graduation rate) in two different groups, and it is determined whether the predictor overpredicts or underpredicts performance on the dependent variables, or if the predictor is a better indicator of performance in one group in comparison to others.

Author
Replying to Reviewer 2

I read the new version, it looks fine. One last nitpick: 

This analysis also suggests that using DIF is not an optimal method for assessing bias in highly biased tests, and that different methods, such as MGCFA should be used to assess it.

I don't remember this sentence before, but I just wanted to say this. DIF methods are indeed weak to high % of biased items, e.g., 30% (often used as cutoff for strong bias in simulation studies), but I don't think MGCFA handles the problem better if the % of biased subtests is high (although there is a lack of research in MGCFA studies about percentage cutoff to be considered as highly biased: some researchers apply MGCFA and found 35% biased subtests but still claim the test to be culture fair). If a test is highly biased, no method seems to handle it properly. The test simply needs to be revised.

I have no idea why I wrote that. I have my doubts about MGCFA too, though I am not confident enough in my assessment of it to strongly disavow it. It is better to assess bias using the traditional Jensen slope/intercept method (see https://arthurjensen.net/wp-content/uploads/2014/06/1980-jensen.pdf) where the predictor that is being tested (e.g. IQ) is regressed against others (e.g. income, GPA, graduation rate) in two different groups, and it is determined whether the predictor overpredicts or underpredicts performance on the dependent variables, or if the predictor is a better indicator of performance in one group in comparison to others.

Bot

Authors have updated the submission to version #5

Reviewer

The manuscript is well written and the methodology is explained clearly.

I do not have major objections to its publication. However, I would like to suggest a novel methodology that could be applied to multiple-choice type of data of this type with correct answers and distractors. Please note that this is just an addition and is not necessary for publication of the paper. 

I acknowledge that this is not a vetted method, hence I cannot require the authors to implement it. However, they can take this as a suggestion.

You can leverage the information gathered in your study to build a confusion matrix, which would tell you the true positive to false positive rate for each item. This would require generating pairs of correct answer - distractors within each item.  Given this, we can build a 2x2 contingency matirx (in this case, it would be called confusion matrix).  Distractors wrongly chosen as correct answer would be false positives, distractors left blank (not ticked) would be true negatives, ticked correct options would be true positives and correct options left blank would be false negatives. 

For each of the 32 items, you could compute 1) Accuracy, 2) Precision, 3) Recall and 4) F1-score. 

These 4 metrics can be computed item-wise, by aggregating the answers from all the subjects, or subject-wise, by aggregating the answers to all the items. So instead of a single score, we would have 4 metrics for each subjects and for each item. The latter could be used to drop items that have too high or too low average values in each of the 4 measures. Moreover, you would get more nuanced information about each item or each participant and you might be able to use this to improve the test by -for example - selecting items whose false positive or false negative rates are not too high. 

 

Author
Replying to Reviewer 4

The manuscript is well written and the methodology is explained clearly.

I do not have major objections to its publication. However, I would like to suggest a novel methodology that could be applied to multiple-choice type of data of this type with correct answers and distractors. Please note that this is just an addition and is not necessary for publication of the paper. 

I acknowledge that this is not a vetted method, hence I cannot require the authors to implement it. However, they can take this as a suggestion.

You can leverage the information gathered in your study to build a confusion matrix, which would tell you the true positive to false positive rate for each item. This would require generating pairs of correct answer - distractors within each item.  Given this, we can build a 2x2 contingency matirx (in this case, it would be called confusion matrix).  Distractors wrongly chosen as correct answer would be false positives, distractors left blank (not ticked) would be true negatives, ticked correct options would be true positives and correct options left blank would be false negatives. 

For each of the 32 items, you could compute 1) Accuracy, 2) Precision, 3) Recall and 4) F1-score. 

These 4 metrics can be computed item-wise, by aggregating the answers from all the subjects, or subject-wise, by aggregating the answers to all the items. So instead of a single score, we would have 4 metrics for each subjects and for each item. The latter could be used to drop items that have too high or too low average values in each of the 4 measures. Moreover, you would get more nuanced information about each item or each participant and you might be able to use this to improve the test by -for example - selecting items whose false positive or false negative rates are not too high. 

 

Unfortunately, I am not familiar with ML theory; perhaps when I am more familiar with it I could write a followup to the paper that contains MGCFA analysis as well.

Reviewer

Replying to Sebastian Jensen

The introduction lacks an explanation of what the goal of the paper is, or which hypotheses your study intends to test. Please add this, as it is a fundamental part of any scientific paper.

" Given that the sex difference in brain size is about d = 0.84 (Nyborg, 2005), the predicted male-female standardized difference in intelligence is 0.24.". Does this account for body size differences?

Replying to Reviewer 4

The manuscript is well written and the methodology is explained clearly.

I do not have major objections to its publication. However, I would like to suggest a novel methodology that could be applied to multiple-choice type of data of this type with correct answers and distractors. Please note that this is just an addition and is not necessary for publication of the paper. 

I acknowledge that this is not a vetted method, hence I cannot require the authors to implement it. However, they can take this as a suggestion.

You can leverage the information gathered in your study to build a confusion matrix, which would tell you the true positive to false positive rate for each item. This would require generating pairs of correct answer - distractors within each item.  Given this, we can build a 2x2 contingency matirx (in this case, it would be called confusion matrix).  Distractors wrongly chosen as correct answer would be false positives, distractors left blank (not ticked) would be true negatives, ticked correct options would be true positives and correct options left blank would be false negatives. 

For each of the 32 items, you could compute 1) Accuracy, 2) Precision, 3) Recall and 4) F1-score. 

These 4 metrics can be computed item-wise, by aggregating the answers from all the subjects, or subject-wise, by aggregating the answers to all the items. So instead of a single score, we would have 4 metrics for each subjects and for each item. The latter could be used to drop items that have too high or too low average values in each of the 4 measures. Moreover, you would get more nuanced information about each item or each participant and you might be able to use this to improve the test by -for example - selecting items whose false positive or false negative rates are not too high. 

 

Unfortunately, I am not familiar with ML theory; perhaps when I am more familiar with it I could write a followup to the paper that contains MGCFA analysis as well.

 

Reviewer

As I wrote before, it is not clear from the introduction what the goals of this paper are or which hypothesis it intends to test. There is a hint in this sentence: "Some research has intended to examine whether the manner in which a test is scored affects how valid it is. For example, it could be possible that a knowledge test that uses free responses is testing a different retrieval mechanism than the ones that use multiple choice mechanisms".

Is the goal of this paper to examine how scoring methods affect its validity? If so, please state this clearly.

Replying to Reviewer 4

Replying to Sebastian Jensen

The introduction lacks an explanation of what the goal of the paper is, or which hypotheses your study intends to test. Please add this, as it is a fundamental part of any scientific paper.

" Given that the sex difference in brain size is about d = 0.84 (Nyborg, 2005), the predicted male-female standardized difference in intelligence is 0.24.". Does this account for body size differences?

Replying to Reviewer 4

The manuscript is well written and the methodology is explained clearly.

I do not have major objections to its publication. However, I would like to suggest a novel methodology that could be applied to multiple-choice type of data of this type with correct answers and distractors. Please note that this is just an addition and is not necessary for publication of the paper. 

I acknowledge that this is not a vetted method, hence I cannot require the authors to implement it. However, they can take this as a suggestion.

You can leverage the information gathered in your study to build a confusion matrix, which would tell you the true positive to false positive rate for each item. This would require generating pairs of correct answer - distractors within each item.  Given this, we can build a 2x2 contingency matirx (in this case, it would be called confusion matrix).  Distractors wrongly chosen as correct answer would be false positives, distractors left blank (not ticked) would be true negatives, ticked correct options would be true positives and correct options left blank would be false negatives. 

For each of the 32 items, you could compute 1) Accuracy, 2) Precision, 3) Recall and 4) F1-score. 

These 4 metrics can be computed item-wise, by aggregating the answers from all the subjects, or subject-wise, by aggregating the answers to all the items. So instead of a single score, we would have 4 metrics for each subjects and for each item. The latter could be used to drop items that have too high or too low average values in each of the 4 measures. Moreover, you would get more nuanced information about each item or each participant and you might be able to use this to improve the test by -for example - selecting items whose false positive or false negative rates are not too high. 

 

Unfortunately, I am not familiar with ML theory; perhaps when I am more familiar with it I could write a followup to the paper that contains MGCFA analysis as well.