Back to Accepted Submissions

1
Spearman’s g Explains Black-White but not Sex Differences in Cognitive Abilities in the Project Talent

Submission status
Accepted

Submission Editor
Emil O. W. Kirkegaard

Author
Meng Hu

Title
Spearman’s g Explains Black-White but not Sex Differences in Cognitive Abilities in the Project Talent

Abstract


The weak form of Spearman’s Hypothesis, which states that the racial group differences are primarily due to differences in the general factor (g), was tested and confirmed in this analysis of the Project Talent data, based on 34 aptitude tests among 9th-12th grade students. Multi-Group Confirmatory Factor Analysis (MGCFA) detected small-modest bias with respect to race but strong bias with respect to within-race sex cognitive difference. After establishing partial measurement equivalence, SH was tested by comparing the model fit of correlated factors (non-g) model with a bifactor (g) model as well as the relative contribution of g factor means to that of the specific factors. While g was the main source of the Black-White differences, this wasn’t the case for within-race sex differences. The average proportion of the score gaps accounted for by g is large (.73/.90) for the Black-White analysis but modest (.43/.50) for the sex analysis. The evidence of measurement bias in the sex analysis may cause ambiguity in interpreting SH for sex differences. Results from MGCFA were somewhat corroborated by the Method of Correlated Vectors, with high correlations of subtests’ loadings with Black-White differences but near-zero correlations with sex differences. This finding replicates earlier MGCFA studies supporting SH with respect to the Black-White cognitive gap as well as earlier MGCFA studies revealing stronger gender bias than racial bias.

Keywords
measurement invariance, MCV, Spearman’s Hypothesis, MGCFA, Black-White IQ gap, Project Talent, Sex IQ gap

Supplemental materials link
https://osf.io/qn67k/

Pdf

Paper

Reviewers ( 0 / 1 / 3 )
Reviewer 1: Considering / Revise
Reviewer 2: Accept
Reviewer 3: Accept
Reviewer 4: Accept

Sat 16 Sep 2023 22:27

Bot

Author has updated the submission to version #2

Reviewer

The paper here contains very important findings. However, at present it is written in a very technical way, to the extent that anyone without intimate knowledge of the literature would be entirely lost.

 

Some suggested edits: 

A very brief and simple definition of Spearman's Hypothesis should be provided in the Abstract

 

On page 2: “One comes from Scheiber (2016b) who found strong measurement bias in the analysis of the WISC-V between 777/830 […]” it is unclear what the fractions in this sentence mean.

 

On page 5: “When within-factor correlated residuals are misspecified, all fit indices favor the correlated factors model regardless of conditions, except for SRMR, show a bias in favor of the correlated factors model (Greene et al., 2019)”. This sentence needs rewording, it does not make sense at the minute.

 

In the analysis section: it would be useful to provide diagrams of what the CF, HOF and BF models look like. This will make it easier for a reader to understand what hypotheses are being tested.

 

Table 1: There could be another column which states, in plain english, what each of these models is used to test for.

 

Table 2: please indicate, for each fit measure what is considered a better fit. E.g: CFI higher is better, RSMEA lower is better.

 

As far as I can tell the model specification is the same wherever it is stated. In which case it should only be stated once, with a phrase along the lines of “for all of our models the model specification is:”

 

Providing a table of g-loadings for math, speed etc would be useful information.

 

The horizontal axis on figures should be “average g-loading (from Black and White male sample)” or similar. This is practically the most important part of the paper, as these graphs are easy to digest for laymen. They need to be as easy to understand as possible.

Bot

Author has updated the submission to version #3

Author

Thank you for the review.

I understand that the analysis is complex. In reality, MGCFA can be much easier if the data is ideal (clean factor structure, near equivalent group samples, no Heywood cases, no pro-bifactor bias, assumption of no cross loadings for computing effect sizes of bias). Unfortunately, the data usually does not fulfill most of these ideal conditions. And in the case of the Project Talent, the large number of subtests, subgroups and models complicate the situation even more. I wish I could simplify as much as possible, but at the same time it is necessary to explain and address the problems that are often ignored in MGCFA studies.

I modified my article according to your suggestions, clarifying and fixing whenever necessary. I also updated my supplementary file.

The weak form of Spearman’s Hypothesis, which states that the racial group differences are primarily due to differences in the general factor (g), was tested and confirmed in this analysis of the Project Talent data, based on 34 aptitude tests among 9th-12th grade students. 

...

One comes from Scheiber (2016b) who found strong measurement bias in the analysis of the WISC-V between 777 White males and 830 White females, 188 Black males and Black 221 females, and 308 Hispanic males and Hispanic 313 females.

...

When within-factor correlated residuals are misspecified, all fit indices correctly favor the correlated factors model regardless of conditions, except for SRMR, which incorrectly favors the bifactor model (Greene et al., 2019, Table 4).

I now provided a new Figure 1, along with the following text:

Figure 1 displays hypothetical competing CFA models that are investigated in the present analysis: 1) the correlated factors model which specifies that the first-order specific factors are correlated without the existence of a general factor, 2) the higher order factor model which specifies that the second-order general factor operates through the first-order specific factors and thus only indirectly influences the subtests, 3) the bifactor model which, unlike the higher order factor, specifies that both the general and specific factors, have direct influences on the subtests.

I also added a note under the model fit tables:

Note: higher values of CFI and Mc indicate better fit, while lower values of χ2, RMSEA, RMSEAD, SRMR indicate better fit.

I, however, found one of your request difficult to fulfill. Specifically, this one:

Table 1: There could be another column which states, in plain english, what each of these models is used to test for.

This is because, to summarize the purpose of the models in just 1 or 2 words is extremely difficult. Considering the column specification is already loaded with information, adding another column filled with more information will make the table more tedious to read, I believe.

The models in Table 1 have been somewhat summarized prior, but now expanded a bit more, with a reference to Table 1 as well. 

MGCFA starts by adding additional constraints to the initial configural model, with the following incremental steps: metric, scalar, strict. A rejection of configural invariance implies that the groups use different latent abilities to solve the same set of item variables. A rejection in metric (loading) invariance implies that the indicators of a latent factor are unequally weighted across groups. A rejection in scalar (intercept) invariance implies that the subtest scores differ across groups when their latent factor means is equalized. A rejection in strict (residual) invariance implies there is a group difference in specific variance and/or measurement error. When invariance is rejected, partial invariance must release parameters until acceptable fit is achieved and these freed parameters must be carried on in the next levels of MGCFA models. The variances of the latent factors are then constrained to be equal across groups to examine whether the groups use the same range of abilities to answer the subtests. The final step is to determine which latent factors can have their mean differences constrained to zero without deteriorating the model fit: a worsening of the model fit indicates that the factor is needed to account for the group differences. These model specifications will be presented in Table 1 further below. 

I provided the R output displaying all parameter values for the best model in Tables 2 through 13, in the supplementary file. The output is in fact so large that, even if I only display the group factor loadings, it will drastically increase the number of pages. The article is already very long. Notice that I did not originally display anywhere in the paper the general factor loadings. This is because, again, there were too many models and subgroups (loadings, both g and specific factors, would need to be displayed for each subgroups, black men, white men, white women, black women for each g models).

The X axis title in figures 2-5 has been modified as per your suggestion.

One final remark. I do not understand this sentence.

As far as I can tell the model specification is the same wherever it is stated. In which case it should only be stated once, with a phrase along the lines of “for all of our models the model specification is:”

Are you refering to the competing models (CF, HOF, BF) or rather to the model constraints (M1-M6)? I suspect the latter, though I'm not sure. If this is the case, each subgroups shows different pattern of non-invariance, so I have to discuss them separately rather than making general and rather unprecise statements about the results. I understand it may be tedious for the readers but I believe it is necessary.

Reviewer
Replying to Meng Hu

Thank you for the review.

I understand that the analysis is complex. In reality, MGCFA can be much easier if the data is ideal (clean factor structure, near equivalent group samples, no Heywood cases, no pro-bifactor bias, assumption of no cross loadings for computing effect sizes of bias). Unfortunately, the data usually does not fulfill most of these ideal conditions. And in the case of the Project Talent, the large number of subtests, subgroups and models complicate the situation even more. I wish I could simplify as much as possible, but at the same time it is necessary to explain and address the problems that are often ignored in MGCFA studies.

I modified my article according to your suggestions, clarifying and fixing whenever necessary. I also updated my supplementary file.

The weak form of Spearman’s Hypothesis, which states that the racial group differences are primarily due to differences in the general factor (g), was tested and confirmed in this analysis of the Project Talent data, based on 34 aptitude tests among 9th-12th grade students. 

...

One comes from Scheiber (2016b) who found strong measurement bias in the analysis of the WISC-V between 777 White males and 830 White females, 188 Black males and Black 221 females, and 308 Hispanic males and Hispanic 313 females.

...

When within-factor correlated residuals are misspecified, all fit indices correctly favor the correlated factors model regardless of conditions, except for SRMR, which incorrectly favors the bifactor model (Greene et al., 2019, Table 4).

I now provided a new Figure 1, along with the following text:

Figure 1 displays hypothetical competing CFA models that are investigated in the present analysis: 1) the correlated factors model which specifies that the first-order specific factors are correlated without the existence of a general factor, 2) the higher order factor model which specifies that the second-order general factor operates through the first-order specific factors and thus only indirectly influences the subtests, 3) the bifactor model which, unlike the higher order factor, specifies that both the general and specific factors, have direct influences on the subtests.

I also added a note under the model fit tables:

Note: higher values of CFI and Mc indicate better fit, while lower values of χ2, RMSEA, RMSEAD, SRMR indicate better fit.

I, however, found one of your request difficult to fulfill. Specifically, this one:

Table 1: There could be another column which states, in plain english, what each of these models is used to test for.

This is because, to summarize the purpose of the models in just 1 or 2 words is extremely difficult. Considering the column specification is already loaded with information, adding another column filled with more information will make the table more tedious to read, I believe.

The models in Table 1 have been somewhat summarized prior, but now expanded a bit more, with a reference to Table 1 as well. 

MGCFA starts by adding additional constraints to the initial configural model, with the following incremental steps: metric, scalar, strict. A rejection of configural invariance implies that the groups use different latent abilities to solve the same set of item variables. A rejection in metric (loading) invariance implies that the indicators of a latent factor are unequally weighted across groups. A rejection in scalar (intercept) invariance implies that the subtest scores differ across groups when their latent factor means is equalized. A rejection in strict (residual) invariance implies there is a group difference in specific variance and/or measurement error. When invariance is rejected, partial invariance must release parameters until acceptable fit is achieved and these freed parameters must be carried on in the next levels of MGCFA models. The variances of the latent factors are then constrained to be equal across groups to examine whether the groups use the same range of abilities to answer the subtests. The final step is to determine which latent factors can have their mean differences constrained to zero without deteriorating the model fit: a worsening of the model fit indicates that the factor is needed to account for the group differences. These model specifications will be presented in Table 1 further below. 

I provided the R output displaying all parameter values for the best model in Tables 2 through 13, in the supplementary file. The output is in fact so large that, even if I only display the group factor loadings, it will drastically increase the number of pages. The article is already very long. Notice that I did not originally display anywhere in the paper the general factor loadings. This is because, again, there were too many models and subgroups (loadings, both g and specific factors, would need to be displayed for each subgroups, black men, white men, white women, black women for each g models).

The X axis title in figures 2-5 has been modified as per your suggestion.

One final remark. I do not understand this sentence.

As far as I can tell the model specification is the same wherever it is stated. In which case it should only be stated once, with a phrase along the lines of “for all of our models the model specification is:”

Are you refering to the competing models (CF, HOF, BF) or rather to the model constraints (M1-M6)? I suspect the latter, though I'm not sure. If this is the case, each subgroups shows different pattern of non-invariance, so I have to discuss them separately rather than making general and rather unprecise statements about the results. I understand it may be tedious for the readers but I believe it is necessary.

Thank you for your reply. I will review the revised version. But I will quickly clarify that last part. I am refering to bit that states: "The model specification is displayed as follows:

english =~ S1 + S13 + S19 + S20 + S21 + S22 + S23 + S24 + S25 + S26 + S31 + S34

math =~ S5 + S25 + S32 + S33 + S34 speed =~ S19 + S34 + S35 + S36 + S37

info =~ S1 + S2 + S3 + S4 + S7 + S8 + S11 + S12 + S13 + S14 + S15 + S16 + S19 + S26

science =~ S1 + S6 + S7 + S8 + S9 + S10

spatial =~ S28 + S29 + S30 + S31 + S37"

This seems to be repeated exactly several times in the paper. e.g on page 16, page 21, page 24.

Author

I understand now. At first glance, it seems the models are identical across subgroups. In fact, they differ a little bit by subgroup. Usually, some subtests have additional cross loadings in some subgroups (e.g. S13 Health, S32 Arithmetic Reasoning, etc.).

Bot

The submission was accepted for publication.

Reviewer | Admin | Editor

 This finding replicates earlier MGCFA studies supporting SH with respect to the Black-White cognitive gap as well as earlier MGCFA studies revealing stronger gender bias than racial bias

I would not frame failure of Spearman's hypothesis as test bias. It can mean that, but it can also just mean that non-g factors have a large role in explaining why differences are larger or smaller on some tests. This is obviously the case for sex differences, where the g factor difference (assuming this exists) has a relatively minor role in explaining differences across subtests compared to differences in non-g abilities. For black-white, the g gap is so large that g will explain most of the differences between subtests. Also, could you decompose the explained proportion and give these in the abstract?

Jensen (1998) proposed that the magnitude of the racial differences in IQ, at least between Black and White Americans, as well as differences in cognitive-related socio-economic outcomes are a function of the g-loadings (i.e., the correlation between the tests or outcomes and the general factor of intelligence) of the respective cognitive tests and outcomes, making the g factor an essential element in the study of socio-economic inequalities

Can you give the page reference for this? I think this is generally called the g-nexus.

 More specifically, Jensen (1998) proposed that race/ethnic group differences on cognitive tests are largely due to latent differences in general mental ability. This is known as Spearman’s hypothesis (SH) which exists in two forms: the strong and the weak form, the latter of which was endorsed by Jensen.

He did propose that, but not in that book. Proposed makes it sound like this was the original writing to propose this, but Jensen's writings on SH go back decades prior. Maybe better would be to discuss:

This is known as Spearman’s hypothesis (SH) which exists in two forms: the strong and the weak form, the latter of which was endorsed by Jensen. The strong form affirms that the differences are solely due to g factor differences while the weak form affirms that the differences are mainly due to differences in g. The alternative contra hypothesis states that group differences reside entirely or mainly in the tests’ group or broad factors and/or test specificity and that g differences contribute little or nothing to the overall ones.

I prefer thinking about this as a continuum of g vs. non-g contributions to the observed gaps. It ranges from 0 to 100% and the strong form is just the point at the end of this line. I think that conceptualizing it this way is more sensible, but I recognize that MGCFA-like methods have to split it up for hypothesis testing purposes. Still, Lasker often uses g's contribution as a % which will be useful for meta-analysis.

Millsap & Olivera-Aguilar (2012, p. 388) provides an illustration: the inclusion of a math test having a mixture of multiple-choice items and problem-solving items, with the latter being story problems, may introduce bias against foreign language speakers due to the verbal content of this test. If the math factor is not supposed to measure verbal skill, then such a test should be discarded.

I think it would be better to say that such a test should not be used with foreign language users, or at least, should be scored differently in order to compensate for this language bias.

Another comes from Benson et al. (2020) who analyzed the UNIT2 norming sample and found that scalar invariance was rejected not only for race (Whites, Blacks, Asians) and ethnicity (Hispanic) groups but also for gender groups. Metric invariance was also rejected for age and gender groups, suggesting that the UNIT2 overall is somewhat biased with respect to any group.3

These are in stark contrast to most findings. What is special about this study? Why does it give so different results? Can one reanalyze the data using the same methods as most papers use?

So far the evidence of strong measurement bias in race differences comes mainly from studies conducted in African countries. Dolan et al. (2004) compared the Junior Aptitude Test (JAT) scores of South African Black and White students and found that both metric and scalar invariance are violated.4 Lasker (2021) re-analyzed Cockroft et al. (2015) and compared the WAIS-III scores of undergraduate South African students enrolled at an English medium University to undergraduate UK university students, and found that metric and scalar invariance are rejected. Warne (2023) compared the WISC-III scores of Kenyan Grade 8 students in Nairobi schools to the American norm and the WAIS-IV scores of Ghanaian students who showed English fluency at high school or university to the American norm. While measurement equivalence was established for Ghanaian students, it was rejected for Kenyan students.5

Many readers will not be familiar with the exact meanings of metric, scalar etc. invariance, and what it means when it is rejected/violated. Can you explain these? Usually, one can attain partial invariance allowing score comparisons, was this done?

Yet research employing MGCFA showed mixed evidence of gender fairness. Some studies reported small or no measurement bias (Chen et al., 2015;6 Dombrowski et al., 2021; Irwing, 2012; Keith et al., 2011; Palejwala & Fine, 2015;7 Reynolds et al., 2008; van der Sluis et al., 2006) while others reported non-trivial bias, intercepts almost always being the common source of bias (Arribas-Aguila et al., 2019; Dolan et al., 2006; Lemos et al., 2013; Pauls et al., 2020; Pezzuti et al., 2020; Saggino et al., 2014; Van der Sluis et al., 2008; Walter et al., 2021).

I would add that it is trivially easy to get intercept bias for group differences if they differ in non-g factors as well as g. If the number of tests isn't sufficiently large to model the non-g factors, differences in them will show up as intercept bias. A good example of this is:

https://www.researchgate.net/publication/236597029_Decoding_the_Meaning_of_Factorial_Invariance_and_Updating_the_Practice_of_Multi-group_Confirmatory_Factor_Analysis_A_Demonstration_With_TIMSS_Data

Which used a g-only model for 5 math subtests from TIMSS.

After fitting a parsimonious weak SH model, they discovered that the equality constraint on the g factor mean did not worsen model fit in 3 of the 4 age subgroups. In the end, there is no compelling evidence that g is the main source of the subtest differences between sex groups.

Given how small the g difference is between sexes, and the limited power of MGCFA methods, it is not surprising that often the null-g models are not rejected. This is an inherent issue with hypothesis testing. Jensen's method has similar issues, because it is biased downward by samping errors in the estimates. When the true difference is relatively small, the bias will often overcome the signal. This was shown by simulation by:

https://www.researchgate.net/publication/353465907_The_Negative_Religiousness-IQ_Nexus_is_a_Jensen_Effect_on_Individual-Level_Data_A_Refutation_of_Dutton_et_al's_'The_Myth_of_the_Stupid_Believer'

The Project Talent is the largest study ever conducted in the United States involving 377,016 9th-12th grade students during 1960 and drawn from all of the 50 states (Flanagan et al., 1962).

Largest what study? It is probably not the largest study ever of any social science in the USA. For instance, the current MVP is larger. https://www.research.va.gov/mvp/ It is probably the largest dataset from the USA concerning detailed measures of intelligence.

Three tests have been removed in the present analysis: memory for sentences (S17), memory for words (S18), and creativity (S27). The memory tests are highly correlated with each other but are poorly correlated with all other variables (between r=.10 and r=.20), which makes them unsuitable for CFA. Creativity has moderate correlations with other variables, has no main loading and its loadings are of modest or small size. Thus, a total of 34 aptitude/cognitive tests are used.

Why are they so low in g-loading? It's a general problem of PT that it has too many tests with small item counts. The item data is not available for analysis for most of the sample, so one cannot attempt to adjust for this so easily.

In their study, the VPR-g model fitted much better than the CHC-based HOF g model.

I think this is the first time "CHC" is used, but not explained.

Figure 1 images are fairly ugly. Maybe draw some nicer ones using e.g. https://app.diagrams.net/

To evaluate and compare model specifications, fit indices such as CFI, RMSEA, RMSEAD, SRMR and McDonald’s Noncentrality Index (Mc) are used to assess model fit, along with the traditional χ2. Higher values of CFI and Mc indicate better fit, while lower values of χ2, RMSEA, RMSEAD, SRMR indicate better fit. Simulation studies established the strength of these indices to detect misspecification (Chen, 2007; Cheung & Rensvold, 2002; Khojasteh & Lo, 2015; Meade et al., 2008). However, with respect to ∆RMSEA, doubts about its sensitivity to detect worse fit among nested models were raised quite often. Savalei et al. (2023) provided the best illustration of its shortcomings. According to them, this was expected because the initial Model A often has large degrees of freedom (dfA) relative to the degrees of freedom introduced by the constraints in Model B (dfB), resulting in very similar values of RMSEAB and RMSEAA, hence a very small ΔRMSEA. For evaluating nested models, including constrained ones, their proposed RMSEAD solves this issue. RMSEAD is based on the same metric as RMSEA and is interpreted exactly the same way: a value of .08 suggests fair fit while a value of .10 suggests poor fit.

Is there some reason people in this area don't use cross-validation? It seems to me that this is a better way to deal with overfitting issues relating to all of these model fit statistics. I see that some people use split-half cross-validation as a measure of overfitting. https://www.tandfonline.com/doi/abs/10.1080/1091367X.2014.952370 In general, though, k-fold cross-validation is a much better approach.

Because the Predictive Mean Matching (PMM) method of imputation calculates the predicted value of target variable Y according to the specified imputation model, the imputation was conducted within race and within gender groups, totaling four imputations. It is inappropriate to impute the entire sample because it implies that the correlation pattern is identical across groups, an assumption that may not be true and may eventually conceal measurement non-invariance.

Alternatively, different imputation models may create spurious non-invariance because the fill the missing data using different model parameters.

The model specification is displayed as follows:

Use code formatting (equal space font) for code.

Table 2. Is "RMSEAD [CI]" difference in RMSEA? The width of the confidence interval is not specified, but one can assume it is 95%. I don't see where this value is from. The differences in RMSEA are not that large between the models, they are all between 0.042 and 0.044.

Table 3. Black-White differences among males using Higher Order Factor

Better captions would be e.g. "Table 3. Black-White differences among males using the Higher Order Factor model"

What do the bold rows in the tables mean?

Finally, as an additional robustness analysis, all models for both the Black-White male and female groups were rerun after removing multivariate outliers with the Minimum Covariance Determinant (MCD) proposed by Leys et al. (2018) who argued that the basic Mahalanobis Distance was not a robust method. Although the multivariate normality was barely acceptable, the number of outliers was large: MCD removed 1,948 White males and 338 Black males, and 1,005 White females and 372 Black females.

Is there some information about why these were outliers? In what sense are they outliers?

Robustness analysis was conducted for the gender difference in the White group because the multivariate normality was non-normal

I don't understand this.

Table 14. d gaps (with their S.E.) from the best fitting g models per group analysis

Are these standard errors really accurate? You could bootstrap the data to find out. It is hard to believe these estimates are that precise.

Note: Negative values indicate advantage for Whites (or males).

Might be more sensible to use positive values for these gaps since these are more numerous and usually the values of interest.

The average proportion is .43 for the sex group among Whites and .50 for the sex group among Blacks. If SH explicitly states that g is the main source of the group difference, it seems that even the weak SH model does not explain well the pattern of sex differences.

On the contrary, these seem high values. It says g gaps explain quite a bit of the variation between tests. Based on the popular view that g gap = 0, these kinds of findings are unexpected.

Table 15. Proportions of subtest group differences due to g based on Bifactor model

It is hard to understand the table when subtests are not given by their names, but by e.g. S15. The reader then has to scroll back to a much earlier page to look up each value.

The g-loadings correlate highly with Black-White d gaps but not with sex d gaps. After correction for unreliability, the correlations (g*d) for the Black-White male, Black-White female, male-female White, and male-female Black groups are, respectively, .79, .79, -.06, -.12. If the reliability for speed subtests is assumed to be .70 instead of .60, the correlations are .80, .80, -.05, -.12.

I think this is a good example of how Jensen's method gives misleading results compared to the proportions calculated above. There it was found g explained about half the variation in subtest differences, whereas Jensen's method suggests it is about 0%.

Figures 2-5 also show us the issue of single regression. In reality, since tests load in multiple factors, you should use multiple regression. The predictors are each tests' factor loadings on all the factors. When this is done, the results (unstandardized slopes) will tell you the estimated gap sizes given tests that measure each factor perfectly and no other factor. Do these values align with the MGCFA results? They should.

You don't see to think of the implied gaps based on the regression models in the figures, but we can read them off at the loading = 1 line. They are about 1.5 d for black-white males, about 1.6 d for black-white females, and quite small for the two male-female gaps. The values for the black-white comparisons are close to the ones from MGCFA in table 14.

After establishing partial invariance, SH was tested in all subgroups. This was validated in the Black-White analyses based on two findings: 1) non-g models fit worse than g-models and 2) the proportion of the subtests’ mean differences due to g is very large.

Maybe it is larger than usual because this sample was large enough so that single sex samples could be used. When joint sex samples are used, some of the variation in the data will be explained by unmodeled sex differences. I think this may make the g proportions lower.

SH is supported through the examination of Forward and Backward Digit Span, showing a BDS Black-White gap that is larger (d=.50) than the FDS gap (Jensen, 1998, p. 370).

See https://www.sciencedirect.com/science/article/abs/pii/S104160801530056X

The great majority of the studies confirms the cross-cultural comparability of IQ tests, the exception mainly comes from South African samples (Dolan et al., 2004; Lasker, 2021). Due to the omnipresent force of the mass-market culture in developed countries, it is not surprising that culture bias is rarely noticeable (Rowe et al. 1994, 1995).

A great many? I thought there were almost no studies examining cross-cultural, that is, between country, comparability of IQ tests.

Attempts to reduce the racial IQ gap using alternative cognitive tests have always been proposed (Jensen, 1973, pp. 299-301; 1980, pp. 518, 522). The most recent, but unconvincing, attempt at reducing the cognitive gap comes from Goldstein et al. (2023). They devised a reasoning test composed of novel tasks that do not require previously learned language and quantitative skills. Because they found a Black-White d gap ranging between 0.35 and 0.48 across their 6 independent samples, far below the typically found d gap of 1.00, they concluded that traditional IQ tests are biased. First, they carefully ignore measurement invariance studies. Second, traditional IQ tests were not administered alongside to serve as benchmarks. Third, their analysis adjusted for socio-economic status because they compare Blacks and Whites who had the same jobs (police officers, deputy sheriffs, firefighters) within the same cities. This study reflects the traditional view that IQ tests are invalid as long as they contain even the slightest cultural component.

Murray's recent book also provides gap sizes within occupations, though not within the same cities. They are generally smaller, of course.

The Project Talent administered aptitude tests. They serve as a proxy for cognitive tests, but they are not cognitive tests. For instance, most of the information test items require specific knowledge: asking who was the hero of the Odyssey, or what a female horse is called, etc. They do not call for relation eduction. Jensen (1985) himself has been highly critical of the Project Talent test battery: “Many of these tests are very short, relatively unreliable, and designed to assess such narrow and highly culture-loaded content as knowledge about domestic science, farming, fishing, hunting, and mechanics” (p. 218).

It is strange to call aptitude tests not cognitive tests. All of these are cognitive or mental tests per Jensen's definition given in The g Factor because performance on them depends mainly on cognition, as opposed to e.g. dexterity. Every cognitive test will have some flavor, at least, no one has made a pure g test so far. I don't think these aptitude or knowledge tests are any worse than many other tests commonly used. In fact, knowledge tests usually have higher loadings due to higher reliability despite not involving anything but recalling previously learned facts and some guesswork in cases of uncertainty.

 

Reviewer | Admin

 

The author applies MGCFA to aptitude tests in the Project Talent dataset to test measurement equivalence and Spearman’s Hypothesis by race and sex. The sample size is large and the cognitive data is rich, so it is well-deserving of publication. The author finds Spearman’s hypothesis applies to the black-white difference but not sex differences. Through the many citations and explication of methods, the author shows a deep knowledge of a difficult literature, where recommendations are constantly evolving (e.g. how to compare the fit of BF and HOF models). So I’m confident that the manuscript provides a rigorous contribution. 

 

My main criticism of the paper is that it is difficult to follow. I am not an expert in MGCFA, or test bias, and other papers in the field are hard to understand, nevertheless there are times in the manuscript where I became more confused than was necessary. Writing more clearly and concisely, while avoiding tangents and moving details to a supplement might be helpful in the future. Sometimes the extra details regarding methods seem unnecessary, while the core parts are not always clearly explained. 

 

As I’ve said this is not my area of expertise and there were areas where I was confused. As such, my comments are only recommendations, some of which may be wrong, but which the author can consider and use as he wishes. I categorise my comments into high level comments regarding the key findings which I found unclear and a larger range of miscellaneous comments I had whilst reading through the manuscript.




 

High level comments

 

The author concludes that there is gender bias in the cognitive tests. Is that accurate? Couldn’t the results be interpreted as there being large sex differences in specific abilities.

 

The author finds huge sex differences in g using MGCFA. Isn’t this an interesting finding, since the received opinion of the field is that sex differences in g are non-existent or trivial. Does the author think his results should make us confident in sex differences in g?

 

Does the author have confidence that the g factor is well-specified? You seem to think the use of knowledge tests has made the latent “g factor” biased towards capturing general knowledge ability more so than it should. Is this really a good explanation for the large sex differences given the women do better on the information component? It looks like the science component perhaps is the key problem? 

 

The author says that there is non-trivial racial bias in the aptitude tests. Could the author clarify how big these biases are? How different is this result to past studies and if the result is different, why is it different? 




 

Miscellaneous Comments:



 

This is for answering this question that Jensen (1998) devised the Method of Correlated Vectors (MCV) to find out how much g explains the variation in subtest differences between groups. 

 

Awkward wording

 

the largest study ever conducted

 

I don’t think this is true. 

 

The Project Talent administered a considerable amount of tests

 

Just Project Talent administered would read better

 

Wise et al. (1979). Major et al. (2012)

 

And instead of a full stop might be clearer here.

 

In their study, the VPR-g model fitted much better than the CHC-based HOF g model. The VPR was initially used in this study but it was found that the VPR model does not fit better than the CHC-based HOF model and produces sometimes inadmissible solutions such as negative variance.

 

Why do the results differ? My understanding was that VPR > CHC but CHC is primarily used because it is tradition

 

Footnote

Major et al. (2012) analyzed and used multiple imputation on the entire sample and separated the

analysis by gender and by grade level (9-12). They included Memory for Words, Memory for

Sentences, and Creativity subtests. In the present study, the VPR fits marginally better with a

CFI=.002 at best, regardless of the subgroups being analyzed, and this remained true even after

analyzing subgroups by grade level (9-12).

 

This appears to say the VPR does fit better than the CHC? Which is it? 

 

Supplementary excel sheet says # GET MY DATA HERE : https://osf.io/qn67k/ OR HERE https://osf.io/5y82g/

 

Perhaps put everything relevant in one OSF page? 

 

I think saving R code into excel files is quite ugly. I realise you might want to show off the console output, but R markdown can do this in a much cleaner way



 

Univariate normality is then scrutinized. Curran et al. (1996) determined that univariate skewness of 2.0 and kurtosis of 7.0 are suspect, and that ML is robust to modest deviation from multivariate non-normality but that ML χ2 is inflated otherwise. Values for univariate kurtosis and skewness were acceptable, although the kurtosis values for Table Reading are a little high among White males (3.2) and White females (4.58). On the other hand, multivariate normality was often rejected. The multivariate non-normality displayed by the QQ plot was moderate for Black-White analysis in both male and female groups and sex analysis in the White group but perfectly fine for sex analysis in the Black group. 

 

Exploratory Factor Analysis (EFA) was used to determine the appropriate number of factors. Similar to Major et al. (2012), it was found here that the 6-factor model was the most interpretable in all subgroups tested. The 4- and 5-factor models blend indicators into factors which are more ambiguous (math and english tests form one common factor; information and science tests form one common factor) and cause severe unbalances in factor sizes. The 7- and 8-factor models produce additional factors which are not associated with any particular ability or do not have any indicators with high loading. EFA reveals a large number of medium-size cross loadings. Since the results from simulation studies (Cao & Liang, 2023; Hsu et al., 2014; Xiao et al., 2019; Ximénez et al., 2022; Zhang et al., 2023) indicated that ignoring small cross loadings, typically set at .15 or .20 in these studies, has a tendency to reduce the sensitivity in commonly used fit indices, cross loadings are allowed when the average of the two groups is close to .20 but with a minimum of .15 per group. 



 

A lot of the results referenced in these two paragraphs don’t appear to be presented in the paper. I can’t see the QQ plot. What was the basis for supposing that the “6-factor model was the most interpretable in all subgroups tested”

 

Overall fit is acceptable in all models, except maybe for Mc. The configural and regression invariance both hold perfectly, thus only the next steps will be critically analyzed. 

 

On what basis are these statements made and the statements in the rest of the paragraph?

 

Table 3 contains a summary of the fit indices of the HOF model and the freed parameters

 

I think “free parameters” sounds nicer than “freed parameters”



 

The male-female g gap in the White and Black groups are, respectively, 0.85 and 0.55. The sex gap seems large compared to earlier reports on IQ gaps, until one realizes that this battery of tests has a strong knowledge component, especially specific knowledge

 

I understand that all variables have been standardized, but a near 1SD gap seems incredible. Even if the tests are very much based on general knowledge. I don’t know the literature, but would sex gaps in general knowledge be so large in the age range of Project Talent?

 

Is the explanation of the large sex gaps being caused by lots of knowledge batteries very plausible? In the information ability, females have a large advantage, which seems contradictory. In fact, most of the broad abilities show a female advantage, yet still there is a male advantage in g. Men are really outperforming woman on science ability (d = -1.7), even moreso than on spatial ability (d = -0.33). So the sex differences are really being driven by this one odd science factor. Maybe it is worth reanalysing the sex differences after dropping the science subtests? Mind you, isn’t the male advantage in spatial ability normally bigger?

 

Also the g gap between the sexes in black group seems much smaller than the white group. Should we be confident in that difference? It seems surprising and interesting and worthy of discussion. I personally am quite interested to know if there are racial differences in sexual dimorphism.

 

If g factors from project talent capture a lot of variance related to general knowledge, doesn’t that pose a big problem for the paper? If the g factor is improperly specified, then testing Spearman’s hypothesis in the sample, doesn’t mean much?

 

Is it interesting or surprising that the bifactor models produce much larger g differences for sex than the HOF models? My understanding is that the large differences in specific abilities, makes it difficult to isolate g differences between the sexes. So the BF model, in more precisely specifying the varying influences of specific abilities, might provide better estimates of sex differences in g?



 

Table 15. Proportions of subtest group differences due to g based on Bifactor model

 

All values for the sex differences are positive. But if males have an advantage in g and perform worse than woman on some subtests, then g should contribute negatively to the sex difference in favour of women? Shouldn’t some be positive and some be negative? 

 

 Freed parameters (by descending order of χ2 size) are: table reading~1, mechanical reasoning~1, social studies~1, clerical checking~1.

 

Most of the abilities are capitalised, but sometimes not as above. Best to be consistent.



 

Figures 2-5. The image quality is poor and the figures take up multiple pages. They appear to have been copy and pasted from Rstudio. This causes the plots to have very poor quality. I recommend using the function ggsave instead. Then insert rather than copying and pasting the picture into your word document. In choosing a device use vector graphics format vector graphics (pdf, svg, eps), not a raster format (jpeg, png). Vector graphics can be scaled in size without losing resolution. 

 

Since Figures 2-5 are very similar, I recommend using facet wrap, gg grid, ggarrange or some other function or method to put all the plots into one image which can fit on one page. 

 

The plots are not very appealing. I’d consider using theme_BW. This style is sleek and appealing. It is also used by Emil and I think is the common style used in OpenPsych. There are other issues too, the numbers are right on top of the points, and they are hidden by the regression line. I don’t know how to fix the latter issue, but the former might be fixed by using scatter, to slightly jiggle where the labels are. Alternatively, the points could be removed to just leave the numbered labels. Making the numbered labels larger, might be more visually appealing. 



 

 the negative correlation amplifies (r=-.304) but this is simply because of the removal of two speed subtests.

 

Check the spacing in the equations. (r = -.304) looks prettier.



 

The effect size in non-invariance indicates a non-trivial bias, mostly disfavoring Blacks, but given the small ratio of biased/unbiased tests, the total effect should not be large… If this is the case, the racial bias cannot be considered as minor anymore

 

The author seems to be saying the tests used in Project Talent are biased in favour of Whites. The use of negative “disfavor Black students” confused me a little bit. “Biased in favour of Whites” would have been clearer. 

 

The reporting of this conclusion is contradictory; “non-trivial” then “the total effect should not be large”. Some rewording might clarify the issue. Given the importance of the question, any more information to explain the size of this bias would be very valuable. 



 

MCV was applied to check the similarity of obtained results with MGCFA. The finding of a large correlation between the Black-White d gaps and g-loadings is consistent with earlier reports on Black adults…

 

If the author is sceptical of MCV, why not leave it for the supplement? If it’s results are interesting, it would be good to explain how the author interprets the MCV results for sex differences. Only the racial difference in MCV is mentioned in the discussion.

 

The discussion seems rather long and goes into tangents in the literature? Maybe this can be shortened. A lot of the writing interpreting and discussing the literature on the effect of culture on IQ tests seems quite tangential and long.



 

 That cross-cultural comparability and Spearman’s g has been repeatedly confirmed for racial differences but not gender differences. 

 

The last sentence does not make sense. 

 

Bot

Author has updated the submission to version #6

Author

Thank you all for these reviews.

 

Reviewer 3: 

I would not frame failure of Spearman's hypothesis as test bias. 

I didn’t say both are related, in this sentence I just said that MGCFA studies found 1) larger gender bias than racial bias and 2) SH is supported. And my paper here confirms these two findings. In section 1 I reviewed prior MGCFA studies and clearly made a distinction between invariance and Spearman’s g. For instance: “While measurement equivalence with respect to racial groups is well established in Western countries, only a few studies have tested the Spearman’s Hypothesis (SH).” This single sentence implies that MI is not testing for SH. But MI is necessary for valid inference about SH.

The explained proportion due to g is now displayed in the abstract.

This will be done.

Can you give the page reference for this? I think this is generally called the g-nexus.

Yes, this is the g-nexus, but I added the referenced chapter, I don’t think a single page will do it justice.

He did propose that, but not in that book

I changed the reference, now using its 1985 paper.

I prefer thinking about this as a continuum of g vs. non-g contributions to the observed gaps.

This is correct, although Jensen I believe never said this shouldn’t be treated as a continuum. This is exactly why Dolan (2000) originally complained that SH is ill defined, because in terms of correlation (using MCV), it’s hard to tell whether r=.3 supports or not the weak SH. However I don’t think it’s necessary for MGCFA to split between these hypotheses.  

What is special about this study? Why does it give so different results? Can one reanalyze the data using the same methods as most papers use?

I don’t know. One detail I noticed though is that they use a bifactor model composed of 3 specific factors, each of these are defined with only 2 subtests. But I still don’t think this explains their results.

Many readers will not be familiar with the exact meanings of metric, scalar etc. invariance, and what it means when it is rejected/violated.

I explained in great detail in some paragraphs after reviewing MGCFA studies. It’s not easy to introduce the methods while reviewing studies. I’ll just add a few words (eg, loadings and subtest means) for the paragraph you cited. I added a short sentence at the beginning of section 1: “Group differences in subtest loadings and means are identified as metric and scalar non-invariance, respectively.”

I would add that it is trivially easy to get intercept bias for group differences if they differ in non-g factors as well as g.

Blacks and Whites often differ in several non-g factor means (in this paper as well), yet intercept bias is almost non-existent. I agree however that an oversimplistic model as shown in the paper you cited likely contains misspecification (despite CFI=1) that can show up as bias.

Given how small the g difference is between sexes, and the limited power of MGCFA methods, it is not surprising that often the null-g models are not rejected. This is an inherent issue with hypothesis testing. Jensen's method has similar issues, because it is biased downward by samping errors in the estimates. When the true difference is relatively small, the bias will often overcome the signal. This was shown by simulation by:

This is a bit complex so let me address these issues one by one. Regarding MGCFA, I’m well aware of simulation studies, especially done by van der Maas, but as you can expect, their results merely show that if sample size is very small, and group differences are also small, power may not be enough. In my situation, not only is the group difference large, but the sample size is large as well. The simulation you are citing confirms this: Dutton & Kirkegaard indeed showed that with small sample and small group differences, the observed Jensen effect deviates from its predicted correlation. If power is an issue in MGCFA, it’s not related to the issue you pointed out, but the possible lack of sensitivity of fit indices. This is why I also use effect size measures such as SDI or MIVI, although these shouldn’t replace commonly used fit indices.

It is probably not the largest study ever of any social science in the USA.

I modified the sentence. I also clarified the CHC model.

Figure 1 images are fairly ugly. Maybe draw some nicer ones using e.g. https://app.diagrams.net/

I find it extremely tedious to work with this tool, so instead I used DiagrammeR(), which makes it easier to replicate as well, although it was a bit tedious too to work with. I think the plot looks quite nice now.

Is there some reason people in this area don't use cross-validation?

I now added cross validation using split half (added this in the results section and supplementary file). It didn’t affect the results.

Table 2. Is "RMSEAD [CI]" difference in RMSEA?

Yes, it's 95% confidence, however the RMSEAd should not be regarded as ∆RMSEA, as noted in section 2. “RMSEAD is based on the same metric as RMSEA and is interpreted exactly the same way: a value of .08 suggests fair fit while a value of .10 suggests poor fit.” RMSEAd is not designed to compare models.

The bold rows in the table display the best fitted models.

Is there some information about why these were outliers? In what sense are they outliers?

This is given in Ley’s paper: “The Minimum Covariance Determinant approach was proposed by Rousseeuw (1984, 1985). The idea is quite simple: to find a fraction h of “good observations” which are not considered to be outliers and to compute the sample mean and covariance from this sub-sample. In other words, for a sample of size n, a number h of observations, where h lies between n/2 and n, is selected on which the empirical mean and empirical covariance matrix are calculated. This procedure is repeated for all possible sub-samples of size h and at the end the sub-sample which has the minimum determinant is selected. This deletes the effect of the most extreme observations, hence also of the outliers, and results in a very robust procedure. The goal is to find the “most central” subsample as that one will correspond to the one having least variability among the observations, meaning whose covariance matrix has minimal determinant, hence the name Minimum Covariance Determinant (MCD). The MCD estimators of location and scatter, denoted μMCD and ΣMCD, correspond to the sample mean and covariance matrix of this most central sub-sample.”

I don't understand this.

I rephrased it this way: “Robustness analysis was conducted for the gender analysis in the White group because the multivariate normality was non-normal”

Are these standard errors really accurate? You could bootstrap the data to find out. 

I now added bootstrap results, but it didn’t change anything. Also, remember the sample size is extremely large.

Might be more sensible to use positive values for these gaps since these are more numerous and usually the values of interest.

I usually follow the procedure of most other MGCFA studies for consistency. Minority and women are group focus whereas majority and men are group reference. The signs displayed here are based on group reference being white majority or men.

On the contrary, these seem high values. It says g gaps explain quite a bit of the variation between tests. Based on the popular view that g gap = 0, these kinds of findings are unexpected.

[…]

I think this is a good example of how Jensen's method gives misleading results compared to the proportions calculated above.

There it was found g explained about half the variation in subtest differences, whereas Jensen's method suggests it is about 0%.

I see the results differently. Based on the definition of weak SH, i.e., a large proportion of the gap is due to g, it follows that a mere 40-50% obviously rejects weak SH, despite being “seemingly” large. Because if 50% is large for g, then non-g is also equally large. But considering weak SH, the proportions due to non-g factors cannot be large. This is why I concluded that MGCFA and MCV are roughly showing us the same picture. 

It is hard to understand the table when subtests are not given by their names, but by e.g. S15. The reader then has to scroll back to a much earlier page to look up each value.

I tried to use initials instead, but for me at least it made things worse. The problem is that there are too many subtests, this makes it hard to follow. But I don’t see any better way.

Figures 2-5 also show us the issue of single regression. In reality, since tests load in multiple factors, you should use multiple regression.

This is complicated because latent factors are unobserved and as you said, numerous, so using regression doesn’t help. But that’s why MGCFA answers this problem. MGCFA deals with multiple dependent variables as it treats subtest variables as dependent variables.

You don't see to think of the implied gaps based on the regression models in the figures, but we can read them off at the loading = 1 line.

As you may have noticed, the correlations are not robust. Remove the speed subtests (which are horrible measures of g in this battery) and the strength of the correlation diminishes greatly. That’s another reason why I don’t want to put too much faith in the d gap predicted by MCV. Also, MGCFA is a latent variable method, not MCV. If I find consistency in d gaps between these methods and argue somewhat this is something we should expect, then it’s easy to attack me later if other studies find discrepancies. I find congruence between MGCFA and MCV interesting, but I don’t think it’s a necessary outcome. 

When joint sex samples are used, some of the variation in the data will be explained by unmodeled sex differences. I think this may make the g proportions lower.

This could be tested later on indeed. Presently, the data shows sex bias, so averaging isn’t ideal here.

A great many? I thought there were almost no studies examining cross-cultural, that is, between country, comparability of IQ tests.

I didn’t mean between countries, I rectified the sentence. However I thought culture could mean people of different cultures within countries as well.

It is strange to call aptitude tests not cognitive tests.

I agree these tests require mental abilities, and I never said or implied otherwise. But compared to IQ tests, some of these tests require quite specific knowledge. Since you mentioned Jensen, I also mentioned Jensen initially in my discussion : “Jensen (1985) himself has been highly critical of the Project Talent test battery: “Many of these tests are very short, relatively unreliable, and designed to assess such narrow and highly culture-loaded content as knowledge about domestic science, farming, fishing, hunting, and mechanics” (p. 218).” So indeed Jensen knew about PT tests, and he didn’t like those at all. I think he exaggerates quite a bit, although his general criticism is fair. A test that looks like PT is the ASVAB used in the NLSY. It’s overly reliant on crystallized ability and has a strong “technical” knowledge just like PT tests, and is suffering from what Jensen called psychometric sampling bias in his book The g Factor. The same criticism can be made against the PT battery. Yet it seems the ASVAB is still a good proxy for g. And I don’t deny it’s likely the same for PT battery, as suggested in the last sentences of my discussion section. This is why I call it a mixed bag. It looks like a parody of IQ test. Yet it’s likely a good proxy for g.

Author

 

Reviewer 4:

I understand the paper is a bit technical but I believe everything I’ve written is important enough to be included in the main text, rather than supplementary. Explanation as to whether strict invariance is necessary to establish measurement invariance, why pro-Bifactor bias in model fit may (or may not) be an issue, about the cutoffs used in MGCFA, about why RMSEA has been recently found to be a bad measure, etc. 

Of course, I can delete those elements, and do like some authors did, e.g., by not discussing technical problems related to MGCFA. But these points I believe are not talked about enough in the literature. Or if the authors mentioned them, they provide a rough summary that doesn't accurately reflect the findings of the literature. For instance, Greene et al (2019) is often cited as evidence that models almost always favor the bifactor, but in their simulation there are many cases in which the bifactor is not favored at all. Yet this is not discussed at all because authors provide a brief one-sentence (and non-technical) summary.

The author concludes that there is gender bias in the cognitive tests. Is that accurate? Couldn’t the results be interpreted as there being large sex differences in specific abilities.

Group differences in subtests within any given ability should be entirely accounted for by its latent factor. If not, there is measurement bias. If men are found to have higher specific abilities, this does not depend on whether there is bias. But if there is bias, then those group differences in specific abilities cannot be interpreted as reflecting solely cognitive ability.

Isn’t this an interesting finding, since the received opinion of the field is that sex differences in g are non-existent or trivial. Does the author think his results should make us confident in sex differences in g?

The battery of tests displays non-negligible gender bias, and is very unbalanced (strong knowledge component, especially specific knowledge), so interpreting differences in g is not exactly straightforward. Despite requiring cognitive skills, some subtests definitely require specific knowledge. It's a rather unique set of tests.

You seem to think the use of knowledge tests has made the latent “g factor” biased towards capturing general knowledge ability more so than it should. Is this really a good explanation for the large sex differences given the women do better on the information component? It looks like the science component perhaps is the key problem?

No, because the test is somewhat biased, so interpreting group differences in latent factors is a bit complicated. Not impossible, but one shouldn’t discard this odd result having its source in non-invariance. Also, HOF models display much smaller sex gaps, more in line with estimates from past studies. Obviously HOF and BF models don’t have to display equal latent mean difference but the magnitude of this difference is not trivial. It might be worth investigating in future analyses. I have added a footnote (n°12) in the relevant paragraph about Table 14. I think the estimates are robust, but can’t tell exactly why the sex gap is so large in the bifactor.

Could the author clarify how big these biases are? How different is this result to past studies and if the result is different, why is it different?

In section 1 I reviewed in detail other MGCFA studies. However it’s difficult to compare because you need to quantify the magnitude of the bias in both the loadings and the intercepts. Sometimes, researchers found bias in the intercepts only, sometimes both in the loadings and intercepts. But typically they use smaller batteries. The more subtests you have, the more likely one of them will exhibit non-invariance. This is obvious. There is no guidance in the MGCFA literature except that the majority of the subtests loadings and/or intercepts should be invariant. Yet quite often researchers even accept as high as 30 or 40% bias among subtest intercepts, and conclude in their paper that the test is “fair”. I don’t agree with this statement. In my situation, the ratio of biased subtests to unbiased subtests is close to 20%. Is it high? Maybe not, but definitely not trivial.

Regarding effect sizes, it’s difficult to compare with earlier studies because they never report the magnitude of group differences in the loadings, and rarely in the intercepts.

I  fixed the issue with the wordings that you pointed out.

Regarding Major et al (2012), as I suggested, I failed to replicate their results. CFI=.002 is really small. They found an astounding CFI=.01 advantage for the VPR. I now added a new supplementary file, which attempts to replicate Major et al procedure, as close as possible (given what the description of their methods) but once more I couldn't replicate the superiority of the VPR consistently. In one school grade level, the CFI is larger by .005, but in other grade level, the difference is marginal or null. 

A lot of the results referenced in these two paragraphs don’t appear to be presented in the paper

They all appear in my original draft, and I checked once more in the pdf, all references are here. Regarding the QQ plot for normality, it’s an analysis that is required before CFA, but I didn’t see the need to display it in the paper. The 6-factor model is the most interpretable because of the sentence that follows: “The 4- and 5-factor models blend indicators into factors which are more ambiguous (math and english tests form one common factor; information and science tests form one common factor) and cause severe unbalances in factor sizes. The 7- and 8-factor models produce additional factors which are not associated with any particular ability or do not have any indicators with high loading.” I didn’t see anything wrong with the interpretability of the 6 factor model, unlike the others.

On what basis are these statements made and the statements in the rest of the paragraph?

This is a summary of all competing models (non g and g models). Basically, configural and regression invariance holds. And model fit values in general are acceptable, but not Mc (the values displayed here are much lower than the recommended cutoff).

I don’t know the literature, but would sex gaps in general knowledge be so large in the age range of Project Talent?

[...]

In the information ability, females have a large advantage, which seems contradictory. In fact, most of the broad abilities show a female advantage, yet still there is a male advantage in g.

I don’t find it contradictory. In the bifactor model, specific factors are independent of g. Also, the sign of the difference in the bifactor is consistent with HOF model, except the magnitude of the difference is much larger. 

I don’t think it would be wise to reanalyze the models after dropping Science subtests. One needs a good reason. For instance, I dropped three subtests (two memory tests and one creativity test) despite Major et al 2012 using these, because I found they don’t correlate very well with most other tests. I don’t see why Science should be dropped simply because it could contribute much of the g difference. If that’s the case, it simply means that general knowledge measures g better and therefore amplifies group gaps. Furthermore, in the bifactor, latent factors are orthogonal, so g can't be driven here by Science. 

Also the g gap between the sexes in black group seems much smaller than the white group. Should we be confident in that difference?

I decided to not discuss this because when researchers analyze sex differences with latent models, they often don’t separate race or they just analyze the white sample, but not the black sample. I doubt it will replicate, but it's a bit early to say.

If g factors from project talent capture a lot of variance related to general knowledge, doesn’t that pose a big problem for the paper? 

It depends. I wrote this in the discussion: “As te Nijenhuis & van der Flier (2003) expressed clearly, cultural loading is unavoidable and even desirable as long as future school and work achievement may have a high cultural loading. Removing such items and/or subtests may adversely affect the predictive validity of the test.” But I admit the problem of the PT battery is that the cognitive domains are quite unbalanced. Many other IQ batteries have this problem, though requiring less (specific) knowledge.

If your question is about the magnitude of bias, it’s hard to tell. One would need first to conduct meta-analyses of sex differences in IQ (and in latent g) using test composition as moderator. But so far I don't think it's convincing, because as I reviewed the studies in section 1, even well constructed IQ tests aren’t very consistent with the magnitude and direction of sex differences in g.

Is it interesting or surprising that the bifactor models produce much larger g differences for sex than the HOF models? My understanding is that the large differences in specific abilities, makes it difficult to isolate g differences between the sexes. So the BF model, in more precisely specifying the varying influences of specific abilities, might provide better estimates of sex differences in g?

Not many researchers contrast the latent mean differences in both the HOF and BF models. I know of 2 such studies (Keith et al, and Reynolds et al), and the means are similar across models. But at the same time, these studies show invariance and focus on IQ tests, not achievement tests. There is a reason why I didn’t discuss this too much in the paper: we actually lack MGCFA studies contrasting BF and HOF models in latent means.

All values for the sex differences are positive. But if males have an advantage in g and perform worse than woman on some subtests, then g should contribute negatively to the sex difference in favour of women? Shouldn’t some be positive and some be negative?

Those are proportions, they have to be positive. In the main text, I wrote this : "It can be computed, in the case of the bifactor model, by dividing the product of the g mean difference and subtest’s loading on g by the sum of the product of all latent mean differences and their subtest’s loadings".

I recommend using facet wrap, gg grid, ggarrange or some other function or method to put all the plots into one image which can fit on one page.

If I do this, it’s no longer possible to even see the item number displayed in the plots. I find it important because it tells us which items display larger/smaller group difference than what is expected on the basis on g-loading. Specifically, I want to see which items are bad with MCV. This is best done with a clear visualization of the plot.

The reporting of this conclusion is contradictory; “non-trivial” then “the total effect should not be large”. 

The values of the subtest intercept bias show medium or large biases (“non-trivial”), but only a very small number of subtests are affected. Given there are a total of 34 subtests, the impact on the total score is very small (“should not be large”).

If the author is sceptical of MCV, why not leave it for the supplement? If it’s results are interesting, it would be good to explain how the author interprets the MCV results for sex differences.

I added a discussion for sex differences. I find MCV interesting despite its shortcomings because more often than not, the results are consistent with MGCFA. And sometimes, raw data or input data (e.g., variance / covariances / means) are not available. In this case, MGCFA can’t be executed. MCV can be more generally applied. I modified some sentences you found oddly written.

Bot

Author has updated the submission to version #7

Reviewer | Admin

I thank the author for his clear and helpful response and welcome the publication of the manuscript!

Bot

Author has updated the submission to version #8

Bot

The submission was accepted for publication.