Back to Submissions

1
Spearman’s g Explains Black-White but not Sex Differences in Cognitive Abilities in the Project Talent

Submission status
Reviewing

Submission Editor
Submission editor not assigned yet.

Author
Meng Hu

Title
Spearman’s g Explains Black-White but not Sex Differences in Cognitive Abilities in the Project Talent

Abstract


In this analysis of the Project Talent data, the g factor model as represented by the Spearman’s Hypothesis (SH) was confirmed for the Black-White cognitive difference but not for the sex difference. Multi-Group Confirmatory Factor Analysis (MGCFA) detected small-modest bias with respect to race but strong bias with respect to within-race sex cognitive difference. After establishing partial measurement equivalence, SH was tested by comparing the model fit of correlated factors (non-g) model with a bifactor (g) model as well as the relative contribution of g factor means to that of the specific factors. While g was the main source of the Black-White differences, this wasn’t the case for within-race sex differences. The evidence of measurement bias in the sex analysis may cause ambiguity in interpreting SH for sex differences. Results from MGCFA were somewhat corroborated by the Method of Correlated Vectors, with high correlations of subtests’ loadings with Black-White differences but near-zero correlations with sex differences. This finding replicates earlier MGCFA studies supporting SH with respect to the Black-White cognitive gap as well as earlier MGCFA studies revealing stronger gender bias than racial bias.

Keywords
measurement invariance, MCV, Spearman’s Hypothesis, MGCFA, Black-White IQ gap, Project Talent, Sex IQ gap

Supplemental materials link
https://osf.io/qn67k/

Pdf

Paper

Reviewers ( 0 / 2 / 2 )
Reviewer 1: Considering / Revise
Reviewer 2: Accept
Reviewer 3: Accept
Reviewer 4: Considering / Revise

Sat 16 Sep 2023 22:27

Bot

Author has updated the submission to version #2

Reviewer

The paper here contains very important findings. However, at present it is written in a very technical way, to the extent that anyone without intimate knowledge of the literature would be entirely lost.

 

Some suggested edits: 

A very brief and simple definition of Spearman's Hypothesis should be provided in the Abstract

 

On page 2: “One comes from Scheiber (2016b) who found strong measurement bias in the analysis of the WISC-V between 777/830 […]” it is unclear what the fractions in this sentence mean.

 

On page 5: “When within-factor correlated residuals are misspecified, all fit indices favor the correlated factors model regardless of conditions, except for SRMR, show a bias in favor of the correlated factors model (Greene et al., 2019)”. This sentence needs rewording, it does not make sense at the minute.

 

In the analysis section: it would be useful to provide diagrams of what the CF, HOF and BF models look like. This will make it easier for a reader to understand what hypotheses are being tested.

 

Table 1: There could be another column which states, in plain english, what each of these models is used to test for.

 

Table 2: please indicate, for each fit measure what is considered a better fit. E.g: CFI higher is better, RSMEA lower is better.

 

As far as I can tell the model specification is the same wherever it is stated. In which case it should only be stated once, with a phrase along the lines of “for all of our models the model specification is:”

 

Providing a table of g-loadings for math, speed etc would be useful information.

 

The horizontal axis on figures should be “average g-loading (from Black and White male sample)” or similar. This is practically the most important part of the paper, as these graphs are easy to digest for laymen. They need to be as easy to understand as possible.

Bot

Author has updated the submission to version #3

Author

Thank you for the review.

I understand that the analysis is complex. In reality, MGCFA can be much easier if the data is ideal (clean factor structure, near equivalent group samples, no Heywood cases, no pro-bifactor bias, assumption of no cross loadings for computing effect sizes of bias). Unfortunately, the data usually does not fulfill most of these ideal conditions. And in the case of the Project Talent, the large number of subtests, subgroups and models complicate the situation even more. I wish I could simplify as much as possible, but at the same time it is necessary to explain and address the problems that are often ignored in MGCFA studies.

I modified my article according to your suggestions, clarifying and fixing whenever necessary. I also updated my supplementary file.

The weak form of Spearman’s Hypothesis, which states that the racial group differences are primarily due to differences in the general factor (g), was tested and confirmed in this analysis of the Project Talent data, based on 34 aptitude tests among 9th-12th grade students. 

...

One comes from Scheiber (2016b) who found strong measurement bias in the analysis of the WISC-V between 777 White males and 830 White females, 188 Black males and Black 221 females, and 308 Hispanic males and Hispanic 313 females.

...

When within-factor correlated residuals are misspecified, all fit indices correctly favor the correlated factors model regardless of conditions, except for SRMR, which incorrectly favors the bifactor model (Greene et al., 2019, Table 4).

I now provided a new Figure 1, along with the following text:

Figure 1 displays hypothetical competing CFA models that are investigated in the present analysis: 1) the correlated factors model which specifies that the first-order specific factors are correlated without the existence of a general factor, 2) the higher order factor model which specifies that the second-order general factor operates through the first-order specific factors and thus only indirectly influences the subtests, 3) the bifactor model which, unlike the higher order factor, specifies that both the general and specific factors, have direct influences on the subtests.

I also added a note under the model fit tables:

Note: higher values of CFI and Mc indicate better fit, while lower values of χ2, RMSEA, RMSEAD, SRMR indicate better fit.

I, however, found one of your request difficult to fulfill. Specifically, this one:

Table 1: There could be another column which states, in plain english, what each of these models is used to test for.

This is because, to summarize the purpose of the models in just 1 or 2 words is extremely difficult. Considering the column specification is already loaded with information, adding another column filled with more information will make the table more tedious to read, I believe.

The models in Table 1 have been somewhat summarized prior, but now expanded a bit more, with a reference to Table 1 as well. 

MGCFA starts by adding additional constraints to the initial configural model, with the following incremental steps: metric, scalar, strict. A rejection of configural invariance implies that the groups use different latent abilities to solve the same set of item variables. A rejection in metric (loading) invariance implies that the indicators of a latent factor are unequally weighted across groups. A rejection in scalar (intercept) invariance implies that the subtest scores differ across groups when their latent factor means is equalized. A rejection in strict (residual) invariance implies there is a group difference in specific variance and/or measurement error. When invariance is rejected, partial invariance must release parameters until acceptable fit is achieved and these freed parameters must be carried on in the next levels of MGCFA models. The variances of the latent factors are then constrained to be equal across groups to examine whether the groups use the same range of abilities to answer the subtests. The final step is to determine which latent factors can have their mean differences constrained to zero without deteriorating the model fit: a worsening of the model fit indicates that the factor is needed to account for the group differences. These model specifications will be presented in Table 1 further below. 

I provided the R output displaying all parameter values for the best model in Tables 2 through 13, in the supplementary file. The output is in fact so large that, even if I only display the group factor loadings, it will drastically increase the number of pages. The article is already very long. Notice that I did not originally display anywhere in the paper the general factor loadings. This is because, again, there were too many models and subgroups (loadings, both g and specific factors, would need to be displayed for each subgroups, black men, white men, white women, black women for each g models).

The X axis title in figures 2-5 has been modified as per your suggestion.

One final remark. I do not understand this sentence.

As far as I can tell the model specification is the same wherever it is stated. In which case it should only be stated once, with a phrase along the lines of “for all of our models the model specification is:”

Are you refering to the competing models (CF, HOF, BF) or rather to the model constraints (M1-M6)? I suspect the latter, though I'm not sure. If this is the case, each subgroups shows different pattern of non-invariance, so I have to discuss them separately rather than making general and rather unprecise statements about the results. I understand it may be tedious for the readers but I believe it is necessary.

Reviewer
Replying to Meng Hu

Thank you for the review.

I understand that the analysis is complex. In reality, MGCFA can be much easier if the data is ideal (clean factor structure, near equivalent group samples, no Heywood cases, no pro-bifactor bias, assumption of no cross loadings for computing effect sizes of bias). Unfortunately, the data usually does not fulfill most of these ideal conditions. And in the case of the Project Talent, the large number of subtests, subgroups and models complicate the situation even more. I wish I could simplify as much as possible, but at the same time it is necessary to explain and address the problems that are often ignored in MGCFA studies.

I modified my article according to your suggestions, clarifying and fixing whenever necessary. I also updated my supplementary file.

The weak form of Spearman’s Hypothesis, which states that the racial group differences are primarily due to differences in the general factor (g), was tested and confirmed in this analysis of the Project Talent data, based on 34 aptitude tests among 9th-12th grade students. 

...

One comes from Scheiber (2016b) who found strong measurement bias in the analysis of the WISC-V between 777 White males and 830 White females, 188 Black males and Black 221 females, and 308 Hispanic males and Hispanic 313 females.

...

When within-factor correlated residuals are misspecified, all fit indices correctly favor the correlated factors model regardless of conditions, except for SRMR, which incorrectly favors the bifactor model (Greene et al., 2019, Table 4).

I now provided a new Figure 1, along with the following text:

Figure 1 displays hypothetical competing CFA models that are investigated in the present analysis: 1) the correlated factors model which specifies that the first-order specific factors are correlated without the existence of a general factor, 2) the higher order factor model which specifies that the second-order general factor operates through the first-order specific factors and thus only indirectly influences the subtests, 3) the bifactor model which, unlike the higher order factor, specifies that both the general and specific factors, have direct influences on the subtests.

I also added a note under the model fit tables:

Note: higher values of CFI and Mc indicate better fit, while lower values of χ2, RMSEA, RMSEAD, SRMR indicate better fit.

I, however, found one of your request difficult to fulfill. Specifically, this one:

Table 1: There could be another column which states, in plain english, what each of these models is used to test for.

This is because, to summarize the purpose of the models in just 1 or 2 words is extremely difficult. Considering the column specification is already loaded with information, adding another column filled with more information will make the table more tedious to read, I believe.

The models in Table 1 have been somewhat summarized prior, but now expanded a bit more, with a reference to Table 1 as well. 

MGCFA starts by adding additional constraints to the initial configural model, with the following incremental steps: metric, scalar, strict. A rejection of configural invariance implies that the groups use different latent abilities to solve the same set of item variables. A rejection in metric (loading) invariance implies that the indicators of a latent factor are unequally weighted across groups. A rejection in scalar (intercept) invariance implies that the subtest scores differ across groups when their latent factor means is equalized. A rejection in strict (residual) invariance implies there is a group difference in specific variance and/or measurement error. When invariance is rejected, partial invariance must release parameters until acceptable fit is achieved and these freed parameters must be carried on in the next levels of MGCFA models. The variances of the latent factors are then constrained to be equal across groups to examine whether the groups use the same range of abilities to answer the subtests. The final step is to determine which latent factors can have their mean differences constrained to zero without deteriorating the model fit: a worsening of the model fit indicates that the factor is needed to account for the group differences. These model specifications will be presented in Table 1 further below. 

I provided the R output displaying all parameter values for the best model in Tables 2 through 13, in the supplementary file. The output is in fact so large that, even if I only display the group factor loadings, it will drastically increase the number of pages. The article is already very long. Notice that I did not originally display anywhere in the paper the general factor loadings. This is because, again, there were too many models and subgroups (loadings, both g and specific factors, would need to be displayed for each subgroups, black men, white men, white women, black women for each g models).

The X axis title in figures 2-5 has been modified as per your suggestion.

One final remark. I do not understand this sentence.

As far as I can tell the model specification is the same wherever it is stated. In which case it should only be stated once, with a phrase along the lines of “for all of our models the model specification is:”

Are you refering to the competing models (CF, HOF, BF) or rather to the model constraints (M1-M6)? I suspect the latter, though I'm not sure. If this is the case, each subgroups shows different pattern of non-invariance, so I have to discuss them separately rather than making general and rather unprecise statements about the results. I understand it may be tedious for the readers but I believe it is necessary.

Thank you for your reply. I will review the revised version. But I will quickly clarify that last part. I am refering to bit that states: "The model specification is displayed as follows:

english =~ S1 + S13 + S19 + S20 + S21 + S22 + S23 + S24 + S25 + S26 + S31 + S34

math =~ S5 + S25 + S32 + S33 + S34 speed =~ S19 + S34 + S35 + S36 + S37

info =~ S1 + S2 + S3 + S4 + S7 + S8 + S11 + S12 + S13 + S14 + S15 + S16 + S19 + S26

science =~ S1 + S6 + S7 + S8 + S9 + S10

spatial =~ S28 + S29 + S30 + S31 + S37"

This seems to be repeated exactly several times in the paper. e.g on page 16, page 21, page 24.

Author

I understand now. At first glance, it seems the models are identical across subgroups. In fact, they differ a little bit by subgroup. Usually, some subtests have additional cross loadings in some subgroups (e.g. S13 Health, S32 Arithmetic Reasoning, etc.).

Bot

The submission was accepted for publication.

Reviewer | Admin

 This finding replicates earlier MGCFA studies supporting SH with respect to the Black-White cognitive gap as well as earlier MGCFA studies revealing stronger gender bias than racial bias

I would not frame failure of Spearman's hypothesis as test bias. It can mean that, but it can also just mean that non-g factors have a large role in explaining why differences are larger or smaller on some tests. This is obviously the case for sex differences, where the g factor difference (assuming this exists) has a relatively minor role in explaining differences across subtests compared to differences in non-g abilities. For black-white, the g gap is so large that g will explain most of the differences between subtests. Also, could you decompose the explained proportion and give these in the abstract?

Jensen (1998) proposed that the magnitude of the racial differences in IQ, at least between Black and White Americans, as well as differences in cognitive-related socio-economic outcomes are a function of the g-loadings (i.e., the correlation between the tests or outcomes and the general factor of intelligence) of the respective cognitive tests and outcomes, making the g factor an essential element in the study of socio-economic inequalities

Can you give the page reference for this? I think this is generally called the g-nexus.

 More specifically, Jensen (1998) proposed that race/ethnic group differences on cognitive tests are largely due to latent differences in general mental ability. This is known as Spearman’s hypothesis (SH) which exists in two forms: the strong and the weak form, the latter of which was endorsed by Jensen.

He did propose that, but not in that book. Proposed makes it sound like this was the original writing to propose this, but Jensen's writings on SH go back decades prior. Maybe better would be to discuss:

This is known as Spearman’s hypothesis (SH) which exists in two forms: the strong and the weak form, the latter of which was endorsed by Jensen. The strong form affirms that the differences are solely due to g factor differences while the weak form affirms that the differences are mainly due to differences in g. The alternative contra hypothesis states that group differences reside entirely or mainly in the tests’ group or broad factors and/or test specificity and that g differences contribute little or nothing to the overall ones.

I prefer thinking about this as a continuum of g vs. non-g contributions to the observed gaps. It ranges from 0 to 100% and the strong form is just the point at the end of this line. I think that conceptualizing it this way is more sensible, but I recognize that MGCFA-like methods have to split it up for hypothesis testing purposes. Still, Lasker often uses g's contribution as a % which will be useful for meta-analysis.

Millsap & Olivera-Aguilar (2012, p. 388) provides an illustration: the inclusion of a math test having a mixture of multiple-choice items and problem-solving items, with the latter being story problems, may introduce bias against foreign language speakers due to the verbal content of this test. If the math factor is not supposed to measure verbal skill, then such a test should be discarded.

I think it would be better to say that such a test should not be used with foreign language users, or at least, should be scored differently in order to compensate for this language bias.

Another comes from Benson et al. (2020) who analyzed the UNIT2 norming sample and found that scalar invariance was rejected not only for race (Whites, Blacks, Asians) and ethnicity (Hispanic) groups but also for gender groups. Metric invariance was also rejected for age and gender groups, suggesting that the UNIT2 overall is somewhat biased with respect to any group.3

These are in stark contrast to most findings. What is special about this study? Why does it give so different results? Can one reanalyze the data using the same methods as most papers use?

So far the evidence of strong measurement bias in race differences comes mainly from studies conducted in African countries. Dolan et al. (2004) compared the Junior Aptitude Test (JAT) scores of South African Black and White students and found that both metric and scalar invariance are violated.4 Lasker (2021) re-analyzed Cockroft et al. (2015) and compared the WAIS-III scores of undergraduate South African students enrolled at an English medium University to undergraduate UK university students, and found that metric and scalar invariance are rejected. Warne (2023) compared the WISC-III scores of Kenyan Grade 8 students in Nairobi schools to the American norm and the WAIS-IV scores of Ghanaian students who showed English fluency at high school or university to the American norm. While measurement equivalence was established for Ghanaian students, it was rejected for Kenyan students.5

Many readers will not be familiar with the exact meanings of metric, scalar etc. invariance, and what it means when it is rejected/violated. Can you explain these? Usually, one can attain partial invariance allowing score comparisons, was this done?

Yet research employing MGCFA showed mixed evidence of gender fairness. Some studies reported small or no measurement bias (Chen et al., 2015;6 Dombrowski et al., 2021; Irwing, 2012; Keith et al., 2011; Palejwala & Fine, 2015;7 Reynolds et al., 2008; van der Sluis et al., 2006) while others reported non-trivial bias, intercepts almost always being the common source of bias (Arribas-Aguila et al., 2019; Dolan et al., 2006; Lemos et al., 2013; Pauls et al., 2020; Pezzuti et al., 2020; Saggino et al., 2014; Van der Sluis et al., 2008; Walter et al., 2021).

I would add that it is trivially easy to get intercept bias for group differences if they differ in non-g factors as well as g. If the number of tests isn't sufficiently large to model the non-g factors, differences in them will show up as intercept bias. A good example of this is:

https://www.researchgate.net/publication/236597029_Decoding_the_Meaning_of_Factorial_Invariance_and_Updating_the_Practice_of_Multi-group_Confirmatory_Factor_Analysis_A_Demonstration_With_TIMSS_Data

Which used a g-only model for 5 math subtests from TIMSS.

After fitting a parsimonious weak SH model, they discovered that the equality constraint on the g factor mean did not worsen model fit in 3 of the 4 age subgroups. In the end, there is no compelling evidence that g is the main source of the subtest differences between sex groups.

Given how small the g difference is between sexes, and the limited power of MGCFA methods, it is not surprising that often the null-g models are not rejected. This is an inherent issue with hypothesis testing. Jensen's method has similar issues, because it is biased downward by samping errors in the estimates. When the true difference is relatively small, the bias will often overcome the signal. This was shown by simulation by:

https://www.researchgate.net/publication/353465907_The_Negative_Religiousness-IQ_Nexus_is_a_Jensen_Effect_on_Individual-Level_Data_A_Refutation_of_Dutton_et_al's_'The_Myth_of_the_Stupid_Believer'

The Project Talent is the largest study ever conducted in the United States involving 377,016 9th-12th grade students during 1960 and drawn from all of the 50 states (Flanagan et al., 1962).

Largest what study? It is probably not the largest study ever of any social science in the USA. For instance, the current MVP is larger. https://www.research.va.gov/mvp/ It is probably the largest dataset from the USA concerning detailed measures of intelligence.

Three tests have been removed in the present analysis: memory for sentences (S17), memory for words (S18), and creativity (S27). The memory tests are highly correlated with each other but are poorly correlated with all other variables (between r=.10 and r=.20), which makes them unsuitable for CFA. Creativity has moderate correlations with other variables, has no main loading and its loadings are of modest or small size. Thus, a total of 34 aptitude/cognitive tests are used.

Why are they so low in g-loading? It's a general problem of PT that it has too many tests with small item counts. The item data is not available for analysis for most of the sample, so one cannot attempt to adjust for this so easily.

In their study, the VPR-g model fitted much better than the CHC-based HOF g model.

I think this is the first time "CHC" is used, but not explained.

Figure 1 images are fairly ugly. Maybe draw some nicer ones using e.g. https://app.diagrams.net/

To evaluate and compare model specifications, fit indices such as CFI, RMSEA, RMSEAD, SRMR and McDonald’s Noncentrality Index (Mc) are used to assess model fit, along with the traditional χ2. Higher values of CFI and Mc indicate better fit, while lower values of χ2, RMSEA, RMSEAD, SRMR indicate better fit. Simulation studies established the strength of these indices to detect misspecification (Chen, 2007; Cheung & Rensvold, 2002; Khojasteh & Lo, 2015; Meade et al., 2008). However, with respect to ∆RMSEA, doubts about its sensitivity to detect worse fit among nested models were raised quite often. Savalei et al. (2023) provided the best illustration of its shortcomings. According to them, this was expected because the initial Model A often has large degrees of freedom (dfA) relative to the degrees of freedom introduced by the constraints in Model B (dfB), resulting in very similar values of RMSEAB and RMSEAA, hence a very small ΔRMSEA. For evaluating nested models, including constrained ones, their proposed RMSEAD solves this issue. RMSEAD is based on the same metric as RMSEA and is interpreted exactly the same way: a value of .08 suggests fair fit while a value of .10 suggests poor fit.

Is there some reason people in this area don't use cross-validation? It seems to me that this is a better way to deal with overfitting issues relating to all of these model fit statistics. I see that some people use split-half cross-validation as a measure of overfitting. https://www.tandfonline.com/doi/abs/10.1080/1091367X.2014.952370 In general, though, k-fold cross-validation is a much better approach.

Because the Predictive Mean Matching (PMM) method of imputation calculates the predicted value of target variable Y according to the specified imputation model, the imputation was conducted within race and within gender groups, totaling four imputations. It is inappropriate to impute the entire sample because it implies that the correlation pattern is identical across groups, an assumption that may not be true and may eventually conceal measurement non-invariance.

Alternatively, different imputation models may create spurious non-invariance because the fill the missing data using different model parameters.

The model specification is displayed as follows:

Use code formatting (equal space font) for code.

Table 2. Is "RMSEAD [CI]" difference in RMSEA? The width of the confidence interval is not specified, but one can assume it is 95%. I don't see where this value is from. The differences in RMSEA are not that large between the models, they are all between 0.042 and 0.044.

Table 3. Black-White differences among males using Higher Order Factor

Better captions would be e.g. "Table 3. Black-White differences among males using the Higher Order Factor model"

What do the bold rows in the tables mean?

Finally, as an additional robustness analysis, all models for both the Black-White male and female groups were rerun after removing multivariate outliers with the Minimum Covariance Determinant (MCD) proposed by Leys et al. (2018) who argued that the basic Mahalanobis Distance was not a robust method. Although the multivariate normality was barely acceptable, the number of outliers was large: MCD removed 1,948 White males and 338 Black males, and 1,005 White females and 372 Black females.

Is there some information about why these were outliers? In what sense are they outliers?

Robustness analysis was conducted for the gender difference in the White group because the multivariate normality was non-normal

I don't understand this.

Table 14. d gaps (with their S.E.) from the best fitting g models per group analysis

Are these standard errors really accurate? You could bootstrap the data to find out. It is hard to believe these estimates are that precise.

Note: Negative values indicate advantage for Whites (or males).

Might be more sensible to use positive values for these gaps since these are more numerous and usually the values of interest.

The average proportion is .43 for the sex group among Whites and .50 for the sex group among Blacks. If SH explicitly states that g is the main source of the group difference, it seems that even the weak SH model does not explain well the pattern of sex differences.

On the contrary, these seem high values. It says g gaps explain quite a bit of the variation between tests. Based on the popular view that g gap = 0, these kinds of findings are unexpected.

Table 15. Proportions of subtest group differences due to g based on Bifactor model

It is hard to understand the table when subtests are not given by their names, but by e.g. S15. The reader then has to scroll back to a much earlier page to look up each value.

The g-loadings correlate highly with Black-White d gaps but not with sex d gaps. After correction for unreliability, the correlations (g*d) for the Black-White male, Black-White female, male-female White, and male-female Black groups are, respectively, .79, .79, -.06, -.12. If the reliability for speed subtests is assumed to be .70 instead of .60, the correlations are .80, .80, -.05, -.12.

I think this is a good example of how Jensen's method gives misleading results compared to the proportions calculated above. There it was found g explained about half the variation in subtest differences, whereas Jensen's method suggests it is about 0%.

Figures 2-5 also show us the issue of single regression. In reality, since tests load in multiple factors, you should use multiple regression. The predictors are each tests' factor loadings on all the factors. When this is done, the results (unstandardized slopes) will tell you the estimated gap sizes given tests that measure each factor perfectly and no other factor. Do these values align with the MGCFA results? They should.

You don't see to think of the implied gaps based on the regression models in the figures, but we can read them off at the loading = 1 line. They are about 1.5 d for black-white males, about 1.6 d for black-white females, and quite small for the two male-female gaps. The values for the black-white comparisons are close to the ones from MGCFA in table 14.

After establishing partial invariance, SH was tested in all subgroups. This was validated in the Black-White analyses based on two findings: 1) non-g models fit worse than g-models and 2) the proportion of the subtests’ mean differences due to g is very large.

Maybe it is larger than usual because this sample was large enough so that single sex samples could be used. When joint sex samples are used, some of the variation in the data will be explained by unmodeled sex differences. I think this may make the g proportions lower.

SH is supported through the examination of Forward and Backward Digit Span, showing a BDS Black-White gap that is larger (d=.50) than the FDS gap (Jensen, 1998, p. 370).

See https://www.sciencedirect.com/science/article/abs/pii/S104160801530056X

The great majority of the studies confirms the cross-cultural comparability of IQ tests, the exception mainly comes from South African samples (Dolan et al., 2004; Lasker, 2021). Due to the omnipresent force of the mass-market culture in developed countries, it is not surprising that culture bias is rarely noticeable (Rowe et al. 1994, 1995).

A great many? I thought there were almost no studies examining cross-cultural, that is, between country, comparability of IQ tests.

Attempts to reduce the racial IQ gap using alternative cognitive tests have always been proposed (Jensen, 1973, pp. 299-301; 1980, pp. 518, 522). The most recent, but unconvincing, attempt at reducing the cognitive gap comes from Goldstein et al. (2023). They devised a reasoning test composed of novel tasks that do not require previously learned language and quantitative skills. Because they found a Black-White d gap ranging between 0.35 and 0.48 across their 6 independent samples, far below the typically found d gap of 1.00, they concluded that traditional IQ tests are biased. First, they carefully ignore measurement invariance studies. Second, traditional IQ tests were not administered alongside to serve as benchmarks. Third, their analysis adjusted for socio-economic status because they compare Blacks and Whites who had the same jobs (police officers, deputy sheriffs, firefighters) within the same cities. This study reflects the traditional view that IQ tests are invalid as long as they contain even the slightest cultural component.

Murray's recent book also provides gap sizes within occupations, though not within the same cities. They are generally smaller, of course.

The Project Talent administered aptitude tests. They serve as a proxy for cognitive tests, but they are not cognitive tests. For instance, most of the information test items require specific knowledge: asking who was the hero of the Odyssey, or what a female horse is called, etc. They do not call for relation eduction. Jensen (1985) himself has been highly critical of the Project Talent test battery: “Many of these tests are very short, relatively unreliable, and designed to assess such narrow and highly culture-loaded content as knowledge about domestic science, farming, fishing, hunting, and mechanics” (p. 218).

It is strange to call aptitude tests not cognitive tests. All of these are cognitive or mental tests per Jensen's definition given in The g Factor because performance on them depends mainly on cognition, as opposed to e.g. dexterity. Every cognitive test will have some flavor, at least, no one has made a pure g test so far. I don't think these aptitude or knowledge tests are any worse than many other tests commonly used. In fact, knowledge tests usually have higher loadings due to higher reliability despite not involving anything but recalling previously learned facts and some guesswork in cases of uncertainty.

 

Reviewer | Admin

 

The author applies MGCFA to aptitude tests in the Project Talent dataset to test measurement equivalence and Spearman’s Hypothesis by race and sex. The sample size is large and the cognitive data is rich, so it is well-deserving of publication. The author finds Spearman’s hypothesis applies to the black-white difference but not sex differences. Through the many citations and explication of methods, the author shows a deep knowledge of a difficult literature, where recommendations are constantly evolving (e.g. how to compare the fit of BF and HOF models). So I’m confident that the manuscript provides a rigorous contribution. 

 

My main criticism of the paper is that it is difficult to follow. I am not an expert in MGCFA, or test bias, and other papers in the field are hard to understand, nevertheless there are times in the manuscript where I became more confused than was necessary. Writing more clearly and concisely, while avoiding tangents and moving details to a supplement might be helpful in the future. Sometimes the extra details regarding methods seem unnecessary, while the core parts are not always clearly explained. 

 

As I’ve said this is not my area of expertise and there were areas where I was confused. As such, my comments are only recommendations, some of which may be wrong, but which the author can consider and use as he wishes. I categorise my comments into high level comments regarding the key findings which I found unclear and a larger range of miscellaneous comments I had whilst reading through the manuscript.




 

High level comments

 

The author concludes that there is gender bias in the cognitive tests. Is that accurate? Couldn’t the results be interpreted as there being large sex differences in specific abilities.

 

The author finds huge sex differences in g using MGCFA. Isn’t this an interesting finding, since the received opinion of the field is that sex differences in g are non-existent or trivial. Does the author think his results should make us confident in sex differences in g?

 

Does the author have confidence that the g factor is well-specified? You seem to think the use of knowledge tests has made the latent “g factor” biased towards capturing general knowledge ability more so than it should. Is this really a good explanation for the large sex differences given the women do better on the information component? It looks like the science component perhaps is the key problem? 

 

The author says that there is non-trivial racial bias in the aptitude tests. Could the author clarify how big these biases are? How different is this result to past studies and if the result is different, why is it different? 




 

Miscellaneous Comments:



 

This is for answering this question that Jensen (1998) devised the Method of Correlated Vectors (MCV) to find out how much g explains the variation in subtest differences between groups. 

 

Awkward wording

 

the largest study ever conducted

 

I don’t think this is true. 

 

The Project Talent administered a considerable amount of tests

 

Just Project Talent administered would read better

 

Wise et al. (1979). Major et al. (2012)

 

And instead of a full stop might be clearer here.

 

In their study, the VPR-g model fitted much better than the CHC-based HOF g model. The VPR was initially used in this study but it was found that the VPR model does not fit better than the CHC-based HOF model and produces sometimes inadmissible solutions such as negative variance.

 

Why do the results differ? My understanding was that VPR > CHC but CHC is primarily used because it is tradition

 

Footnote

Major et al. (2012) analyzed and used multiple imputation on the entire sample and separated the

analysis by gender and by grade level (9-12). They included Memory for Words, Memory for

Sentences, and Creativity subtests. In the present study, the VPR fits marginally better with a

CFI=.002 at best, regardless of the subgroups being analyzed, and this remained true even after

analyzing subgroups by grade level (9-12).

 

This appears to say the VPR does fit better than the CHC? Which is it? 

 

Supplementary excel sheet says # GET MY DATA HERE : https://osf.io/qn67k/ OR HERE https://osf.io/5y82g/

 

Perhaps put everything relevant in one OSF page? 

 

I think saving R code into excel files is quite ugly. I realise you might want to show off the console output, but R markdown can do this in a much cleaner way



 

Univariate normality is then scrutinized. Curran et al. (1996) determined that univariate skewness of 2.0 and kurtosis of 7.0 are suspect, and that ML is robust to modest deviation from multivariate non-normality but that ML χ2 is inflated otherwise. Values for univariate kurtosis and skewness were acceptable, although the kurtosis values for Table Reading are a little high among White males (3.2) and White females (4.58). On the other hand, multivariate normality was often rejected. The multivariate non-normality displayed by the QQ plot was moderate for Black-White analysis in both male and female groups and sex analysis in the White group but perfectly fine for sex analysis in the Black group. 

 

Exploratory Factor Analysis (EFA) was used to determine the appropriate number of factors. Similar to Major et al. (2012), it was found here that the 6-factor model was the most interpretable in all subgroups tested. The 4- and 5-factor models blend indicators into factors which are more ambiguous (math and english tests form one common factor; information and science tests form one common factor) and cause severe unbalances in factor sizes. The 7- and 8-factor models produce additional factors which are not associated with any particular ability or do not have any indicators with high loading. EFA reveals a large number of medium-size cross loadings. Since the results from simulation studies (Cao & Liang, 2023; Hsu et al., 2014; Xiao et al., 2019; Ximénez et al., 2022; Zhang et al., 2023) indicated that ignoring small cross loadings, typically set at .15 or .20 in these studies, has a tendency to reduce the sensitivity in commonly used fit indices, cross loadings are allowed when the average of the two groups is close to .20 but with a minimum of .15 per group. 



 

A lot of the results referenced in these two paragraphs don’t appear to be presented in the paper. I can’t see the QQ plot. What was the basis for supposing that the “6-factor model was the most interpretable in all subgroups tested”

 

Overall fit is acceptable in all models, except maybe for Mc. The configural and regression invariance both hold perfectly, thus only the next steps will be critically analyzed. 

 

On what basis are these statements made and the statements in the rest of the paragraph?

 

Table 3 contains a summary of the fit indices of the HOF model and the freed parameters

 

I think “free parameters” sounds nicer than “freed parameters”



 

The male-female g gap in the White and Black groups are, respectively, 0.85 and 0.55. The sex gap seems large compared to earlier reports on IQ gaps, until one realizes that this battery of tests has a strong knowledge component, especially specific knowledge

 

I understand that all variables have been standardized, but a near 1SD gap seems incredible. Even if the tests are very much based on general knowledge. I don’t know the literature, but would sex gaps in general knowledge be so large in the age range of Project Talent?

 

Is the explanation of the large sex gaps being caused by lots of knowledge batteries very plausible? In the information ability, females have a large advantage, which seems contradictory. In fact, most of the broad abilities show a female advantage, yet still there is a male advantage in g. Men are really outperforming woman on science ability (d = -1.7), even moreso than on spatial ability (d = -0.33). So the sex differences are really being driven by this one odd science factor. Maybe it is worth reanalysing the sex differences after dropping the science subtests? Mind you, isn’t the male advantage in spatial ability normally bigger?

 

Also the g gap between the sexes in black group seems much smaller than the white group. Should we be confident in that difference? It seems surprising and interesting and worthy of discussion. I personally am quite interested to know if there are racial differences in sexual dimorphism.

 

If g factors from project talent capture a lot of variance related to general knowledge, doesn’t that pose a big problem for the paper? If the g factor is improperly specified, then testing Spearman’s hypothesis in the sample, doesn’t mean much?

 

Is it interesting or surprising that the bifactor models produce much larger g differences for sex than the HOF models? My understanding is that the large differences in specific abilities, makes it difficult to isolate g differences between the sexes. So the BF model, in more precisely specifying the varying influences of specific abilities, might provide better estimates of sex differences in g?



 

Table 15. Proportions of subtest group differences due to g based on Bifactor model

 

All values for the sex differences are positive. But if males have an advantage in g and perform worse than woman on some subtests, then g should contribute negatively to the sex difference in favour of women? Shouldn’t some be positive and some be negative? 

 

 Freed parameters (by descending order of χ2 size) are: table reading~1, mechanical reasoning~1, social studies~1, clerical checking~1.

 

Most of the abilities are capitalised, but sometimes not as above. Best to be consistent.



 

Figures 2-5. The image quality is poor and the figures take up multiple pages. They appear to have been copy and pasted from Rstudio. This causes the plots to have very poor quality. I recommend using the function ggsave instead. Then insert rather than copying and pasting the picture into your word document. In choosing a device use vector graphics format vector graphics (pdf, svg, eps), not a raster format (jpeg, png). Vector graphics can be scaled in size without losing resolution. 

 

Since Figures 2-5 are very similar, I recommend using facet wrap, gg grid, ggarrange or some other function or method to put all the plots into one image which can fit on one page. 

 

The plots are not very appealing. I’d consider using theme_BW. This style is sleek and appealing. It is also used by Emil and I think is the common style used in OpenPsych. There are other issues too, the numbers are right on top of the points, and they are hidden by the regression line. I don’t know how to fix the latter issue, but the former might be fixed by using scatter, to slightly jiggle where the labels are. Alternatively, the points could be removed to just leave the numbered labels. Making the numbered labels larger, might be more visually appealing. 



 

 the negative correlation amplifies (r=-.304) but this is simply because of the removal of two speed subtests.

 

Check the spacing in the equations. (r = -.304) looks prettier.



 

The effect size in non-invariance indicates a non-trivial bias, mostly disfavoring Blacks, but given the small ratio of biased/unbiased tests, the total effect should not be large… If this is the case, the racial bias cannot be considered as minor anymore

 

The author seems to be saying the tests used in Project Talent are biased in favour of Whites. The use of negative “disfavor Black students” confused me a little bit. “Biased in favour of Whites” would have been clearer. 

 

The reporting of this conclusion is contradictory; “non-trivial” then “the total effect should not be large”. Some rewording might clarify the issue. Given the importance of the question, any more information to explain the size of this bias would be very valuable. 



 

MCV was applied to check the similarity of obtained results with MGCFA. The finding of a large correlation between the Black-White d gaps and g-loadings is consistent with earlier reports on Black adults…

 

If the author is sceptical of MCV, why not leave it for the supplement? If it’s results are interesting, it would be good to explain how the author interprets the MCV results for sex differences. Only the racial difference in MCV is mentioned in the discussion.

 

The discussion seems rather long and goes into tangents in the literature? Maybe this can be shortened. A lot of the writing interpreting and discussing the literature on the effect of culture on IQ tests seems quite tangential and long.



 

 That cross-cultural comparability and Spearman’s g has been repeatedly confirmed for racial differences but not gender differences. 

 

The last sentence does not make sense.