This finding replicates earlier MGCFA studies supporting SH with respect to the Black-White cognitive gap as well as earlier MGCFA studies revealing stronger gender bias than racial bias
I would not frame failure of Spearman's hypothesis as test bias. It can mean that, but it can also just mean that non-g factors have a large role in explaining why differences are larger or smaller on some tests. This is obviously the case for sex differences, where the g factor difference (assuming this exists) has a relatively minor role in explaining differences across subtests compared to differences in non-g abilities. For black-white, the g gap is so large that g will explain most of the differences between subtests. Also, could you decompose the explained proportion and give these in the abstract?
Jensen (1998) proposed that the magnitude of the racial differences in IQ, at least between Black and White Americans, as well as differences in cognitive-related socio-economic outcomes are a function of the g-loadings (i.e., the correlation between the tests or outcomes and the general factor of intelligence) of the respective cognitive tests and outcomes, making the g factor an essential element in the study of socio-economic inequalities
Can you give the page reference for this? I think this is generally called the g-nexus.
More specifically, Jensen (1998) proposed that race/ethnic group differences on cognitive tests are largely due to latent differences in general mental ability. This is known as Spearman’s hypothesis (SH) which exists in two forms: the strong and the weak form, the latter of which was endorsed by Jensen.
He did propose that, but not in that book. Proposed makes it sound like this was the original writing to propose this, but Jensen's writings on SH go back decades prior. Maybe better would be to discuss:
This is known as Spearman’s hypothesis (SH) which exists in two forms: the strong and the weak form, the latter of which was endorsed by Jensen. The strong form affirms that the differences are solely due to g factor differences while the weak form affirms that the differences are mainly due to differences in g. The alternative contra hypothesis states that group differences reside entirely or mainly in the tests’ group or broad factors and/or test specificity and that g differences contribute little or nothing to the overall ones.
I prefer thinking about this as a continuum of g vs. non-g contributions to the observed gaps. It ranges from 0 to 100% and the strong form is just the point at the end of this line. I think that conceptualizing it this way is more sensible, but I recognize that MGCFA-like methods have to split it up for hypothesis testing purposes. Still, Lasker often uses g's contribution as a % which will be useful for meta-analysis.
Millsap & Olivera-Aguilar (2012, p. 388) provides an illustration: the inclusion of a math test having a mixture of multiple-choice items and problem-solving items, with the latter being story problems, may introduce bias against foreign language speakers due to the verbal content of this test. If the math factor is not supposed to measure verbal skill, then such a test should be discarded.
I think it would be better to say that such a test should not be used with foreign language users, or at least, should be scored differently in order to compensate for this language bias.
Another comes from Benson et al. (2020) who analyzed the UNIT2 norming sample and found that scalar invariance was rejected not only for race (Whites, Blacks, Asians) and ethnicity (Hispanic) groups but also for gender groups. Metric invariance was also rejected for age and gender groups, suggesting that the UNIT2 overall is somewhat biased with respect to any group.3
These are in stark contrast to most findings. What is special about this study? Why does it give so different results? Can one reanalyze the data using the same methods as most papers use?
So far the evidence of strong measurement bias in race differences comes mainly from studies conducted in African countries. Dolan et al. (2004) compared the Junior Aptitude Test (JAT) scores of South African Black and White students and found that both metric and scalar invariance are violated.4 Lasker (2021) re-analyzed Cockroft et al. (2015) and compared the WAIS-III scores of undergraduate South African students enrolled at an English medium University to undergraduate UK university students, and found that metric and scalar invariance are rejected. Warne (2023) compared the WISC-III scores of Kenyan Grade 8 students in Nairobi schools to the American norm and the WAIS-IV scores of Ghanaian students who showed English fluency at high school or university to the American norm. While measurement equivalence was established for Ghanaian students, it was rejected for Kenyan students.5
Many readers will not be familiar with the exact meanings of metric, scalar etc. invariance, and what it means when it is rejected/violated. Can you explain these? Usually, one can attain partial invariance allowing score comparisons, was this done?
Yet research employing MGCFA showed mixed evidence of gender fairness. Some studies reported small or no measurement bias (Chen et al., 2015;6 Dombrowski et al., 2021; Irwing, 2012; Keith et al., 2011; Palejwala & Fine, 2015;7 Reynolds et al., 2008; van der Sluis et al., 2006) while others reported non-trivial bias, intercepts almost always being the common source of bias (Arribas-Aguila et al., 2019; Dolan et al., 2006; Lemos et al., 2013; Pauls et al., 2020; Pezzuti et al., 2020; Saggino et al., 2014; Van der Sluis et al., 2008; Walter et al., 2021).
I would add that it is trivially easy to get intercept bias for group differences if they differ in non-g factors as well as g. If the number of tests isn't sufficiently large to model the non-g factors, differences in them will show up as intercept bias. A good example of this is:
https://www.researchgate.net/publication/236597029_Decoding_the_Meaning_of_Factorial_Invariance_and_Updating_the_Practice_of_Multi-group_Confirmatory_Factor_Analysis_A_Demonstration_With_TIMSS_Data
Which used a g-only model for 5 math subtests from TIMSS.
After fitting a parsimonious weak SH model, they discovered that the equality constraint on the g factor mean did not worsen model fit in 3 of the 4 age subgroups. In the end, there is no compelling evidence that g is the main source of the subtest differences between sex groups.
Given how small the g difference is between sexes, and the limited power of MGCFA methods, it is not surprising that often the null-g models are not rejected. This is an inherent issue with hypothesis testing. Jensen's method has similar issues, because it is biased downward by samping errors in the estimates. When the true difference is relatively small, the bias will often overcome the signal. This was shown by simulation by:
https://www.researchgate.net/publication/353465907_The_Negative_Religiousness-IQ_Nexus_is_a_Jensen_Effect_on_Individual-Level_Data_A_Refutation_of_Dutton_et_al's_'The_Myth_of_the_Stupid_Believer'
The Project Talent is the largest study ever conducted in the United States involving 377,016 9th-12th grade students during 1960 and drawn from all of the 50 states (Flanagan et al., 1962).
Largest what study? It is probably not the largest study ever of any social science in the USA. For instance, the current MVP is larger. https://www.research.va.gov/mvp/ It is probably the largest dataset from the USA concerning detailed measures of intelligence.
Three tests have been removed in the present analysis: memory for sentences (S17), memory for words (S18), and creativity (S27). The memory tests are highly correlated with each other but are poorly correlated with all other variables (between r=.10 and r=.20), which makes them unsuitable for CFA. Creativity has moderate correlations with other variables, has no main loading and its loadings are of modest or small size. Thus, a total of 34 aptitude/cognitive tests are used.
Why are they so low in g-loading? It's a general problem of PT that it has too many tests with small item counts. The item data is not available for analysis for most of the sample, so one cannot attempt to adjust for this so easily.
In their study, the VPR-g model fitted much better than the CHC-based HOF g model.
I think this is the first time "CHC" is used, but not explained.
Figure 1 images are fairly ugly. Maybe draw some nicer ones using e.g. https://app.diagrams.net/
To evaluate and compare model specifications, fit indices such as CFI, RMSEA, RMSEAD, SRMR and McDonald’s Noncentrality Index (Mc) are used to assess model fit, along with the traditional χ2. Higher values of CFI and Mc indicate better fit, while lower values of χ2, RMSEA, RMSEAD, SRMR indicate better fit. Simulation studies established the strength of these indices to detect misspecification (Chen, 2007; Cheung & Rensvold, 2002; Khojasteh & Lo, 2015; Meade et al., 2008). However, with respect to ∆RMSEA, doubts about its sensitivity to detect worse fit among nested models were raised quite often. Savalei et al. (2023) provided the best illustration of its shortcomings. According to them, this was expected because the initial Model A often has large degrees of freedom (dfA) relative to the degrees of freedom introduced by the constraints in Model B (dfB), resulting in very similar values of RMSEAB and RMSEAA, hence a very small ΔRMSEA. For evaluating nested models, including constrained ones, their proposed RMSEAD solves this issue. RMSEAD is based on the same metric as RMSEA and is interpreted exactly the same way: a value of .08 suggests fair fit while a value of .10 suggests poor fit.
Is there some reason people in this area don't use cross-validation? It seems to me that this is a better way to deal with overfitting issues relating to all of these model fit statistics. I see that some people use split-half cross-validation as a measure of overfitting. https://www.tandfonline.com/doi/abs/10.1080/1091367X.2014.952370 In general, though, k-fold cross-validation is a much better approach.
Because the Predictive Mean Matching (PMM) method of imputation calculates the predicted value of target variable Y according to the specified imputation model, the imputation was conducted within race and within gender groups, totaling four imputations. It is inappropriate to impute the entire sample because it implies that the correlation pattern is identical across groups, an assumption that may not be true and may eventually conceal measurement non-invariance.
Alternatively, different imputation models may create spurious non-invariance because the fill the missing data using different model parameters.
The model specification is displayed as follows:
Use code formatting (equal space font) for code.
Table 2. Is "RMSEAD [CI]" difference in RMSEA? The width of the confidence interval is not specified, but one can assume it is 95%. I don't see where this value is from. The differences in RMSEA are not that large between the models, they are all between 0.042 and 0.044.
Table 3. Black-White differences among males using Higher Order Factor
Better captions would be e.g. "Table 3. Black-White differences among males using the Higher Order Factor model"
What do the bold rows in the tables mean?
Finally, as an additional robustness analysis, all models for both the Black-White male and female groups were rerun after removing multivariate outliers with the Minimum Covariance Determinant (MCD) proposed by Leys et al. (2018) who argued that the basic Mahalanobis Distance was not a robust method. Although the multivariate normality was barely acceptable, the number of outliers was large: MCD removed 1,948 White males and 338 Black males, and 1,005 White females and 372 Black females.
Is there some information about why these were outliers? In what sense are they outliers?
Robustness analysis was conducted for the gender difference in the White group because the multivariate normality was non-normal
I don't understand this.
Table 14. d gaps (with their S.E.) from the best fitting g models per group analysis
Are these standard errors really accurate? You could bootstrap the data to find out. It is hard to believe these estimates are that precise.
Note: Negative values indicate advantage for Whites (or males).
Might be more sensible to use positive values for these gaps since these are more numerous and usually the values of interest.
The average proportion is .43 for the sex group among Whites and .50 for the sex group among Blacks. If SH explicitly states that g is the main source of the group difference, it seems that even the weak SH model does not explain well the pattern of sex differences.
On the contrary, these seem high values. It says g gaps explain quite a bit of the variation between tests. Based on the popular view that g gap = 0, these kinds of findings are unexpected.
Table 15. Proportions of subtest group differences due to g based on Bifactor model
It is hard to understand the table when subtests are not given by their names, but by e.g. S15. The reader then has to scroll back to a much earlier page to look up each value.
The g-loadings correlate highly with Black-White d gaps but not with sex d gaps. After correction for unreliability, the correlations (g*d) for the Black-White male, Black-White female, male-female White, and male-female Black groups are, respectively, .79, .79, -.06, -.12. If the reliability for speed subtests is assumed to be .70 instead of .60, the correlations are .80, .80, -.05, -.12.
I think this is a good example of how Jensen's method gives misleading results compared to the proportions calculated above. There it was found g explained about half the variation in subtest differences, whereas Jensen's method suggests it is about 0%.
Figures 2-5 also show us the issue of single regression. In reality, since tests load in multiple factors, you should use multiple regression. The predictors are each tests' factor loadings on all the factors. When this is done, the results (unstandardized slopes) will tell you the estimated gap sizes given tests that measure each factor perfectly and no other factor. Do these values align with the MGCFA results? They should.
You don't see to think of the implied gaps based on the regression models in the figures, but we can read them off at the loading = 1 line. They are about 1.5 d for black-white males, about 1.6 d for black-white females, and quite small for the two male-female gaps. The values for the black-white comparisons are close to the ones from MGCFA in table 14.
After establishing partial invariance, SH was tested in all subgroups. This was validated in the Black-White analyses based on two findings: 1) non-g models fit worse than g-models and 2) the proportion of the subtests’ mean differences due to g is very large.
Maybe it is larger than usual because this sample was large enough so that single sex samples could be used. When joint sex samples are used, some of the variation in the data will be explained by unmodeled sex differences. I think this may make the g proportions lower.
SH is supported through the examination of Forward and Backward Digit Span, showing a BDS Black-White gap that is larger (d=.50) than the FDS gap (Jensen, 1998, p. 370).
See https://www.sciencedirect.com/science/article/abs/pii/S104160801530056X
The great majority of the studies confirms the cross-cultural comparability of IQ tests, the exception mainly comes from South African samples (Dolan et al., 2004; Lasker, 2021). Due to the omnipresent force of the mass-market culture in developed countries, it is not surprising that culture bias is rarely noticeable (Rowe et al. 1994, 1995).
A great many? I thought there were almost no studies examining cross-cultural, that is, between country, comparability of IQ tests.
Attempts to reduce the racial IQ gap using alternative cognitive tests have always been proposed (Jensen, 1973, pp. 299-301; 1980, pp. 518, 522). The most recent, but unconvincing, attempt at reducing the cognitive gap comes from Goldstein et al. (2023). They devised a reasoning test composed of novel tasks that do not require previously learned language and quantitative skills. Because they found a Black-White d gap ranging between 0.35 and 0.48 across their 6 independent samples, far below the typically found d gap of 1.00, they concluded that traditional IQ tests are biased. First, they carefully ignore measurement invariance studies. Second, traditional IQ tests were not administered alongside to serve as benchmarks. Third, their analysis adjusted for socio-economic status because they compare Blacks and Whites who had the same jobs (police officers, deputy sheriffs, firefighters) within the same cities. This study reflects the traditional view that IQ tests are invalid as long as they contain even the slightest cultural component.
Murray's recent book also provides gap sizes within occupations, though not within the same cities. They are generally smaller, of course.
The Project Talent administered aptitude tests. They serve as a proxy for cognitive tests, but they are not cognitive tests. For instance, most of the information test items require specific knowledge: asking who was the hero of the Odyssey, or what a female horse is called, etc. They do not call for relation eduction. Jensen (1985) himself has been highly critical of the Project Talent test battery: “Many of these tests are very short, relatively unreliable, and designed to assess such narrow and highly culture-loaded content as knowledge about domestic science, farming, fishing, hunting, and mechanics” (p. 218).
It is strange to call aptitude tests not cognitive tests. All of these are cognitive or mental tests per Jensen's definition given in The g Factor because performance on them depends mainly on cognition, as opposed to e.g. dexterity. Every cognitive test will have some flavor, at least, no one has made a pure g test so far. I don't think these aptitude or knowledge tests are any worse than many other tests commonly used. In fact, knowledge tests usually have higher loadings due to higher reliability despite not involving anything but recalling previously learned facts and some guesswork in cases of uncertainty.