Back to [Archive] Withdrawn submissions

[ODP] Sex differences in g and chronometric tests
Males average 100g more brain than women, and brain size is known to correlate with general intelligence (\textit{g}), leading to the possibility that men average somewhat higher in \textit{g} than women. Chronometric tests are known to be correlated with \textit{g} and there is strong evidence that men do better on chronometric tests. Furthermore there is evidence that adding chronometric tests to standard batteries yields a better measurement of \textit{g}. However, chronometric tests are not found in standard IQ batteries and thus if one relies on standard batteries to test for a sex difference, one may underestimate the sex difference.

Keywords: Sex differences; intelligence; IQ; g-factor, reaction time, elementary cognitive test

Attached files: Layout PDF, source PDF.

If you're talking about e.g. simple reaction time, then the correlation is low between groups because the test is not very g-loaded, .2-.3 or so. I think that if one uses a battery of tests with more g-loaded chronometric tests, then one will see Spearman's law confirmed and the racial differences on the tests highly g-loaded will be close to the normal 1-1.1 d.

Yes, this is because the g-loading of individual ECTs is very low as well, as discussed by Jensen (1993). Spearman's law was confirmed in his study as well (r's about .75). It would be more interesting to have a study with a large number of ECTs varying wildly in their g-loading and at least two groups with different means present (of whatever type), to see how the ECT d matches up to conventional IQ test d's.

Jensen, Arthur R. "Spearman's hypothesis tested with chronometric information-processing tasks." Intelligence 17.1 (1993): 47-77.
Discuss this, then publish
Discuss this, then publish

New version attached.
Can reviewers comment on this submission? It has been 17 days since the last post.
I approve its publication.
Can reviewers comment on this submission? It has been 17 days since the last post.

Your paper gives me no sense of the vast amount of literature on the topic. Also, you fail to mention a number of conflicting reports. For example:

Sex differences in latent cognitive abilities ages 5 to 17: Evidence from the Differential Ability Scales—Second Edition. Intelligence, 39(5), 389-404.
Sex differences in brain volume are related to specific skills, not to general intelligence. Intelligence, 40(1), 60-68.
Null sex differences in general intelligence among elderly. Personality and Individual Differences, 63, 53-57.

Also, this issue has already been discussed some. See, for example:

Sex differences on elementary cognitive tasks despite no differences on the Wonderlic Personnel Test. Personality and individual differences, 45(5), 429-431.

I'm not sure what your discussion adds.
Good find chuck. I was not aware of Pesta et al. In that case, my paper here serves no purpose. The idea was mainly to get someone to look at sex diffs with use of ECT's. I will withdraw the paper.
Concerning the discussion on RT, I will recommend :

Neural transmission and general mental ability
Margaret McRorie, Colin Cooper

Jensen’s (1987) summary of 33 reaction time (RT)/IQ studies has reported average increasing correlations in the one-, two-, four-, and eight-choice conditions (r = -.19, -.21, -.24, and -.26, respectively).

So it seems indeed that there is some proof of it. However, Jensen's meta-analysis is on Jensen, A. R. (1987). Individual differences in the Hick Paradigm. In P. A. Vernon (Ed.), Speed of information processing and intelligence. Norwood, NJ: Ablex. but the book is too costly. No one can buy it.

But if you want a good graph, you can. You just go read Jensen (2006) Clocking the Mind (it's on page 180). The mean reactive time increases with number of bits, and the mean RT-SD increases with N of alternative responses. The total number of subjects in his meta-analysis was around 1400.

Processing of Simple and Choice Reaction Time Tasks by Chicano Adolescents

Two reasons why I recommend that (old) article. Because no one else (aside the author itself) cited it. Second, it's because they are aware about something a lot of researchers may not know. Sometimes, when subjects play with RT, they tend to anticipate (i.e., guess) the moment they have to move and react. Precisely, Jensen (1998) writes :

Outlier” trials are usually eliminated. Response times less than about 150 milliseconds are considered outliers. Such outliers are excluded from analysis because they are faster than humans’ “ physiological limit” for the time required for the transduction of the stimulus by the sense organs, through the sensory nerves to the brain, then through the efferent nerves to the arm and hand muscles, These fast outliers most often result from “ anticipatory errors,” that is, the subject’s initiating the response just before the onset of the reaction stimulus. At the other extreme, slow response times that are more than three standard deviations slower than the subject’s median response time are also considered outliers. They usually result from a momentary distraction or lapse of attention. As outliers are essentially flukes that contribute to error variance, omitting them from the subject’s total score improves the reliability of measurement.

But now what ? you'll see if you read most literature that the researchers usually don't mention that, as if they are totally unaware of this problem. And it's another plausible explanation why RT studies are so controversial, with lot of conflicting results. Because you have lot of bad studies. You must mention the lowest RTs in your sample, and if lower than 150ms, you must apply a data cleaning, with, e.g., winzorizing procedures.

Some researchers also seem not aware of the utility to use RT-SD or the IT-SD, because you rarely see these incorporated in their analyses, either as separate variable or through factor analyses or latent variable approach. The fact they always use student samples, and small samples, is just going to worsen their outcome. But even so, you can deal somewhat with measurement error, if the test is lengthy enough, and if you succeed to build a latent g. I'm not talking about 1rst-order g, but 2nd order-g. That's possible. The RT SD has have different properties, and IT also measures something that RT does not measure. So, you have some specificities.

Look at that study for example.

Individual alpha peak frequency is related to latent factors of general cognitive abilities (Grandy 2013)

It says the same thing. Some older studies produce conflicting results, either IAF was or was not linked to IQ. They use SEM, with 9 tests, they make 3 first-order (group) factors and they build their 2nd-order g above these factors. r=0.40.

There is also another thing you should know about RT-IQ correlation. One thing that is once again almost never discussed. On Jensen's Clocking the Mind (p184), you'll see an inversed-U shape curve that shows the highest correlation RT-IQ is obtained when the complexity of the ECTs is at the medium level. Too difficult or too easy, and you'll kill your correlations. Thus Jensen summarizes :

The correlation is influenced by two conditions: (1) test complexity and (2) the mean and range of IQ in the subject sample, as the peak of the complexity function shifts to longer RTs as the mean IQ declines. Therefore, the significant RT–IQ correlations fall within a relatively narrow range of task complexity for various groups selected from different regions of the whole spectrum of ability in the population. Hence, when it comes to measuring general intelligence by means of RT there is probably no possibility of finding any single RT task with a level of task complexity that is optimally applicable to different samples that ranges widely in ability. The average RT–IQ correlation in the general population on any single task, therefore, represents an average of mostly suboptimal complexity levels (hence lower RT–IQ correlations) for most of the ability strata within in the whole population.

The optimum level of task complexity for the IQ–RT correlation is hypothesized to occur near the RT threshold between error-free responses and error responses. This is the point on the complexity continuum beyond which RT becomes less correlated (negatively) with IQ and errors become increasingly correlated (negatively) with IQ.

I rarely see this particular point discussed in RT/IQ studies. Again, I don't like it. Because it strongly suggest that a lot of researchers have poor knowledge on RT.

For race differences, I'm not surprised about the lower IQ differences. Psychometric tests usually measure a "knowledge component" which you don't have in RT/IT. In general, RT difference is low because of range restriction and measurement errors. The researchers usually correct for these artifacts. But I can't say I find them optimal. Because those 2 artifacts can lower the correlation, but also can change the sign of the correlation. No one discusses this.

For example, Daniel Metzen (2012) conducted a meta-analysis on g and group differences on RT.

But as you expect, the reliability is not trustworthy. Half of these reliabilities is negative. I'm not also surprised that the meta-analytic correlation is so bad. Normally, for these meta-analyses too be strong, the SD in effect sizes must not be too heterogeneous. Either you look at SDr or %VE, that's the same. It tells you that the disparity in the correlations is too large to have lot of confidence in your results.

And someone here says that race difference in CRT is also too low. But according to Jensen, Oddman RT is the most complex.

Concerning the article.

I have nothing to say against it. Some suggestions. It seems to me you have missed some other studies. Both failed to confirm sex difference in g.

Sex differences on the WISC-R in Belgium and The Netherlands (van der Sluis 2008)
Multi-group covariance and mean structure modeling of the relationship between the WAIS-III common factors and sex and educational attainment in Spain (Dolan 2006)

John already said something like that, but I disagree with the fact that having conflicting result would mean all of them are equally relevant. The kind of method matters. Irwing as well said it's due to methods. Here's what he says, specifically.

It has been shown by Molenaar, Dolan, and Wicherts (2009) that large samples are required to attain sufficient power in order to detect a mean difference in MGCFA models. Here, we have such a large sample, and in order to ensure sufficient power we carry out the analysis in the entire sample aged 23 years and older. A more profound difficulty is that most analyses have failed to separate out measurement issues from structural analyses. In doing so, authors have simply followed recommended practice (Chen, Sousa, & West, 2005). The problem is that for cross group comparisons to be valid scalar invariance must hold (Widaman & Reise, 1997). To establish scalar invariance multiple congeneric measures at the first order factor level are required (Widaman & Reise, 1997), but to date, no study including the current one, has had access to multiple measures. However, we adopt a somewhat novel solution by simply recognizing that testing of metric invariance is the most that we can achieve with only one measure for each construct.

Probably the most serious problem in validly testing for mean differences in MGCFA models is that factors are correlated, and therefore order of testing influences the conclusion. The problem is closely analogous to that presented by post hoc testing in multivariate analysis of variance. Here, in order to achieve an unambiguous conclusion, we present two solutions to this problem. The first followed the practice in stepdown analysis of prioritizing the order of testing according to a mixture of theoretical and practical criteria. We then used a Bonferroni correction in order to control for type 1 error. In the second, we used a Bi-factor model which removes the problem of correlated factors by orthogonalizing them.

I'm not entirely sure what he means after the 1rst sentence in the 1rst paragraph. But anyway, it's true that there seems no consensus again. However, looking at John's reference and mine, you don't see this method applied. Irwing has not been replicated yet, because practioners don't use the method he recommends.