Back to Submissions

1
Meta-analysis of sex differences in intelligence

Submission status
Reviewing

Submission Editor
Noah Carl

Authors
Leonardo Parra
Emil O. W. Kirkegaard

Title
Meta-analysis of sex differences in intelligence

Abstract

There is no consensus within the field of psychology on whether there are sex differences in intelligence. To test whether there are, 2,092 effect sizes were gathered that measured differences in mental ability between men and women, representing 15,981,672 individuals. Men scored 2.58 IQ points (95% CI [1.91, 3.25], I^2 = 99.2%, k = 47) above women on general ability tests within adults. Whether this difference is due to general intelligence (g) is not clear, though it is likely. Two of the three methods used to test the developmental theory of sex differences suggested that the male advantage in ability increases with age.

 

Keywords
intelligence, IQ, gender, sex

Supplemental materials link
https://theuntangler.wordpress.com/2024/10/27/the-files/

Pdf

Paper

Reviewers ( 0 / 1 / 0 )
Reviewer 1: Considering / Revise
Public Note
correcting minor typographic error

Tue 22 Oct 2024 21:26

Reviewer | Admin

I have to be brutally honest. For a blog article, because I read regularly high-quality blogs from people who have great skills and knowledge (some of these people are even academics), this article would make for a very good blog article. But as a paper, this is just horrible.

1. Introduction

The introduction has zero structure. The typical introduction must explain the core problem of the subject you're about to be dealing with, the history, the past and current debate perhaps, but it is absolutely necessary that you explain what is the purpose of the present paper and how it will contribute to (perhaps) narrowing the current debate about sex differences in IQ. Given that other researchers have conducted meta-analyses too, why not telling us in what way the new meta-analysis is useful, perhaps by using new, and more moderators?

Since you mentioned different tests without explaining to the readers what are these tests (Wordsum comes to mind first), I suggest you discuss a little bit about how sex difference can vary depending on the measurement. After all, you mentioned this point multiple times. This would help introducing these different tests and explaining to the readers their specificities (eg, reliability, magnitude of group differences etc, measurement invariance?).

In the introduction, do not talk about methods and their results. Even if it's a simulation, move it in result section or in discussion section. About that, and although it's besides the point, I disagree that the simple fact that sex differences in specific abilities are large makes MCV hard to test Spearman's hypothesis. Remember there are large race differences in specific abilities too. The key element here is that if the pattern of sex differences does not follow the expected gap (ie, the regression line) based on g loadings, then you don't expect g to be the driving factor of sex differences. 

2. Method (I suggest you collapse materials and methodology under a single section Method, e.g., 2.1 Materials, 2.2. Methodology).

Since that is a meta-analysis, you need to follow quite strictly the guidelines. Make very explicit all these details: years for your search, engine that you use, whether you included unpublished paper/dissertation, what are the keywords being used etc. The exclusion of studies due to having poor cognitive tests need to be explained better (you mentioned Wordsum but not why it's bad). Explain what is a bad test. Eventually, if you have looked at the references mentioned within the papers (if yes mention it, if not it's fine). I regularly write study reviews and I have conducted/participated in a few meta-analysis, and one regular task I do is check the references mentioned in the introduction and discussion of the papers included in the review or meta-analysis. This is because sometimes even with search, I notice that I would miss a few (typically very rarely cited) papers which, fortunately have been picked up by some authors. 

You may want to include a flowchart of study inclusion and remove Table 2 in this case. 

You did not clarify which effect size metric you are using: Cohen’s d, Hedges’ g, or another? Were you using only descriptive statistics to compute effect sizes? I see you are converting effect sizes into IQ-metric. This needs to be explained in detail. 

In meta-analysis, it's crucial to explain why you include this or that moderator. It has to be theoretically driven, hypothesized a priori. Each moderator has to be explained this way.
The test quality classification is not very convincing. What is the content of those academic achievement tests (reading, math, both, and how many of each)? Why Raven was given a rating of 4 but not lower or higher (also Raven should be capitalized in your text)? And which ones? There are many Raven tests, and some are very short form. If test quality is taken into account, it's hard to ignore test reliability, but somehow this is not mentioned here. You also did not mention that military tests are typically heavily crystallized. These tests are known to be unrepresentative of the cognitive abilities being assessed, and even if the correlation with other good tests (Wechsler, WJ, Kaufman etc) is high this doesn't in itself fix the problem of unrepresentative cognitive domains. Given your classification is crude, I highly recommend that you use sensitivity analyses, perhaps by moving up or down some categories in the middle because those in the middle are most likely the ones being at risk of misclassification. Another possibility is to simply rerun the analysis with test quality as moderator but removing Raven category, since the test is qualitatively different from others. Typically, nearly all tests are somewhat crystallized biased. So Raven is the real outlier here.

Another detail you need to check is the variability of effect sizes (ie, sex differences) within each category of test quality, controlling and without controlling for test type.

What is the meaning of "tests testing one subtest exhaustively"? Did you mean cognitive dimension, rather than subtest?

As I said above, I am surprised there is no mention about measurement error. It is customary to correct for measurement error in meta-analysis of test scores. 

Fig. 2 Total sample size by country, all abilities

What do you mean by "all abilities"?

First, a meta-analysis of sex differences in specific cognitive abilities was made within adults and children separately. This tests both the developmental theory of sex differences in intelligence (Lynn, 2021) and whether there are sex differences in specific abilities.

I would avoid starting the new section with "First,..." You should explain the strategic plan overall. Since you have 2 sets of anlayses, why not explaining why this is important to make that distinction and what is your expectation (ie, your working hypothesis). 

Also, you need to tell us the age range of these children and adults. 

To avoid spurious findings, specific abilities where not enough samples (500) or effect sizes (2) within age subgroups were excluded

I would not remove small samples with small number of effect sizes. If the analysis employs a weighting method, I don't see why you would exclude these samples. You could however use the weighting method and then compare its results with your trimming. You could also incorporate both sample size and sample size as moderators and see if one showed a relatively stronger impact.

A second meta-analysis was conducted to test whether there is a sex difference in full scale ability using only the highest quality samples; these exclusionary criteria are available in Table 2, which reduced the number of effect sizes to 48 

In your table 2 though, I read 119 effect sizes for high quality tests.

If a study reported multiple effect sizes, these effect sizes were averaged into one effect size.

First, and this is true everytime you mention "multiple effect size", you need to clarify that this means multiple effect size of the same cognitive domain (e.g., spatial, knowledge, technical knowledge etc). 

Second, averaging multiple effect sizes create bias. At least theoretically. There are plenty of papers advising against this strategy. The best method is to use hierarchical (ie, multilevel) meta-analytic model. However, this best method is not ideal if the number of studies with multiple effect size is very small. Make it very explicit why you wouldn't use multilevel models. How many studies report multiple effect sizes? It looks like it's 48 but I can't be sure.

I'll let you decide whether to include such a moderator but I believe it may be important: year of data collection. The reason is because there seems to be a tendency over time for test makers to erase anything that they suspect is group biased (race or gender). So you can check if there is any systematic fluctuation in the sex difference.

First, a meta-analysis was conducted within studies that tested full scale ability and the average age of the samples was examined as a moderator. Then, a second meta-analysis was conducted within all effect sizes that tested the effect of the average age of the sample on male advantages on tests, independent of the sex ratio, type of test, and the year that the study was conducted in.

Why not testing for possible nonlinearity effect of age?

A big issue here is the wording: "within studies" and "within all effect sizes" could be more clearly distinguished to explain what makes these analyses distinct. Also, be consistent with your wording: "type of test" I suspect is test quality, but how do I know? It's confusing. Check every sentence in this paragraph (and earlier ones) and try using the same wording. Especially the second sentence which you need to rework entirely; I can't tell what you are trying to do here.

This wording is awkward: "tested the effect of the average age of the sample on male advantages on tests". It should be, e.g., "sex difference in test scores". Because otherwise it gives me the impression you have included only the studies that show an advantage for men.
Given that you use 3 meta-analysis (each using different conditions) you may want to explain how each condition differs from each other and what kind of information (eg, advantages) they offer individually. 

In the result, you said that the regression shows no publication bias. Again, if only you have mentioned clearly the techniques you were going to use, I wouldn't ask the question. I think I know what you did, most likely you used Egger's regression test (or Putejovski's version), or Peters' regression test. To sum up, all methods and techniques must be described and referenced properly in the method section. Try to add as much detail as possible (eg, packages used, textbooks if any). This is to ensure you keep a high level of rigour in your work. Don't make it look like it's a blog article, because it really feels like one.

What about heterogeneity? I see no report on I² statistic or tau-squared τ². Those are highly recommended (if not mandatory) in meta-analyses. High heterogeneity may suggest unmeasured confoundings.

Adult men scored slightly higher in full scale ability (d = .17, p < .001) when all of the adult samples were pooled together, this difference remained within higher quality tests as well

This sentence should be split into two, I find it very hard to read. For instance I don't know if "when all of the adult samples were pooled together" refers to the earlier or the next statement.

Lastly, I would appreciate if you give more details on the models you have specified to account for moderators.

4. Discussion

Just like the introduction, it's lacking. Tons of papers are not discussed. One very important and quite recent is Reynolds' review. He concluded that sex differences exist in some specific abilities but not g. You need to discuss this.

Reynolds, M. R., Hajovsky, D. B., & Caemmerer, J. M. (2022). The sexes do not differ in general intelligence, but they do in some specifics. Intelligence, 92, 101651. doi: 10.1016/j.intell.2022.101651

sex differences in intelligence could also be influenced by other variables which causes the difference to go in different directions.

This sentence requires some references.

(Keith et al., 2008; Arribas-Aguila et al., 2019): .... Whether these differences would remain after using more sophisticated methods, such as correcting for violations of measurement invariance using latent models, has yet to be seen.

Actually both of these papers tested for measurement invariance (MI) in their MGCFA models. MI was tenable in Keith (only for their bifactor model though), but not in Arribas-Aguila.

One should also question whether a test can be created that does not have a sex bias due to the large group-factor differences.

Measurement bias has nothing to do with the magnitude of group differences in group factors. Intercept bias occurs whenever the subtest mean score differs between group despite being perfectly equated on the latent factor score. Loading bias occurs whenever the importance of subtests within a latent factor differs between group. Those are the two main common biases. In the large literature of sex differences in g using MGCFA, about half shows MI, about half rejects MI. In the large literature of black-white differences, with large gap in specific abilities as well, MI is never rejected except in some underdeveloped countries.

Think of a classical regression, or IRT (or logistic regression). You use any subtest you wish as dependent var, sex and total test score and their interaction as predictors. If the interaction is different from zero, there is bias. In logistic and IRT, you do the same thing but the item is the dependent var. The main idea is: conditioning on latent score, you should not expect any "residual" group difference in any of the subtest that make up this latent factor. If there is, you have the equivalent of "intercept" bias.

Even if latent methods are used, it’s not clear whether it’s even possible to adjust for the effect of the group factors on the estimation of the overall difference

The bifactor model does exactly this, by specifying g as being independent of group factors. In the hierarchical second-order g, the relationship between g and subtests is mediated by the first order group factors. Keith et al (2008) used both models. In the hierarchical, females outscored men by 1.21 and in the bifactor by 3.51 points. It's not the first time that I've seen the bifactor models magnifying the difference instead of reducing them. 

Moreover, when MGCFA is applied to black-White data, it's fully consistent with Spearman's hypothesis, and g explains most of the variance of any given subtests. But not when applied to data on sex difference. There is yet no evidence that the sex IQ gap is mainly driven by differences in g. This is because not many papers actually tested the Spearman's g using MGCFA. 90% (and I'm being generous here) of researchers who used MGCFA to assess the g gap between sexes never tested the Spearman's g. 

General remarks:

I strongly suggest you spend some time analyzing the structure of academic papers, in particular meta-analyses. Pick 10 or 20 papers and see what's common in every single one of them, and see how they present their result section, and do the same. That's what I do.

I appreciate the contribution here, as the subject is quite important, and so these analyses are extremely meaningful. But the presentation, once again, should emulate academic papers, not blog articles.

Everything in the glossary could (and should) be put in the main text or in a footnote.

The tables in the appendix appear important enough to be mentioned, discussed in the result section. All of these figures must be discussed, described in detail in the result section. As for the violin plot, explain what it is and the reasoning in the method section. I have some concern about Table A9 since you never mentioned anywhere you were also collecting g-loading data. This should be explained, once again, in precise details. There are plots on other survey data such as PNC, PT, etc, but how did you carry out those analyses? Did you use sampling weight? There are too many unanswered questions. Every single survey data used require careful presentation.

Do not forget to share the supplemental materials by, e.g., uploading these files at OSF.

Few typos: 

Test quality was was classified using seven categories

Author
Replying to Reviewer 1

I have to be brutally honest. For a blog article, because I read regularly high-quality blogs from people who have great skills and knowledge (some of these people are even academics), this article would make for a very good blog article. But as a paper, this is just horrible.

Sure...? I guess some things were not in the right order or not explained explicitly, but the order in which the results are presented is consistent with an academic paper. And I'm not sure how that makes it "horrible", in any case. Are these standards and regulations really that valuable?

1. Introduction

The introduction has zero structure. The typical introduction must explain the core problem of the subject you're about to be dealing with, the history, the past and current debate perhaps, but it is absolutely necessary that you explain what is the purpose of the present paper and how it will contribute to (perhaps) narrowing the current debate about sex differences in IQ. Given that other researchers have conducted meta-analyses too, why not telling us in what way the new meta-analysis is useful, perhaps by using new, and more moderators?

Since you mentioned different tests without explaining to the readers what are these tests (Wordsum comes to mind first), I suggest you discuss a little bit about how sex difference can vary depending on the measurement. After all, you mentioned this point multiple times. This would help introducing these different tests and explaining to the readers their specificities (eg, reliability, magnitude of group differences etc, measurement invariance?).

The issue lies more with the differences in group factors being large, therefore the inclusion of which tests to include being a determining factor in how large the composite difference is. I'll be more explicit about this in the introduction or the methodology.

In the introduction, do not talk about methods and their results. Even if it's a simulation, move it in result section or in discussion section. About that, and although it's besides the point,

Fair enough. I'll shove the MCV similations in the appendix.

I disagree that the simple fact that sex differences in specific abilities are large makes MCV hard to test Spearman's hypothesis. Remember there are large race differences in specific abilities too. The key element here is that if the pattern of sex differences does not follow the expected gap (ie, the regression line) based on g loadings, then you don't expect g to be the driving factor of sex differences. 

I do not disagree that the MCV results imply that g does not drive the sex differences in mental ability. It does, however, mean that if a sex difference in g does exist, that it will be much harder to test for.

2. Method (I suggest you collapse materials and methodology under a single section Method, e.g., 2.1 Materials, 2.2. Methodology).

Since that is a meta-analysis, you need to follow quite strictly the guidelines. Make very explicit all these details: years for your search, engine that you use, whether you included unpublished paper/dissertation, what are the keywords being used etc. The exclusion of studies due to having poor cognitive tests need to be explained better (you mentioned Wordsum but not why it's bad). Explain what is a bad test. Eventually, if you have looked at the references mentioned within the papers (if yes mention it, if not it's fine). I regularly write study reviews and I have conducted/participated in a few meta-analysis, and one regular task I do is check the references mentioned in the introduction and discussion of the papers included in the review or meta-analysis. This is because sometimes even with search, I notice that I would miss a few (typically very rarely cited) papers which, fortunately have been picked up by some authors. 

This will not be difficult to do. 

You may want to include a flowchart of study inclusion and remove Table 2 in this case. 

You did not clarify which effect size metric you are using: Cohen’s d, Hedges’ g, or another? Were you using only descriptive statistics to compute effect sizes? I see you are converting effect sizes into IQ-metric. This needs to be explained in detail. 

Both. But the vast majority of them are cohen's d.

In meta-analysis, it's crucial to explain why you include this or that moderator. It has to be theoretically driven, hypothesized a priori. Each moderator has to be explained this way.
The test quality classification is not very convincing. What is the content of those academic achievement tests (reading, math, both, and how many of each)? Why Raven was given a rating of 4 but not lower or higher (also Raven should be capitalized in your text)? And which ones? There are many Raven tests, and some are very short form. If test quality is taken into account, it's hard to ignore test reliability, but somehow this is not mentioned here. You also did not mention that military tests are typically heavily crystallized. These tests are known to be unrepresentative of the cognitive abilities being assessed, and even if the correlation with other good tests (Wechsler, WJ, Kaufman etc) is high this doesn't in itself fix the problem of unrepresentative cognitive domains. Given your classification is crude, I highly recommend that you use sensitivity analyses, perhaps by moving up or down some categories in the middle because those in the middle are most likely the ones being at risk of misclassification. Another possibility is to simply rerun the analysis with test quality as moderator but removing Raven category, since the test is qualitatively different from others. Typically, nearly all tests are somewhat crystallized biased. So Raven is the real outlier here.

I will, in fact, remove the classification of test quality altogether. It's not used as a continuous variable anywhere, and only serves to confuse the readers. It's only used as the inclusion criteria for the main meta-analysis.

Fig. 2 Total sample size by country, all abilities

What do you mean by "all abilities"?

"All abilities" meaning no test was excluded based on what it tested. I will remove the phrase as it will confuse the readers.

First, a meta-analysis of sex differences in specific cognitive abilities was made within adults and children separately. This tests both the developmental theory of sex differences in intelligence (Lynn, 2021) and whether there are sex differences in specific abilities.

I would avoid starting the new section with "First,..." You should explain the strategic plan overall. Since you have 2 sets of anlayses, why not explaining why this is important to make that distinction and what is your expectation (ie, your working hypothesis). 

I'll mention the developmental theory in the introduction. I thought that would be too elementary, though I guess a lot of people don't know about it.

Also, you need to tell us the age range of these children and adults. 

Added. Adults are explicitly said to be 16 and over.

To avoid spurious findings, specific abilities where not enough samples (500) or effect sizes (2) within age subgroups were excluded

I would not remove small samples with small number of effect sizes. If the analysis employs a weighting method, I don't see why you would exclude these samples. You could however use the weighting method and then compare its results with your trimming. You could also incorporate both sample size and sample size as moderators and see if one showed a relatively stronger impact.

No, that's not what I did. I did not bother posting the sex differences in specific abilities where there was not enough data to make an accurate estimate of what the difference was. I'll try to make this more explicit in the second version.

A second meta-analysis was conducted to test whether there is a sex difference in full scale ability using only the highest quality samples; these exclusionary criteria are available in Table 2, which reduced the number of effect sizes to 48 

In your table 2 though, I read 119 effect sizes for high quality tests.

I think it was 119 effect sizes from 48 different studies. 

If a study reported multiple effect sizes, these effect sizes were averaged into one effect size.

First, and this is true everytime you mention "multiple effect size", you need to clarify that this means multiple effect size of the same cognitive domain (e.g., spatial, knowledge, technical knowledge etc). 

Not in this particular case. I only averaged the effect sizes when they were measuring the same ability. 

I'll let you decide whether to include such a moderator but I believe it may be important: year of data collection. The reason is because there seems to be a tendency over time for test makers to erase anything that they suspect is group biased (race or gender). So you can check if there is any systematic fluctuation in the sex difference.

I saw in the moderator regression that there was either no or little evidence of a time trend. I could post it in the appendix.

First, a meta-analysis was conducted within studies that tested full scale ability and the average age of the samples was examined as a moderator. Then, a second meta-analysis was conducted within all effect sizes that tested the effect of the average age of the sample on male advantages on tests, independent of the sex ratio, type of test, and the year that the study was conducted in.

Why not testing for possible nonlinearity effect of age?

I think it's unnecessary, but I could explicitly test this with an ANOVA. 

A big issue here is the wording: "within studies" and "within all effect sizes" could be more clearly distinguished to explain what makes these analyses distinct. Also, be consistent with your wording: "type of test" I suspect is test quality, but how do I know? It's confusing. Check every sentence in this paragraph (and earlier ones) and try using the same wording. Especially the second sentence which you need to rework entirely; I can't tell what you are trying to do here.

It's not the test quality. 

This wording is awkward: "tested the effect of the average age of the sample on male advantages on tests". It should be, e.g., "sex difference in test scores". Because otherwise it gives me the impression you have included only the studies that show an advantage for men.

Sure.


Given that you use 3 meta-analysis (each using different conditions) you may want to explain how each condition differs from each other and what kind of information (eg, advantages) they offer individually.

I genuinely don't think it matters. I posted different methods of testing the developmental hypothesis out of honesty, not inquiry.

In the result, you said that the regression shows no publication bias. Again, if only you have mentioned clearly the techniques you were going to use, I wouldn't ask the question. I think I know what you did, most likely you used Egger's regression test (or Putejovski's version), or Peters' regression test. To sum up, all methods and techniques must be described and referenced properly in the method section. Try to add as much detail as possible (eg, packages used, textbooks if any). This is to ensure you keep a high level of rigour in your work. Don't make it look like it's a blog article, because it really feels like one.

It was eggers.

What about heterogeneity? I see no report on I² statistic or tau-squared τ². Those are highly recommended (if not mandatory) in meta-analyses.

From the abstract: "There is no consensus within the field of psychology on whether there are sex differences in intelligence. To test this hypothesis, 2,089 effect sizes were compiled, representing 15,976,369 individuals that tested sex differences in ability. Men scored 2.58 IQ points (95% CI [1.93, 3.23], I^2 = 99.2%) "

The results section: "Adult men scored slightly higher in full scale ability (d = .17, p < .001) when all of the adult samples were pooled together, this difference remained within higher quality tests as well (d = .17, 95% CI [.13, .22], I^2 = 99.2%, p < .00001)"

High heterogeneity may suggest unmeasured confoundings.

Well, yes. It could also reflect the use of a large amount of samples. The CIs are not indicative of large amounts of error.

Lastly, I would appreciate if you give more details on the models you have specified to account for moderators.

4. Discussion

Just like the introduction, it's lacking. Tons of papers are not discussed. One very important and quite recent is Reynolds' review. He concluded that sex differences exist in some specific abilities but not g. You need to discuss this.

Reynolds, M. R., Hajovsky, D. B., & Caemmerer, J. M. (2022). The sexes do not differ in general intelligence, but they do in some specifics. Intelligence, 92, 101651. doi: 10.1016/j.intell.2022.10165

I will say that little value will come from regurgitating the issue, but if you insist. 

(Keith et al., 2008; Arribas-Aguila et al., 2019): .... Whether these differences would remain after using more sophisticated methods, such as correcting for violations of measurement invariance using latent models, has yet to be seen.

Actually both of these papers tested for measurement invariance (MI) in their MGCFA models. MI was tenable in Keith (only for their bifactor model though), but not in Arribas-Aguila.

One should also question whether a test can be created that does not have a sex bias due to the large group-factor differences.

Measurement bias has nothing to do with the magnitude of group differences in group factors. Intercept bias occurs whenever the subtest mean score differs between group despite being perfectly equated on the latent factor score. Loading bias occurs whenever the importance of subtests within a latent factor differs between group. Those are the two main common biases. In the large literature of sex differences in g using MGCFA, about half shows MI, about half rejects MI. In the large literature of black-white differences, with large gap in specific abilities as well, MI is never rejected except in some underdeveloped countries.

I am aware that there is a difference between intercept and loading bias. My contention was that if there are large sex differences in specific abilities, that the general sum score difference will be biased by the specific differences. The magnitude of this bias will, of course, depend on the features of the battery, but it is still an inevitability. 

Think of a classical regression, or IRT (or logistic regression). You use any subtest you wish as dependent var, sex and total test score and their interaction as predictors. If the interaction is different from zero, there is bias. In logistic and IRT, you do the same thing but the item is the dependent var. The main idea is: conditioning on latent score, you should not expect any "residual" group difference in any of the subtest that make up this latent factor. If there is, you have the equivalent of "intercept" bias.

I don't think IRT bias testing works, but I'm not interested in arguing about it in a review.

Even if latent methods are used, it’s not clear whether it’s even possible to adjust for the effect of the group factors on the estimation of the overall difference

The bifactor model does exactly this, by specifying g as being independent of group factors. In the hierarchical second-order g, the relationship between g and subtests is mediated by the first order group factors. Keith et al (2008) used both models. In the hierarchical, females outscored men by 1.21 and in the bifactor by 3.51 points. It's not the first time that I've seen the bifactor models magnifying the difference instead of reducing them. 

Moreover, when MGCFA is applied to black-White data, it's fully consistent with Spearman's hypothesis, and g explains most of the variance of any given subtests. But not when applied to data on sex difference. There is yet no evidence that the sex IQ gap is mainly driven by differences in g. This is because not many papers actually tested the Spearman's g using MGCFA. 90% (and I'm being generous here) of researchers who used MGCFA to assess the g gap between sexes never tested the Spearman's g. 

I honestly I thought I was going to get pushback for being to agnostic on the issue.

Everything in the glossary could (and should) be put in the main text or in a footnote.

The tables in the appendix appear important enough to be mentioned, discussed in the result section. All of these figures must be discussed, described in detail in the result section. As for the violin plot, explain what it is and the reasoning in the method section. I have some concern about Table A9 since you never mentioned anywhere you were also collecting g-loading data. This should be explained, once again, in precise details. There are plots on other survey data such as PNC, PT, etc, but how did you carry out those analyses? Did you use sampling weight? There are too many unanswered questions. Every single survey data used require careful presentation.

Do not forget to share the supplemental materials by, e.g., uploading these files at OSF.

Sure.

 

Reviewer | Admin

Are these standards and regulations really that valuable?

Of course, at least if you want to be taken more seriously by scholars. If you don't see why the paper is horribly in its presentation, again, analyze carefully how other such papers are organized and presented. There is a huge gap in quality with what is expected from an academic paper.

It does, however, mean that if a sex difference in g does exist, that it will be much harder to test for.

Your point was that the sex difference in group factors is large, yet the Black-White difference in these factors are large, if not larger, but Spearman's hypothesis always holds for black-White data, but not for sex differences, because the in the former comparison, the differences vary in function of g, but not the latter. Why would the magnitude of group difference in group factor means precludes any relationship between g-loading and group differences?

It's not used as a continuous variable anywhere, and only serves to confuse the readers.

That is a good decision, because reading that paragraph I expected you to use it as continuous moderator (which I thought could be the "test type" mentioned further down below in the text) but again I wasn't completely sure if that was the real intent.

I thought that would be too elementary, though I guess a lot of people don't know about it.

Yes do this. And make sure you elaborate on the very first sentence of the introduction. Because it does a very poor job at explaining the source of the problem. Saying that a group of researchers agree and another group disagree helps nothing. You need to show us whether the present meta-analysis could help narrowing the debate of sex differences, and solving some of the ongoing questions. 

Speaking of the intro: "Analysis that employs the method of correlated vectors suggests". You should use plural, i.e., "analyses", "employ" and "suggest". And make sure you italicize g properly, because sometimes italics are used, sometimes not.

I genuinely don't think it matters.

I won't insist since that is not a crucial point with respect to the analysis. Yet it is customary to explain why you use a particular method. At least for a professionally written paper.

Mea culpa for the I². In my defense though, the fact I haven't noticed it (perhaps because sandwiched between several statistics reports) is because I focused on the method section and saw that there was no mention of these tests being employed. Make sure you add references that describe why this is important in meta-analysis. Generally, references are cruelly lacking.

Here's a recommendation: Display every important result in tables/figures. If you think there are too many Tables/figures, just ignore the least important ones, eg, robustness tests if they are not showing a result different from your main analysis. If on the other hand, some of your additional (including robustness) tests show an important pattern, definitely include them in your tables/figures output. Results that aren't important (such as, not providing additional information on top of your main analyses) can be restricted to the main text or perhaps supplementary files. This is to ensure that readers can quickly look through your figures and tables and check the important output, without having to carefully check the wall of text.

My contention was that if there are large sex differences in specific abilities, that the general sum score difference will be biased by the specific differences.

As I explained, the detection of MI has nothing to do with the magnitude of group differences. To quote myself: "Intercept bias occurs whenever the subtest mean score differs between group despite being perfectly equated on the latent factor score." You can have quite varying patterns of group differences, and it's what typically happens in real data. Buf MI is established as long as a given score on any subtest reflects the same level of the latent factor for both groups. Intercept differences only reflect systematic influences that are unrelated to the common factors. Intercept differences are not differences in latent factor means or subtests, they are unrelated.

As long as the test battery is not affected by "sampling bias" which typically results from unrepresentative sampling of tests, the latent g difference should be valid.

With respect to g, I recommend you read Molenaar et al 2009. They showed that the main factors that influence the power to detect a g difference are: size of g (which is obvious), sample size, strength of the positive manifold. And although the residualized mean differences in first-order latent factors was found to negatively affect power, its impact clearly diminished at the strength of the positive manifold increased (after all, you don't want to employ MGCFA if the subtest correlations are small). Note that they talk about the residual factor means in the higher order g model (ie, the differences not accounted for by g, not the size of the latent factor means). Moreover, intercept differences didn't impact the power to detect g at all.

Molenaar, D., Dolan, C. V., & Wicherts, J. M. (2009). The power to detect sex differences in IQ test scores using multi-group covariance and means structure analyses. Intelligence, 37(4), 396-404.

Author

Changelog:

- Latent vs observed differences in sex ability discussed in intro/discussion.

- Most results in the appendix were removed to avoid clog

- Project talent g-loading / sex*age results were discussed in discussion section

- Publication bias test specified to be egger's.

- Tests excluded for being low quality (WORDSUM, UKBIOBANK fluid intelligence tests) were idenified

- The Weschler 'comprehension' test was mistakenly labelled as being a test of reading comprehension, when in reality it is closer to a "common sense" or social cognition test. One more sample (Keith analysis of WJIII) was added, which increased the total number of effect sizes to 2092.

- Abstract shortened

- Meta-analytic search procedure was described in more detail

- 'metafor' package was cited

- Terms in the glossary were explained in the text and the glossary itself was removed.

- Adults specified to be 16+

- linear vs nonlinear models of the developmental theory of sex differences were explicitly tested. 

 

Bot

Authors have updated the submission to version #2

Reviewer | Admin

 

There is some improvement, especially about clarity in some sentences in the method section, but that’s about it. Most of my concerns have not been addressed. There is still a big step before I can consider this submission a quality one. At this current state, it’s still far from being acceptable. 

representing 15,900,000 individuals that tested

This is obviously not the exact number, so you should write this correctly: "about 15,900,000 individuals".

there was no sex difference in intelligence in a large sample of Scottish children born between 1922 and 1932 (Deary et al., 2003). This consensus was then contested by Richard Lynn, who noted that the sex difference in intelligence was a function of age, with there being no difference at the age of 12 and small male advantage of 3-4 points in adulthood (Lynn, 1994),

The statement “was then contested” implied an answer to the earlier statement. Which is not possible since Lynn’s 1994 study was published before Deary 2003, not after.

Most subsequent work was able to replicate the developmental effect (Nyborg, 2005; Colom & Lynn, 2004), with a few exceptions that measured intelligence at a latent level (Reynolds et al 2008; Keith et al, 2008).

By “Most subsequent work” I expected more than just 2 studies, and especially, more recent ones as well. Moreover, how can you accept a first statement “most” which relies on only two studies when the following statement “with a few exceptions” which also relies on two studies? It's not very logical.

which tests whether the g-loadings (loading on the first general factor of mental ability) of the subtests are correlated with the associations the individual subtests have with another variable

That’s incorrect. MCV tests the relationship between group difference on any given subtest with its g-loading.

if this is the most powerful way to test for a difference, as there are large sex differences in specific abilities that are not related to general ability 

As I have told you before, you still have to explain why MCV would fail to detect a relationship between g-loading and group difference if there is a large difference in specific abilities. Why does the black-white difference always support SH but not sex difference, when the black-white difference is even larger?

the results of such studies were summarized in Reynolds et al (2002)

2022

that 5/7 found a small female advantage

Write instead : 5 out of 7

 In some cases, this holds (Keith et al., 2008) 

MI holds only for their bifactor model, not their higher order factor model. And when I said it held for bifactor, there was in reality a few biased intercepts. But just a few, which warrants the conclusion that the test is generally fair.

This is relevant as violations of scalar measurement invariance imply that the subtests are biased measurements of general ability between groups, leading to mismeasured group differences.

Psychometricians always employ partial scalar models whenever MI doesn't hold. In this case, the subtests that are biased no longer assume the group equality constraints, which leaves the latent g difference unaffected.

This meta-analysis will primarily serve to test the developmental theory of sex differences, which proposes that there is no sex difference in mental ability within younger teenagers, but that a difference emerges as all of the subjects finish developing.

This needs to connect better with the previous paragraphs. You summarized the disagreement about sex differences, but you still have to explain why your study is important in light of the current debate., e.g., “Given the aforementioned controversy, it is necessary to detect any systematic pattern by disentangling possible confoundings…”.

sex differences in raven’s matrices

Raven should be capitalized. Always. Make sure you check all instances where it happened. Because it's all over the place.

When calculating the differences in these batteries, factor scores were used to measure intelligence and sampling weights were not used.

Which dataset had a cognitive battery? Certainly not the GSS. As I've asked before, give a description of these datasets and cognitive variables used, because the description is still lacking. You can of course, give the details of the variables in a supplementary file, but at least say in the main text what are the tests used in each datasets, and if using batteries, how many subtests, and what kind of abilities they are primarily measuring. 

 

Furthermore, how are these factor scores obtained? Unrotated first factor I assume? Which extraction method? Be more explicit. And what about the Wordsum in the GSS? It’s a single test (composed of 10 items). Did you also use factor score? Which procedure did you use then? In my earlier comment, I have observed you used IQ-metric in some of your Figures, but then factor scores typically have a mean/SD of 0/1, so you must have converted into IQ metric afterwards. Why is it not explained clearly yet again?

Intersex and transgender individuals were removed from the analyses when possible. If there was no sex variable, self-reported gender identity was used as a proxy for it.

I assume this has to do with your own analysis of the datasets mentioned, but make sure this is made explicit in the text.

 Given that IQ tests are typically highly reliable (Rinaldi & Karmiloff-Smith, 2017), correcting for test unreliability was deemed unnecessary.

That is not an acceptable answer. The paper actually reports reliability by age for cognitive tests such as the Binet, WISC and WAIS. This is not what you are using. Test reliabilities must be dealt with case by case. Let me illustrate why this is important. For instance, you used the Project Talent (PT) data, but if you look closely at this paper:

Major, J. T., Johnson, W., & Deary, I. J. (2012). Comparing models of intelligence in Project TALENT: The VPR model fits better than the CHC and extended Gf–Gc models. Intelligence, 40(6), 543-559.

You will observe that the reliabilities are quite low, very low. This is less of a problem for latent models, but that's not what you are using. So, when calculating the observed total score of the PT battery, you cannot make the assumption that the sex difference is not affected at all by differential reliability issues, just because some other tests are reliable. So, report the reliability for all of these subtests in each dataset. This can be reported in a supplementary file but in the main text, at the very least, report whether the reliabilities are high or not. Also, if you still decide not to correct for unreliability, make a clear statement about this limitation.

Again with respect to the Project Talent data, you will notice that Major used “only” 37 subtests that they deem as “cognitive” tests. Yet in your appendix, you show 59 subtests. This is curious, as I would like to know what you use, and why you would consider these additional tests as cognitive tests, unlike Major et al (2012). Moreover, get the numbers right. In your appendix figure for the PT, it shows r=.61 and n=59 but in your main text, you said it’s 61 subtests. 

Included in the analysis were scholastic tests (e.g. PISA), achievement tests (e.g. SweSAT, SAT), and IQ tests (e.g. WAIS). 

There is a strong disconnect between this paragraph and this first (as well as second) paragraph of this section. I do not see for instance how this sentence relates to the first paragraph which mentioned that you analyzed a variety of datasets such as PISA, GSS, NLSY, Add Health. Speaking of the latter, did you use the public or the private data? (The public version has a much smaller sample) And which Wave of this data?

 

I notice you are still not using a flowchart instead of this unconventional table 1 (used as criteria). Is there any reason for not using a flowchart? Have you examined other meta-analyses, as I have suggested? They almost all use a flowchart. This has become extremely recommended, especially by PRISMA standards.

 the nonquantitative ones are the sample type (e.g. college students), country, test type (e.g. WAIS-IV), and ability (e.g. spatial reasoning).

You have to describe in detail the categories of your moderators, instead of just naming one of them. What are the other sample types, what are the other tests? For countries, since there are so many, you can refer to your figure 1 (which now has disappeared in the new version).

Both tests were excluded due to their brevity and lack of items.

A short test does not always imply the reliability is low. If the test employed a computerized adaptive testing procedure, the test can accurately reflect participants’ true score across all ranges even with a short form. Like I’ve mentioned before, report the reliability estimates for these tests.

 

Can you report the means for the moderators displayed in your table 2?

Studies that tested the mental abilities of both adults and children and did not report the effect sizes

It should be “but did not report the effect sizes”

The differences in specific abilities where there were not enough samples (500) or studies (2) were not posted

Use whichever word you want, but remove “posted”. 

which takes heterogeneity when calculating the mean differences.

which takes heterogeneity into account when calculating the mean differences. 

Speaking of heterogeneity, you still haven’t described I² in your method section somehow, what it does, what it tells.

To avoid age and country segregation from biasing the results, effect sizes from individual samples that separated results by age and country were combined into one effect size

...

If a study reported multiple effect sizes, these effect sizes were averaged into one effect size, a process which was only done if said effect sizes were testing the same abilities. This is to avoid spurious publication bias that could arise from studies with smaller or larger differences reporting more effect sizes than the average study.

You still ignored my previous comment about this method. Please answer it.

and a regression test was used to assess whether there was publication bias in the meta-analysis.

Specify this is Egger's test.

Besides this, the developmental theory of sex differences was formally tested using several different methods. First, a meta-analysis was conducted only within studies that tested full scale ability and the average age of the samples was used as a moderator. Then, a second meta-analysis was conducted within all samples that tested the effect of the average age of the sample on the sex difference in intelligence independent of the ability it was testing. Last, a meta-analysis of studies that reported effect sizes for separate age groups was conducted to test for whether the effect existed within the same sample. 

Although the clarity has greatly improved here, you still ignored my request earlier. What particular strengths/weaknesses each of these methods have? Because they certainly do not overlap. More generally, as I’ve explained before, each type of moderating analysis is supposed to test a particular model or theory. You never attempted to explain any of this. 

 

For improving readability of figure 4, I suggest either removing “(children)” or adding “(adult)” among these rows. Probably removing is best, as you described already in the title of figure 4. Still about Figure 4, describe what is this “common sense” ability. Maybe better label it “social cognition test” if it’s close to that meaning. Because common sense is vague.

An age effect was found within the samples that tested full scale ability (b = .0021, p < .001), and when all tests were analyzed, an age effect was found even when the ability tested was controlled for (b = .0031, p < .001). Studies that explicitly tested the developmental theory by comparing sex differences within age groups also had an age effect, though it did not pass significance testing (b = .0007, p = .19).

I assume age is in its original scale, but what about the dependent variable? This would help interpreting these unstandardized coefficients. I would also not interpret the latter result as having an effect, as it seems like the coefficient is extremely small (regardless of the p value), likely not even worth considering.

according to the egger’s regression

Egger, not egger. This problem needs to be handled very carefully, because it’s widespread throughout the paper.

Fig. 6 Male advantage in mental ability by age group. Samples with ages of above 20 were set to 20. The 95% CI is shaded in grey.

Is there a good rationale for constraining the age range at 20? This is a bad methodology. 

Prior literature which found variance in sex differences in international scholastic test scores by country were replicated in this study.

I don’t like the term variance employed here. Why not heterogeneity?

Analyses done on the project talent battery, which tested about 370,000 adolescents using 61 different cognitive tests, finds that tests with a high baseline male advantage are also the ones that come to favour them more as they mature.

First, it’s Project Talent, not project talent. Second, why didn't you mention PT in the method section at all? You likely refer to your appendix figure A2, but why don’t you mention it, along with the numbers? Speaking of A2, are these tests expressed in z scores (ie, standardized)? Also, the description in the title is not clear.

What is table A1 about? You still didn’t provide any information. Is this for the PT? Discuss the results of Table A1, and A2 in the main text (result section).

The sex difference in each individual subtest that was administered was calculated, and then the interaction between sex and age was calculated, and placed on the x axis. Individuals who did not have data on age were excluded from the analysis, and scores on the subtests were controlled for age within the whole cohort.

Using age as a predictor means you are controlling for age, so when saying “were controlled for age” in the subsequent sentence, it strikes me as odd because it could mean you are referring to something else entirely. Try to rephrase this.

they may tend to systematically bias their test items so as to minimize sex differences

If this is a horse, call it a horse, but don’t call it a deer. Bias in the jargon typically means a nuisance ability not intended to be measured by the psychometric test. So you should use another word, but not “bias” in this context. Because it’s misleading. Here’s a proposition: “they tend to select their test items so as to minimize sex differences”.

 

Regarding the appendix, what is the violin plot? I’ve asked before to introduce it in the method section.

 

Make sure you put italics on g consistently (or not all if you don’t want). Because it’s extremely inconsistent throughout your text. There are too many cases with non-italics g.

 

Finally, I have asked for the supplementals to be uploaded on OSF, not on your blog. This is not very professional. Do you imagine scholars reading your paper, and expecting the supplementals to be uploaded at OSF, as it’s the standard, but instead are directed toward your personal blog instead? You might think “This is not very professional” is a bit harsh, but I can tell you: the others would think no less. When uploading at OSF, can you also upload your codes used for the analysis? The guidelines at OP, if you haven’t forgotten, is that you upload your materials (supplementary analyses and outputs, codes; and data if possible) at OSF, and only at OSF.

Moreover, you should give the link to the supplementals in the text. Readers would assume that, since you haven’t mentioned any supplementals in the article, that there are none. They won’t bother checking the OP journal.

 

In your supplementals, you should display the link for the GATB, PIAAC, PISA, PT, etc. that you used. There are some additional datasets that you have used, though they require a link. The absence of a link suggests you haven’t relied upon other authors’ papers (which all have a link to them), but perhaps you did your own analysis. If this is truly the case, it means you have access to these datasets (not named here), and you may need to provide more information and explanation about that.

 

In the supplementals, tons of information are lacking. What is the “weight” column all about? Why is there no description and explanation? What do the values of test quality really mean, now that your revised submission has removed the information about these values? Are these values “6” those labeled “high quality” in your text? You need to display either in some of the supplementary files (or in the main text, probably better) the descriptive statistics of these columns. Display the mean, median and SD for year of publication, age of sample, sex ratio, sample size, as well as the proportion or number of each sample type (ie, how many “college student” do you have, how many “general population” do you have, etc), test type (how many are Raven matrices, which one, SPM, or APM, short form or not and does it matter for your moderation; how many are WAIS-IV, how many WAIS-R, how many are “composite” scores, etc), ability type (how many are scholastic, nonverbal, full scale etc). Recall this is what I have asked in my earlier comment. Why is this not done?

 

In your supplementals, some values are missing, eg, for the Latent column. Especially toward the end.

 

Sowers et al 2001 mentioned in your reference section doesn’t appear in the main text. Same with Lokeshwar et al 2021, Haier 2023, Fawns-Ritchie & Deary 2020, Dykiert et al 2009. Perhaps even a few more I’ve missed. Make sure you scrutinize very carefully your entire list of references.

 

Author
Replying to Reviewer 1

 

There is some improvement, especially about clarity in some sentences in the method section, but that’s about it. Most of my concerns have not been addressed. There is still a big step before I can consider this submission a quality one. At this current state, it’s still far from being acceptable. 

representing 15,900,000 individuals that tested

Fixed.

 

there was no sex difference in intelligence in a large sample of Scottish children born between 1922 and 1932 (Deary et al., 2003). This consensus was then contested by Richard Lynn, who noted that the sex difference in intelligence was a function of age, with there being no difference at the age of 12 and small male advantage of 3-4 points in adulthood (Lynn, 1994),

The lack of a difference on the test was known for some time, and I cited Deary since it was the source that was the easiest to search for. I didn't bother looking for other citations and just used different ones.

By “Most subsequent work” I expected more than just 2 studies, and especially, more recent ones as well. Moreover, how can you accept a first statement “most” which relies on only two studies when the following statement “with a few exceptions” which also relies on two studies? It's not very logical.

Sure. Dug up a few more studies that supported the developmental hypothesis.

That’s incorrect. MCV tests the relationship between group difference on any given subtest with its g-loading.

No. The method of correlated vectors can also test whether the association between IQ and a continuous variable (e.g. income) is on g.

As I have told you before, you still have to explain why MCV would fail to detect a relationship between g-loading and group difference if there is a large difference in specific abilities. Why does the black-white difference always support SH but not sex difference,

If you take Blacks and Whites, and test their general reading comprehension, the observed difference in the group factor will be about .9 SD or so. But this is largely due to the fact that Blacks and Whites differ in general intelligence, and that general intelligence causes the ability to comprehend written language. The reasons why this is thought are secondary. There are some small race differences in specific ability independent of this; if I recall, Black people overperform in memory tests and underperform in abstract reasoning tests in comparison to what would be expected from how well those tasks correlate with intelligence.

Sex is a bit more complicated. Women perform better on processing speed, social cognition, and reading ability tests, while men perform better on spatial tasks, mathematical ability, and general knowledge. Because the sex differences in observed group factors of intelligence go in different directions depending on the ability in question, this means that there must be (relevant) sex differences in group factors of intelligence independent of the general difference. Even if there is a small difference in general intelligence, it will be harder to detect because the relationship between the g-loading vector and the sex differences will be attenuated by the presence of the differences in group factors.

when the black-white difference is even larger?

I don't understand your logic here. A larger difference is easier to detect than a smaller one.

2022

Fixed.

Write instead : 5 out of 7

Fixed.

MI holds only for their bifactor model, not their higher order factor model. And when I said it held for bifactor, there was in reality a few biased intercepts. But just a few, which warrants the conclusion that the test is generally fair.

Fixed.

Psychometricians always employ partial scalar models whenever MI doesn't hold. In this case, the subtests that are biased no longer assume the group equality constraints, which leaves the latent g difference unaffected.

Sentence deleted.

This needs to connect better with the previous paragraphs. You summarized the disagreement about sex differences, but you still have to explain why your study is important in light of the current debate., e.g., “Given the aforementioned controversy, it is necessary to detect any systematic pattern by disentangling possible confoundings…”.

Shifted a few paragraphs around.

Raven should be capitalized. Always. Make sure you check all instances where it happened. Because it's all over the place.

... It happened twice.

Which dataset had a cognitive battery? Certainly not the GSS. As I've asked before, give a description of these datasets and cognitive variables used, because the description is still lacking. You can of course, give the details of the variables in a supplementary file, but at least say in the main text what are the tests used in each datasets, and if using batteries, how many subtests, and what kind of abilities they are primarily measuring. 

No, the GSS did nort. Sure I;lkl add a description.

 

Furthermore, how are these factor scores obtained? Unrotated first factor I assume? Which extraction method? Be more explicit. And what about the Wordsum in the GSS? It’s a single test (composed of 10 items). Did you also use factor score? Which procedure did you use then? In my earlier comment, I have observed you used IQ-metric in some of your Figures, but then factor scores typically have a mean/SD of 0/1, so you must have converted into IQ metric afterwards. Why is it not explained clearly yet again?

I assume this has to do with your own analysis of the datasets mentioned, but make sure this is made explicit in the text.

Sure.

That is not an acceptable answer. The paper actually reports reliability by age for cognitive tests such as the Binet, WISC and WAIS. This is not what you are using. Test reliabilities must be dealt with case by case. Let me illustrate why this is important. For instance, you used the Project Talent (PT) data, but if you look closely at this paper:

Major, J. T., Johnson, W., & Deary, I. J. (2012). Comparing models of intelligence in Project TALENT: The VPR model fits better than the CHC and extended Gf–Gc models. Intelligence, 40(6), 543-559.

You will observe that the reliabilities are quite low, very low. This is less of a problem for latent models, but that's not what you are using. So, when calculating the observed total score of the PT battery, you cannot make the assumption that the sex difference is not affected at all by differential reliability issues, just because some other tests are reliable. So, report the reliability for all of these subtests in each dataset. This can be reported in a supplementary file but in the main text, at the very least, report whether the reliabilities are high or not. Also, if you still decide not to correct for unreliability, make a clear statement about this limitation.

Again with respect to the Project Talent data, you will notice that Major used “only” 37 subtests that they deem as “cognitive” tests. Yet in your appendix, you show 59 subtests. This is curious, as I would like to know what you use, and why you would consider these additional tests as cognitive tests, unlike Major et al (2012). Moreover, get the numbers right. In your appendix figure for the PT, it shows r=.61 and n=59 but in your main text, you said it’s 61 subtests. 

I grouped the PT tests into composites. There were the omega total reliabilities for each:

General knowledge: .96

Memory: .53

Processing speed: .7

Verbal ability: .91

Mathematical ability: .81

Spatial ability: .82

Composite score: .88

Only two are particularly low; fortunately in the meta-analysis they're dwarfed by performances on the Weschler subtests. I'll clarify that this will lead to the differences in group factors of intelligence being attenuated, while the difference in the full scale scores will not be.

There is a strong disconnect between this paragraph and this first (as well as second) paragraph of this section. I do not see for instance how this sentence relates to the first paragraph which mentioned that you analyzed a variety of datasets such as PISA, GSS, NLSY, Add Health. Speaking of the latter, did you use the public or the private data? (The public version has a much smaller sample) And which Wave of this data?

Private.

 

I notice you are still not using a flowchart instead of this unconventional table 1 (used as criteria). Is there any reason for not using a flowchart? Have you examined other meta-analyses, as I have suggested? They almost all use a flowchart. This has become extremely recommended, especially by PRISMA standards.

Convenience and efficiency; it's easier to create.

You have to describe in detail the categories of your moderators, instead of just naming one of them. What are the other sample types, what are the other tests? For countries, since there are so many, you can refer to your figure 1 (which now has disappeared in the new version).

Sure.

A short test does not always imply the reliability is low. If the test employed a computerized adaptive testing procedure, the test can accurately reflect participants’ true score across all ranges even with a short form. Like I’ve mentioned before, report the reliability estimates for these tests.

Well, it will not necessarily have a low reliability. I do not know how much the reliability of a test and the number of items it has in it would correlate; if you could refer me to a peer reviewed study on the matter then maybe it could be of use.

I managed to track down somebody else's estimate of the test-retest reliability of the UK Biobank Fluid intelligence test. It was .6 to .69.

https://www.researchgate.net/publication/329332353_Are_Bigger_Brains_Smarter_Evidence_From_a_Large-Scale_Preregistered_Study see page 48

 

Can you report the means for the moderators displayed in your table 2?

Studies that tested the mental abilities of both adults and children and did not report the effect sizes

It should be “but did not report the effect sizes”

Sure.

Use whichever word you want, but remove “posted”. 

Sure.

which takes heterogeneity into account when calculating the mean differences. 

Speaking of heterogeneity, you still haven’t described I² in your method section somehow, what it does, what it tells.

Sure.

You still ignored my previous comment about this method. Please answer it.

The reason why I averaged the effect sizes that came from the same study was to avoid the test for publication bias being confounded by some studies reporting segregated effect sizes for ages. If I recall correctly, the test for publication bias showed a bias against reporting advantages in male ability. This disappeared when I averaged the effect sizes of different studies into one "larger" effect size. I then did the same when it came to analyzing the differences in the group factors of ability because the estimation of standard errors was biased by the fact that some studies/datasets reported a massive number of effect sizes (e.g. PISA, PIRLS datasets).

There are plenty of papers advising against this strategy

In this particular case, not averaging together the effect sizes from each individual study would be very misleading. I refuse to do otherwise, as not doing so would reflect a lack of intergrity. 

If necessary, I could post a version of the main results that do not use this averaging method. 

Specify this is Egger's test.

I didn't notice that I mentioned the test for publication bias in the methods section.

Although the clarity has greatly improved here, you still ignored my request earlier. What particular strengths/weaknesses each of these methods have? Because they certainly do not overlap. More generally, as I’ve explained before, each type of moderating analysis is supposed to test a particular model or theory. You never attempted to explain any of this. 

For the sake, something, I guess, I could add a post-hoc explanation as to why the results of each method could differ. If the developmental effect is a result of attrition bias within the same studies, then the effect will show up consistently within the same study, but not between studies. If the developmental effect varies by subtest/ability, then it would be best to ignore that results that do not exclude effect sizes that do not measure full scale differences.

 

For improving readability of figure 4, I suggest either removing “(children)” or adding “(adult)” among these rows. Probably removing is best, as you described already in the title of figure 4. Still about Figure 4, describe what is this “common sense” ability. Maybe better label it “social cognition test” if it’s close to that meaning. Because common sense is vague.

"Common sense" was chosen as the best label to apply to the comprehension test of the Weschler, as what ability the test measures isn't obvious; even if it is measuring social cognition to some degree, men still tend to score higher on the test, so that it cannot be what it is measuring in totality.

I guess the colours do work well enough to allow the reader to distinguish between what tests are administered to children and to adults.

I assume age is in its original scale, but what about the dependent variable? This would help interpreting these unstandardized coefficients. I would also not interpret the latter result as having an effect, as it seems like the coefficient is extremely small (regardless of the p value), likely not even worth considering.

SD

Egger, not egger. This problem needs to be handled very carefully, because it’s widespread throughout the paper.

Sure.

Is there a good rationale for constraining the age range at 20? This is a bad methodology. 

There isn't. It's an arbitrary decision. I could set it at 18. If the comment is referring to constraining at all, the issue is that some of the older samples also include some children in their analyses, which could lead to an artificial increase in the male advantage in general ability as the sample means go from 20 to 30. If it's absolutely necessary I could post a version with no constraints in the appendix.

I don’t like the term variance employed here. Why not heterogeneity?

I must have half-rewritten that sentence.

First, it’s Project Talent, not project talent. Second, why didn't you mention PT in the method section at all? You likely refer to your appendix figure A2, but why don’t you mention it, along with the numbers? Speaking of A2, are these tests expressed in z scores (ie, standardized)? Also, the description in the title is not clear.

Forgot to mention A2 in the body.

What is table A1 about? You still didn’t provide any information. Is this for the PT? Discuss the results of Table A1, and A2 in the main text (result section).

Clarified it was PT

Using age as a predictor means you are controlling for age, so when saying “were controlled for age” in the subsequent sentence, it strikes me as odd because it could mean you are referring to something else entirely. Try to rephrase this.

It is odd. But even if age is controlled for, the developmental theory can still be tested because it refers to the difference in the growth in intelligence between men and women.

If this is a horse, call it a horse, but don’t call it a deer. Bias in the jargon typically means a nuisance ability not intended to be measured by the psychometric test. So you should use another word, but not “bias” in this context. Because it’s misleading. Here’s a proposition: “they tend to select their test items so as to minimize sex differences”.

Sure.

Regarding the appendix, what is the violin plot? I’ve asked before to introduce it in the method section.

It provides the reader an intuitive sense of how spread out the sex differences are between and within subtests. 

Make sure you put italics on g consistently (or not all if you don’t want). Because it’s extremely inconsistent throughout your text. There are too many cases with non-italics g.

Sure.

Finally, I have asked for the supplementals to be uploaded on OSF, not on your blog. This is not very professional. Do you imagine scholars reading your paper, and expecting the supplementals to be uploaded at OSF, as it’s the standard, but instead are directed toward your personal blog instead? You might think “This is not very professional” is a bit harsh, but I can tell you: the others would think no less. When uploading at OSF, can you also upload your codes used for the analysis? The guidelines at OP, if you haven’t forgotten, is that you upload your materials (supplementary analyses and outputs, codes; and data if possible) at OSF, and only at OSF.

I got banned from OSF, so I uploaded it to my side blog.

Moreover, you should give the link to the supplementals in the text. Readers would assume that, since you haven’t mentioned any supplementals in the article, that there are none. They won’t bother checking the OP journal.

Sure. I guess I'll make a new account then.

 

In your supplementals, you should display the link for the GATB, PIAAC, PISA, PT, etc. that you used. There are some additional datasets that you have used, though they require a link. The absence of a link suggests you haven’t relied upon other authors’ papers (which all have a link to them), but perhaps you did your own analysis. If this is truly the case, it means you have access to these datasets (not named here), and you may need to provide more information and explanation about that.

The supplement reveals all datasets and studies that I used; the methodology I used to analyze the datasets was also fairly uniform. 

Note: I am being intentionally vague here. 

In the supplementals, tons of information are lacking. What is the “weight” column all about? Why is there no description and explanation? What do the values of test quality really mean, now that your revised submission has removed the information about these values? Are these values “6” those labeled “high quality” in your text? You need to display either in some of the supplementary files (or in the main text, probably better) the descriptive statistics of these columns. Display the mean, median and SD for year of publication, age of sample, sex ratio, sample size, as well as the proportion or number of each sample type (ie, how many “college student” do you have, how many “general population” do you have, etc), test type (how many are Raven matrices, which one, SPM, or APM, short form or not and does it matter for your moderation; how many are WAIS-IV, how many WAIS-R, how many are “composite” scores, etc), ability type (how many are scholastic, nonverbal, full scale etc). Recall this is what I have asked in my earlier comment. Why is this not done?

I'll post a different version that explains all of these things in a separate sheet.

 

In your supplementals, some values are missing, eg, for the Latent column. Especially toward the end.

In some cases I forgot to label samples as measuring observed ability (0) and not latent. These were relabelled in the meta-analysis:

 

Sowers et al 2001 mentioned in your reference section doesn’t appear in the main text. Same with Lokeshwar et al 2021, Haier 2023, Fawns-Ritchie & Deary 2020, Dykiert et al 2009. Perhaps even a few more I’ve missed. Make sure you scrutinize very carefully your entire list of references.

 

Very well.

Bot

Authors have updated the submission to version #3

Author

Changelog:

- Uncited papers removed

- Abstract reworded

- Introduction paragraphs were shuffled around

- Analysis procedure identified for most datasets

- Clerical errors in file corrected. Results of the study were largely the same.

- Maps of sample size by country were re-added.

- Reasoning for testing the developmental hypothesis using multiple methods of analysis was elaborated on.

- Dot size now scales with sample size in Fig 5

- Analyses in appendix now discussed in discussion section.

- Limitations section added.

- Reasoning for making the violin plot placed in the methodology section.

- Still banned from OSF, so will not bother resubmitting the effect sizes spreadsheet

- Wording of most of the article changed for increased clarity.

 

Bot

Authors have updated the submission to version #4

Reviewer | Admin

I am generally OK with the structure of the introduction though it’s still brief.

The lack of a difference on the test was known for some time, and I cited Deary since it was the source that was the easiest to search for. I didn't bother looking for other citations and just used different ones.

If you don't change the references, at least change the sentence because now it's very awkward. Currently your text now looks like this: “some intelligence researchers have now contested this consensus with the developmental theory of sex differences in intelligence (Lynn, 1994).” I propose you remove the “now” as it’s still awkward. And if you use “researchers” as plural, you should add other authors, like maybe Irwing (2012) because again, this is awkward otherwise.

No. The method of correlated vectors can also test whether the association between IQ and a continuous variable (e.g. income) is on g. 

You are correct, MCV has some other applications, but our discussion revolved around the test of Spearman’s hypothesis, e.g., whether group differences are due to g, since in your text you are still saying “An early method of testing this hypothesis was the method of correlated vectors”. In this case, it’s about correlating group differences and g loadings.

Because the sex differences in observed group factors of intelligence go in different directions depending on the ability in question, this means that there must be (relevant) sex differences in group factors of intelligence independent of the general difference. 

Yes you have said this before. But as I argued, you still need to show that a positive relationship between group gaps and g-loadings cannot be achieved when there are large group differences in subtests that tend to balance out in their direction (e.g., near-zero total test score differences in the case of sex gaps due to male and female advantages canceling each other out).

Jensen’s discussion about MCV in The g Factor made it very clear that restriction in g loading range can attenuate the correlation. When you think about it, this observation can be extended to group differences. If you think smaller sex group differences across subtests lead to correlations deviating more from zero, I would say this is not obvious and it may well lead to the opposite outcome. You cannot expect a large correlation using MCV when the sex group differences don’t vary much across subtests. 

You don’t want to cite what I am about to say here, but I have tested SH before (and even recently) by using the absolute value of sex differences rather than the signed ones. And it didn’t lead to much different outcome, for the couple analyses I’ve tried, it didn’t lead to a large (or even modest) positive relationship between g loading and group differences. The difference in interpretation here would be “as g loading increases so does sex differences” rather than “as g loading increases so does the male advantage”. Your criticism is more relevant to the second statement, much less so with regard to the first statement.

MGCFA can help you see through this intricate mess. Look again at Keith et al. (2008) and compare their tables 6 and 8. I told you earlier the g gaps were 1.21 and 3.51 IQ for HOF and BF models respectively. I probably should have told you this as well: the sex differences in group factors in BF were much, much bigger than in the HOF model. In other words, the gender gap in g was larger even though the gender gap in group factors was also larger. Regardless, the discrepancy can be explained by the fact that BF orthogonalized the latent factors, but not HOF. The implication here is that, when properly modeled, you can easily find a g gap between sexes despite large differences across group factors.

The only artifact that may cause g not being detected is selection bias in items, which seems to occur in traditional IQ tests, which is why it is best to test sex differences in aptitude tests as well, because they don’t account for sex differences by removing any items displaying large sex differences.

Regarding this matter, I would like to make this clear. I merely want to see if we can reach an agreement here. But regardless of your conclusion, whatever you say will not affect whether I accept or decline the submission. In the past, I have accepted papers at OP despite disagreeing with authors on theoretical grounds. I accept disagreement, unless the matter is extremely serious (such as depicting the main analysis in a very misleading way). I think the most important element should be methodological.

I don't understand your logic here. A larger difference is easier to detect than a smaller one. 

We are talking about effect size. Your argument is that sex gap in g must be small if specific factors show a large sex gap. My counterpoint was: BW gap in specific factors are also very large.

There were the omega total reliabilities for each

It is good you showed me these values, but why are these values not reported in the paper? Along with whether it impacts the results or not. For instance, you said “I'll clarify that this will lead to the differences in group factors of intelligence being attenuated, while the difference in the full scale scores will not be” but you forgot to mention this in the text.

if you could refer me to a peer reviewed study on the matter then maybe it could be of use.

This paper (p 1013) showed that the reliability of Wordsum is improved a bit by merely changing the composition of the item set (without changing test length). 

Cor, M. K., Haertel, E., Krosnick, J. A., & Malhotra, N. (2012). Improving ability measurement in surveys by following the principles of IRT: The Wordsum vocabulary test in the General Social Survey. Social science research, 41(5), 1003-1016.

The reason why I averaged the effect sizes that came from the same study was to avoid the test for publication bias being confounded by some studies reporting segregated effect sizes for ages. If I recall correctly, the test for publication bias showed a bias against reporting advantages in male ability. This disappeared when I averaged the effect sizes of different studies into one "larger" effect size. I then did the same when it came to analyzing the differences in the group factors of ability because the estimation of standard errors was biased by the fact that some studies/datasets reported a massive number of effect sizes (e.g. PISA, PIRLS datasets).

Your first statement is fair and you can make this statement in the discussion section. But I need to say this: by using averaging, the variations within studies are lost, and the true relationship between effect size and standard error will likely be obscured. You can use multilevel and averaging and compare the results, and with respect to publication bias as well. I would say, if the bias disappears only by using averaging and not after using a multilevel model, perhaps it means that your method conceals publication bias rather than addressing it. The reason I recommend multilevel is because it’s a much more accurate method. If bias disappears after using a more accurate method, I see no problem. But if bias disappears by using a method that distorts standard errors, then that’s an issue. Your second statement also looks fair. But multilevel can also address this problem, with the added benefit of providing accurate standard errors.

I eyeballed your spreadsheet and my first impression is that you have likely enough studies with multiple effects to conduct such a superior analysis. Again, I don’t force you here. Like I’ve said, you may decide not to use multilevel, but the problem with aggregating effects must be acknowledged.

Here are few articles I highly recommend reading about biases introduced by averaging:

Cheung, M. W. L. (2014). Modeling dependent effect sizes with three-level meta-analyses: a structural equation modeling approach. Psychological Methods, 19(2), 211–229.

Moeyaert, M., Ugille, M., Natasha Beretvas, S., Ferron, J., Bunuan, R., & Van den Noortgate, W. (2017). Methods for dealing with multiple outcomes in meta-analysis: A comparison between averaging effect sizes, robust variance estimation and multilevel meta-analysis. International Journal of Social Research Methodology, 20(6), 559–572.

I could add a post-hoc explanation as to why the results of each method could differ

I don’t recommend resorting to post-hoc explanation, since hypotheses should be made clear at the start, not after seeing the results, but in your situation, it’s better than nothing.

If the comment is referring to constraining at all, the issue is that some of the older samples also include some children in their analyses, which could lead to an artificial increase in the male advantage in general ability as the sample means go from 20 to 30

This detail is not trivial but it wasn’t mentioned in the main text. I told you before to take great care of the details of your method. All details. Regarding the age restriction range, in this case, use different specifications as sensitivity analyses. You said age 18. Maybe add age 22 or 24 as well, but most importantly use a specification with no constraints. If it's not biasing the results, the estimates should be consistent.

It is odd. But even if age is controlled for, the developmental theory can still be tested because it refers to the difference in the growth in intelligence between men and women. 

Yes I don’t doubt, but my point is that your sentence was confusing.

Sure. I guess I’ll make a new account then.

You can also ask your co-author to store everything in his account. 

You can decide not to do it, or not to create another account (which you said in your last response) because it’s not convenient, but in this case, I find it very hard to accept your submission. I see papers as professional work, and in this case, nothing can be taken lightly just because it’s “convenient” not to comply with the rules of the game. (This also applies to the flowchart as well). Yes, I know paper submission is hard work and time consuming, but that’s how it is.

Presentation is crucially important. Otherwise, scholars will never bother to cite your study or even take it seriously, and in this case, why should I bother reviewing it, and why bother publishing then? The more you deviate from the “norm” and the more negatively your paper is viewed among scholars (regardless of where the paper is published).

To be (brutally) honest, the initial submission would be a desk rejection in nearly all other journals, because it deviates so much from what a professional work typically looks like, even more so given the high standards of meta-analyses. If any editor accepted it, then I think he should be fired. This is to show you how much the paper has to improve to meet standard quality, and why I was so concerned by the initial submission. It’s clearly better now and I’m somewhat happy with the direction, but there is still room for improvement.

In any case, once you upload the additional material (code + data + data description) I’ll examine these materials and I will provide you with what is likely my final advice.

The supplement reveals all datasets and studies that I used

What I said is that some data lack source and links, i.e., the cells are empty within the column “author & year”. I think about the GATB in particular. Links are missing for many tests employed. Same for study name, e.g., for PISA, study name should be PISA, etc.

 

Regarding your updated paper:

Within the NLSY79, the ASVAB was administered to 11,914 respondents in 1981.

You mentioned the number of participants here who took the test, but not for the other datasets. You need to be consistent. Either you display it in the main text for all studies (probably most recommended), or not at all.

Within the NLSY97, the same methodology was used, but the differences in performance within each ability and age group (12, 13, 14, 15, 16-18) were calculated.

The problem is that you mentioned in the NLSY79 there are 10 subtests, but you know there are 12 in NLSY97, so you should mention this. Also, why is the difference calculated within separate age groups in NLSY97 but not NLSY79? You should use the same method. 

The scores on all times for both tests into one composite score.

There must be some missing element in this unfinished sentence.

Within wave 1, the ages of the participants were segregated

I prefer separated, as segregated sounds weird to me even though it means the same thing.

The Programme for the International Assessment of Adult Competencies administered tests…

You should add PIAC under parentheses. There is some concern with the method described in that same paragraph:

- Regarding PIAC, your method strikes me as odd. Although this concerns only a few countries, their general ability was measured by “the standardized difference in the composite of numeracy and literacy was calculated instead.” while for many other countries which had also completed the problem solving test, the three tests were used altogether to produce a g factor score. But this means the scores between these countries are no longer comparable as one is a factor score of 3 tests and the other is a composite score of 2 tests. I suggest you use a 2-subtest factor score for all countries as robustness analysis. Or consider whether imputation is reasonable given your missing pattern, and by also checking whether the relationships between the three tests are comparable across countries.

- Regarding PISA, you mentioned you are not using g scores. Why the inconsistency?

On average, men performed better than women (d = .039, n= 31,950, p < .001). 

I don't know why you report these results in the method section, and only for the Wordsum test, which is even more weird.

A dataset of 28,699 employees who took the GATB was privately sent to the authors

In your data spreadsheet, there is no source (ie, author or link) mentioned for lots of tests (ie, they have empty cells), including this GATB. Anyone looking at this file will immediately think this is suspicious.

In the Project Talent, the 61 subtests … Then, the sex difference in each ability at the ages of 13, 13, 15, 16, 17, and 18 was calculated.

You still haven’t answered my comment earlier. Why 61 subtests? You need to justify. Look at Major et al, they used 37. Regarding the computation of sex difference by age group, you need again to keep consistency. And provide an explanation as to why you use such a method (eg, testing Lynn’s hypothesis). There is also a typo, with 13 being repeated.

If there was no sex variable, self-reported gender identity was used as a proxy for it.

This is convoluted. Sex variable is typically self reported. If there are different measures of sex, you should mention and explain this.

the WORDSUM, a 10 item multiple choice vocabulary test, and …

The Wordsum and its data were presented properly earlier in the method section, but not the cognitive test and the data that is mentioned after, in the sentence I quoted above. If there are other unmentioned datasets, mention those too.

I think I told you earlier about Figure 3. The description of each row must keep a degree of clarity and consistency. Since you already mentioned in the figure’s title the red/black stands for children/adults, you don’t need to specify (C) for children in each row, or if you do, then add also (A) for adults for the remaining rows.

I see you are now mentioning Egger’s test, but your text needs to be properly referenced, and this is especially important with respect to statistical methods.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test. BMJ, 315(7109), 629–634.

The next sentence mentions the moderator analyses, displayed in the Appendix, but I’ve told you to discuss these results, and it’s still not done. Either present these results in the results section or present these results in the appendix section under/above these Figures/Tables. Also, with respect to Figures/Tables, the Figures A1-2 are put in bold, but not Tables A1-2. You need to be consistent.

Adult men scored slightly higher in full scale ability … Publication bias in favor … no visual signs of publication bias.

This entire paragraph should be split into two, as it currently combines two distinct analyses: one on sex gaps in full-scale IQ and another on publication bias. Proper presentation requires that each analysis be discussed in its own paragraph to maintain clarity and focus. 

Regarding Figure 4, I would note in the main text that although there’s no observable publication bias, the data points show no funnel-shaped pattern. It kind of matters, because a more decisive conclusion regarding publication bias requires not only symmetry but also an appropriate funnel-shaped pattern in the data points. To me, it suggests there is some heterogeneity.

Figure 4 displays some labels I haven’t seen mentioned anywhere. “No ID”, “IST”, “NIZ IQ test”, DRT-B”, “GAMA”, etc. Those are too numerous to list. So either give a proper description either in the main text or in a supplementary. As I’ve said before, details are very important. 

Then, restricted cubic splines were used to calculate the non-linear relationship between the two variables.

What do you mean by “between the two variables”? Clarify, because I fail to see what you’re referring to, even considering the sentence before that one. I assume it must be sex gap and test score.

You discussed Hyde’s earlier meta-analyses 1988 & 1990. In the first half of this paragraph, you seem to suggest that their 1988 meta-analysis did not concur with your own results, apparently due to verbal ability being so broadly defined, such that “heterogeneity in results is not unexpected as there is no reason to think that the sex difference within these subtypes of verbal ability tests should be the same”. But for this answer to be fully convincing you need to identify which tests are the cause of this discrepancy between their studies and yours. Let’s say you measure verbal with A, B, C, D and they measure verbal with A, B, E, F, the differences in results would likely stem from the inclusion of E and F. But if their sex gap estimates of A and B differ widely from yours, then this is not solely due to verbal ability being measured with non-overlapping sets of tests.

In the second half, you said that Hyde et al. 1990 found support for Lynn’s developmental theory because “independent of selectivity, age still had an association with the sex difference in mathematical ability”. However I didn’t see where they did such an analysis. Their Table 4 indeed displays sex gaps by age groups, but selectivity was not accounted for. Instead, each table accounts for one moderator effect at a time. For instance, table 4 accounts for age as moderator, table 4 for ethnicity as moderator, table 6 for selectivity as moderator. Moreover, their finding of an age*sex interaction is at best a very weak support for Lynn’s hypothesis because of huge heterogeneity (explained in their text and displayed in their Table 4). Computation and Concept showed no male advantage for all groups but it is also true these tests lack samples for ages 19-25 and 26+ whereas Problem Solving showed a sex*age interaction consistent with Lynn’s expectations, yet there is no difference in effect sizes between ages 15-18 and 19+ which could be suggestive that the lack of sex difference in Computation and Concept at age 15-18 could be true for age group 19+ as well. The only pattern that is undeniably consistent with Lynn’s is the age*sex interaction for “All studies” instead of cognitive domains separately. So considering the entire bulk of results, I think the heterogeneity of effects and some missing data at later ages prevent strong inferences. Your conclusion might be somewhat right, but it should be more balanced by considering the heterogeneity across math domains.

This meta-analysis found that men score higher in tests of mathematical ability by about .3 SD, which does not corroborate results from a previous meta-analysis (Hyde et al., 1990).

“This meta-analysis” must refer to Hyde & Lynn 1988 but when you say “from a previous meta-analysis” by referencing Hyde et al. 1990 which was published later, and not before, you should remove “previous” because it’s very awkward.

I would recommend putting a dot right after .3 SD, and then write something like “However, another meta-analysis (Hyde et al., 1990) found a different result and argued that the gender difference…”. But it’s up to you.

This meta-analysis argued that the gender difference in mathematical abilities were a result of selective samples […]

Typo: it should be “was” if the subject “gender difference” is singular.

In some of these cases, such as Pezzuti & Orsini (2016), the observed difference in intelligence is of roughly the same magnitude as the latent difference, so it would be misleading to say that the use of latent methods is responsible for the discrepancy in results.

This would imply that observed total scores and latent g scores don’t have to be different. Yet they should be different. This is because if you assume that the gender gap reflects true intelligence, then the gap at the latent g level should be magnified because it estimates only the true variance, unlike observed scores. 

I don’t have anything more to say about the discussion section, since the other paragraphs look fine.

Reviewer | Admin

Additional notes (previous message was too long to be posted fully).

 

Throughout the text, your referencing is inconsistent, here’s an example:

Arribas-Aguila et al., 2019; Bakhiet et al., 2015), with a few exceptions that could not (Reynolds et al 2008; Keith et al, 2008).

in Reynolds et al (2022)

Keith et al (2008)

DeCarli et al., 2023; Eliot et al., 2021; Ritchie et al., 2018

(Pluck et al., 2012) 

So sometimes it’s et al. and sometimes it’s et al

 

In your excel file, Column C "Latent" has empty cells from line 2075 through 2149. Same problem as your old file. Row 2149 there is a typo in the author’s name: “Spinpath et al. 2008” it’s Spinath. 

Author

Let me be clear.

Generally, when two people disagree on a matter, it is either due to one of them making a mistake in thought or because somebody is being motivated to reason in a way that brings them to a conclusion they already hold. Because of that underlying issue, I am deliberately trying to signal indirectly that I am not doing the latter by revealing parts of the process that are privy to discretion -- e.g. posting that the WORDSUM test that I excluded had a male advantage; or even making decisions that do not inflate the male advantage (or even deliberately defalting it) -- e.g. choosing to average the effect sizes because it does not reveal publication bias against male advantages.

As for the whole latent level debate, I think it is unimportant and said that it could be an avenue of future research for professional purposes. Despite (admittedly?) not being an expert in statistics, Hanania (https://www.richardhanania.com/p/are-men-smarter-than-women) has a nice criticism of the latent and MCV literature -- it's possible for there to be a null MCV finding despite the existence of a group difference, and the findings in the latent studies depend on the method of analysis.

If you don't change the references, at least change the sentence because now it's very awkward. Currently your text now looks like this: “some intelligence researchers have now contested this consensus with the developmental theory of sex differences in intelligence (Lynn, 1994).” I propose you remove the “now” as it’s still awkward. And if you use “researchers” as plural, you should add other authors, like maybe Irwing (2012) because again, this is awkward otherwise.

Fixed.

It is good you showed me these values, but why are these values not reported in the paper? Along with whether it impacts the results or not. For instance, you said “I'll clarify that this will lead to the differences in group factors of intelligence being attenuated, while the difference in the full scale scores will not be” but you forgot to mention this in the text.

Fixed.

This detail is not trivial but it wasn’t mentioned in the main text. I told you before to take great care of the details of your method. All details. Regarding the age restriction range, in this case, use different specifications as sensitivity analyses. You said age 18. Maybe add age 22 or 24 as well, but most importantly use a specification with no constraints. If it's not biasing the results, the estimates should be consistent.

I'll just add a version that is unwinsorised in the Appendix.

You can decide not to do it, or not to create another account (which you said in your last response) because it’s not convenient, but in this case, I find it very hard to accept your submission. I see papers as professional work, and in this case, nothing can be taken lightly just because it’s “convenient” not to comply with the rules of the game. (This also applies to the flowchart as well). Yes, I know paper submission is hard work and time consuming, but that’s how it is.

I did say I would create another account...

What I said is that some data lack source and links, i.e., the cells are empty within the column “author & year”. I think about the GATB in particular. Links are missing for many tests employed. Same for study name, e.g., for PISA, study name should be PISA, etc.

In many cases it is intentional, e.g. there is no study name for the PISA differences since they were self-calculated.

You mentioned the number of participants here who took the test, but not for the other datasets. You need to be consistent. Either you display it in the main text for all studies (probably most recommended), or not at all.

Fixed.

This is convoluted. Sex variable is typically self reported. If there are different measures of sex, you should mention and explain this.

I can't recall why that section was added, to be frank.

I see you are now mentioning Egger’s test, but your text needs to be properly referenced, and this is especially important with respect to statistical methods.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test. BMJ, 315(7109), 629–634.

Seems rather unnecessary, but it was added anyway.

Regarding Figure 4, I would note in the main text that although there’s no observable publication bias, the data points show no funnel-shaped pattern. It kind of matters, because a more decisive conclusion regarding publication bias requires not only symmetry but also an appropriate funnel-shaped pattern in the data points. To me, it suggests there is some heterogeneity.

>I^2 = 99.2%

Figure 4 displays some labels I haven’t seen mentioned anywhere. “No ID”, “IST”, “NIZ IQ test”, DRT-B”, “GAMA”, etc. Those are too numerous to list. So either give a proper description either in the main text or in a supplementary. As I’ve said before, details are very important. 

Labels added.

What do you mean by “between the two variables”? Clarify, because I fail to see what you’re referring to, even considering the sentence before that one. I assume it must be sex gap and test score.

Probably a carryover or a sentence got deleted. Either way, fixed.

You discussed Hyde’s earlier meta-analyses 1988 & 1990. In the first half of this paragraph, you seem to suggest that their 1988 meta-analysis did not concur with your own results, apparently due to verbal ability being so broadly defined, such that “heterogeneity in results is not unexpected as there is no reason to think that the sex difference within these subtypes of verbal ability tests should be the same”. But for this answer to be fully convincing you need to identify which tests are the cause of this discrepancy between their studies and yours. Let’s say you measure verbal with A, B, C, D and they measure verbal with A, B, E, F, the differences in results would likely stem from the inclusion of E and F. But if their sex gap estimates of A and B differ widely from yours, then this is not solely due to verbal ability being measured with non-overlapping sets of tests.

It was a theory; not a committed statement. I'll rephrase the discussion section to clarify that.

In the second half, you said that Hyde et al. 1990 found support for Lynn’s developmental theory because “independent of selectivity, age still had an association with the sex difference in mathematical ability”. However I didn’t see where they did such an analysis.

"The result was a simple, well-defined equation in which 87% of the variance in d was predicted by three variables: subjects' age, selectivity of the sample, and cognitive level of the test. All three were significant predictors; Age was the strongest predictor, F(l, 232) = 1,171.04, p < .0001, followed by sample selectivity, F(3, 232) = 113.22, p < .0001, which was followed by cognitive level, F\3, 232) = 7.88, p < .0001. (Sample selectivity and cognitive level were coded as class variables.)"

“This meta-analysis” must refer to Hyde & Lynn 1988 but when you say “from a previous meta-analysis” by referencing Hyde et al. 1990 which was published later, and not before, you should remove “previous” because it’s very awkward.

Fixed.

This would imply that observed total scores and latent g scores don’t have to be different. Yet they should be different. This is because if you assume that the gender gap reflects true intelligence, then the gap at the latent g level should be magnified because it estimates only the true variance, unlike observed scores. 

Well, not exactly different.

In your excel file, Column C "Latent" has empty cells from line 2075 through 2149. Same problem as your old file. Row 2149 there is a typo in the author’s name: “Spinpath et al. 2008” it’s Spinath. 

I think I may have commented on this earlier, but missing implies 0. It's fixed in the code.