I am generally OK with the structure of the introduction though it’s still brief.
The lack of a difference on the test was known for some time, and I cited Deary since it was the source that was the easiest to search for. I didn't bother looking for other citations and just used different ones.
If you don't change the references, at least change the sentence because now it's very awkward. Currently your text now looks like this: “some intelligence researchers have now contested this consensus with the developmental theory of sex differences in intelligence (Lynn, 1994).” I propose you remove the “now” as it’s still awkward. And if you use “researchers” as plural, you should add other authors, like maybe Irwing (2012) because again, this is awkward otherwise.
No. The method of correlated vectors can also test whether the association between IQ and a continuous variable (e.g. income) is on g.
You are correct, MCV has some other applications, but our discussion revolved around the test of Spearman’s hypothesis, e.g., whether group differences are due to g, since in your text you are still saying “An early method of testing this hypothesis was the method of correlated vectors”. In this case, it’s about correlating group differences and g loadings.
Because the sex differences in observed group factors of intelligence go in different directions depending on the ability in question, this means that there must be (relevant) sex differences in group factors of intelligence independent of the general difference.
Yes you have said this before. But as I argued, you still need to show that a positive relationship between group gaps and g-loadings cannot be achieved when there are large group differences in subtests that tend to balance out in their direction (e.g., near-zero total test score differences in the case of sex gaps due to male and female advantages canceling each other out).
Jensen’s discussion about MCV in The g Factor made it very clear that restriction in g loading range can attenuate the correlation. When you think about it, this observation can be extended to group differences. If you think smaller sex group differences across subtests lead to correlations deviating more from zero, I would say this is not obvious and it may well lead to the opposite outcome. You cannot expect a large correlation using MCV when the sex group differences don’t vary much across subtests.
You don’t want to cite what I am about to say here, but I have tested SH before (and even recently) by using the absolute value of sex differences rather than the signed ones. And it didn’t lead to much different outcome, for the couple analyses I’ve tried, it didn’t lead to a large (or even modest) positive relationship between g loading and group differences. The difference in interpretation here would be “as g loading increases so does sex differences” rather than “as g loading increases so does the male advantage”. Your criticism is more relevant to the second statement, much less so with regard to the first statement.
MGCFA can help you see through this intricate mess. Look again at Keith et al. (2008) and compare their tables 6 and 8. I told you earlier the g gaps were 1.21 and 3.51 IQ for HOF and BF models respectively. I probably should have told you this as well: the sex differences in group factors in BF were much, much bigger than in the HOF model. In other words, the gender gap in g was larger even though the gender gap in group factors was also larger. Regardless, the discrepancy can be explained by the fact that BF orthogonalized the latent factors, but not HOF. The implication here is that, when properly modeled, you can easily find a g gap between sexes despite large differences across group factors.
The only artifact that may cause g not being detected is selection bias in items, which seems to occur in traditional IQ tests, which is why it is best to test sex differences in aptitude tests as well, because they don’t account for sex differences by removing any items displaying large sex differences.
Regarding this matter, I would like to make this clear. I merely want to see if we can reach an agreement here. But regardless of your conclusion, whatever you say will not affect whether I accept or decline the submission. In the past, I have accepted papers at OP despite disagreeing with authors on theoretical grounds. I accept disagreement, unless the matter is extremely serious (such as depicting the main analysis in a very misleading way). I think the most important element should be methodological.
I don't understand your logic here. A larger difference is easier to detect than a smaller one.
We are talking about effect size. Your argument is that sex gap in g must be small if specific factors show a large sex gap. My counterpoint was: BW gap in specific factors are also very large.
There were the omega total reliabilities for each
It is good you showed me these values, but why are these values not reported in the paper? Along with whether it impacts the results or not. For instance, you said “I'll clarify that this will lead to the differences in group factors of intelligence being attenuated, while the difference in the full scale scores will not be” but you forgot to mention this in the text.
if you could refer me to a peer reviewed study on the matter then maybe it could be of use.
This paper (p 1013) showed that the reliability of Wordsum is improved a bit by merely changing the composition of the item set (without changing test length).
Cor, M. K., Haertel, E., Krosnick, J. A., & Malhotra, N. (2012). Improving ability measurement in surveys by following the principles of IRT: The Wordsum vocabulary test in the General Social Survey. Social science research, 41(5), 1003-1016.
The reason why I averaged the effect sizes that came from the same study was to avoid the test for publication bias being confounded by some studies reporting segregated effect sizes for ages. If I recall correctly, the test for publication bias showed a bias against reporting advantages in male ability. This disappeared when I averaged the effect sizes of different studies into one "larger" effect size. I then did the same when it came to analyzing the differences in the group factors of ability because the estimation of standard errors was biased by the fact that some studies/datasets reported a massive number of effect sizes (e.g. PISA, PIRLS datasets).
Your first statement is fair and you can make this statement in the discussion section. But I need to say this: by using averaging, the variations within studies are lost, and the true relationship between effect size and standard error will likely be obscured. You can use multilevel and averaging and compare the results, and with respect to publication bias as well. I would say, if the bias disappears only by using averaging and not after using a multilevel model, perhaps it means that your method conceals publication bias rather than addressing it. The reason I recommend multilevel is because it’s a much more accurate method. If bias disappears after using a more accurate method, I see no problem. But if bias disappears by using a method that distorts standard errors, then that’s an issue. Your second statement also looks fair. But multilevel can also address this problem, with the added benefit of providing accurate standard errors.
I eyeballed your spreadsheet and my first impression is that you have likely enough studies with multiple effects to conduct such a superior analysis. Again, I don’t force you here. Like I’ve said, you may decide not to use multilevel, but the problem with aggregating effects must be acknowledged.
Here are few articles I highly recommend reading about biases introduced by averaging:
Cheung, M. W. L. (2014). Modeling dependent effect sizes with three-level meta-analyses: a structural equation modeling approach. Psychological Methods, 19(2), 211–229.
Moeyaert, M., Ugille, M., Natasha Beretvas, S., Ferron, J., Bunuan, R., & Van den Noortgate, W. (2017). Methods for dealing with multiple outcomes in meta-analysis: A comparison between averaging effect sizes, robust variance estimation and multilevel meta-analysis. International Journal of Social Research Methodology, 20(6), 559–572.
I could add a post-hoc explanation as to why the results of each method could differ
I don’t recommend resorting to post-hoc explanation, since hypotheses should be made clear at the start, not after seeing the results, but in your situation, it’s better than nothing.
If the comment is referring to constraining at all, the issue is that some of the older samples also include some children in their analyses, which could lead to an artificial increase in the male advantage in general ability as the sample means go from 20 to 30
This detail is not trivial but it wasn’t mentioned in the main text. I told you before to take great care of the details of your method. All details. Regarding the age restriction range, in this case, use different specifications as sensitivity analyses. You said age 18. Maybe add age 22 or 24 as well, but most importantly use a specification with no constraints. If it's not biasing the results, the estimates should be consistent.
It is odd. But even if age is controlled for, the developmental theory can still be tested because it refers to the difference in the growth in intelligence between men and women.
Yes I don’t doubt, but my point is that your sentence was confusing.
Sure. I guess I’ll make a new account then.
You can also ask your co-author to store everything in his account.
You can decide not to do it, or not to create another account (which you said in your last response) because it’s not convenient, but in this case, I find it very hard to accept your submission. I see papers as professional work, and in this case, nothing can be taken lightly just because it’s “convenient” not to comply with the rules of the game. (This also applies to the flowchart as well). Yes, I know paper submission is hard work and time consuming, but that’s how it is.
Presentation is crucially important. Otherwise, scholars will never bother to cite your study or even take it seriously, and in this case, why should I bother reviewing it, and why bother publishing then? The more you deviate from the “norm” and the more negatively your paper is viewed among scholars (regardless of where the paper is published).
To be (brutally) honest, the initial submission would be a desk rejection in nearly all other journals, because it deviates so much from what a professional work typically looks like, even more so given the high standards of meta-analyses. If any editor accepted it, then I think he should be fired. This is to show you how much the paper has to improve to meet standard quality, and why I was so concerned by the initial submission. It’s clearly better now and I’m somewhat happy with the direction, but there is still room for improvement.
In any case, once you upload the additional material (code + data + data description) I’ll examine these materials and I will provide you with what is likely my final advice.
The supplement reveals all datasets and studies that I used
What I said is that some data lack source and links, i.e., the cells are empty within the column “author & year”. I think about the GATB in particular. Links are missing for many tests employed. Same for study name, e.g., for PISA, study name should be PISA, etc.
Regarding your updated paper:
Within the NLSY79, the ASVAB was administered to 11,914 respondents in 1981.
You mentioned the number of participants here who took the test, but not for the other datasets. You need to be consistent. Either you display it in the main text for all studies (probably most recommended), or not at all.
Within the NLSY97, the same methodology was used, but the differences in performance within each ability and age group (12, 13, 14, 15, 16-18) were calculated.
The problem is that you mentioned in the NLSY79 there are 10 subtests, but you know there are 12 in NLSY97, so you should mention this. Also, why is the difference calculated within separate age groups in NLSY97 but not NLSY79? You should use the same method.
The scores on all times for both tests into one composite score.
There must be some missing element in this unfinished sentence.
Within wave 1, the ages of the participants were segregated
I prefer separated, as segregated sounds weird to me even though it means the same thing.
The Programme for the International Assessment of Adult Competencies administered tests…
You should add PIAC under parentheses. There is some concern with the method described in that same paragraph:
- Regarding PIAC, your method strikes me as odd. Although this concerns only a few countries, their general ability was measured by “the standardized difference in the composite of numeracy and literacy was calculated instead.” while for many other countries which had also completed the problem solving test, the three tests were used altogether to produce a g factor score. But this means the scores between these countries are no longer comparable as one is a factor score of 3 tests and the other is a composite score of 2 tests. I suggest you use a 2-subtest factor score for all countries as robustness analysis. Or consider whether imputation is reasonable given your missing pattern, and by also checking whether the relationships between the three tests are comparable across countries.
- Regarding PISA, you mentioned you are not using g scores. Why the inconsistency?
On average, men performed better than women (d = .039, n= 31,950, p < .001).
I don't know why you report these results in the method section, and only for the Wordsum test, which is even more weird.
A dataset of 28,699 employees who took the GATB was privately sent to the authors
In your data spreadsheet, there is no source (ie, author or link) mentioned for lots of tests (ie, they have empty cells), including this GATB. Anyone looking at this file will immediately think this is suspicious.
In the Project Talent, the 61 subtests … Then, the sex difference in each ability at the ages of 13, 13, 15, 16, 17, and 18 was calculated.
You still haven’t answered my comment earlier. Why 61 subtests? You need to justify. Look at Major et al, they used 37. Regarding the computation of sex difference by age group, you need again to keep consistency. And provide an explanation as to why you use such a method (eg, testing Lynn’s hypothesis). There is also a typo, with 13 being repeated.
If there was no sex variable, self-reported gender identity was used as a proxy for it.
This is convoluted. Sex variable is typically self reported. If there are different measures of sex, you should mention and explain this.
the WORDSUM, a 10 item multiple choice vocabulary test, and …
The Wordsum and its data were presented properly earlier in the method section, but not the cognitive test and the data that is mentioned after, in the sentence I quoted above. If there are other unmentioned datasets, mention those too.
I think I told you earlier about Figure 3. The description of each row must keep a degree of clarity and consistency. Since you already mentioned in the figure’s title the red/black stands for children/adults, you don’t need to specify (C) for children in each row, or if you do, then add also (A) for adults for the remaining rows.
I see you are now mentioning Egger’s test, but your text needs to be properly referenced, and this is especially important with respect to statistical methods.
Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test. BMJ, 315(7109), 629–634.
The next sentence mentions the moderator analyses, displayed in the Appendix, but I’ve told you to discuss these results, and it’s still not done. Either present these results in the results section or present these results in the appendix section under/above these Figures/Tables. Also, with respect to Figures/Tables, the Figures A1-2 are put in bold, but not Tables A1-2. You need to be consistent.
Adult men scored slightly higher in full scale ability … Publication bias in favor … no visual signs of publication bias.
This entire paragraph should be split into two, as it currently combines two distinct analyses: one on sex gaps in full-scale IQ and another on publication bias. Proper presentation requires that each analysis be discussed in its own paragraph to maintain clarity and focus.
Regarding Figure 4, I would note in the main text that although there’s no observable publication bias, the data points show no funnel-shaped pattern. It kind of matters, because a more decisive conclusion regarding publication bias requires not only symmetry but also an appropriate funnel-shaped pattern in the data points. To me, it suggests there is some heterogeneity.
Figure 4 displays some labels I haven’t seen mentioned anywhere. “No ID”, “IST”, “NIZ IQ test”, DRT-B”, “GAMA”, etc. Those are too numerous to list. So either give a proper description either in the main text or in a supplementary. As I’ve said before, details are very important.
Then, restricted cubic splines were used to calculate the non-linear relationship between the two variables.
What do you mean by “between the two variables”? Clarify, because I fail to see what you’re referring to, even considering the sentence before that one. I assume it must be sex gap and test score.
You discussed Hyde’s earlier meta-analyses 1988 & 1990. In the first half of this paragraph, you seem to suggest that their 1988 meta-analysis did not concur with your own results, apparently due to verbal ability being so broadly defined, such that “heterogeneity in results is not unexpected as there is no reason to think that the sex difference within these subtypes of verbal ability tests should be the same”. But for this answer to be fully convincing you need to identify which tests are the cause of this discrepancy between their studies and yours. Let’s say you measure verbal with A, B, C, D and they measure verbal with A, B, E, F, the differences in results would likely stem from the inclusion of E and F. But if their sex gap estimates of A and B differ widely from yours, then this is not solely due to verbal ability being measured with non-overlapping sets of tests.
In the second half, you said that Hyde et al. 1990 found support for Lynn’s developmental theory because “independent of selectivity, age still had an association with the sex difference in mathematical ability”. However I didn’t see where they did such an analysis. Their Table 4 indeed displays sex gaps by age groups, but selectivity was not accounted for. Instead, each table accounts for one moderator effect at a time. For instance, table 4 accounts for age as moderator, table 4 for ethnicity as moderator, table 6 for selectivity as moderator. Moreover, their finding of an age*sex interaction is at best a very weak support for Lynn’s hypothesis because of huge heterogeneity (explained in their text and displayed in their Table 4). Computation and Concept showed no male advantage for all groups but it is also true these tests lack samples for ages 19-25 and 26+ whereas Problem Solving showed a sex*age interaction consistent with Lynn’s expectations, yet there is no difference in effect sizes between ages 15-18 and 19+ which could be suggestive that the lack of sex difference in Computation and Concept at age 15-18 could be true for age group 19+ as well. The only pattern that is undeniably consistent with Lynn’s is the age*sex interaction for “All studies” instead of cognitive domains separately. So considering the entire bulk of results, I think the heterogeneity of effects and some missing data at later ages prevent strong inferences. Your conclusion might be somewhat right, but it should be more balanced by considering the heterogeneity across math domains.
This meta-analysis found that men score higher in tests of mathematical ability by about .3 SD, which does not corroborate results from a previous meta-analysis (Hyde et al., 1990).
“This meta-analysis” must refer to Hyde & Lynn 1988 but when you say “from a previous meta-analysis” by referencing Hyde et al. 1990 which was published later, and not before, you should remove “previous” because it’s very awkward.
I would recommend putting a dot right after .3 SD, and then write something like “However, another meta-analysis (Hyde et al., 1990) found a different result and argued that the gender difference…”. But it’s up to you.
This meta-analysis argued that the gender difference in mathematical abilities were a result of selective samples […]
Typo: it should be “was” if the subject “gender difference” is singular.
In some of these cases, such as Pezzuti & Orsini (2016), the observed difference in intelligence is of roughly the same magnitude as the latent difference, so it would be misleading to say that the use of latent methods is responsible for the discrepancy in results.
This would imply that observed total scores and latent g scores don’t have to be different. Yet they should be different. This is because if you assume that the gender gap reflects true intelligence, then the gap at the latent g level should be magnified because it estimates only the true variance, unlike observed scores.
I don’t have anything more to say about the discussion section, since the other paragraphs look fine.