Thank you both for your time to review the paper.

Reviewer #1

1. Right now I don't know how to improve the tables. I mainly followed Willerman's presentation of the tables, except I added extra columns (since I looked at age 7 as well). I think they are easy to read but of course I'm open to suggestions. Remember these are all summary statistics (table means) and not regression tables. I have means, N, SD as my columns.

2. See my reply to Reviewer #2.

3. Thanks, I will add it and adjust accordingly the write-up.

4. I just reread Niswander & Gorden (1972) and I believe the CPP was well conducted. They mentioned for example that women who dropped out of the study before completion of their pregnancy didn't not bias estimates of mortality rates etc., but of course it's difficult to know whether these cases were missing at random when not having either IQ or education variables. As I use listwise deletion for all of my analyses, the problem pertains to all datasets. I'll add a note about that.

5. This review study is interesting. I'll add it in the update. The children were tested at a young age (1, 4 and 6-yrs old) so we don't really know how much of the gains would sustain, but it's worth mentioning due to its biological correlates.

6. Yes, I did mention it briefly in the paper. See section 3.2. The Add Health and HSLS black mothers who intermarried had more education levels (therefore, likely higher IQ) than white mothers who intermarried, and the only exception was the CPP. I reported results controlling for SES, therefore, some of the IQ effect has been partialled out. But I will add a note about it too.

Reviewer #2

"The p values appear to be incorrect. The sample sizes listed are quite tiny, so p < .001 seems impossible 6 times in a row. Is this because you are using sampling weights, which in some contexts messes up the p values?"

It is true that sampling weights artifiially reduce p-values, but reporting p-values on the unweighted result isn't wise either, especially on the Add Health for which, as explained in a footnote, the result is sensitive to the use/non-use of sampling weight. Generally, since p-values are a function of sample size and effect size (and probably data spread as well), I usually ignore p-values for inflated samples. I will adjust the write-up.

"You can test this by repeating the regressions using the full sample as they did."

I cannot access the full Add Health data, but on the public one at least I failed to replicate their findings.

"There must be newer studies than these, since 2014-16."

I'll look for them.

"The samples are quite small. However, it is possible to integrate the findings with meta-analysis to obtain more precise results."

This is indeed a great idea. I think the inverse-variance method however isn't best used on this data. It requires standard errors and those are affected by sample size. The Add health and HSLS had inflated N compared to the CPP and it wouldn't be wise to use the SEs on the unweighted result. While Bootstrap fixes the issue of biased SEs owing to distribution, it doesn't seem to handle biased estimates owing to inflated Ns. Even if it does, the sampling weights on both the Add Health and HSLS have non-integer values, which means I probably have to round them if Bootstrap has to be used, but some research reported this method being flawed.

Andreis, F., & Mecatti, F. (2015). Rounding Non-integer Weights in Bootstrapping Non-iid Samples: actual problem or harmless practice?. In Advances in complex data modeling and computational methods in statistics (pp. 17-35). Springer, Cham.

https://air.unimi.it/retrieve/handle/2434/250358/341637/Andreis%20-%20Mecatti%20-%20Short%20paper.pdf

Andreis, F., Conti, P. L., & Mecatti, F. (2019). On the role of weights rounding in applications of resampling based on pseudopopulations. Statistica Neerlandica, 73(2), 160-175.

https://dspace.stir.ac.uk/bitstream/1893/27562/1/AndConMec_review_clean.pdf

Is there another method ? I just looked at alternatives and I believe it's best to weight by the inverse of the sample size instead of the standard errors. I might end up doing the analysis but remember the data use cognitive tests which aren't comparable. In the CPP, we have the Wechsler, in the Add Health, a vocabulary test, in the HSLS, an achievement test (math) rather than cognitive test.

Finally, with respect to averaging waves, I believe it's more accurate not to aggregate, especially in the CPP. Willerman's main point was that black mother is associated with a very large cognitive deficit for the children at age 4, but I showed this wasn't the case at age 7. And the hereditarian argument is that the mother's involvement effect decreases over time.

On the other hand, the Add Health respondents' age at Wave I and Wave III range from 12-19 to 18-26, respectively, there might be some rationale as for aggregating, if one suspects sampling errors. Some of these respondents are quite young but likely this won't affect the result too much. I can aggregate these result for the meta analysis however.