Back to [Archive] Post-review discussions

U.S. Ethnic/Race Differences in Aptitude by Generation
This is a very good paper. However, I think the abstract could be improved. Usually abstracts do not get into technical summaries abruptly. An explanation should be added of what the aims of this investigations were. Ths should be placed between "We conducted an exploratory meta-analysis using 18 samples for which we were able to decompose scores by sociological race and immigrant generation". and "For NH Blacks and NH Whites of the same generation, the first, second, and third+ generation B/W d-values were 0.78, 0.76, and 0.98". An explanation of what the d-value represents should also be added (it may be obvious to you but some readers may ask why you decided to use d values).
Other than this I have no furter comments and I approve publication after abstract will be changed.
I fixed the errors noted by Dalliard and Meng Hu; I added to the abstract as requested by Duxide. I corrected some other typos. I redid the tables, except for #2, per Meng Hu's request. Regarding Meng Hu's table format request, I don't think that my screenshot and paste method (see table #2) is out of line with the standards of this journal (or with APA standards). Perhaps Emil could comment. The problem with Meng Hu's method is that it's very difficult to get the spacing right (especially when the columns are long), the tables either end up splitting across pages or taking up too much space, etc., and I end up spending more time on the table format than on the original analysis.

(Also, two reviews, Emil and Dalliard, took issue with my calling this an "exploratory meta-analysis". Emil suggested that it wasn't in fact a meta-analysis because it wasn't systematic. Dalliard noted that the term "exploratory" is ambiguous. I agree with Dalliard and this ambiguity allows me to call it "exploratory". This usage is not just my own idiosyncratic. For example, Jan te Nijenhuis calls these types of analysis "exploratory". He himself told me that I could slap the qualifier "exploratory" onto such meta-analyses to denote that they were not systematic. Per Emil's comment, it's a meta-analysis because I am computing a statistic based on multiple analyses.)
I fixed the errors noted by Dalliard and Meng Hu; I added to the abstract as requested by Duxide. I corrected some other typos. I redid the tables, except for #2, per Meng Hu's request. Regarding Meng Hu's table format request, I don't think that my screenshot and paste method (see table #2) is out of line with the standards of this journal (or with APA standards). Perhaps Emil could comment. The problem with Meng Hu's method is that it's very difficult to get the spacing right, especially when the columns are long.


I cannot open the file. Can you please reupload a readable file format (PDF or docx)?
I fixed the errors noted by Dalliard and Meng Hu; I added to the abstract as requested by Duxide. I corrected some other typos. I redid the tables, except for #2, per Meng Hu's request. Regarding Meng Hu's table format request, I don't think that my screenshot and paste method (see table #2) is out of line with the standards of this journal (or with APA standards). Perhaps Emil could comment. The problem with Meng Hu's method is that it's very difficult to get the spacing right, especially when the columns are long.


I cannot open the file. Can you please reupload a readable file format (PDF or docx)?


Fixed.
Ok the abstract is much better. I approve publication.
I have several requests. In table 16, the row

white 752729

is in the same line as "reference". I think it should be below. (EDIT: forget this comment; after examining it carefully, "reference" was not a part of the text "reading/math d-value". It meant reference groups. But perhaps you can skip a line to help distinguish that)

Concerning table 14, the letters and numbers are smaller than the rest of the text. You should fix it.

Concerning this sentence, page 10 :

This selectivity could account for some of the
difference. Alternatively, the National IQs of Black majority nations could be underestimated. This issue will require more investigation.


That should be :

This selectivity could account for some of the difference. Alternatively, the National IQs of Black majority nations could be underestimated. This issue will require more investigation.


As for this :

Alternatively, the National IQs of Black majority nations could be underestimated.


Are you referring to Wicherts studies on Ss African countries ? If so, you should cite them, because it's not necessarily clear what you're talking about.

color IQ correlation for second generation Blacks but not for first generation ones.


But that should be "color-IQ corelation".

Because there were relatively few second generation Black individuals who reported being mixed race, using an inclusive definition had a little effect on the overall scores


You have probably forgotten a dot somewhere.

In your table 3, when you type "reference". I think you should better type "reference group". It's much clearer this way. And you have enough space to write it. And how did you get the values for black-white biracials (e.g., 0.38 for 2nd generation). Given the data file I have, I don't see it. Is it an averaging of ACT and SAT or something ? Because if so, you should have said it explicitly. Because it's impossible to guess.

At page 13 you present the results for Wordsum in the GSS, but have you presented what the "wordsum" test is ?

Self-reported scores are probably not the best index of true English ability and they may not be comparable across groups (i.e., measurement invariance might not hold);


It's not a point needed to be focused on specifically, but I want to be clear with that. Lot of people don't know what MI is. And I don't understand why Wicherts says that violation of MI implies incomparability of scores. It's false. When analyzing subtests, when they show no MI, that means either the true ability of one group is underestimated, or overestimated. Or neither, for example, if the subtest biases cancel out at the test level, then there is only subtest bias, but no test bias. If biases are cumulative and thus are mainly one-sided, then there is test bias at the total score as well. Even in this case, the scores can be compared. It's just that the scores are under/over estimated. Nothing more. Either with MGCFA or IRT, you can calculate more or less the amount of IQ that is under/over estimated.

when second+ generation scores are adjusted for psychometric bias in the form of differential item function, the differences remain large (see, Richwine, 2009, table 2.11).


What Richwine shows is that Piat-math is not biased in hispanic-white comparison. But it's just a math test, whereas in your table 9, the tests are mainly verbal/language tests.

This study should be (a little bit) more relevant to your question.

Schmitt, A. P., & Dorans, N. J. (1990). Differential item functioning for minority examinees on the SAT. Journal of Educational Measurement, 27(1), 67-81.

Because analyses for the SAT-Mathematical test did not demonstrate much DIF for either Hispanic group, Schmitt focused her analyses on the SAT-Verbal test (Schmitt, 1985, 1988). In the first form analyses, 8 and 12 items out of 85 SAT-Verbal items were flagged for DSTD values greater than .05 in absolute value for Mexican Americans and Puerto Ricans, respectively. In the analyses for the second form, the number of flagged items for Mexican Americans and Puerto Ricans were 14 and 16, respectively. It should be noted that most of the flagged items across both Hispanic groups and across both studies had DSTD values whose absolute magnitudes were smaller than .10. Of the four item types on the SAT (antonyms, analogies, sentence completion, and reading comprehension), analogy items exhibited the greatest number of negative DIF items for both Hispanic groups.


Negative DIF means here that the focus groups (the groups for which you suspect there is bias, e.g., minorities) have underestimated scores. 0.10 is the cutoff for large DIF. Given that they do not give the magnitude of total bias, there is no way for me to draw any conclusion on it. But my impression is that the bias is not trivial. This assumes however that all DIFs are one-sided.

Whatever the case, absence/presence of bias in any given test should not be generalized to other tests. You cite Trundt (2013), who has analyzed the DAS-II. But there is no DAS test in none of the samples you analyze.

For example, Hansen et al. (2008) report scores for male children of natives


It's Hansen (2010), not (2008).

---

Sometimes, the letter "g" is cut, that is, we don't see the the letter entirely. I don't know why this happened. The zeros in table 2 has the same problem by the way (and particularly for the table for asians).

Concerning my suggestion for making tables, I have proposed another one, when i commented on "Genetic and Environmental Determinants of IQ in Black, White, and Hispanic Americans" and you should have already received that mail already.

Concerning the second file (167 kb) i cannot open it.

EDIT:

When you write this :

For example, third+ generation English-only speaking Hispanics perform only marginally better than all Hispanics of the same generation. TIMSS (2007) results are shown below in table 11.


I don't understand why you did not even report the d effect size.

=(483-478)/((73+67)/2)=0.07

If you do this :

=(533-483)/((67+73)/2)
=(533-478)/((67+67)/2)

the d difference relative to whites who always speak english is a little bit larger, something like 0.11 SD.

It has been suggested that the African migrant IQ might be on par with that of Whites; if so, the first and second generation B/third+ generation W gaps


Personally, I prefer Black and White instead of B and W, because the "B/third+" is somewhat confusing.

In table 5, the data is from Capps et al. (2012), it is not mentioned at the bottom of the table but in the text instead. Personally, I prefer to see the reference "attached" to the table, it's easier to detect it.

For tables 8 & 15, you must say explicitly, e.g., at the bottom of the table, what's the reference for the numbers in these columns come from (eg. Heoffel et al). No one can guess that. I can say the same thing for table 12. For your table 16, now that I've looked at it once again, I think the IQ for nepalese should be 78.0 and not 78.8. And for your figure 1, if it is based on data given in table 16, you should say it explicitly.

Globally, Vietnamese, Asian Indians, and Filipinos are estimated to have national IQs, respectively, 0.40, 1.19, and 0.93 SD below the White mean and yet the Californian CAT differences between American Whites and American individuals of these nationalities is, respectively, - 0.13, -0.11, and 0.13.


You must precise what the negative sign would mean. If you have computed white minus asian, and it's negative, it mean those asian groups score higher. But there is no possibility to guess it unless it is stated explicitly.

The computations are presented in the excel file.


It's just my opinion, but I prefer "presented in the supplementary file".

For table 17, it is difficult for me to follow. Perhaps you can improve it. For example, you do not need to put "White" and "Black" in the line below "non-Hispanic". You have enough space to write them in the same row.

If possible, I would like to see the newer version with modified text in color or in bold. That would help me a lot. And probably others too.
Admin
I fixed the errors noted by Dalliard and Meng Hu; I added to the abstract as requested by Duxide. I corrected some other typos. I redid the tables, except for #2, per Meng Hu's request. Regarding Meng Hu's table format request, I don't think that my screenshot and paste method (see table #2) is out of line with the standards of this journal (or with APA standards). Perhaps Emil could comment. The problem with Meng Hu's method is that it's very difficult to get the spacing right (especially when the columns are long), the tables either end up splitting across pages or taking up too much space, etc., and I end up spending more time on the table format than on the original analysis.

(Also, two reviews, Emil and Dalliard, took issue with my calling this an "exploratory meta-analysis". Emil suggested that it wasn't in fact a meta-analysis because it wasn't systematic. Dalliard noted that the term "exploratory" is ambiguous. I agree with Dalliard and this ambiguity allows me to call it "exploratory". This usage is not just my own idiosyncratic. For example, Jan te Nijenhuis calls these types of analysis "exploratory". He himself told me that I could slap the qualifier "exploratory" onto such meta-analyses to denote that they were not systematic. Per Emil's comment, it's a meta-analysis because I am computing a statistic based on multiple analyses.)


I never meant to say that it wasn't a meta-analysis. Of course it is. I wrote that it isn't a systematic meta-analysis. If studies are heterogeneous, then it is possible to select subsets of studies such as to bias the effect size upwards or downwards. For this reason it is best to be systematic. All it means to be systematic is that there was some objective literature search and clear inclusion and exclusion criteria. The masters of meta-analysis, The Cochrane Collaboration, recommends this practice.

We might call a non-systematic meta-analysis for an exploratory meta-analysis, that seems in line with current practice and is fine with me.
In the PDF file, for a reason unknownst to me, some of the table border lines don't appear. I'm not going to attempt to fix this. If that's a problem, Meng Hu can diagnose and fix the issue -- since I redid the tables using his format. I am not going to keep monkeying around with table formats.

I made all of the corrections MH requested except:

(1) "Are you referring to Wicherts studies on Ss African countries ? If so, you should cite them, because it's not necessarily clear what you're talking about."

I wasn't referring to Wicherts. I was simply offering an explanation for why migrants from region x don't perform as their region x National IQs would predict.

(2) "Whatever the case, absence/presence of bias in any given test should not be generalized to other tests. You cite Trundt (2013), who has analyzed the DAS-II. But there is no DAS test in none of the samples you analyze."

There is an implied inductive argument here: (a) There are large unbiased gaps in some tests (PIAT and DAS); (b) there are likely large unbiased gaps in general; (c) thus, the found gaps based on the tests used are probably not substantially due to bias. You might deem that my evidence for (a) is weak. But I was unable to find better sources; this is what I have. If you can show me other sources -- that show the magnitude of adjusted or MI scores -- I will add them.

(3) "Sometimes, the letter "g" is cut, that is, we don't see the the letter entirely. I don't know why this happened. The zeros in table 2 has the same problem by the way (and particularly for the table for asians)."

I have no idea what you are referring to.

(4) "I don't understand why you did not even report the d effect size...
the d difference relative to whites who always speak english is a little bit larger, something like 0.11 SD."

I didn't feel like it.

(5) "[b]If possible, I would like to see the newer version with modified text in color or in bold."

No.
Admin
I read the most recent PDF version. Overall good. Soon ready for publication. I have some comments.

Perhaps add "national IQs" to the key words.

In the list of studies on pp. 2-3, the words "publicly" has a grey box. Is that on purpose? What is the purpose?

"Generally, the research which we did find and did not include did not meet one of our inclusion criteria."

p. 3

This seems to imply that you found some research and didn't include it although it met the inclusion criteria. Clarify please.

Is there a list of samples you considered but did not include?

p. 4

"We were unable to compute sample sizes for a number of the studies, as many were analyzed with online statistical tools and as these tools did not provide the necessary statistical options to generate sample sizes; as such, we did not report them in table 2 and we did not weight the survey d-values when computing meta-analytic averages; even if sample sizes were available for all surveys, doing otherwise arguably would have been undesirable given the heterogeneity of the samples, which varied in birth year, age, test type, representativity, and sample size."

Generally people use the median when calculating a central tendency for heterogeneous results (e.g. from many different kinds of methods and samples). Using the median instead means that outliers have no effect on the result which they do on the mean. If outliers are skewed in a certain direction for whatever reason, the mean will be a biased estimate.

However, the median does not work well when K is small. It seems like you used the mean. Did you consider using the median? If you feel like it, you could have a look and see if using the median changes things. My hunch is that it won't change much.

As an example of a study using the median. The IPCC used the median (50th centile) result in literature review of greenhouse gases from different energy sources.

Moomaw, W., P. Burgherr, G. Heath, M. Lenzen, J. Nyboer, A. Verbruggen, 2011: Annex II: Methodology. In IPCC: Special Report on Renewable Energy Sources and Climate Change Mitigation

"are listed in table 1"

Normally one capitalizes the word table because it is a proper name in this context that refers to a specific table. The rest of the paper uses the same practice of not capitalizing references to tables. I prefer capitalization, but it's a stylistic disagreement.

p. 5:

The table does not seem to be a real table as the text does not follow vertical lines exactly. Did you make it look like a table using spaces instead of tabs?

"We reported results for other studies such as TIMSS 1995, TIMSS 1999, TIMSS 2003, and PIRLS 2001 in the supplementary file. We did not include these results in the meta-analysis because we desired a balanced sample of surveys."

How would including them change results?

In a meta-analysis, including many studies of the same type biases the main result in the direction, if any, that the methodology of that type of study biases results in, so wanting a balanced sample is not irrational.

"When sample sizes were too small to generate reliable results, scores were left blank in the chart and were not factored into the meta-analytic averages."

What was the threshold for "too small"?

p. 9:

"Relative to third+ generation Whites, the average d-values were 0.98, 0.80, and 0.98 for first, second, and third+ generation Black individuals, 1.02, 0.68, and 0.56 for first, second, and third+ generation Hispanic individuals, 0.10, -0.21, and -0.19 for first, second, and third+ generation Asian individuals, and 0.20 and 0.04 for first and second generation White individuals. For Blacks and Whites of the same generation, the first, second, and third+ generation B/W dvalues were 0.78, 0.76, and 0.98. For Hispanics and Whites of the same generation, the first, second, and third+ generation H/W d-values were 0.78, 0.65, and 0.56. For Asians and Whites 10 of the same generation, the first, second, and third+ generation d-values were -0.10, -0.21, and -0.19."

Is this paragraph necessary? It is just a complete repetition of the results in the tables just presented.

p. 13:

"Table 5. Percent of Black Immigrants to the U.S. by Region of Origin, 1980 to 2008*"

In this table and others the author refers to the numbers as percentages, but they are not multiplied by 100 and are merely parts of 1. It can throw the reader off.

p. 19:

Is there some reason why there are missing values in SD and IQ columns? Presumably IQs are calculated by converting the Score column values. Clarify?

p. 20:

What are theta scores? Is that from IRT? http://en.wikipedia.org/wiki/Item_response_theory

p 24:

I don't understand how col G works. For Chinese, the prediction based on LV IQ is -.39, while the actual performance is -.46, a difference of |.07|. Very small. Col G says it is .86. Compare with the Japanese below. Predicted -.28, actual -.40, delta |.12|, also small. G says -.01.

What about the three missing values? Presumably the one in F is because the composition of "other Asian" is unknown while "All Asians" uses the estimated proportions from Table 15 to get to .4 ((100-94)/15=.4).

p. 26:

There is a black dot on the right of the regression plot.
I wasn't referring to Wicherts. I was simply offering an explanation for why migrants from region x don't perform as their region x National IQs would predict.


Ok, but it's not easy to understand.

There is an implied inductive argument here: (a) There are large unbiased gaps in some tests (PIAT and DAS); (b) there are likely large unbiased gaps in general; (c) thus, the found gaps based on the tests used are probably not substantially due to bias. You might deem that my evidence for (a) is weak. But I was unable to find better sources; this is what I have. If you can show me other sources -- that show the magnitude of adjusted or MI scores -- I will add them.


The argument is wrong here. Bias has different sources, and reasons. Difference in speededness, attitude, differential interpretation regarding words, knowledge, etc. Blacks, for example, can show DIF for easy and hard items, but those have likely different explanation, such as easy DIF item owing to difference in interpretation (given that the words are widely known and heard) and hard items due to rarity. Speeded tests can induce people to guess, and more so for members for which the mean score is lower than the other group. If tests differ in properties, they can differ in the amount of bias and in its direction. You cannot imply (a) to say in general there is no bias. You cannot, for example, predict from PIAT and DAS that the ASVAB is not biased. And there is proof that ASVAB is biased, although the author has not made clear the direction of bias.

Gibson, S. G. (1998). Gender and ethnicity-based differential item functioning on the Armed Services Vocational Aptitude Battery (Doctoral dissertation, Virginia Polytechnic Institute and State University).
http://scholar.lib.vt.edu/theses/available/etd-93098-11430/unrestricted/Sggps2.pdf

---

I cannot open your 2nd and 3rd files.

------

Emil :

What are theta scores? Is that from IRT?


Concerning HSLS2009, it's not the IRT scores. The syntax look something like :

WEIGHT BY w1student.
MEANS TABLES=x1txmth BY x1race BY gen
/CELLS MEAN COUNT STDDEV.

x1txmth is for X1 Mathematics theta score

whereas

x1txmscr is for X1 Mathematics IRT-estimated number right score (of 72 base year items)

See here (pp 13-14)
http://nces.ed.gov/pubs2014/2014361_AppendixI.pdf
Admin
It seems to be an estimate of the number of scores right if the testee had attempted every item. Presumably used to remove any bias of measurement resulting from different inclinations to try all items. If A and B have the same math ability, but A tries half of the items and then gives up, while B tries half of the items and then gives up, but nonetheless guesses on the remaining B will get a higher score due to randomly getting some items right in the 2nd half.
When I used theta scores and IRT scores for this blog post here
http://humanvarieties.org/2013/10/20/race-ses-interaction-some-evidence-of-increasing-black-white-iq-differences-with-ses-levels-from-various-survey-data/

In the HSLS 2009, I see no big difference. If anything, the black-white (and perhaps black-asian) gaps increase a little bit when using IRT scores.
The argument is wrong here. Bias has different sources, and reasons. Difference in speededness, attitude, differential interpretation regarding words, knowledge, etc. Blacks, for example, can show DIF for easy and hard items, but those have likely different explanation, such as easy DIF item owing to difference in interpretation (given that the words are widely known and heard) and hard items due to rarity. Speeded tests can induce people to guess, and more so for members for which the mean score is lower than the other group. If tests differ in properties, they can differ in the amount of bias and in its direction. You cannot imply (a) to say in general there is no bias. You cannot, for example, predict from PIAT and DAS that the ASVAB is not biased. And there is proof that ASVAB is biased, although the author has not made clear the direction of bias.


The argument is valid. First, what did I say: "There are three reasons to suspect that such bias is minimal by the second and subsequent generations". Since I said "suspect" not e.g., "conclude" my evidence doesn't need to be strong. The argument just must be coherent. And it is. Finding large unbiased differences of magnitude x in samples (a) through (c) evidences the existence of large unbiased latent ability differences of magnitude x between populations. (True or false?) If there are large unbiased latent ability differences of magnitude x between the populations, then it is unlikely that the average difference of magnitude x on tests (e) through (z) are more than minimally due to bias. (True or false?) The argument isn't that tests (e) through (z) are unbiased because tests (a) through (z) are, but that the bias for tests (e) through (z), can't be large because tests (e) through (z) show a magnitude of difference commensurate with the true latent ability one, the existence of which is evidenced by tests (a) through (c).
[hr]
Ok, but it's not easy to understand.


How would you rewrite it?

I cannot open your 2nd and 3rd files.


It opens for me. I'm using Windows 2013, so they're .docx or .odt files. What version are you using? If that's the problem, I will also add a .doc version or whatever.
Admin
I'll see if I can get rid of that pesky replacement algorithm that replaces ( c ) with © and ( r ) with ®.

Fixed! I have no idea why they would put something as annoying as that in to begin with.
The argument is valid. First, what did I say: "There are three reasons to suspect that such bias is minimal by the second and subsequent generations". Since I said "suspect" not e.g., "conclude" my evidence doesn't need to be strong. The argument just must be coherent. And it is. Finding large unbiased differences of magnitude x in samples (a) through (c) evidences the existence of large unbiased latent ability differences of magnitude x between populations. (True or false?) If there are large unbiased latent ability differences of magnitude x between the populations, then it is unlikely that the average difference of magnitude x on tests (e) through (z) are more than minimally due to bias. (True or false?) The argument isn't that tests (e) through (z) are unbiased because tests (a) through (z) are, but that the bias for tests (e) through (z), can't be large because tests (e) through (z) show a magnitude of difference commensurate with the true latent ability one, the existence of which is evidenced by tests (a) through (c).


I underlined the passages i disagree with. But I do not understand the rest. You say first, samples (a) through (c) and then tests (a) through (c). And also this one: "that tests (e) through (z) are unbiased because tests (a) through (z) are" (but more generally it's the entire sentence).

I understand you did not conclude anything but merely suspect. This is why you made indirect inferences. It's not possible to argue that MI in any given test is generalizable to other tests. You have to prove it first. It's like i said before. If you argue MI in a specific test is generalizable to others, then, based on Wechsler, PIAT math, and DAS-II, you will wrongly conclude ASVAB respects MI. That's not true.

IQ or cognitive, or achievement tests, do not have the same items/questions. And are probably not allotted with the same amount of time (i.e., minutes) to take the test. If you say right, then it must be true that IQ tests having different properties and administered different ways will never produce different probabilities of finding DIF (or not). If you can't prove it, the argument is not justified.

In one of my blog posts that hasn't been posted yet (for several reasons) I have written :

For instance, Drasgow (1987) analyzed a sample of 3000 examinees taking the (40-item) ACT-Math and a subsample composed of 1000 examinees taking the (75-item) ACT-English. Some items were found to be biased in both direction, either favoring minorities (e.g., blacks and hispanics) or the majority (i.e., whites) group, regarding significance tests. But χ²-based tests are useless if they are overly affected by N. What is of importance is the magnitude of the DIF. And even though a large amount of items shows "significant" DIFs, they were generally of very small magnitude, hence the conclusion that both the ACT-Math and ACT-English were not racially biased. Because of small biases in both directions, they tended to be cancelled out at the (overall) test level, so as to result in a relatively fair test. A cumulative DIF effect has been investigated by means of TCC, Test Characteristic Curve. The procedure involves the combination (i.e., the sum) of all individual ICCs (perhaps except poorly estimated items, due to large sampling variance) from the free model (i.e., no constrained parameters) in order to give an expected "number-right score" as a function of the latent trait. No cumulative DIF effects were detected. There was DIF cancellation. Drasgow used a procedure known as test purification, by removing the DIFs (detected using the original test score during the first DIF testing) from the matching test and compute the new (unbiased) total score for a second DIF testing. This procedure is thought to minimize methodological artifacts.

Bao (2007, 2009) analyzes the ACT-reading (40 items nested within 4 testlets). The 4 passages consists in Prose Fiction, Social Science, Humanities, and Natural Science. The ACT was not expected to be unidimensional. For example, the item responses in Humanities are conditioned Humanities total scores, which is called subtest, bundle, or testlet. Two samples minority/caucasians (1271/3171) and 3078/2875 (females/males). LR and unsigned/signed areas from IRT approach are the methods used. Given the ICCs provided in Appendix D and E, in Figures 11, 22, 33, and 44 for (respectively) testlet A, B, C, D, women appeared to be slightly advantaged on A and C. So, when the 4 testlets were to be combined, ACT would be slightly biased against men. In the same Figures 11, 22, 33, and 44, but in Appendix E, all of the testlets are biased against minorities, each under-estimating minority probability of success by about 0.10 or 0.12. The DIFs were of uniform kind, except the testlet A having non-uniform DIF. The bias does not appear to be small, but moderate. Unfortunately, the author never provided the composition of the ethnicities within the "minority" group.

Gibson (1998) conducted an ambitious IRT study on the items of each of the ASVAB subtests, comparing gender (male/female) and races (whites/hispanics/blacks) which means 6 subgroups in total. The biases were large in magnitude. Nonetheless, Gibson noted that the bias varied greatly depending on the forms and versions of ASVAB subtests. Some Forms were biased while some weren't. Interestingly, Word Knowledge and Mathematics Knowledge seemed to be free from DIFs. Also, Coding Speed was the least biased among all subtests and has easily passed the DIF test. Electronics Information behaved quite curiously because it performed fairly well for 5 subgroups except for black women for whom the IQ was under-estimated. The remaining 6 subtests showed inconsistencies in their biases, depending on the Forms (e.g., 15A, 15B, ...). The impact of overall DIFs on the racial differences in the total composite score is not reported or even tested. But one is left with the idea that ASVAB probably needs to be rethought.

Lawrence et al. (1988) analyzed the 85-item SAT-verbal differences between males and females. They were concerned about the DIF issue since the modifications of the SAT, between 1974 and 1978, widened the male advantage. The 4 independent sample sizes were N~13000 in Form 1, N~53000 in Form 2, N~32000 in Form 3, N~77000 in Form 4; I have calculated the d gaps given their Table 1, and they were, respectively, 0.1430, 0.1178, 0.1359, and 0.1665. STD P-DIF was the method in use. They found DIFs at the advantage of women for items related to humanities and DIFs at the advantage of men for items related to science. But in general, there was DIF cancellation. Most of the DIFs were of small size, with P-DIF value inferior to absolute 0.05. Almost none had large DIFs (P-DIF larger than absolute 0.10). Despite the absence of bias, the fact that those two trends have been detected suggests there is lack of unidimensionality. Probably, matching on the total SAT-V was not optimal. Humanities and Sciences "subtests" should have been created inside the SAT-V. If groups were matched on these subtests for analyzing the items belonging respectively to these dimensions, less DIFs (in number and in magnitude) would have been found. See e.g., Bao (2007).

Nandakumar (1993) found, with MH and SIBTEST, some evidence of gender bias against females on the 60-item ACT math (males=2115, females=2885) and 36-item NAEP 1986 history test (males=1225, females=1215). On the ACT, some items favored males, some others females. On the test level, there has been partial DIF cancellation, with a large Differential Test Functioning (i.e., the sum of DIFs) bias against females (βU=.294). Items favoring males generally require some sort of analytical/geometry knowledge, such as, properties of triangles and trapezoids, angles in a circle, and volume of a box. On the NAEP history however, there was no evidence of DTF with a very weak βU=.018. Items favoring males mainly involved factual knowledge, such as, location of different countries on the world map, dates of certain historical events, whereas items favoring females involved reasoning ability about the constitution or entrance to the League of Nations. Nandakumar also examine the 36-item NAEP 1986 history test on 1711 whites and 447 blacks, and although the DTF effect size is not reported, the number as well as magnitude of DIFs favoring whites is much larger than those favoring blacks. There seems to be a very strong bias against blacks. The items favoring whites required some geographical knowledge and facts about World War 2, which is a perfect illustration of the differences in exposure to knowledge.


I select only some portion, because it's too long. But I know fully well how difficult it is to generalize individual studies. Especially when they use different methods. Trundt et al. analyze subtests whereas Richwine analyzed the test's items. You can think if MGCFA produces MI, then other techniques, e.g., IRT, SIBTEST, Mantel-Haenzsel, will also lead to the same conclusion. But unless researchers use different techniques and compare the outcome, there is no definite proof of it.

How would you rewrite it?


No idea comes to mind. At least, not now.

It opens for me. I'm using Windows 2013, so they're .docx or .odt files. What version are you using? If that's the problem, I will also add a .doc version or whatever.


I should use Windows 7 I think, it's a recent computer (2 years or so).
Admin
What Chuck is saying is that if we find that one tests 1-3, there is a group difference of 1d between X and Y and we check and there isn't any bias or very little, then this is evidence than when we check tests 4-6 for the same groups and find a difference of about 1d, this difference is likely not due to any large bias as well. It is an empirical generalization.
What Chuck is saying is that if we find that one tests 1-3, there is a group difference of 1d between X and Y and we check and there isn't any bias or very little, then this is evidence than when we check tests 4-6 for the same groups and find a difference of about 1d, this difference is likely not due to any large bias as well. It is an empirical generalization.


Ok, it's clearer. But I don't think it's necessarily generalizable, even in this case. Because tests 4 through 6 can have different test length, and different questions as well, etc. If the Wordsum shows, for example, d gap of about 0.6-0.7, and it has no bias, and if other vocabulary tests (PPVT) or subtests (Vocabulary-Wechsler) show stronger d gap and no bias either, you can't seriously suspect PPVT and Vocabulary Wechsler to be biased just because the d value is stronger. For John, he has generalized non-verbal IQ tests to english/literacy tests (in table 9). These are not only different tests, but tests of different kind. No one says that hispanic-white gap is identical for verbal and non-verbal gap in terms of true latent scores for example. If really they are different, then the finding of equivalent verbal and non-verbal gap is evidence there is a bias somewhere.
I read the most recent PDF version. Overall good. Soon ready for publication. I have some comments. Perhaps add "national IQs" to the key words.


Ok.

In the list of studies on pp. 2-3, the words "publicly" has a grey box. Is that on purpose? What is the purpose?


Ok.

"Generally, the research which we did find and did not include did not meet one of our inclusion criteria." This seems to imply that you found some research and didn't include it although it met the inclusion criteria. Clarify please. Is there a list of samples you considered but did not include?


I came across one study that presented first and second generation scores by nationality and third generation scores by race. I could have tried to group nationalities into races/ethnicity, but I would have had to go beyond the data. I used the term "generally" to signify that there might have been research out there that could have been squeezed in but was not. Since this was an "unsystematic" meta-analysis, I don't feel that I have an obligation to identify and list all studies not included. As it is, 17 out of 18 of the studies were said to be nationally representative; this mitigates against selection bias.

Generally people use the median when calculating a central tendency for heterogeneous results (e.g. from many different kinds of methods and samples). Using the median instead means that outliers have no effect on the result which they do on the mean. If outliers are skewed in a certain direction for whatever reason, the mean will be a biased estimate...However, the median does not work well when K is small. It seems like you used the mean. Did you consider using the median? If you feel like it, you could have a look and see if using the median changes things. My hunch is that it won't change much.


The largest (mean minus median) effect was for first generation Hispanics at 0.12. (1.02 versus 0.90). With median, you lose information. This is good when you have extreme scores that are liable to throw off averages. I only had extreme scores for first generation Hispanics. So I decided to use mean. I was tempted to weight by total sample size (on the assumption that % of subgroup was relatively constant) but decided not to because this would give too much weight to certain only semi-representative studies, mainly the NPSAS studies which were representative only of the university population. Overall, I think my method was the least worse. I did consider others though.

Normally one capitalizes the word table because it is a proper name in this context that refers to a specific table. The rest of the paper uses the same practice of not capitalizing references to tables. I prefer capitalization, but it's a stylistic disagreement.


Ok.

p. 5: The table does not seem to be a real table as the text does not follow vertical lines exactly. Did you make it look like a table using spaces instead of tabs?


Fixed.

"We reported results for other studies such as TIMSS 1995, TIMSS 1999, TIMSS 2003, and PIRLS 2001 in the supplementary file. We did not include these results in the meta-analysis because we desired a balanced sample of surveys." How would including them change results?


I could have added at least 6 more international test studies. To the extent this would have changed the results -- depending on subgroup -- it would have done so by giving them a dominant international test flavor. Even then, the effects would not have be substantial with regards to my discussion.

"When sample sizes were too small to generate reliable results, scores were left blank in the chart and were not factored into the meta-analytic averages."What was the threshold for "too small"?


I added: "When sample sizes were too small to generate reliable results, scores were left blank in the chart and were not factored into the meta-analytic averages. NAEP’s data explorers only generate values if the sample sizes are 62 or more. For analyses conducted with SPSS, we reported results if individual sample sizes were equal to or greater than 30." Some of the too small values had crept into the meta-analysis (when switching back and forth with MH), so I deleted them and updated the numbers.

"Relative to third+ generation Whites, the average d-values were 0.98, 0.80, and 0.98 for first, second, and third+ generation Black individuals, 1.02, 0.68, and 0.56 for first, second, and third+ generation Hispanic individuals, 0.10, -0.21, and -0.19 for first, second, and third+ generation Asian individuals, and 0.20 and 0.04 for first and second generation White individuals. For Blacks and Whites of the same generation, the first, second, and third+ generation B/W dvalues were 0.78, 0.76, and 0.98. For Hispanics and Whites of the same generation, the first, second, and third+ generation H/W d-values were 0.78, 0.65, and 0.56. For Asians and Whites 10 of the same generation, the first, second, and third+ generation d-values were -0.10, -0.21, and -0.19."
Is this paragraph necessary? It is just a complete repetition of the results in the tables just presented.


The section in bold is not contained in the text. Also, if I delete the whole thing, I imagine that someone else will just come along and ask that I add it.

"Table 5. Percent of Black Immigrants to the U.S. by Region of Origin, 1980 to 2008*"In this table and others the author refers to the numbers as percentages, but they are not multiplied by 100 and are merely parts of 1. It can throw the reader off.


Changed.

p. 19:Is there some reason why there are missing values in SD and IQ columns? Presumably IQs are calculated by converting the Score column values. Clarify?


Because I computed d-values relative to the White 3rd+ gen SD. I don't think I need to add a note.

p. 20:What are theta scores? Is that from IRT? http://en.wikipedia.org/wiki/Item_response_theory


Readers can look it up. Either I had to report theta or IRT values. I reported theta.

p 24:I don't understand how col G works. For Chinese, the prediction based on LV IQ is -.39, while the actual performance is -.46, a difference of |.07|. Very small. Col G says it is .86. Compare with the Japanese below. Predicted -.28, actual -.40, delta |.12|, also small. G says -.01.What about the three missing values? Presumably the one in F is because the composition of "other Asian" is unknown while "All Asians" uses the estimated proportions from Table 15 to get to .4 ((100-94)/15=.4).


Copy and paste error. It was supposed to be F-E or Lynn IQ minus CAT AQ.

[hr]
I understand you did not conclude anything but merely suspect. This is why you made indirect inferences. It's not possible to argue that MI in any given test is generalizable to other tests. You have to prove it first. It's like i said before. If you argue MI in a specific test is generalizable to others, then, based on Wechsler, PIAT math, and DAS-II, you will wrongly conclude ASVAB respects MI. That's not true.


I'm not arguing that the others tests are unbiased. I'm arguing that bias can not explain a large portion of the difference found on these other tests because there are commensurate latent ability differences.

Imagine we had 5 NAEP Reading tests. And 2 of them showed no DIF and no measure non-invariance. Imagine that the H/W difference on these 2 tests was 0.65. This would provide evidence that there was a "true" population level latent H/W ability difference of 0.65 SD, no? Now imagine we had 3 other NAEP tests for which DIF and measure non-invariance were found, but adjusted scores were not presented. Imagine that these 3 tests also showed an average score difference of 0.65 SD. Knowing nothing else, we can infer that the psychometric bias on the latter 3 tests is not accounting for much of the average 0.65 H/W score difference because the evidence show that there is, in fact, a 0.65 SD latent ability difference.

I fixed the quote formatting of this post. -Emil
Admin
I took a quick look at the new PDF. One more question and one comment. :)

Because I computed d-values relative to the White 3rd+ gen SD. I don't think I need to add a note.


I don't understand. What I meant is that I don't understand why in Table 12, Col "IQ" that there is no numbers for both the admixtured groups (W-H and H-W).

The section in bold is not contained in the text. Also, if I delete the whole thing, I imagine that someone else will just come along and ask that I add it.


I don't understand. It is right there under the table at page 9.

You decide whether to keep it or not. I prefer delete.
"Because I computed d-values relative to the White 3rd+ gen SD. I don't think I need to add a note."

I don't understand. What I meant is that I don't understand why in Table 12, Col "IQ" that there is no numbers for both the admixtured groups (W-H and H-W).


I will check over all of the tables again later. The format was changed several times. During the changes, some got messed up.

I don't understand. It is right there under the table at page 9.


You originally said:

9:

"Relative to third+ generation Whites, the average d-values were 0.98, 0.80, and 0.98 for first, second, and third+ generation Black individuals, 1.02, 0.68, and 0.56 for first, second, and third+ generation Hispanic individuals, 0.10, -0.21, and -0.19 for first, second, and third+ generation Asian individuals, and 0.20 and 0.04 for first and second generation White individuals. For Blacks and Whites of the same generation, the first, second, and third+ generation B/W dvalues were 0.78, 0.76, and 0.98. For Hispanics and Whites of the same generation, the first, second, and third+ generation H/W d-values were 0.78, 0.65, and 0.56. For Asians and Whites 10 of the same generation, the first, second, and third+ generation d-values were -0.10, -0.21, and -0.19."

Is this paragraph necessary? It is just a complete repetition of the results in the tables just presented.


The bold part is not shown in the table. It was shown in the old version. The bold part compares e.g., White generation 1 to Black generation 1 or rather e.g., the White generation 1 d-values to the Black generation 1 d-values, where the respective d-values are computed relative to Whites of the 3rd+ generation. The original table clearly showed the method, but since it was changed, I probably should explain in the text. A number of these mistakes are due, directly and indirectly (by way of frustration and time consumption), to the table format changes. Maybe you could specify how you would like the tables to be presented so that I don't run into this problem again. For example, can I screen shot and paste tables? Can I use colors?