Hello There, Guest!

U.S. Ethnic/Race Differences in Aptitude by Generation

It seems to be an estimate of the number of scores right if the testee had attempted every item. Presumably used to remove any bias of measurement resulting from different inclinations to try all items. If A and B have the same math ability, but A tries half of the items and then gives up, while B tries half of the items and then gives up, but nonetheless guesses on the remaining B will get a higher score due to randomly getting some items right in the 2nd half.
When I used theta scores and IRT scores for this blog post here
http://humanvarieties.org/2013/10/20/rac...rvey-data/

In the HSLS 2009, I see no big difference. If anything, the black-white (and perhaps black-asian) gaps increase a little bit when using IRT scores.
Quote:The argument is wrong here. Bias has different sources, and reasons. Difference in speededness, attitude, differential interpretation regarding words, knowledge, etc. Blacks, for example, can show DIF for easy and hard items, but those have likely different explanation, such as easy DIF item owing to difference in interpretation (given that the words are widely known and heard) and hard items due to rarity. Speeded tests can induce people to guess, and more so for members for which the mean score is lower than the other group. If tests differ in properties, they can differ in the amount of bias and in its direction. You cannot imply (a) to say in general there is no bias. You cannot, for example, predict from PIAT and DAS that the ASVAB is not biased. And there is proof that ASVAB is biased, although the author has not made clear the direction of bias.

The argument is valid. First, what did I say: "There are three reasons to suspect that such bias is minimal by the second and subsequent generations". Since I said "suspect" not e.g., "conclude" my evidence doesn't need to be strong. The argument just must be coherent. And it is. Finding large unbiased differences of magnitude x in samples (a) through © evidences the existence of large unbiased latent ability differences of magnitude x between populations. (True or false?) If there are large unbiased latent ability differences of magnitude x between the populations, then it is unlikely that the average difference of magnitude x on tests (e) through (z) are more than minimally due to bias. (True or false?) The argument isn't that tests (e) through (z) are unbiased because tests (a) through (z) are, but that the bias for tests (e) through (z), can't be large because tests (e) through (z) show a magnitude of difference commensurate with the true latent ability one, the existence of which is evidenced by tests (a) through ©.

(2014-Jul-17, 15:15:45)menghu1001 Wrote: Ok, but it's not easy to understand.

How would you rewrite it?

Quote:I cannot open your 2nd and 3rd files.

It opens for me. I'm using Windows 2013, so they're .docx or .odt files. What version are you using? If that's the problem, I will also add a .doc version or whatever.

Attached Files
U.S. Ethnic-Race Differences in Aptitude by Generation - An Exploratory Meta-analysis (Appendix) (1).odt (Size: 731.41 KB / Downloads: 570)
U.S. Ethnic-Race Differences in Aptitude by Generation - An Exploratory Meta-analysis (Appendix) (1).doc (Size: 760.5 KB / Downloads: 590)
U.S. Ethnic-Race Differences in Aptitude by Generation - An Exploratory Meta-analysis (Appendix) (1).docx (Size: 740.14 KB / Downloads: 603)
I'll see if I can get rid of that pesky replacement algorithm that replaces ( c ) with © and ( r ) with ®.

Fixed! I have no idea why they would put something as annoying as that in to begin with.
(2014-Jul-17, 19:16:28)Chuck Wrote: The argument is valid. First, what did I say: "There are three reasons to suspect that such bias is minimal by the second and subsequent generations". Since I said "suspect" not e.g., "conclude" my evidence doesn't need to be strong. The argument just must be coherent. And it is. Finding large unbiased differences of magnitude x in samples (a) through © evidences the existence of large unbiased latent ability differences of magnitude x between populations. (True or false?) If there are large unbiased latent ability differences of magnitude x between the populations, then it is unlikely that the average difference of magnitude x on tests (e) through (z) are more than minimally due to bias. (True or false?) The argument isn't that tests (e) through (z) are unbiased because tests (a) through (z) are, but that the bias for tests (e) through (z), can't be large because tests (e) through (z) show a magnitude of difference commensurate with the true latent ability one, the existence of which is evidenced by tests (a) through ©.

I underlined the passages i disagree with. But I do not understand the rest. You say first, samples (a) through © and then tests (a) through ©. And also this one: "that tests (e) through (z) are unbiased because tests (a) through (z) are" (but more generally it's the entire sentence).

I understand you did not conclude anything but merely suspect. This is why you made indirect inferences. It's not possible to argue that MI in any given test is generalizable to other tests. You have to prove it first. It's like i said before. If you argue MI in a specific test is generalizable to others, then, based on Wechsler, PIAT math, and DAS-II, you will wrongly conclude ASVAB respects MI. That's not true.

IQ or cognitive, or achievement tests, do not have the same items/questions. And are probably not allotted with the same amount of time (i.e., minutes) to take the test. If you say right, then it must be true that IQ tests having different properties and administered different ways will never produce different probabilities of finding DIF (or not). If you can't prove it, the argument is not justified.

In one of my blog posts that hasn't been posted yet (for several reasons) I have written :

Quote:For instance, Drasgow (1987) analyzed a sample of 3000 examinees taking the (40-item) ACT-Math and a subsample composed of 1000 examinees taking the (75-item) ACT-English. Some items were found to be biased in both direction, either favoring minorities (e.g., blacks and hispanics) or the majority (i.e., whites) group, regarding significance tests. But χ²-based tests are useless if they are overly affected by N. What is of importance is the magnitude of the DIF. And even though a large amount of items shows "significant" DIFs, they were generally of very small magnitude, hence the conclusion that both the ACT-Math and ACT-English were not racially biased. Because of small biases in both directions, they tended to be cancelled out at the (overall) test level, so as to result in a relatively fair test. A cumulative DIF effect has been investigated by means of TCC, Test Characteristic Curve. The procedure involves the combination (i.e., the sum) of all individual ICCs (perhaps except poorly estimated items, due to large sampling variance) from the free model (i.e., no constrained parameters) in order to give an expected "number-right score" as a function of the latent trait. No cumulative DIF effects were detected. There was DIF cancellation. Drasgow used a procedure known as test purification, by removing the DIFs (detected using the original test score during the first DIF testing) from the matching test and compute the new (unbiased) total score for a second DIF testing. This procedure is thought to minimize methodological artifacts.

Bao (2007, 2009) analyzes the ACT-reading (40 items nested within 4 testlets). The 4 passages consists in Prose Fiction, Social Science, Humanities, and Natural Science. The ACT was not expected to be unidimensional. For example, the item responses in Humanities are conditioned Humanities total scores, which is called subtest, bundle, or testlet. Two samples minority/caucasians (1271/3171) and 3078/2875 (females/males). LR and unsigned/signed areas from IRT approach are the methods used. Given the ICCs provided in Appendix D and E, in Figures 11, 22, 33, and 44 for (respectively) testlet A, B, C, D, women appeared to be slightly advantaged on A and C. So, when the 4 testlets were to be combined, ACT would be slightly biased against men. In the same Figures 11, 22, 33, and 44, but in Appendix E, all of the testlets are biased against minorities, each under-estimating minority probability of success by about 0.10 or 0.12. The DIFs were of uniform kind, except the testlet A having non-uniform DIF. The bias does not appear to be small, but moderate. Unfortunately, the author never provided the composition of the ethnicities within the "minority" group.

Gibson (1998) conducted an ambitious IRT study on the items of each of the ASVAB subtests, comparing gender (male/female) and races (whites/hispanics/blacks) which means 6 subgroups in total. The biases were large in magnitude. Nonetheless, Gibson noted that the bias varied greatly depending on the forms and versions of ASVAB subtests. Some Forms were biased while some weren't. Interestingly, Word Knowledge and Mathematics Knowledge seemed to be free from DIFs. Also, Coding Speed was the least biased among all subtests and has easily passed the DIF test. Electronics Information behaved quite curiously because it performed fairly well for 5 subgroups except for black women for whom the IQ was under-estimated. The remaining 6 subtests showed inconsistencies in their biases, depending on the Forms (e.g., 15A, 15B, ...). The impact of overall DIFs on the racial differences in the total composite score is not reported or even tested. But one is left with the idea that ASVAB probably needs to be rethought.

Lawrence et al. (1988) analyzed the 85-item SAT-verbal differences between males and females. They were concerned about the DIF issue since the modifications of the SAT, between 1974 and 1978, widened the male advantage. The 4 independent sample sizes were N~13000 in Form 1, N~53000 in Form 2, N~32000 in Form 3, N~77000 in Form 4; I have calculated the d gaps given their Table 1, and they were, respectively, 0.1430, 0.1178, 0.1359, and 0.1665. STD P-DIF was the method in use. They found DIFs at the advantage of women for items related to humanities and DIFs at the advantage of men for items related to science. But in general, there was DIF cancellation. Most of the DIFs were of small size, with P-DIF value inferior to absolute 0.05. Almost none had large DIFs (P-DIF larger than absolute 0.10). Despite the absence of bias, the fact that those two trends have been detected suggests there is lack of unidimensionality. Probably, matching on the total SAT-V was not optimal. Humanities and Sciences "subtests" should have been created inside the SAT-V. If groups were matched on these subtests for analyzing the items belonging respectively to these dimensions, less DIFs (in number and in magnitude) would have been found. See e.g., Bao (2007).

Nandakumar (1993) found, with MH and SIBTEST, some evidence of gender bias against females on the 60-item ACT math (males=2115, females=2885) and 36-item NAEP 1986 history test (males=1225, females=1215). On the ACT, some items favored males, some others females. On the test level, there has been partial DIF cancellation, with a large Differential Test Functioning (i.e., the sum of DIFs) bias against females (βU=.294). Items favoring males generally require some sort of analytical/geometry knowledge, such as, properties of triangles and trapezoids, angles in a circle, and volume of a box. On the NAEP history however, there was no evidence of DTF with a very weak βU=.018. Items favoring males mainly involved factual knowledge, such as, location of different countries on the world map, dates of certain historical events, whereas items favoring females involved reasoning ability about the constitution or entrance to the League of Nations. Nandakumar also examine the 36-item NAEP 1986 history test on 1711 whites and 447 blacks, and although the DTF effect size is not reported, the number as well as magnitude of DIFs favoring whites is much larger than those favoring blacks. There seems to be a very strong bias against blacks. The items favoring whites required some geographical knowledge and facts about World War 2, which is a perfect illustration of the differences in exposure to knowledge.

I select only some portion, because it's too long. But I know fully well how difficult it is to generalize individual studies. Especially when they use different methods. Trundt et al. analyze subtests whereas Richwine analyzed the test's items. You can think if MGCFA produces MI, then other techniques, e.g., IRT, SIBTEST, Mantel-Haenzsel, will also lead to the same conclusion. But unless researchers use different techniques and compare the outcome, there is no definite proof of it.

Quote:How would you rewrite it?

No idea comes to mind. At least, not now.

Quote:It opens for me. I'm using Windows 2013, so they're .docx or .odt files. What version are you using? If that's the problem, I will also add a .doc version or whatever.

I should use Windows 7 I think, it's a recent computer (2 years or so).
What Chuck is saying is that if we find that one tests 1-3, there is a group difference of 1d between X and Y and we check and there isn't any bias or very little, then this is evidence than when we check tests 4-6 for the same groups and find a difference of about 1d, this difference is likely not due to any large bias as well. It is an empirical generalization.
(2014-Jul-17, 21:10:40)Emil Wrote: What Chuck is saying is that if we find that one tests 1-3, there is a group difference of 1d between X and Y and we check and there isn't any bias or very little, then this is evidence than when we check tests 4-6 for the same groups and find a difference of about 1d, this difference is likely not due to any large bias as well. It is an empirical generalization.

Ok, it's clearer. But I don't think it's necessarily generalizable, even in this case. Because tests 4 through 6 can have different test length, and different questions as well, etc. If the Wordsum shows, for example, d gap of about 0.6-0.7, and it has no bias, and if other vocabulary tests (PPVT) or subtests (Vocabulary-Wechsler) show stronger d gap and no bias either, you can't seriously suspect PPVT and Vocabulary Wechsler to be biased just because the d value is stronger. For John, he has generalized non-verbal IQ tests to english/literacy tests (in table 9). These are not only different tests, but tests of different kind. No one says that hispanic-white gap is identical for verbal and non-verbal gap in terms of true latent scores for example. If really they are different, then the finding of equivalent verbal and non-verbal gap is evidence there is a bias somewhere.

Ok.

Quote:In the list of studies on pp. 2-3, the words "publicly" has a grey box. Is that on purpose? What is the purpose?

Ok.

Quote:"Generally, the research which we did find and did not include did not meet one of our inclusion criteria." This seems to imply that you found some research and didn't include it although it met the inclusion criteria. Clarify please. Is there a list of samples you considered but did not include?

I came across one study that presented first and second generation scores by nationality and third generation scores by race. I could have tried to group nationalities into races/ethnicity, but I would have had to go beyond the data. I used the term "generally" to signify that there might have been research out there that could have been squeezed in but was not. Since this was an "unsystematic" meta-analysis, I don't feel that I have an obligation to identify and list all studies not included. As it is, 17 out of 18 of the studies were said to be nationally representative; this mitigates against selection bias.

Quote:Generally people use the median when calculating a central tendency for heterogeneous results (e.g. from many different kinds of methods and samples). Using the median instead means that outliers have no effect on the result which they do on the mean. If outliers are skewed in a certain direction for whatever reason, the mean will be a biased estimate...However, the median does not work well when K is small. It seems like you used the mean. Did you consider using the median? If you feel like it, you could have a look and see if using the median changes things. My hunch is that it won't change much.

The largest (mean minus median) effect was for first generation Hispanics at 0.12. (1.02 versus 0.90). With median, you lose information. This is good when you have extreme scores that are liable to throw off averages. I only had extreme scores for first generation Hispanics. So I decided to use mean. I was tempted to weight by total sample size (on the assumption that % of subgroup was relatively constant) but decided not to because this would give too much weight to certain only semi-representative studies, mainly the NPSAS studies which were representative only of the university population. Overall, I think my method was the least worse. I did consider others though.

Quote:Normally one capitalizes the word table because it is a proper name in this context that refers to a specific table. The rest of the paper uses the same practice of not capitalizing references to tables. I prefer capitalization, but it's a stylistic disagreement.

Ok.

Quote:p. 5: The table does not seem to be a real table as the text does not follow vertical lines exactly. Did you make it look like a table using spaces instead of tabs?

Fixed.

Quote:"We reported results for other studies such as TIMSS 1995, TIMSS 1999, TIMSS 2003, and PIRLS 2001 in the supplementary file. We did not include these results in the meta-analysis because we desired a balanced sample of surveys." How would including them change results?

I could have added at least 6 more international test studies. To the extent this would have changed the results -- depending on subgroup -- it would have done so by giving them a dominant international test flavor. Even then, the effects would not have be substantial with regards to my discussion.

Quote:"When sample sizes were too small to generate reliable results, scores were left blank in the chart and were not factored into the meta-analytic averages."What was the threshold for "too small"?

I added: "When sample sizes were too small to generate reliable results, scores were left blank in the chart and were not factored into the meta-analytic averages. NAEP’s data explorers only generate values if the sample sizes are 62 or more. For analyses conducted with SPSS, we reported results if individual sample sizes were equal to or greater than 30." Some of the too small values had crept into the meta-analysis (when switching back and forth with MH), so I deleted them and updated the numbers.

Quote:"Relative to third+ generation Whites, the average d-values were 0.98, 0.80, and 0.98 for first, second, and third+ generation Black individuals, 1.02, 0.68, and 0.56 for first, second, and third+ generation Hispanic individuals, 0.10, -0.21, and -0.19 for first, second, and third+ generation Asian individuals, and 0.20 and 0.04 for first and second generation White individuals. For Blacks and Whites of the same generation, the first, second, and third+ generation B/W dvalues were 0.78, 0.76, and 0.98. For Hispanics and Whites of the same generation, the first, second, and third+ generation H/W d-values were 0.78, 0.65, and 0.56. For Asians and Whites 10 of the same generation, the first, second, and third+ generation d-values were -0.10, -0.21, and -0.19."
Is this paragraph necessary? It is just a complete repetition of the results in the tables just presented.

The section in bold is not contained in the text. Also, if I delete the whole thing, I imagine that someone else will just come along and ask that I add it.

Quote:"Table 5. Percent of Black Immigrants to the U.S. by Region of Origin, 1980 to 2008*"In this table and others the author refers to the numbers as percentages, but they are not multiplied by 100 and are merely parts of 1. It can throw the reader off.

Changed.

Quote:p. 19:Is there some reason why there are missing values in SD and IQ columns? Presumably IQs are calculated by converting the Score column values. Clarify?

Because I computed d-values relative to the White 3rd+ gen SD. I don't think I need to add a note.

Quote:p. 20:What are theta scores? Is that from IRT? http://en.wikipedia.org/wiki/Item_response_theory

Readers can look it up. Either I had to report theta or IRT values. I reported theta.

Quote:p 24:I don't understand how col G works. For Chinese, the prediction based on LV IQ is -.39, while the actual performance is -.46, a difference of |.07|. Very small. Col G says it is .86. Compare with the Japanese below. Predicted -.28, actual -.40, delta |.12|, also small. G says -.01.What about the three missing values? Presumably the one in F is because the composition of "other Asian" is unknown while "All Asians" uses the estimated proportions from Table 15 to get to .4 ((100-94)/15=.4).

Copy and paste error. It was supposed to be F-E or Lynn IQ minus CAT AQ.

(2014-Jul-17, 20:38:07)menghu1001 Wrote: I understand you did not conclude anything but merely suspect. This is why you made indirect inferences. It's not possible to argue that MI in any given test is generalizable to other tests. You have to prove it first. It's like i said before. If you argue MI in a specific test is generalizable to others, then, based on Wechsler, PIAT math, and DAS-II, you will wrongly conclude ASVAB respects MI. That's not true.

I'm not arguing that the others tests are unbiased. I'm arguing that bias can not explain a large portion of the difference found on these other tests because there are commensurate latent ability differences.

Imagine we had 5 NAEP Reading tests. And 2 of them showed no DIF and no measure non-invariance. Imagine that the H/W difference on these 2 tests was 0.65. This would provide evidence that there was a "true" population level latent H/W ability difference of 0.65 SD, no? Now imagine we had 3 other NAEP tests for which DIF and measure non-invariance were found, but adjusted scores were not presented. Imagine that these 3 tests also showed an average score difference of 0.65 SD. Knowing nothing else, we can infer that the psychometric bias on the latter 3 tests is not accounting for much of the average 0.65 H/W score difference because the evidence show that there is, in fact, a 0.65 SD latent ability difference.

I fixed the quote formatting of this post. -Emil

Attached Files
U.S. Ethnic-Race Differences in Aptitude by Generation - An Exploratory Meta-analysis (John Fuerst 2014) (07172014) (2). (Size: 903.55 KB / Downloads: 109)
I took a quick look at the new PDF. One more question and one comment. :)

Quote: Because I computed d-values relative to the White 3rd+ gen SD. I don't think I need to add a note.

I don't understand. What I meant is that I don't understand why in Table 12, Col "IQ" that there is no numbers for both the admixtured groups (W-H and H-W).

Quote: The section in bold is not contained in the text. Also, if I delete the whole thing, I imagine that someone else will just come along and ask that I add it.

I don't understand. It is right there under the table at page 9.

You decide whether to keep it or not. I prefer delete.
Quote: "Because I computed d-values relative to the White 3rd+ gen SD. I don't think I need to add a note."

I don't understand. What I meant is that I don't understand why in Table 12, Col "IQ" that there is no numbers for both the admixtured groups (W-H and H-W).

I will check over all of the tables again later. The format was changed several times. During the changes, some got messed up.

Quote:I don't understand. It is right there under the table at page 9.

You originally said:

Quote:9:

"Relative to third+ generation Whites, the average d-values were 0.98, 0.80, and 0.98 for first, second, and third+ generation Black individuals, 1.02, 0.68, and 0.56 for first, second, and third+ generation Hispanic individuals, 0.10, -0.21, and -0.19 for first, second, and third+ generation Asian individuals, and 0.20 and 0.04 for first and second generation White individuals. For Blacks and Whites of the same generation, the first, second, and third+ generation B/W dvalues were 0.78, 0.76, and 0.98. For Hispanics and Whites of the same generation, the first, second, and third+ generation H/W d-values were 0.78, 0.65, and 0.56. For Asians and Whites 10 of the same generation, the first, second, and third+ generation d-values were -0.10, -0.21, and -0.19."

Is this paragraph necessary? It is just a complete repetition of the results in the tables just presented.

The bold part is not shown in the table. It was shown in the old version. The bold part compares e.g., White generation 1 to Black generation 1 or rather e.g., the White generation 1 d-values to the Black generation 1 d-values, where the respective d-values are computed relative to Whites of the 3rd+ generation. The original table clearly showed the method, but since it was changed, I probably should explain in the text. A number of these mistakes are due, directly and indirectly (by way of frustration and time consumption), to the table format changes. Maybe you could specify how you would like the tables to be presented so that I don't run into this problem again. For example, can I screen shot and paste tables? Can I use colors?

Forum Jump:

Users browsing this thread: 2 Guest(s)