[ODP] An update on the narrowing of the black-white gap in the Wordsum

1 2 3 Next Last

[ODP] An update on the narrowing of the black-white gap in the Wordsum

Tue 30 Sep 2014 08:45

Admin

An update on the secular narrowing of the black-white gap in the Wordsum vocabulary test (1974-2012)

Abstract

The aim of this article is to provide an update to Huang & Hauser (2001) study. I use the General Social Survey (GSS) to analyze the trend in the black-white difference in Wordsum scores by survey years and by cohorts. Tobit regression models show that the black-white difference diminishes over time by cohorts but just slightly by survey years. That is, the gap closing is mainly a cohort effect. The black-white narrowing may have ceased in the last cohorts and periods.

Keywords: IQ, black-white gap

(I have initially tried to attach .rar file that contains the .doc, pdf files as well as XLS but apparently that option is not available for me. Can Emil or Duxide help me with this ? That should prevent google from indexing non-published versions, I believe.)

EDIT.

The article has been uploaded to OSF as well.
https://osf.io/9tgmi/

And here's the link to the xls file.
https://osf.io/2w4h9/

Emil O. W. Kirkegaard

Tue 30 Sep 2014 09:48

Admin

Just upload the project files to http://osf.io/

Davide Piffer

Tue 30 Sep 2014 11:13

I cannot even open the paper because I am on holiday and my old netbook is not well equipped. So I just read the abstract. Genetic admixture with Caucasians has been steadily increasing and reached about 20% in the contemporary African American population but could you find statistics with an estimate of how Euro admixture has gone up over the last century? And maybe correlate it with cognitive gap closing? This of course wouldn't rule out the cultural explanation but would show that the environmental explanation is not the only one available.

Meng Hu

Tue 30 Sep 2014 16:52

Admin

could you find statistics with an estimate of how Euro admixture has gone up over the last century? And maybe correlate it with cognitive gap closing?

In the GSS, there are two variables, racecen1 and racecen2 for first and second race mentioned. I guess if someone answers black for racecen1 and white for racecen2 he could be biracial. The problem is that racecen1 has modest sample size (17480) while the sample size of racecen2 is ridiculous (1033). And I do not even look at those who don't have wordsum data. So, in the GSS itself, it's just impossible to do that.

Of course, there may be other data, but I don't have those.

Emil O. W. Kirkegaard

Tue 30 Sep 2014 17:49

Admin

MH,

If you upload the files to OSF as I proposed, Piffer would also be able to read them. OSF features in-browser reading of PDF files and other standard files, like spreadsheets.

Emil O. W. Kirkegaard

Fri 03 Oct 2014 10:11

Admin

The code for STATA should be in an independent supplementary file, not the appendix. Copying code from PDF files does not work well.

Figure 1 shows that there is a strong ceiling effect
for the wordsum variable in the white sample.

Which subset of data is the figures based on? The text doesn't say and the ceiling effect is not similar over time.

Can you put the figures in the text near where they are mentioned instead of having them compiled in the end. Keeping the regression tables back there seems okay.

Overall paper seems okay. I'm not familiar with tobit regression, so I can't comment on that.

Meng Hu

Sun 05 Oct 2014 21:59

Admin

In the article, figure 1 obviously shows the histogram for the sample of each racial groups (but excluding people aged 70+). In the section "limitation" it is said that in the white sample, the ceiling effect diminishes over time (when looking at each category of cohort6). I attach the pictures here.

The pictures (fig. 1-3) are too large. I can only put one by page. I can put fig 1 exactly where you ask, but for figures 2-3, it's more difficult to put them exactly where I want. That did not look good. That's why I put everything at the end of the article.

For the syntax, I know that lot of people will not even look at the supplementary files, and I want to show the syntax, this can encourage others to do so.

For tobit regression, I would recommend :

The Uses of Tobit Analysis (McDonald & Moffitt 1980)
Roncek, D. W. (1992). Learning more from tobit coefficients: Extending a comparative analysis of political protest. American Sociological Review, 503-507.
Introductory Econometrics: A Modern Approach (Jeffrey M. Wooldridge 2012; pages 596-601) (you can have it in libgen)

Besides, I have detected two errors.

Concerning the variable "sibs" there are two observations for which the number of siblings (55 and 68) shows a dramatic departure from all other respondents. I decided to filter them out. As for age variable, I decided to remove (set to missing data) people aged 70 or more.

Concerns about sample representativeness have sometimes been expressed. Hence, following the recommendations of Hauser & Huang (1999) I weight the data by the variable "weight" which is the interaction of the variables "wtssall" and "oversamp", although this will not change the results.1

It's not "set to missing" because I "dropped" the observations. Also, it's not Hauser & Huang (1999) because there is no 1999 paper. Only the 1996, 1998 (in Neisser's book on the Flynn effect), 2000 or the 2001. I will correct those mistakes later.

By the way, I have contacted Hauser, Huang, Lynn, Flynn, and Dickens, that is, those who have worked on the BW IQ changes over time. They responded quickly, and (almost) the same day (30 sept/1 oct). Lynn said it's interesting. Huang said that their analysis needed an update, and he thanked me for doing so (and hoped the best for my publication). Flynn said it looks fascinating, and we must keep in touch. So, no one has really commented the article. I will try to email other people whose work is close to that topic. I think it's important to have the opinions from scholars.

Emil O. W. Kirkegaard

Sun 05 Oct 2014 22:43

Admin

In the article, figure 1 obviously shows the histogram for the sample of each racial groups (but excluding people aged 70+). In the section "limitation" it is said that in the white sample, the ceiling effect diminishes over time (when looking at each category of cohort6). I attach the pictures here.

I mean, which year? All years combined?

The pictures (fig. 1-3) are too large. I can only put one by page. I can put fig 1 exactly where you ask, but for figures 2-3, it's more difficult to put them exactly where I want. That did not look good. That's why I put everything at the end of the article.

Just shrink them a bit so they aren't too large. They don't need to be that large. Having figures in the end of the article is bad since it interrupts reading flow (readers have to stop up, scroll to the end and read the figures, then jump back to the page with the relevant text).

For the syntax, I know that lot of people will not even look at the supplementary files, and I want to show the syntax, this can encourage others to do so.

Most readers will not read the code in the appendix either, and most readers don't use STATA and even those who do are often not familiar with complex syntax. I maintain that code should be in a supplementary file.

For tobit regression, I would recommend :

The Uses of Tobit Analysis (McDonald & Moffitt 1980)
Roncek, D. W. (1992). Learning more from tobit coefficients: Extending a comparative analysis of political protest. American Sociological Review, 503-507.
Introductory Econometrics: A Modern Approach (Jeffrey M. Wooldridge 2012; pages 596-601) (you can have it in libgen)

Thank you. I will look into the last reference.

By the way, I have contacted Hauser, Huang, Lynn, Flynn, and Dickens, that is, those who have worked on the BW IQ changes over time. They responded quickly, and (almost) the same day (30 sept/1 oct). Lynn said it's interesting. Huang said that their analysis needed an update, and he thanked me for doing so (and hoped the best for my publication). Flynn said it looks fascinating, and we must keep in touch. So, no one has really commented the article. I will try to email other people whose work is close to that topic. I think it's important to have the opinions from scholars.

I agree. Contacting relevant scholars is a good idea.

Meng Hu

Sun 05 Oct 2014 23:34

Admin

The histograms are for all years combined. This is obvious since in the "limitation" I said the pattern looks different when we look at specific periods. I will however modify the sentence as follows "Figure 1 plots the histogram of wordsum for all years combined. It shows that there is a strong ceiling effect for the wordsum variable in the white sample.". Would it be ok ?

As for the size of the pictures, I did not ask that big, it's just how Stata has generated them, and i just saved them as png. I will try to reduce the size of the pictures without deteriorating the quality. However, I maintain it would not be possible to put the tables (notably the regressions) where you want me to put them. As you can see, their size is so large that you need (almost) an entire page. That's one of the reason why I put everything at the end. If you want, however, I will put the pictures where you think they should.

For the syntax, i will think about it, because I'm convinced it's beneficial. Of course, most won't care about it, but those who want to go into the details and are familiar with the software (unfortunately, R works badly for me, which is why I decided to use Stata instead) can see it. I have the feeling it's less likely they will examine my syntax if it's only displayed in the supplementary file. They won't bother to go here at OP forums, check the files, etc. That's what I'm afraid about.

Emil O. W. Kirkegaard

Mon 06 Oct 2014 00:08

Admin

Some more comments.

The histograms are for all years combined. This is obvious since in the "limitation" I said the pattern looks different when we look at specific periods. I will however modify the sentence as follows "Figure 1 plots the histogram of wordsum for all years combined. It shows that there is a strong ceiling effect for the wordsum variable in the white sample.". Would it be ok ?

It wasn't obvious to me. You should probably add it to the figure caption too.

As for the size of the pictures, I did not ask that big, it's just how Stata has generated them, and i just saved them as png. I will try to reduce the size of the pictures without deteriorating the quality. However, I maintain it would not be possible to put the tables (notably the regressions) where you want me to put them. As you can see, their size is so large that you need (almost) an entire page. That's one of the reason why I put everything at the end. If you want, however, I will put the pictures where you think they should.

I didn't say to move the regression tables I wrote "Keeping the regression tables back there seems okay.". You can easily resize the figures using whichever program you use to write in.

For the syntax, i will think about it, because I'm convinced it's beneficial. Of course, most won't care about it, but those who want to go into the details and are familiar with the software (unfortunately, R works badly for me, which is why I decided to use Stata instead) can see it. I have the feelin it's less likely they will examine my syntax if it's only in the supplementary file. They won't bother to go here at OP forums, check the files, etc. That's what I'm afraid about.

I think most readers, say 95%, will not examine the STATA syntax no matter where you put them. Those who want to will likely examine it no matter where you put it. You can add a link to the OSF repository in the paper, so they don't need to go to OP forums for the supplementary files.

-

I found another error:

I build four models. For model 1, I use cohort (or survey year), race, and the interaction of race with cohort (or survey year). For model 2, I add age and gender variables. For model 3, I add the log of real family income (realinc), degree (degree) and years of school completed (educ), and region of residence at age 16 (reg16). For model 3, I add the number of siblings (sibs) as well as "type of place lived at age 16" (res16) and "living with parents at age 16" (family16).

This should be model 4.

-

The variable race "bw1" has a value of 0 for blacks and 1 for whites. Since the year 2000, the GSS begins to ask whether the respondent is hispanic or not. 5 For respondents in survey year 2000+ I have only included the respondents who declared not being hispanic (see appendix). The variable year has values going from 1972 to 2012. The variable cohort has values going from 1883 to 1994. The variable sex has the following values; male=1, female=2. The variable age has values going from 18 to 89. The variable degree has the following values; 0=lower than high school, 1=high school, 2=junior college, 3=bachelor, 4=graduate. The variable educ has values going from 0 to 20. The variable realinc has values going from 245 to 162607, and the respective numbers for log income are 5.5 and 11.99. The variable reg16 has the following values; 0=Foreign, 1=New England, 2=Middle Atlantic, 3=East North Central, 4=West North Central, 5=South Atlantic, 6=East South Atlantic, 7=West South Atlantic, 8=Mountain, 9=Pacific. The variable res16 has the following values; 1= in open country but not on a farm, 2=on a farm, 3=town lower than 50,000, 4=50,000 to 250,000, 5=in a suburb near a big city, 6=city greater than 250,000. The variable family16 has the following values; 0=other arrangement with relatives (e.g., aunt, uncle, grandparents), 1=mother & father, 2=father & stepmother, 3=mother & stepfather, 4=father, 5=mother, 6=male relative, 7=female relative, 8=male & female relatives. The variable sibs has values going from 0 to 37 (apart from two apparent outliers).

Why do you mention the range of numerical variables, and all the possible values of nominal variables?

-

It is unclear how the nominal variables are used in the regression models. Hopefully you have not used them as continuous variables, as that makes no sense at all. Reg16 (region lived) and family16 are clearly not even quasi-continuous variables. Regression on that as they were is clearly nonsense. Res16 is quasi-continuous, so regression with it is okay.

-

which means the gap has been reduced by an half

stronger gain over time in the model 2

It is merely a vocabulary test, and its reliability is not high (0.71).

As far as I can tell, you use the internal consistency value (cronbach's alpha presumably). This is not optimal for adjusting for measurement error since it doesn't correct for transient error. See Hunter and Schmidt (2004, p. 99). The real reliability is probably somewhat lower.

-

The wordsum gap has been reduced by approximately 40% or 50% but this is still coherent with their idea that the black-white IQ gap is (at least) 50% genetic and 50% environmental.

This should be "consistent" I think.

-

The supplementary files are made available at http://openpsych.net/forum/index.php.

You should link to the specific thread (http://www.openpsych.net/forum/showthread.php?tid=168). You should also link to the OSF repository with the supplementary files.

Ref
Hunter, J. E., & Schmidt, F. L. (Eds.). (2004). Methods of meta-analysis: Correcting error and bias in research findings. Sage.

John Fuerst

Mon 06 Oct 2014 17:17

[quote][/quote]

The aim of this article is to provide an update to Huang & Hauser (2001) study

This should be: Huang & Hauser's (2001)

The wordsum correlates at 0.71 with the AGCT aptitude test, and that the wordsum has an internal consistency reliability of 0.71 for whites and 0.63 for blacks (Huang & Hauser, 2001), which is not surprising given the shortness of the test.

This should be: The wordsum correlates at 0.71 with the AGCT aptitude test, and it has an internal reliability of 0.71 for whites and 0.63 for blacks (Huang & Hauser, 2001); these reliabilities are relatively low for cognitive measures, but this is not supprising given the shortness of the test.

The usual operation

This should be: formula.

But it is clear that the d gaps in the period 1988-1993 were clearly smaller than than earlier years. Lynn has also regressed the d gaps on years. The "b" slope was -0.004, which means that over 22 years, the d gap has been reduced by 0.004*22=0.088, given that the linearity assumption holds (which was true according to Lynn). This is indeed not very large.

Try: But it is clear that the d gaps in the period 1988-1993 were smaller than than earlier years. Lynn has also regressed the d gaps on years. The "b" slope was -0.004, which means that, over 22 years, the d gap diminished by 0.004*22=0.088, given that a linearity assumption holds (which was true according to Lynn). This is indeed not very large.[/quote]

However, their d scores differ from Lynn's only for years 1993 and 1994. But, more importantly, they faulted Lynn for not having used cohort as the variable of time trend (which can be calculated as year minus age).

Try: However, their d scores differ from Lynn's only for years 1993 and 1994. But, more importantly, they faulted Lynn for not having used cohort as the variable for the time trend (which can be calculated as year minus age).

Here is an explanation of the two concepts. With survey year, assuming age is held constant, we are asking how are the 40-year-olds in 1980 different from the 40-year-olds in 1990. The former experienced WWII, but the latter didn't. This is the period effect. With birth cohort, assuming age is held constant, we are asking how are people born in 1950 different from people born in 1960, when they were both 40 years old. The former experienced the sexual revolution in their teenage years, but the latter didn't. This is the cohort effect. The two effects may or may not be the same thing. (I must thank Satoshi Kanazawa for the tip.)

Try: Here is an explanation of the two concepts: With survey year, assuming age is held constant, we are asking, "How are the 40-year-olds in 1980 different from the 40-year-olds in 1990?". The former experienced WWII, but the latter didn't. This is the period effect. With birth cohort, assuming age is held constant, we are asking, "How are 40 year olds born in 1950 different from 40 year olds born in 1960?". The former experienced the sexual revolution in their teenage years, but the latter didn't. This is the cohort effect. The two effects may or may not be the same thing. (I must thank Satoshi Kanazawa for the tip.)

Given their parameters of 2.641 for intercept, 3.037 for race, 0.024 for the slope of year, and -0.0176 for the interaction, we can predict the changes in the gap over time. This is done by computing the white trend with race*year interaction, 2.641+3.037+(0.024*24)-(0.0176*24)=5.8316, and the white trend without the interaction, 2.641+3.037+(0.024*24)=6.2540, which gives a difference of 0.4224

It's standard to round to the same level of significant digits when dealing with the same sets of numbers. So either 0.024 and -0.018 or 0.024? and -0.0176.

difference (corrected for censored distribution of wordsum)

What does this mean? Also, add an article ("the") before censored.

Squared and perhaps cubed terms should have been applied to categorical variables of years and their interaction with race rather than using the continuous variable of survey year.

Why "should" have they? Was the relations non-linear. Perhaps you mean:

The authors should have checked if using squared and perhaps cubed terms produced a better fitting model. Doing so, might have generated different results.

The finding of Huang & Hauser (2001) is interesting because it is known that the black-white IQ gap in the U.S. has not declined in the adult samples, only in the children samples (Rushton & Jensen, 2006; Dickens & Flynn, 2006).

Maybe: The finding of Huang & Hauser (2001) is interesting because it is known that the black-white IQ gap in the U.S. has not declined in adult samples but only in child and adolescent ones (Rushton & Jensen, 2006; Dickens & Flynn, 2006).

(Use "the adult samples" when referring to a specific set of samples; use "adult samples" when referring to an unspecific set of samples; in this case, I think you are referring to an unspecific set. If not, you should say e.g.,:

It is known that the black-white IQ gap in the U.S. has not declined in the adult samples but only in the child samples discussed by Rushton & Jensen (2006) and Dickens & Flynn (2006).

It is possible, nonetheless, that there was a gap closing before the period analyzed by Dickens and Flynn. See Murray (2007).

Try: It is possible, nonetheless, that there was a gap closing before the period analyzed by Dickens and Flynn (see Murray, 2007).

Before deciding which method to apply, one needs to examine the distribution of the variables we will use.

Try: Before deciding which method to apply, one needs to examine the distribution of the variables one wishes to use.

An important assumption of linear regression is the normality of the data, especially the distribution of the dependent variable.

Try: An important assumption of linear regression is the normality of the data, especially in context to the distribution of the dependent variable.

The right procedure should be to use a tobit regression (for an introduction, see, McDonald & Moffitt, 1980).

Try: The right procedure should be to use a tobit regression (for an introduction, see McDonald & Moffitt, 1980).

Since the year 2000, the GSS begins to ask whether

Try: Since the year 2000, the GSS began to ask whether

For respondents in survey year 2000+ I have only included the respondents who declared not being hispanic (see appendix).

Use a common.

The variable cohort has values going from 1883 to 1994. The variable sex has the following values; male=1, female=2. The variable age has values going from 18 to 89. The variable degree has the following values; 0=lower than high school, 1=high school, 2=junior college, 3=bachelor, 4=graduate. The variable educ has values going from 0 to 20. The variable realinc has values going from 245 to 162607, and the respective numbers for log income are 5.5 and 11.99. The variable reg16 has the

For clarity place the variable names in quotes.

The variable "cohort" has values going from 1883 to 1994. The variable "sex" has the following values; male=1, female=2. The variable "age"....

According to the GSS codebook, the "white" category in variable "race" (before the year 2000) includes mexicans, spaniards and puerto ricans "who appear to be white".

Capitalize e.g., Mexican.

As for age variable, I decided to remove (set to missing data) people aged 70 or more

Try: As for the age variable...

Hence, following the recommendations of Hauser & Huang (1999) I weight the data by the variable "weight" which is the interaction of the variables "wtssall" and "oversamp", although this will not change the results.

I would use a comma.

The black-white raw score gap in cohort1 was 2.023 items correct and has become 1.001 item correct in cohort6, which means the gap has been reduced by an half, while the gap was 1.638 items correct in year1 and has become 1.333

This is because the more recent cohorts are younger, and the wordsum correlates positively with age (r=0.1005). In models 3 and 4, the scores among whites have a declining trend.

Who ever reported correlations to the ten-thousandth place? Also try:

This is because the more recent cohorts are younger, and wordsum correlates positively with age (r=0.1005). In models 3 and 4, the scores among whites have a declining trend.

This is still 50% reduction

Try: This is still a 50% reduction

A subsequent analysis is done by computing the d gap (see supplementary file) within each of the category of the dummy variables.

Categories.

I split the variable wordsum into two parts

Try: I split the variable "wordsum" into two parts

Another way to investigate whether or not the improvement occurs at high levels is to conduct logistic regression with wordsum as dependent binary variable (score levels 0-7 coded 0 and score levels 8-10 coded 1)

Try: Another way to investigate whether or not the improvement occurs at high levels is to conduct logistic regression with wordsum as the dependent binary variable (score levels 0-7 coded as 0 and score levels 8-10 coded as 1)

The most notable problem with the wordsum is not to be a measure of general intelligence

Try: The most notable problem with using wordsum, in this context, is that it is not a great measure of general intelligence.

Given Huang & Hauser's (1996, pp. 7-8) discussion, there is no clear answer to this question

Try: Given Huang & Hauser's (1996, pp. 7-8) discussion, there is no clear way to determine if this has occurred.

The affirmation that the test has become harder may be true. To some extent

Try: The affirmation that the test has become harder may be true to some extent.

whites find the wordsum harder over time while the blacks would find it a little bit easier

try: whites find the wordsum harder over time while the blacks find it a little bit easier

Generally, there is some indication that the black-white gap has been under-estimated in early cohorts. And by the same token, the magnitude of the gap narrowing.

Fragment. Try: Generally, there is some indication that the black-white gap has been under-estimated in early cohorts -- and by the same token, the magnitude of the gap narrowing.

But at the same time, the white trend could have been even flatter or turned out to be somewhat dysgenic.

I don't understand this and I would advise against using "dysgenic", since this implies a causal model, the discussion of which is outside the scope of the paper. Maybe just delete the sentence.

Granted the limitation of the wordsum test, one may wonder what is the consequence of the black-white gap decline for the genetic hypothesis proposed by Rushton & Jensen (2010). ...

I wouldn't, in this paper, discuss this. It's not directly relevant to the topic of the paper and it unnecessarily geneticizes the discussion (thus turning off potential readers). I've made the same point regarding many of Emil's discussions: Don't conflate issues e.g., the "spatial transferability hypothesis" with certain global genetic hypotheses. Delete the whole paragraph.

I'll get back to you regarding method later.

John Fuerst

Wed 08 Oct 2014 14:41

[quote][/quote]I'll get back to you regarding method later.

[Edits made]

I thought over the statistical method; I'm fine with it. I would like to know, though, why the survey year and birth cohort method produce such divergent results. There must be an age x survey year interaction. Could you check for this? I get what's happening as I've looked at the results prior. Basically, in 1975 older (50-75) African Americans perform much worse than mid age and younger (18-50) ones. During later years, the older age gap narrows. Now, one might take this as indicating a (cross age) cohort narrowing, yet another interpretation would be that it represents an older age narrowing i.e., there is less a difference between older people in 2000 than 1975. To determine which, you would need data from same age people in e.g., 1925 and 2000 which you don't have (for the GSS).

Over at HV, I commented on a similar (in methodology) analysis:

"Reardon’s analysis, of course, is deeply flawed by his failure to take into account both age effects and test content effects in addition to his dubious method of deriving early comparison points. As for the latter, he, for example, derives his early points, from the 1940s, from Charles Murray’s analysis of the 1976, 1986, and 1996 Woodcock–Johnson I to III standardizations. Of course, these samples were from the 70s, 80s, and 90s. To derive magnitudes of differences from the 40s, he projects back in time based on Murray’s birth cohort analysis. These differences, based on Full scale IQ — e.g., between 70 year old Blacks and Whites in the 90s who would have been 20 or so in the 40s — are then compared with the average Math and Reading differences between 5 to 7 year olds from the Early Childhood Longitudinal Study in the late 1990s (a study which showed a large effect of age on the magnitude of the math and reading gap — see: sample 48 — and also a large general knowledge gap at very young ages — see III, Chuck (2012c). His analysis, then, is confounded by the three problems and their interactions: (1) His method of deriving early points. (2) His comparison across measures. (3) And his comparison across ages."

For your cohort analysis, you are looking at e.g., age 50 differences in 1975 and age 25 differences in 2000 and finding a large change. But it's not obvious that this is fully a cohort change in the sense of age 18 through 65 people in 1925 versus age 18 through 65 in 2000 as opposed to an age x survey interaction in the sense that older people in 1975 (but less so younger) versus older people in 2000 (which is not the same as e.g., younger people in 1925 versus 1975). Anyways, I think that you should make a note concerning this issue. Generally, it's not clear if your "cohort analysis" is better than the survey year analysis in terms of determining the true cross age cohort effect.

(The proper interpretation should be, "There is a much larger older age gap in 1975 versus 2000" as opposed to, "There was a larger 1925 to 2000 cohort narrowing".)

Meng Hu

Wed 08 Oct 2014 19:37

Admin

I have made a lot of changes, but I will upload later.

One important change is the removal (at least temporarily) of my logistic regression analysis. I know that MacCallum et al. (2002) have already treated the practice of dichotomization of a continuous variable. They say it has problems because it lowers the reliability of the variable and can possibly alter the interpretation of it. One illustration can help to understand. Imagine you have 4 people with different level of fear about spider, A (100%), B (60%), C (40%), D (0%). You dichotomize the variable at the mean or median, so that A and B have value of 1 (fear) while C and D have value of 0 (no fear) and yet B and C are more alike than either A and B or C and D. This labeling is totally arbitrary and not justified. However, these authors applied this criticism to correlational, ANOVA and regression analyses. I was using logistic regression, which attempt to estimate the likelihood of having value of 1 versus 0. But now that I think about it, I'm not so sure about its robustness. One can still argue that my labeling (0-7 vs 8-10) is arbitrary but at the same time, the categories of my dummy variables must also be arbitrary. So, I will email MacCallum and ask him what he thinks about it. Of course, I know lot of people in recent papers conducted such dichotomization for logistic regression, but none of them have cited MacCallum et al. In light of this, I have replaced this analysis by another; I computed the d gap of wordlow and wordhigh, by dividing the black-white difference by the SD given in Table 4.

I have also added an explanation of the tobit coefficient, just in case someone would like to ask me to write it.

Now, the comments.

Emil :

Why do you mention the range of numerical variables, and all the possible values of nominal variables?

For age variable, you don't need to guess what means a range of 18-69. But for region, you don't know what values are assigned to each regions.

It is unclear how the nominal variables are used in the regression models. Hopefully you have not used them as continuous variables, as that makes no sense at all. Reg16 (region lived) and family16 are clearly not even quasi-continuous variables. Regression on that as they were is clearly nonsense. Res16 is quasi-continuous, so regression with it is okay.

Generally, I read that people accept the idea that a variable is (can be) thought as continuous when it has at least 5 values. In the variables you mentioned, they have more than 5 values.

As for the link to the thread, I couldn't (but now I can), because when I wrote the article, I have not created this thread. But of course I will link to the OSF later.

Chuck :

I have made all the modifications you indicated. However, concerning the number of digit after zero, they differ because it's how it is presented in Huang & Hauser (0.024 for cohort and -0.0176 for race*cohort). In the case of interaction variable, in my experience, I have seen quite a lot of time that even a small coefficient can have meaningful effect, so in my opinion, I find it justified to add one more digit for this variable. Another reason where I think it's justified to add more digit (for unstandardized coeff, but not standardized coeff) after the zero is when the variable can take on a large number of values, such as age (16-69). Concerning the correlation of wordsum with age (0.1005) it's how Stata has displayed the result. If you insist, I can round it at 0.10. Also, I have removed the word "dysgenic" and replaced it by "negative".

I wouldn't, in this paper, discuss this. It's not directly relevant to the topic of the paper and it unnecessarily geneticizes the discussion (thus turning off potential readers).

This would be unfortunate I think. When someone discusses black-white gap (IQ or achievement) he or she always attempts to understand the causes of it. If I don't attempt to explain the meaning of the gap narrowing in verbal IQ, I don't understand the meaning of such analysis. For example, Huang & Hauser (2001) don't buy the hereditarian argument. And they show the gap narrowing is due to gain in SES over time (although I think their analyses don't really prove it). But most people do not attempt to weigh the hypotheses. It's really of no use to test the environmental hypothesis if you don't think about the prediction that the hereditarian hypothesis can make. I think most researchers should stop making fallacies like the "confirmation bias". It's easy to fall into this trap. I don't remember how many times I have read "environmental variables explain the gap, we have proved this hypothesis to be true". If the BW gap has narrowed (and even if it didn't), I need to discuss the consequences. During this period studied, blacks have certainly improved in social status, probably more than did whites, so I need to talk about the relevance of the environmental and genetic hypotheses, even if people don't like it. Of course, I can delete the word genetic and replace it by hereditarian. That is a less provocative term, by the idea is still the same.

I get what's happening as I've looked at the results prior. Basically, in 1975 older (50-75) African Americans perform much worse than mid age and younger (18-50) ones. During later years, the older age gap narrows. Now, one might take this as indicating a (cross age) cohort narrowing, yet another interpretation would be that it represents an older age narrowing i.e., there is less a difference between older people in 2000 than 1975.

Tell me if I'm right. You want me to conduct a tobit regression with cohort, race, age, cohort*race, cohort*age variables ? And you say that you suspect the cohort*age effect to become stronger in later cohorts ?

The syntax looks something like this :

gen ageC1 = age*cohortdummy1
gen ageC2 = age*cohortdummy2
gen ageC3 = age*cohortdummy3
gen ageC4 = age*cohortdummy4
gen ageC5 = age*cohortdummy5
gen ageC6 = age*cohortdummy6

tobit wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex ageC2 ageC3 ageC4 ageC5 ageC6 [pweight = weight], ll(0) ul(10)

Tobit regression                                  Number of obs   =      22156
                                                  F(  18,  22138) =      92.49
                                                  Prob > F        =     0.0000
Log pseudolikelihood =  -47838.71                 Pseudo R2       =     0.0176

------------------------------------------------------------------------------
             |               Robust
     wordsum |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         bw1 |   2.023368   .1559898    12.97   0.000     1.717617     2.32912
cohortdummy2 |  -.5709338    .564835    -1.01   0.312    -1.678051     .536183
cohortdummy3 |  -.6702743   .5388143    -1.24   0.214    -1.726389    .3858401
cohortdummy4 |  -1.338139   .5381658    -2.49   0.013    -2.392982   -.2832961
cohortdummy5 |  -1.531758   .5444232    -2.81   0.005    -2.598866     -.46465
cohortdummy6 |  -1.503185   .5799227    -2.59   0.010    -2.639874   -.3664949
        bwC2 |  -.2301317   .1922563    -1.20   0.231    -.6069678    .1467043
        bwC3 |  -.5600697   .1797855    -3.12   0.002     -.912462   -.2076774
        bwC4 |  -.6036657   .1770871    -3.41   0.001    -.9507689   -.2565624
        bwC5 |  -1.006898   .1803751    -5.58   0.000    -1.360446   -.6533498
        bwC6 |  -1.004008   .1995938    -5.03   0.000    -1.395226   -.6127898
         age |  -.0131813   .0082283    -1.60   0.109    -.0293094    .0029467
         sex |   .1739637   .0317265     5.48   0.000     .1117776    .2361498
       ageC2 |   .0157154    .009103     1.73   0.084    -.0021271     .033558
       ageC3 |   .0311774   .0087378     3.57   0.000     .0140508    .0483041
       ageC4 |    .042542   .0088985     4.78   0.000     .0251004    .0599837
       ageC5 |   .0591789   .0094609     6.26   0.000     .0406349    .0777228
       ageC6 |   .0695841   .0124641     5.58   0.000     .0451535    .0940146
       _cons |    4.84196   .5199612     9.31   0.000     3.822799     5.86112
-------------+----------------------------------------------------------------
      /sigma |   2.107472   .0135601                      2.080893    2.134051
------------------------------------------------------------------------------
  Obs. summary:        140  left-censored observations at wordsum<=0
                     20698     uncensored observations
                      1318 right-censored observations at wordsum>=10

tobit wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 age sex ageC2 ageC3 ageC4 ageC5 ageC6 [pweight = weight], ll(0) ul(10)

Tobit regression                                  Number of obs   =      22156
                                                  F(  13,  22143) =     123.37
                                                  Prob > F        =     0.0000
Log pseudolikelihood = -47868.274                 Pseudo R2       =     0.0170

------------------------------------------------------------------------------
             |               Robust
     wordsum |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         bw1 |   1.414415     .04254    33.25   0.000     1.331034    1.497797
cohortdummy2 |  -.7754692   .5424091    -1.43   0.153     -1.83863    .2876911
cohortdummy3 |  -1.182844    .519948    -2.27   0.023    -2.201979   -.1637094
cohortdummy4 |  -1.889955   .5191508    -3.64   0.000    -2.907528   -.8723828
cohortdummy5 |  -2.448969   .5250497    -4.66   0.000    -3.478104   -1.419835
cohortdummy6 |  -2.410204   .5602023    -4.30   0.000     -3.50824   -1.312167
         age |  -.0131711   .0082734    -1.59   0.111    -.0293876    .0030455
         sex |   .1779512    .031749     5.60   0.000      .115721    .2401814
       ageC2 |   .0153997   .0091492     1.68   0.092    -.0025334    .0333328
       ageC3 |   .0311452   .0087811     3.55   0.000     .0139337    .0483567
       ageC4 |   .0425241   .0089401     4.76   0.000     .0250008    .0600474
       ageC5 |   .0600193   .0095079     6.31   0.000     .0413832    .0786553
       ageC6 |   .0708083   .0125489     5.64   0.000     .0462115    .0954051
       _cons |   5.392481   .5070554    10.63   0.000     4.398616    6.386345
-------------+----------------------------------------------------------------
      /sigma |   2.110143   .0135794                      2.083527     2.13676
------------------------------------------------------------------------------
  Obs. summary:        140  left-censored observations at wordsum<=0
                     20698     uncensored observations
                      1318 right-censored observations at wordsum>=10

tobit wordsum cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 age sex ageC2 ageC3 ageC4 ageC5 ageC6 [pweight = weight], ll(0) ul(10)

Tobit regression                                  Number of obs   =      23817
                                                  F(  12,  23805) =      40.13
                                                  Prob > F        =     0.0000
Log pseudolikelihood =   -52876.4                 Pseudo R2       =     0.0054

------------------------------------------------------------------------------
             |               Robust
     wordsum |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
cohortdummy2 |  -.6895328   .5584058    -1.23   0.217    -1.784044    .4049781
cohortdummy3 |  -1.159271   .5356257    -2.16   0.030    -2.209131   -.1094102
cohortdummy4 |   -1.81035    .534596    -3.39   0.001    -2.858192   -.7625077
cohortdummy5 |  -2.206941    .538878    -4.10   0.000    -3.263176   -1.150705
cohortdummy6 |  -2.340127   .5706149    -4.10   0.000    -3.458569   -1.221686
         age |  -.0126966    .008511    -1.49   0.136    -.0293787    .0039854
         sex |   .1219685    .031794     3.84   0.000     .0596502    .1842868
       ageC2 |   .0124503   .0094042     1.32   0.186    -.0059825    .0308831
       ageC3 |   .0280788   .0090236     3.11   0.002     .0103921    .0457656
       ageC4 |   .0368479   .0091736     4.02   0.000     .0188671    .0548287
       ageC5 |   .0455797   .0096493     4.72   0.000     .0266664    .0644929
       ageC6 |   .0533736   .0123475     4.32   0.000     .0291717    .0775754
       _cons |   6.736302   .5206905    12.94   0.000     5.715715    7.756888
-------------+----------------------------------------------------------------
      /sigma |   2.197996   .0134352                      2.171662    2.224329
------------------------------------------------------------------------------
  Obs. summary:        184  left-censored observations at wordsum<=0
                     22282     uncensored observations
                      1351 right-censored observations at wordsum>=10

So, how do we interpret this outcome ? It seems to me that the age gap becomes larger at later cohort, because you have positive coefficients that become stronger over time. When you controlling for age*cohort interaction, the cohort effect is negative. That is, the wordsum score for the entire group diminishes over time. There is still a meaningful black-white narrowing.

Emil O. W. Kirkegaard

Wed 08 Oct 2014 20:01

Admin

One important change is the removal (at least temporarily) of my logistic regression analysis. I know that MacCallum et al. (2002) have already treated the practice of dichotomization of a continuous variable. They say it has problems because it lowers the reliability of the variable and can possibly alter the interpretation of it. One illustration can help to understand. Imagine you have 4 people with different level of fear about spider, A (100%), B (60%), C (40%), D (0%). You dichotomize the variable at the mean or median, so that A and B have value of 1 (fear) while C and D have value of 0 (no fear) and yet B and C are more alike than either A and B or C and D. This labeling is totally arbitrary and not justified. However, these authors applied this criticism to correlational, ANOVA and regression analyses. I was using logistic regression, which attempt to estimate the likelihood of having value of 1 versus 0. But now that I think about it, I'm not so sure about its robustness. One can still argue that my labeling (0-7 vs 8-10) is arbitrary but at the same time, the categories of my dummy variables must also be arbitrary. So, I will email MacCallum and ask him what he thinks about it. Of course, I know lot of people in recent papers conducted such dichotomization for logistic regression, but none of them have cited MacCallum et al. In light of this, I have replaced this analysis by another; I computed the d gap of wordlow and wordhigh, by dividing the black-white difference by the SD given in Table 4.

You could explore the effect of dichotomizing it in different places. You used 0-7 vs. 8-10. You could try 0-6 vs. 7-10, 0-5 vs. 6-10 (the most 'natural' since it is split evenly along the scale), and 0-8 vs. 9-10.

Generally, I read that people accept the idea that a variable is (can be) thought as continuous when it has at least 5 values. In the variables you mentioned, they have more than 5 values.

Look at the variable for region. It is:

The variable reg16 has the following values; 0=Foreign, 1=New England, 2=Middle Atlantic, 3=East North Central, 4=West North Central, 5=South Atlantic, 6=East South Atlantic, 7=West South Atlantic, 8=Mountain, 9=Pacific.

What does it mean to be higher in this variable? What does it mean to be lower? Nothing. These are not places along a scale of something. It is a nominal variable. Using it as a continuous variable is nonsense.

The variable family16 has the following values; 0=other arrangement with relatives (e.g., aunt, uncle, grandparents), 1=mother & father, 2=father & stepmother, 3=mother & stepfather, 4=father, 5=mother, 6=male relative, 7=female relative, 8=male & female relatives.

Same for this. There is no answer to the question "What does it mean to be higher in family16?". It is because it is a nominal variable.

Finally,

he variable res16 has the following values; 1= in open country but not on a farm, 2=on a farm, 3=town lower than 50,000, 4=50,000 to 250,000, 5=in a suburb near a big city, 6=city greater than 250,000.

Is fine because: 1) there are 5 or more possible values, 2) there is a sensible answer to the question "What does it mean to be higher in res16?" The answer is that the higher one is in res16, the more people lives around oneself. Or reversely, the lower, the less people live around oneself. Or one could answer it with density of people in the area, etc. There are sensible answers. Variables like this one are called "quasi-continuous" (or "quasi-interval") because they are not quite continuous (every real number between min and max value is possible), but they are sort of continuous because there are a number of possible values between AND because interpretation of it as a scale is sensible.

Scales of measurement are usually discussed in the beginning of introductory statistics books. There is one in this book, section 2.2: http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/

Meng Hu

Wed 08 Oct 2014 20:44

Admin

Emil,

What does it mean to be higher in this variable? What does it mean to be lower? Nothing. These are not places along a scale of something. It is a nominal variable. Using it as a continuous variable is nonsense.

When you control for any variable, it's obvious that there are two things. One is that you adjust for differences in subgroups. For example, one subgroup (father) can have lower mean IQ than another subgroup (mother). If you adjust for it, and whatever the order of the values, the interpretation of the other coefficients won't change. However, and this is the second thing, in the variables you mentioned, e.g., res16, reg16, family16, a higher value obviously has no meaning. Thus, their coefficients have no meaning. But I'm not interested in these things. I don't care about what their coefficient is. I care only about BW changes over time.

For the logistic regression, I can obviously use different splitting. Obviously, one problem with my split is that 5-7 are not "low score" but medium or high. So, perhaps, try 0-4 vs 8-10 or 7-10. If I get similar results, perhaps it would mean that such dichotomization had not distorted the construct of interest. But what if there is a difference ? There are so many possible dichotomization that I don't know if it's relevant anymore. I will see if I can reach some agreement with MacCallum.

Emil O. W. Kirkegaard

Wed 08 Oct 2014 22:02

Admin

I don't think it even works as a control. You need to divide them up into dichotomous dummy variables to control for them in MR I think.

Meng Hu

Wed 08 Oct 2014 23:35

Admin

When you control for a given variable, the other coefficients are expressed at the mean of the controled variable. Say, you control for dichotomized race (1;2). In that case, the other coefficients are expressed as if race is equal to 1.5. If race is coded 0;1, then it's 0.5 (note it's also how it works in ANCOVA). In most cases, when you recode your variable, you'll likely get similar estimates for the other (non-recoded) coefficients. The thing that may be subjected to large change is the intercept, especially if you reverse code the original variable.

I have coded family16 differently (e.g., assigned 0 instead of 5, 1 instead of 2, etc.) but that didn't change the results.

Here's a try.

keep if age<70
gen weight = wtssall*oversamp
gen blackwhite2000after=1 if year>=2000 & race==1 & hispanic==1
replace blackwhite2000after=0 if year>=2000 & race==2 & hispanic==1
gen blackwhite2000before=1 if year<2000 & race==1
replace blackwhite2000before=0 if year<2000 & race==2
gen bw1 = max(blackwhite2000after,blackwhite2000before)
gen income = realinc
replace income = . if income==0
gen logincome = log(income)
replace educ = . if educ>20
replace degree = . if degree>4
replace sibs = . if sibs==-1
replace sibs = . if sibs>37
replace res16 = . if res16==0
replace res16 = . if res16>=8
replace family16 = . if family16==-1
replace family16 = . if family16==9
replace wordsum = . if wordsum<0
replace wordsum = . if wordsum>10
replace cohort = . if cohort==0
replace cohort = . if cohort==9999
recode cohort (1905/1928=1) (1929/1943=2) (1944/1953=3) (1954/1962=4) (1963/1973=5) (1974/1994=6), generate(cohort6)
replace cohort6 = . if cohort6>6
tabulate cohort6, gen(cohortdummy)
gen bwC1 = bw1*cohortdummy1
gen bwC2 = bw1*cohortdummy2
gen bwC3 = bw1*cohortdummy3
gen bwC4 = bw1*cohortdummy4
gen bwC5 = bw1*cohortdummy5
gen bwC6 = bw1*cohortdummy6
gen familyrecode = .
replace familyrecode = 0 if family16==0
replace familyrecode = 1 if family16==8
replace familyrecode = 2 if family16==7
replace familyrecode = 3 if family16==6
replace familyrecode = 4 if family16==5
replace familyrecode = 5 if family16==4
replace familyrecode = 6 if family16==3
replace familyrecode = 7 if family16==2
replace familyrecode = 8 if family16==1
gen familyrecode1 = .
replace familyrecode1 = 0 if family16==5
replace familyrecode1 = 1 if family16==2
replace familyrecode1 = 2 if family16==7
replace familyrecode1 = 3 if family16==3
replace familyrecode1 = 4 if family16==1
replace familyrecode1 = 5 if family16==0
replace familyrecode1 = 6 if family16==8
replace familyrecode1 = 7 if family16==4
replace familyrecode1 = 8 if family16==6
gen familyrecode2 = .
replace familyrecode2 = 0 if family16==0
replace familyrecode2 = 1 if family16==4
replace familyrecode2 = 2 if family16==7
replace familyrecode2 = 3 if family16==3
replace familyrecode2 = 4 if family16==6
replace familyrecode2 = 5 if family16==5
replace familyrecode2 = 6 if family16==8
replace familyrecode2 = 7 if family16==1
replace familyrecode2 = 8 if family16==2
gen reg = .
replace reg = 0 if reg16==3
replace reg = 1 if reg16==5
replace reg = 2 if reg16==7
replace reg = 3 if reg16==6
replace reg = 4 if reg16==8
replace reg = 5 if reg16==4
replace reg = 6 if reg16==0
replace reg = 7 if reg16==2
replace reg = 8 if reg16==1
replace reg = 9 if reg16==9

regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg16 res16 family16 sibs [pweight = weight], beta

Linear regression                                      Number of obs =   20226
                                                       F( 20, 20205) =  364.79
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.3117
                                                       Root MSE      =   1.697

------------------------------------------------------------------------------
             |               Robust
     wordsum |      Coef.   Std. Err.      t    P>|t|                     Beta
-------------+----------------------------------------------------------------
         bw1 |   1.095699   .1468845     7.46   0.000                 .1777428
cohortdummy2 |  -.2432754   .1667967    -1.46   0.145                -.0467582
cohortdummy3 |  -.1327666   .1610207    -0.82   0.410                -.0273688
cohortdummy4 |  -.4442166   .1630993    -2.72   0.006                  -.08921
cohortdummy5 |  -.3711671   .1662279    -2.23   0.026                -.0669138
cohortdummy6 |  -.1091146   .1879013    -0.58   0.561                -.0148669
        bwC2 |  -.0214332   .1742682    -0.12   0.902                 -.003947
        bwC3 |  -.1655899   .1658092    -1.00   0.318                -.0327039
        bwC4 |  -.1458145   .1663234    -0.88   0.381                -.0277045
        bwC5 |    -.40005   .1687139    -2.37   0.018                -.0673835
        bwC6 |  -.4935015   .1931291    -2.56   0.011                -.0609301
         age |   .0060567   .0014348     4.22   0.000                 .0408087
         sex |   .2701488   .0266859    10.12   0.000                 .0658305
   logincome |   .1710022   .0158248    10.81   0.000                 .0794625
      degree |    .175303   .0268235     6.54   0.000                   .09789
        educ |   .2571534   .0118184    21.76   0.000                 .3546749
       reg16 |  -.0073669   .0058151    -1.27   0.205                -.0088605
       res16 |    .095045   .0090247    10.53   0.000                 .0711172
    family16 |   .0025916   .0079647     0.33   0.745                 .0022032
        sibs |  -.0465609   .0046251   -10.07   0.000                -.0685357
       _cons |   -.581039   .2460116    -2.36   0.018                        .
------------------------------------------------------------------------------

regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg16 res16 familyrecode sibs [pweight = weight], beta

Linear regression                                      Number of obs =   20226
                                                       F( 20, 20205) =  365.08
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.3118
                                                       Root MSE      =   1.697

------------------------------------------------------------------------------
             |               Robust
     wordsum |      Coef.   Std. Err.      t    P>|t|                     Beta
-------------+----------------------------------------------------------------
         bw1 |   1.098148   .1468342     7.48   0.000                   .17814
cohortdummy2 |  -.2428329   .1667948    -1.46   0.145                -.0466731
cohortdummy3 |  -.1326954   .1610526    -0.82   0.410                -.0273542
cohortdummy4 |  -.4447821   .1631311    -2.73   0.006                -.0893236
cohortdummy5 |  -.3724344   .1662453    -2.24   0.025                -.0671423
cohortdummy6 |  -.1116337   .1879246    -0.59   0.552                -.0152101
        bwC2 |  -.0224281    .174277    -0.13   0.898                -.0041302
        bwC3 |  -.1664026   .1658642    -1.00   0.316                -.0328644
        bwC4 |  -.1466234   .1663764    -0.88   0.378                -.0278582
        bwC5 |  -.4007139   .1687489    -2.37   0.018                -.0674953
        bwC6 |  -.4932983   .1931479    -2.55   0.011                 -.060905
         age |   .0060158   .0014359     4.19   0.000                 .0405335
         sex |   .2700135   .0266828    10.12   0.000                 .0657975
   logincome |   .1715282   .0158549    10.82   0.000                  .079707
      degree |   .1755253   .0268157     6.55   0.000                 .0980142
        educ |   .2573497   .0118336    21.75   0.000                 .3549456
       reg16 |   -.007438   .0058213    -1.28   0.201                -.0089461
       res16 |   .0948378   .0090363    10.50   0.000                 .0709622
familyrecode |  -.0051175   .0069678    -0.73   0.463                -.0050225
        sibs |  -.0466206   .0046247   -10.08   0.000                -.0686237
       _cons |  -.5469528   .2459031    -2.22   0.026                        .
------------------------------------------------------------------------------

regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg16 res16 familyrecode1 sibs [pweight = weight], beta

Linear regression                                      Number of obs =   20226
                                                       F( 20, 20205) =  365.04
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.3118
                                                       Root MSE      =  1.6969

-------------------------------------------------------------------------------
              |               Robust
      wordsum |      Coef.   Std. Err.      t    P>|t|                     Beta
--------------+----------------------------------------------------------------
          bw1 |   1.095838   .1465374     7.48   0.000                 .1777654
 cohortdummy2 |  -.2452279   .1667372    -1.47   0.141                -.0471335
 cohortdummy3 |   -.139467   .1609478    -0.87   0.386                -.0287501
 cohortdummy4 |  -.4513169   .1630004    -2.77   0.006                -.0906359
 cohortdummy5 |  -.3816927   .1661727    -2.30   0.022                -.0688114
 cohortdummy6 |  -.1194981   .1878523    -0.64   0.525                -.0162816
         bwC2 |  -.0207212   .1742131    -0.12   0.905                -.0038159
         bwC3 |  -.1597148   .1657395    -0.96   0.335                -.0315436
         bwC4 |  -.1399076   .1662173    -0.84   0.400                -.0265822
         bwC5 |  -.3924364    .168619    -2.33   0.020                 -.066101
         bwC6 |  -.4848992    .193071    -2.51   0.012                 -.059868
          age |   .0060752   .0014345     4.24   0.000                 .0409337
          sex |   .2694222   .0266878    10.10   0.000                 .0656535
    logincome |   .1712699   .0158322    10.82   0.000                 .0795869
       degree |   .1756636   .0268296     6.55   0.000                 .0980914
         educ |   .2570987   .0118139    21.76   0.000                 .3545995
        reg16 |  -.0075259   .0058152    -1.29   0.196                -.0090518
        res16 |   .0943371   .0090397    10.44   0.000                 .0705875
familyrecode1 |  -.0137886   .0090215    -1.53   0.126                -.0102153
         sibs |  -.0466329   .0046277   -10.08   0.000                -.0686417
        _cons |  -.5246691   .2463459    -2.13   0.033                        .
-------------------------------------------------------------------------------

regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg16 res16 familyrecode2 sibs [pweight = weight], beta

Linear regression                                      Number of obs =   20226
                                                       F( 20, 20205) =  365.08
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.3118
                                                       Root MSE      =  1.6969

-------------------------------------------------------------------------------
              |               Robust
      wordsum |      Coef.   Std. Err.      t    P>|t|                     Beta
--------------+----------------------------------------------------------------
          bw1 |   1.097943   .1464629     7.50   0.000                 .1781067
 cohortdummy2 |  -.2424547    .166609    -1.46   0.146                -.0466004
 cohortdummy3 |  -.1341314   .1608393    -0.83   0.404                -.0276502
 cohortdummy4 |  -.4472196   .1629636    -2.74   0.006                -.0898131
 cohortdummy5 |  -.3764029   .1660956    -2.27   0.023                -.0678577
 cohortdummy6 |  -.1174024   .1877553    -0.63   0.532                -.0159961
         bwC2 |  -.0236267   .1741071    -0.14   0.892                -.0043509
         bwC3 |  -.1661975   .1656591    -1.00   0.316                -.0328239
         bwC4 |  -.1462968   .1662056    -0.88   0.379                -.0277961
         bwC5 |  -.4005288   .1685777    -2.38   0.018                -.0674641
         bwC6 |  -.4930299   .1929912    -2.55   0.011                -.0608718
          age |   .0059868    .001435     4.17   0.000                 .0403379
          sex |   .2694145   .0266841    10.10   0.000                 .0656516
    logincome |   .1720981   .0158374    10.87   0.000                 .0799718
       degree |   .1757301   .0268169     6.55   0.000                 .0981285
         educ |   .2576631   .0118325    21.78   0.000                 .3553779
        reg16 |  -.0076389   .0058247    -1.31   0.190                -.0091877
        res16 |   .0944661   .0090305    10.46   0.000                  .070684
familyrecode2 |  -.0113374   .0079373    -1.43   0.153                -.0094531
         sibs |   -.046764   .0046238   -10.11   0.000                -.0688347
        _cons |  -.5154476    .246452    -2.09   0.036                        .
-------------------------------------------------------------------------------

regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg res16 family16 sibs [pweight = weight], beta


Linear regression                                      Number of obs =   20226
                                                       F( 20, 20205) =  365.16
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.3128
                                                       Root MSE      =  1.6957

------------------------------------------------------------------------------
             |               Robust
     wordsum |      Coef.   Std. Err.      t    P>|t|                     Beta
-------------+----------------------------------------------------------------
         bw1 |   1.077054   .1463924     7.36   0.000                 .1747182
cohortdummy2 |  -.2437178   .1663023    -1.47   0.143                -.0468432
cohortdummy3 |  -.1299419   .1604554    -0.81   0.418                -.0267866
cohortdummy4 |  -.4435006   .1626352    -2.73   0.006                -.0890662
cohortdummy5 |  -.3727247   .1657568    -2.25   0.025                -.0671946
cohortdummy6 |  -.1158754   .1878084    -0.62   0.537                 -.015788
        bwC2 |   -.024472   .1738017    -0.14   0.888                -.0045066
        bwC3 |  -.1737067   .1652074    -1.05   0.293                 -.034307
        bwC4 |  -.1513894   .1658148    -0.91   0.361                -.0287637
        bwC5 |  -.4049843   .1681489    -2.41   0.016                -.0682146
        bwC6 |   -.491967   .1930388    -2.55   0.011                -.0607406
         age |   .0060645   .0014337     4.23   0.000                 .0408615
         sex |   .2691577   .0266842    10.09   0.000                  .065589
   logincome |   .1697286   .0157905    10.75   0.000                 .0788707
      degree |   .1759572   .0267701     6.57   0.000                 .0982553
        educ |   .2558024    .011793    21.69   0.000                 .3528116
         reg |   .0218362    .004184     5.22   0.000                 .0338935
       res16 |   .0910688   .0090408    10.07   0.000                  .068142
    family16 |   .0012812   .0079736     0.16   0.872                 .0010892
        sibs |  -.0462154   .0046184   -10.01   0.000                -.0680273
       _cons |  -.6298851   .2438412    -2.58   0.010                        .
------------------------------------------------------------------------------

regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg res16 familyrecode1 sibs [pweight = weight], beta

Linear regression                                      Number of obs =   20226
                                                       F( 20, 20205) =  365.47
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.3129
                                                       Root MSE      =  1.6956

-------------------------------------------------------------------------------
              |               Robust
      wordsum |      Coef.   Std. Err.      t    P>|t|                     Beta
--------------+----------------------------------------------------------------
          bw1 |   1.078355   .1460574     7.38   0.000                 .1749292
 cohortdummy2 |  -.2455091   .1662688    -1.48   0.140                -.0471875
 cohortdummy3 |  -.1358898   .1604158    -0.85   0.397                -.0280127
 cohortdummy4 |  -.4500099   .1625733    -2.77   0.006                -.0903734
 cohortdummy5 |  -.3824976    .165738    -2.31   0.021                -.0689565
 cohortdummy6 |  -.1258436   .1877791    -0.67   0.503                -.0171462
         bwC2 |  -.0238513   .1737711    -0.14   0.891                -.0043923
         bwC3 |  -.1685687   .1651688    -1.02   0.307                -.0332922
         bwC4 |  -.1461107   .1657415    -0.88   0.378                -.0277608
         bwC5 |  -.3982269   .1680839    -2.37   0.018                -.0670764
         bwC6 |  -.4841555   .1929902    -2.51   0.012                -.0597762
          age |   .0060783   .0014334     4.24   0.000                 .0409543
          sex |   .2684576   .0266865    10.06   0.000                 .0654184
    logincome |   .1700912   .0157964    10.77   0.000                 .0790392
       degree |   .1764089   .0267762     6.59   0.000                 .0985075
         educ |   .2557707   .0117892    21.70   0.000                 .3527678
          reg |   .0217097   .0041827     5.19   0.000                 .0336972
        res16 |   .0904257   .0090558     9.99   0.000                 .0676608
familyrecode1 |  -.0123615    .009031    -1.37   0.171                 -.009158
         sibs |  -.0462869    .004621   -10.02   0.000                -.0681324
        _cons |  -.5837063   .2442501    -2.39   0.017                        .
-------------------------------------------------------------------------------

regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg res16 familyrecode2 sibs [pweight = weight], beta

Linear regression                                      Number of obs =   20226
                                                       F( 20, 20205) =  365.46
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.3128
                                                       Root MSE      =  1.6956

-------------------------------------------------------------------------------
              |               Robust
      wordsum |      Coef.   Std. Err.      t    P>|t|                     Beta
--------------+----------------------------------------------------------------
          bw1 |   1.080223   .1460067     7.40   0.000                 .1752322
 cohortdummy2 |  -.2430345   .1661583    -1.46   0.144                -.0467119
 cohortdummy3 |  -.1310357   .1603239    -0.82   0.414                 -.027012
 cohortdummy4 |  -.4462279   .1625364    -2.75   0.006                -.0896139
 cohortdummy5 |  -.3776067   .1656619    -2.28   0.023                -.0680748
 cohortdummy6 |  -.1237979   .1877031    -0.66   0.510                -.0168675
         bwC2 |  -.0264528   .1736817    -0.15   0.879                -.0048713
         bwC3 |  -.1744731   .1651051    -1.06   0.291                -.0344583
         bwC4 |    -.15195   .1657303    -0.92   0.359                -.0288702
         bwC5 |  -.4056042   .1680526    -2.41   0.016                 -.068319
         bwC6 |  -.4915503   .1929303    -2.55   0.011                -.0606892
          age |       .006    .001434     4.18   0.000                 .0404265
          sex |   .2684602   .0266825    10.06   0.000                  .065419
    logincome |   .1708242    .015807    10.81   0.000                 .0793798
       degree |   .1764796   .0267638     6.59   0.000                  .098547
         educ |    .256259   .0118057    21.71   0.000                 .3534413
          reg |   .0217382   .0041794     5.20   0.000                 .0337414
        res16 |   .0905556   .0090466    10.01   0.000                  .067758
familyrecode2 |  -.0099752   .0079317    -1.26   0.209                -.0083173
         sibs |  -.0464019   .0046177   -10.05   0.000                -.0683017
        _cons |  -.5769559   .2440902    -2.36   0.018                        .
-------------------------------------------------------------------------------

Emil O. W. Kirkegaard

Thu 09 Oct 2014 00:21

Admin

Say, you control for dichotomized race (1;2). In that case, the other coefficients are expressed as if race is equal to 1.5. If race is coded 0;1, then it's 0.5 (note it's also how it works in ANCOVA).

Only if the Ns of the two racial groups are equal, no?

If you look at your estimated beta's, they are not the same:

reg16:
family16 .0022032 reg16 -.0088605
familyrecode -.0050225 reg16 -.0089461
familyrecode1 -.0102153 reg16 -.0090518
familyrecode2 -.0094531 reg16 -.0091877

reg:
family16 .0010892 reg .0338935
familyrecode1 reg .0336972
familyrecode2 reg -.0083173
(u forgot familyrecode + reg)

The results are all similar because these variables are probably not very important. However, they are not the same because MR is treating them as interval variables, and setting the 'mean' (also nonsense) to different values.

The overall results are not changed much. R2's all near .31. You could correlate the betas from each model to see how similar they are. I did it for model1 x model2.

model1.betas = c(.1777428,-.0467582,-.0273688,-.08921,-.0669138,-.0148669,-.003947,
           -.0327039,-.0277045,-.0673835,-.0609301,.0408087,.0658305,.0794625,
           .09789,.3546749,-.0088605,.0711172,.0022032,-.0685357)
model2.betas = c(.17814,-.0466731,-.0273542,-.0893236,-.0671423,-.0152101,-.0041302,
                 -.0328644,-.0278582,-.0674953,-.060905,.0405335,.0657975,.079707,
                 .0980142,.3549456,-.0089461,.0709622,-.0050225,-.0686237)
cor(model1.betas,model2.betas)
[1] 0.9998823

So, yes, it seems to be not worth bothering about.

Meng Hu

Thu 09 Oct 2014 13:53

Admin

Only if the Ns of the two racial groups are equal, no?

I don't think the N is the problem. See here for example :
http://menghublog.wordpress.com/2014/03/25/how-to-remove-the-influence-of-confoundings-on-a-continuous-variable-by-way-of-linear-regression/

If you look at your estimated beta's, they are not the same:

reg16:
family16 .0022032 reg16 -.0088605
familyrecode -.0050225 reg16 -.0089461
familyrecode1 -.0102153 reg16 -.0090518
familyrecode2 -.0094531 reg16 -.0091877

reg:
family16 .0010892 reg .0338935
familyrecode1 reg .0336972
familyrecode2 reg -.0083173
(u forgot familyrecode + reg)

The results are all similar because these variables are probably not very important. However, they are not the same because MR is treating them as interval variables, and setting the 'mean' (also nonsense) to different values.

I know that family16 and reg16 have very low correlation. I have chosen these variables because you have problems with them. Hopefully, Stata gives the 7 digits after zero, so you can look at the beta with much precision. They are not perfectly identical, of course, but they are similar. In the numbers you show me, it's only familyrecode2 that is different than the other "family" variables. However, I'm not surprised. I did not say that the coefficients of family16 and reg16 would be (almost) the same. I said the coefficients of other variables will remain almost the same. I see it's true. Of course, the fact that family16 and reg16 have low correlations with wordsum helped a little bit, as you noted.

John Fuerst

Thu 09 Oct 2014 22:31

What were the gaps for each age group in the first and last survey year?

1 2 3 Next Last