An update on the secular narrowing of the black-white gap in the Wordsum vocabulary test (1974-2012)
Abstract
The aim of this article is to provide an update to Huang & Hauser (2001) study. I use the General Social Survey (GSS) to analyze the trend in the black-white difference in Wordsum scores by survey years and by cohorts. Tobit regression models show that the black-white difference diminishes over time by cohorts but just slightly by survey years. That is, the gap closing is mainly a cohort effect. The black-white narrowing may have ceased in the last cohorts and periods.
Keywords: IQ, black-white gap
(I have initially tried to attach .rar file that contains the .doc, pdf files as well as XLS but apparently that option is not available for me. Can Emil or Duxide help me with this ? That should prevent google from indexing non-published versions, I believe.)
EDIT.
The article has been uploaded to OSF as well.
https://osf.io/9tgmi/
And here's the link to the xls file.
https://osf.io/2w4h9/
Back to [Archive] Withdrawn submissions
Just upload the project files to http://osf.io/
I cannot even open the paper because I am on holiday and my old netbook is not well equipped. So I just read the abstract. Genetic admixture with Caucasians has been steadily increasing and reached about 20% in the contemporary African American population but could you find statistics with an estimate of how Euro admixture has gone up over the last century? And maybe correlate it with cognitive gap closing? This of course wouldn't rule out the cultural explanation but would show that the environmental explanation is not the only one available.
could you find statistics with an estimate of how Euro admixture has gone up over the last century? And maybe correlate it with cognitive gap closing?
In the GSS, there are two variables, racecen1 and racecen2 for first and second race mentioned. I guess if someone answers black for racecen1 and white for racecen2 he could be biracial. The problem is that racecen1 has modest sample size (17480) while the sample size of racecen2 is ridiculous (1033). And I do not even look at those who don't have wordsum data. So, in the GSS itself, it's just impossible to do that.
Of course, there may be other data, but I don't have those.
MH,
If you upload the files to OSF as I proposed, Piffer would also be able to read them. OSF features in-browser reading of PDF files and other standard files, like spreadsheets.
If you upload the files to OSF as I proposed, Piffer would also be able to read them. OSF features in-browser reading of PDF files and other standard files, like spreadsheets.
The code for STATA should be in an independent supplementary file, not the appendix. Copying code from PDF files does not work well.
Which subset of data is the figures based on? The text doesn't say and the ceiling effect is not similar over time.
Can you put the figures in the text near where they are mentioned instead of having them compiled in the end. Keeping the regression tables back there seems okay.
Overall paper seems okay. I'm not familiar with tobit regression, so I can't comment on that.
Figure 1 shows that there is a strong ceiling effect
for the wordsum variable in the white sample.
Which subset of data is the figures based on? The text doesn't say and the ceiling effect is not similar over time.
Can you put the figures in the text near where they are mentioned instead of having them compiled in the end. Keeping the regression tables back there seems okay.
Overall paper seems okay. I'm not familiar with tobit regression, so I can't comment on that.
In the article, figure 1 obviously shows the histogram for the sample of each racial groups (but excluding people aged 70+). In the section "limitation" it is said that in the white sample, the ceiling effect diminishes over time (when looking at each category of cohort6). I attach the pictures here.
The pictures (fig. 1-3) are too large. I can only put one by page. I can put fig 1 exactly where you ask, but for figures 2-3, it's more difficult to put them exactly where I want. That did not look good. That's why I put everything at the end of the article.
For the syntax, I know that lot of people will not even look at the supplementary files, and I want to show the syntax, this can encourage others to do so.
For tobit regression, I would recommend :
The Uses of Tobit Analysis (McDonald & Moffitt 1980)
Roncek, D. W. (1992). Learning more from tobit coefficients: Extending a comparative analysis of political protest. American Sociological Review, 503-507.
Introductory Econometrics: A Modern Approach (Jeffrey M. Wooldridge 2012; pages 596-601) (you can have it in libgen)
Besides, I have detected two errors.
It's not "set to missing" because I "dropped" the observations. Also, it's not Hauser & Huang (1999) because there is no 1999 paper. Only the 1996, 1998 (in Neisser's book on the Flynn effect), 2000 or the 2001. I will correct those mistakes later.
By the way, I have contacted Hauser, Huang, Lynn, Flynn, and Dickens, that is, those who have worked on the BW IQ changes over time. They responded quickly, and (almost) the same day (30 sept/1 oct). Lynn said it's interesting. Huang said that their analysis needed an update, and he thanked me for doing so (and hoped the best for my publication). Flynn said it looks fascinating, and we must keep in touch. So, no one has really commented the article. I will try to email other people whose work is close to that topic. I think it's important to have the opinions from scholars.
The pictures (fig. 1-3) are too large. I can only put one by page. I can put fig 1 exactly where you ask, but for figures 2-3, it's more difficult to put them exactly where I want. That did not look good. That's why I put everything at the end of the article.
For the syntax, I know that lot of people will not even look at the supplementary files, and I want to show the syntax, this can encourage others to do so.
For tobit regression, I would recommend :
The Uses of Tobit Analysis (McDonald & Moffitt 1980)
Roncek, D. W. (1992). Learning more from tobit coefficients: Extending a comparative analysis of political protest. American Sociological Review, 503-507.
Introductory Econometrics: A Modern Approach (Jeffrey M. Wooldridge 2012; pages 596-601) (you can have it in libgen)
Besides, I have detected two errors.
Concerning the variable "sibs" there are two observations for which the number of siblings (55 and 68) shows a dramatic departure from all other respondents. I decided to filter them out. As for age variable, I decided to remove (set to missing data) people aged 70 or more.
Concerns about sample representativeness have sometimes been expressed. Hence, following the recommendations of Hauser & Huang (1999) I weight the data by the variable "weight" which is the interaction of the variables "wtssall" and "oversamp", although this will not change the results.1
It's not "set to missing" because I "dropped" the observations. Also, it's not Hauser & Huang (1999) because there is no 1999 paper. Only the 1996, 1998 (in Neisser's book on the Flynn effect), 2000 or the 2001. I will correct those mistakes later.
By the way, I have contacted Hauser, Huang, Lynn, Flynn, and Dickens, that is, those who have worked on the BW IQ changes over time. They responded quickly, and (almost) the same day (30 sept/1 oct). Lynn said it's interesting. Huang said that their analysis needed an update, and he thanked me for doing so (and hoped the best for my publication). Flynn said it looks fascinating, and we must keep in touch. So, no one has really commented the article. I will try to email other people whose work is close to that topic. I think it's important to have the opinions from scholars.
In the article, figure 1 obviously shows the histogram for the sample of each racial groups (but excluding people aged 70+). In the section "limitation" it is said that in the white sample, the ceiling effect diminishes over time (when looking at each category of cohort6). I attach the pictures here.
I mean, which year? All years combined?
The pictures (fig. 1-3) are too large. I can only put one by page. I can put fig 1 exactly where you ask, but for figures 2-3, it's more difficult to put them exactly where I want. That did not look good. That's why I put everything at the end of the article.
Just shrink them a bit so they aren't too large. They don't need to be that large. Having figures in the end of the article is bad since it interrupts reading flow (readers have to stop up, scroll to the end and read the figures, then jump back to the page with the relevant text).
For the syntax, I know that lot of people will not even look at the supplementary files, and I want to show the syntax, this can encourage others to do so.
Most readers will not read the code in the appendix either, and most readers don't use STATA and even those who do are often not familiar with complex syntax. I maintain that code should be in a supplementary file.
For tobit regression, I would recommend :
The Uses of Tobit Analysis (McDonald & Moffitt 1980)
Roncek, D. W. (1992). Learning more from tobit coefficients: Extending a comparative analysis of political protest. American Sociological Review, 503-507.
Introductory Econometrics: A Modern Approach (Jeffrey M. Wooldridge 2012; pages 596-601) (you can have it in libgen)
Thank you. I will look into the last reference.
By the way, I have contacted Hauser, Huang, Lynn, Flynn, and Dickens, that is, those who have worked on the BW IQ changes over time. They responded quickly, and (almost) the same day (30 sept/1 oct). Lynn said it's interesting. Huang said that their analysis needed an update, and he thanked me for doing so (and hoped the best for my publication). Flynn said it looks fascinating, and we must keep in touch. So, no one has really commented the article. I will try to email other people whose work is close to that topic. I think it's important to have the opinions from scholars.
I agree. Contacting relevant scholars is a good idea.
The histograms are for all years combined. This is obvious since in the "limitation" I said the pattern looks different when we look at specific periods. I will however modify the sentence as follows "Figure 1 plots the histogram of wordsum for all years combined. It shows that there is a strong ceiling effect for the wordsum variable in the white sample.". Would it be ok ?
As for the size of the pictures, I did not ask that big, it's just how Stata has generated them, and i just saved them as png. I will try to reduce the size of the pictures without deteriorating the quality. However, I maintain it would not be possible to put the tables (notably the regressions) where you want me to put them. As you can see, their size is so large that you need (almost) an entire page. That's one of the reason why I put everything at the end. If you want, however, I will put the pictures where you think they should.
For the syntax, i will think about it, because I'm convinced it's beneficial. Of course, most won't care about it, but those who want to go into the details and are familiar with the software (unfortunately, R works badly for me, which is why I decided to use Stata instead) can see it. I have the feeling it's less likely they will examine my syntax if it's only displayed in the supplementary file. They won't bother to go here at OP forums, check the files, etc. That's what I'm afraid about.
As for the size of the pictures, I did not ask that big, it's just how Stata has generated them, and i just saved them as png. I will try to reduce the size of the pictures without deteriorating the quality. However, I maintain it would not be possible to put the tables (notably the regressions) where you want me to put them. As you can see, their size is so large that you need (almost) an entire page. That's one of the reason why I put everything at the end. If you want, however, I will put the pictures where you think they should.
For the syntax, i will think about it, because I'm convinced it's beneficial. Of course, most won't care about it, but those who want to go into the details and are familiar with the software (unfortunately, R works badly for me, which is why I decided to use Stata instead) can see it. I have the feeling it's less likely they will examine my syntax if it's only displayed in the supplementary file. They won't bother to go here at OP forums, check the files, etc. That's what I'm afraid about.
Some more comments.
It wasn't obvious to me. You should probably add it to the figure caption too.
I didn't say to move the regression tables I wrote "Keeping the regression tables back there seems okay.". You can easily resize the figures using whichever program you use to write in.
I think most readers, say 95%, will not examine the STATA syntax no matter where you put them. Those who want to will likely examine it no matter where you put it. You can add a link to the OSF repository in the paper, so they don't need to go to OP forums for the supplementary files.
-
I found another error:
This should be model 4.
-
Why do you mention the range of numerical variables, and all the possible values of nominal variables?
-
It is unclear how the nominal variables are used in the regression models. Hopefully you have not used them as continuous variables, as that makes no sense at all. Reg16 (region lived) and family16 are clearly not even quasi-continuous variables. Regression on that as they were is clearly nonsense. Res16 is quasi-continuous, so regression with it is okay.
-
-
-
As far as I can tell, you use the internal consistency value (cronbach's alpha presumably). This is not optimal for adjusting for measurement error since it doesn't correct for transient error. See Hunter and Schmidt (2004, p. 99). The real reliability is probably somewhat lower.
-
This should be "consistent" I think.
-
You should link to the specific thread (http://www.openpsych.net/forum/showthread.php?tid=168). You should also link to the OSF repository with the supplementary files.
Ref
Hunter, J. E., & Schmidt, F. L. (Eds.). (2004). Methods of meta-analysis: Correcting error and bias in research findings. Sage.
The histograms are for all years combined. This is obvious since in the "limitation" I said the pattern looks different when we look at specific periods. I will however modify the sentence as follows "Figure 1 plots the histogram of wordsum for all years combined. It shows that there is a strong ceiling effect for the wordsum variable in the white sample.". Would it be ok ?
It wasn't obvious to me. You should probably add it to the figure caption too.
As for the size of the pictures, I did not ask that big, it's just how Stata has generated them, and i just saved them as png. I will try to reduce the size of the pictures without deteriorating the quality. However, I maintain it would not be possible to put the tables (notably the regressions) where you want me to put them. As you can see, their size is so large that you need (almost) an entire page. That's one of the reason why I put everything at the end. If you want, however, I will put the pictures where you think they should.
I didn't say to move the regression tables I wrote "Keeping the regression tables back there seems okay.". You can easily resize the figures using whichever program you use to write in.
For the syntax, i will think about it, because I'm convinced it's beneficial. Of course, most won't care about it, but those who want to go into the details and are familiar with the software (unfortunately, R works badly for me, which is why I decided to use Stata instead) can see it. I have the feelin it's less likely they will examine my syntax if it's only in the supplementary file. They won't bother to go here at OP forums, check the files, etc. That's what I'm afraid about.
I think most readers, say 95%, will not examine the STATA syntax no matter where you put them. Those who want to will likely examine it no matter where you put it. You can add a link to the OSF repository in the paper, so they don't need to go to OP forums for the supplementary files.
-
I found another error:
I build four models. For model 1, I use cohort (or survey year), race, and the interaction of race with cohort (or survey year). For model 2, I add age and gender variables. For model 3, I add the log of real family income (realinc), degree (degree) and years of school completed (educ), and region of residence at age 16 (reg16). For model 3, I add the number of siblings (sibs) as well as "type of place lived at age 16" (res16) and "living with parents at age 16" (family16).
This should be model 4.
-
The variable race "bw1" has a value of 0 for blacks and 1 for whites. Since the year 2000, the GSS begins to ask whether the respondent is hispanic or not. 5 For respondents in survey year 2000+ I have only included the respondents who declared not being hispanic (see appendix). The variable year has values going from 1972 to 2012. The variable cohort has values going from 1883 to 1994. The variable sex has the following values; male=1, female=2. The variable age has values going from 18 to 89. The variable degree has the following values; 0=lower than high school, 1=high school, 2=junior college, 3=bachelor, 4=graduate. The variable educ has values going from 0 to 20. The variable realinc has values going from 245 to 162607, and the respective numbers for log income are 5.5 and 11.99. The variable reg16 has the following values; 0=Foreign, 1=New England, 2=Middle Atlantic, 3=East North Central, 4=West North Central, 5=South Atlantic, 6=East South Atlantic, 7=West South Atlantic, 8=Mountain, 9=Pacific. The variable res16 has the following values; 1= in open country but not on a farm, 2=on a farm, 3=town lower than 50,000, 4=50,000 to 250,000, 5=in a suburb near a big city, 6=city greater than 250,000. The variable family16 has the following values; 0=other arrangement with relatives (e.g., aunt, uncle, grandparents), 1=mother & father, 2=father & stepmother, 3=mother & stepfather, 4=father, 5=mother, 6=male relative, 7=female relative, 8=male & female relatives. The variable sibs has values going from 0 to 37 (apart from two apparent outliers).
Why do you mention the range of numerical variables, and all the possible values of nominal variables?
-
It is unclear how the nominal variables are used in the regression models. Hopefully you have not used them as continuous variables, as that makes no sense at all. Reg16 (region lived) and family16 are clearly not even quasi-continuous variables. Regression on that as they were is clearly nonsense. Res16 is quasi-continuous, so regression with it is okay.
-
which means the gap has been reduced by an half
-
stronger gain over time in the model 2
-
It is merely a vocabulary test, and its reliability is not high (0.71).
As far as I can tell, you use the internal consistency value (cronbach's alpha presumably). This is not optimal for adjusting for measurement error since it doesn't correct for transient error. See Hunter and Schmidt (2004, p. 99). The real reliability is probably somewhat lower.
-
The wordsum gap has been reduced by approximately 40% or 50% but this is still coherent with their idea that the black-white IQ gap is (at least) 50% genetic and 50% environmental.
This should be "consistent" I think.
-
The supplementary files are made available at http://openpsych.net/forum/index.php.
You should link to the specific thread (http://www.openpsych.net/forum/showthread.php?tid=168). You should also link to the OSF repository with the supplementary files.
Ref
Hunter, J. E., & Schmidt, F. L. (Eds.). (2004). Methods of meta-analysis: Correcting error and bias in research findings. Sage.
[quote][/quote]
This should be: Huang & Hauser's (2001)
This should be: The wordsum correlates at 0.71 with the AGCT aptitude test, and it has an internal reliability of 0.71 for whites and 0.63 for blacks (Huang & Hauser, 2001); these reliabilities are relatively low for cognitive measures, but this is not supprising given the shortness of the test.
This should be: formula.
Try: But it is clear that the d gaps in the period 1988-1993 were smaller than than earlier years. Lynn has also regressed the d gaps on years. The "b" slope was -0.004, which means that, over 22 years, the d gap diminished by 0.004*22=0.088, given that a linearity assumption holds (which was true according to Lynn). This is indeed not very large.[/quote]
Try: However, their d scores differ from Lynn's only for years 1993 and 1994. But, more importantly, they faulted Lynn for not having used cohort as the variable for the time trend (which can be calculated as year minus age).
Try: Here is an explanation of the two concepts: With survey year, assuming age is held constant, we are asking, "How are the 40-year-olds in 1980 different from the 40-year-olds in 1990?". The former experienced WWII, but the latter didn't. This is the period effect. With birth cohort, assuming age is held constant, we are asking, "How are 40 year olds born in 1950 different from 40 year olds born in 1960?". The former experienced the sexual revolution in their teenage years, but the latter didn't. This is the cohort effect. The two effects may or may not be the same thing. (I must thank Satoshi Kanazawa for the tip.)
It's standard to round to the same level of significant digits when dealing with the same sets of numbers. So either 0.024 and -0.018 or 0.024? and -0.0176.
What does this mean? Also, add an article ("the") before censored.
Why "should" have they? Was the relations non-linear. Perhaps you mean:
The authors should have checked if using squared and perhaps cubed terms produced a better fitting model. Doing so, might have generated different results.
The finding of Huang & Hauser (2001) is interesting because it is known that the black-white IQ gap in the U.S. has not declined in the adult samples, only in the children samples (Rushton & Jensen, 2006; Dickens & Flynn, 2006).
Maybe: The finding of Huang & Hauser (2001) is interesting because it is known that the black-white IQ gap in the U.S. has not declined in adult samples but only in child and adolescent ones (Rushton & Jensen, 2006; Dickens & Flynn, 2006).
(Use "the adult samples" when referring to a specific set of samples; use "adult samples" when referring to an unspecific set of samples; in this case, I think you are referring to an unspecific set. If not, you should say e.g.,:
It is known that the black-white IQ gap in the U.S. has not declined in the adult samples but only in the child samples discussed by Rushton & Jensen (2006) and Dickens & Flynn (2006).
Try: It is possible, nonetheless, that there was a gap closing before the period analyzed by Dickens and Flynn (see Murray, 2007).
Try: Before deciding which method to apply, one needs to examine the distribution of the variables one wishes to use.
Try: An important assumption of linear regression is the normality of the data, especially in context to the distribution of the dependent variable.
Try: The right procedure should be to use a tobit regression (for an introduction, see McDonald & Moffitt, 1980).
Try: Since the year 2000, the GSS began to ask whether
Use a common.
For clarity place the variable names in quotes.
The variable "cohort" has values going from 1883 to 1994. The variable "sex" has the following values; male=1, female=2. The variable "age"....
Capitalize e.g., Mexican.
Try: As for the age variable...
I would use a comma.
Who ever reported correlations to the ten-thousandth place? Also try:
This is because the more recent cohorts are younger, and wordsum correlates positively with age (r=0.1005). In models 3 and 4, the scores among whites have a declining trend.
Try: This is still a 50% reduction
Categories.
Try: I split the variable "wordsum" into two parts
Try: Another way to investigate whether or not the improvement occurs at high levels is to conduct logistic regression with wordsum as the dependent binary variable (score levels 0-7 coded as 0 and score levels 8-10 coded as 1)
Try: The most notable problem with using wordsum, in this context, is that it is not a great measure of general intelligence.
Try: Given Huang & Hauser's (1996, pp. 7-8) discussion, there is no clear way to determine if this has occurred.
Try: The affirmation that the test has become harder may be true to some extent.
try: whites find the wordsum harder over time while the blacks find it a little bit easier
Fragment. Try: Generally, there is some indication that the black-white gap has been under-estimated in early cohorts -- and by the same token, the magnitude of the gap narrowing.
I don't understand this and I would advise against using "dysgenic", since this implies a causal model, the discussion of which is outside the scope of the paper. Maybe just delete the sentence.
I wouldn't, in this paper, discuss this. It's not directly relevant to the topic of the paper and it unnecessarily geneticizes the discussion (thus turning off potential readers). I've made the same point regarding many of Emil's discussions: Don't conflate issues e.g., the "spatial transferability hypothesis" with certain global genetic hypotheses. Delete the whole paragraph.
I'll get back to you regarding method later.
The aim of this article is to provide an update to Huang & Hauser (2001) study
This should be: Huang & Hauser's (2001)
The wordsum correlates at 0.71 with the AGCT aptitude test, and that the wordsum has an internal consistency reliability of 0.71 for whites and 0.63 for blacks (Huang & Hauser, 2001), which is not surprising given the shortness of the test.
This should be: The wordsum correlates at 0.71 with the AGCT aptitude test, and it has an internal reliability of 0.71 for whites and 0.63 for blacks (Huang & Hauser, 2001); these reliabilities are relatively low for cognitive measures, but this is not supprising given the shortness of the test.
The usual operation
This should be: formula.
But it is clear that the d gaps in the period 1988-1993 were clearly smaller than than earlier years. Lynn has also regressed the d gaps on years. The "b" slope was -0.004, which means that over 22 years, the d gap has been reduced by 0.004*22=0.088, given that the linearity assumption holds (which was true according to Lynn). This is indeed not very large.
Try: But it is clear that the d gaps in the period 1988-1993 were smaller than than earlier years. Lynn has also regressed the d gaps on years. The "b" slope was -0.004, which means that, over 22 years, the d gap diminished by 0.004*22=0.088, given that a linearity assumption holds (which was true according to Lynn). This is indeed not very large.[/quote]
However, their d scores differ from Lynn's only for years 1993 and 1994. But, more importantly, they faulted Lynn for not having used cohort as the variable of time trend (which can be calculated as year minus age).
Try: However, their d scores differ from Lynn's only for years 1993 and 1994. But, more importantly, they faulted Lynn for not having used cohort as the variable for the time trend (which can be calculated as year minus age).
Here is an explanation of the two concepts. With survey year, assuming age is held constant, we are asking how are the 40-year-olds in 1980 different from the 40-year-olds in 1990. The former experienced WWII, but the latter didn't. This is the period effect. With birth cohort, assuming age is held constant, we are asking how are people born in 1950 different from people born in 1960, when they were both 40 years old. The former experienced the sexual revolution in their teenage years, but the latter didn't. This is the cohort effect. The two effects may or may not be the same thing. (I must thank Satoshi Kanazawa for the tip.)
Try: Here is an explanation of the two concepts: With survey year, assuming age is held constant, we are asking, "How are the 40-year-olds in 1980 different from the 40-year-olds in 1990?". The former experienced WWII, but the latter didn't. This is the period effect. With birth cohort, assuming age is held constant, we are asking, "How are 40 year olds born in 1950 different from 40 year olds born in 1960?". The former experienced the sexual revolution in their teenage years, but the latter didn't. This is the cohort effect. The two effects may or may not be the same thing. (I must thank Satoshi Kanazawa for the tip.)
Given their parameters of 2.641 for intercept, 3.037 for race, 0.024 for the slope of year, and -0.0176 for the interaction, we can predict the changes in the gap over time. This is done by computing the white trend with race*year interaction, 2.641+3.037+(0.024*24)-(0.0176*24)=5.8316, and the white trend without the interaction, 2.641+3.037+(0.024*24)=6.2540, which gives a difference of 0.4224
It's standard to round to the same level of significant digits when dealing with the same sets of numbers. So either 0.024 and -0.018 or 0.024? and -0.0176.
difference (corrected for censored distribution of wordsum)
What does this mean? Also, add an article ("the") before censored.
Squared and perhaps cubed terms should have been applied to categorical variables of years and their interaction with race rather than using the continuous variable of survey year.
Why "should" have they? Was the relations non-linear. Perhaps you mean:
The authors should have checked if using squared and perhaps cubed terms produced a better fitting model. Doing so, might have generated different results.
The finding of Huang & Hauser (2001) is interesting because it is known that the black-white IQ gap in the U.S. has not declined in the adult samples, only in the children samples (Rushton & Jensen, 2006; Dickens & Flynn, 2006).
Maybe: The finding of Huang & Hauser (2001) is interesting because it is known that the black-white IQ gap in the U.S. has not declined in adult samples but only in child and adolescent ones (Rushton & Jensen, 2006; Dickens & Flynn, 2006).
(Use "the adult samples" when referring to a specific set of samples; use "adult samples" when referring to an unspecific set of samples; in this case, I think you are referring to an unspecific set. If not, you should say e.g.,:
It is known that the black-white IQ gap in the U.S. has not declined in the adult samples but only in the child samples discussed by Rushton & Jensen (2006) and Dickens & Flynn (2006).
It is possible, nonetheless, that there was a gap closing before the period analyzed by Dickens and Flynn. See Murray (2007).
Try: It is possible, nonetheless, that there was a gap closing before the period analyzed by Dickens and Flynn (see Murray, 2007).
Before deciding which method to apply, one needs to examine the distribution of the variables we will use.
Try: Before deciding which method to apply, one needs to examine the distribution of the variables one wishes to use.
An important assumption of linear regression is the normality of the data, especially the distribution of the dependent variable.
Try: An important assumption of linear regression is the normality of the data, especially in context to the distribution of the dependent variable.
The right procedure should be to use a tobit regression (for an introduction, see, McDonald & Moffitt, 1980).
Try: The right procedure should be to use a tobit regression (for an introduction, see McDonald & Moffitt, 1980).
Since the year 2000, the GSS begins to ask whether
Try: Since the year 2000, the GSS began to ask whether
For respondents in survey year 2000+ I have only included the respondents who declared not being hispanic (see appendix).
Use a common.
The variable cohort has values going from 1883 to 1994. The variable sex has the following values; male=1, female=2. The variable age has values going from 18 to 89. The variable degree has the following values; 0=lower than high school, 1=high school, 2=junior college, 3=bachelor, 4=graduate. The variable educ has values going from 0 to 20. The variable realinc has values going from 245 to 162607, and the respective numbers for log income are 5.5 and 11.99. The variable reg16 has the
For clarity place the variable names in quotes.
The variable "cohort" has values going from 1883 to 1994. The variable "sex" has the following values; male=1, female=2. The variable "age"....
According to the GSS codebook, the "white" category in variable "race" (before the year 2000) includes mexicans, spaniards and puerto ricans "who appear to be white".
Capitalize e.g., Mexican.
As for age variable, I decided to remove (set to missing data) people aged 70 or more
Try: As for the age variable...
Hence, following the recommendations of Hauser & Huang (1999) I weight the data by the variable "weight" which is the interaction of the variables "wtssall" and "oversamp", although this will not change the results.
I would use a comma.
The black-white raw score gap in cohort1 was 2.023 items correct and has become 1.001 item correct in cohort6, which means the gap has been reduced by an half, while the gap was 1.638 items correct in year1 and has become 1.333
This is because the more recent cohorts are younger, and the wordsum correlates positively with age (r=0.1005). In models 3 and 4, the scores among whites have a declining trend.
Who ever reported correlations to the ten-thousandth place? Also try:
This is because the more recent cohorts are younger, and wordsum correlates positively with age (r=0.1005). In models 3 and 4, the scores among whites have a declining trend.
This is still 50% reduction
Try: This is still a 50% reduction
A subsequent analysis is done by computing the d gap (see supplementary file) within each of the category of the dummy variables.
Categories.
I split the variable wordsum into two parts
Try: I split the variable "wordsum" into two parts
Another way to investigate whether or not the improvement occurs at high levels is to conduct logistic regression with wordsum as dependent binary variable (score levels 0-7 coded 0 and score levels 8-10 coded 1)
Try: Another way to investigate whether or not the improvement occurs at high levels is to conduct logistic regression with wordsum as the dependent binary variable (score levels 0-7 coded as 0 and score levels 8-10 coded as 1)
The most notable problem with the wordsum is not to be a measure of general intelligence
Try: The most notable problem with using wordsum, in this context, is that it is not a great measure of general intelligence.
Given Huang & Hauser's (1996, pp. 7-8) discussion, there is no clear answer to this question
Try: Given Huang & Hauser's (1996, pp. 7-8) discussion, there is no clear way to determine if this has occurred.
The affirmation that the test has become harder may be true. To some extent
Try: The affirmation that the test has become harder may be true to some extent.
whites find the wordsum harder over time while the blacks would find it a little bit easier
try: whites find the wordsum harder over time while the blacks find it a little bit easier
Generally, there is some indication that the black-white gap has been under-estimated in early cohorts. And by the same token, the magnitude of the gap narrowing.
Fragment. Try: Generally, there is some indication that the black-white gap has been under-estimated in early cohorts -- and by the same token, the magnitude of the gap narrowing.
But at the same time, the white trend could have been even flatter or turned out to be somewhat dysgenic.
I don't understand this and I would advise against using "dysgenic", since this implies a causal model, the discussion of which is outside the scope of the paper. Maybe just delete the sentence.
Granted the limitation of the wordsum test, one may wonder what is the consequence of the black-white gap decline for the genetic hypothesis proposed by Rushton & Jensen (2010). ...
I wouldn't, in this paper, discuss this. It's not directly relevant to the topic of the paper and it unnecessarily geneticizes the discussion (thus turning off potential readers). I've made the same point regarding many of Emil's discussions: Don't conflate issues e.g., the "spatial transferability hypothesis" with certain global genetic hypotheses. Delete the whole paragraph.
I'll get back to you regarding method later.
[quote][/quote]I'll get back to you regarding method later.
[Edits made]
I thought over the statistical method; I'm fine with it. I would like to know, though, why the survey year and birth cohort method produce such divergent results. There must be an age x survey year interaction. Could you check for this? I get what's happening as I've looked at the results prior. Basically, in 1975 older (50-75) African Americans perform much worse than mid age and younger (18-50) ones. During later years, the older age gap narrows. Now, one might take this as indicating a (cross age) cohort narrowing, yet another interpretation would be that it represents an older age narrowing i.e., there is less a difference between older people in 2000 than 1975. To determine which, you would need data from same age people in e.g., 1925 and 2000 which you don't have (for the GSS).
Over at HV, I commented on a similar (in methodology) analysis:
"Reardon’s analysis, of course, is deeply flawed by his failure to take into account both age effects and test content effects in addition to his dubious method of deriving early comparison points. As for the latter, he, for example, derives his early points, from the 1940s, from Charles Murray’s analysis of the 1976, 1986, and 1996 Woodcock–Johnson I to III standardizations. Of course, these samples were from the 70s, 80s, and 90s. To derive magnitudes of differences from the 40s, he projects back in time based on Murray’s birth cohort analysis. These differences, based on Full scale IQ — e.g., between 70 year old Blacks and Whites in the 90s who would have been 20 or so in the 40s — are then compared with the average Math and Reading differences between 5 to 7 year olds from the Early Childhood Longitudinal Study in the late 1990s (a study which showed a large effect of age on the magnitude of the math and reading gap — see: sample 48 — and also a large general knowledge gap at very young ages — see III, Chuck (2012c). His analysis, then, is confounded by the three problems and their interactions: (1) His method of deriving early points. (2) His comparison across measures. (3) And his comparison across ages."
For your cohort analysis, you are looking at e.g., age 50 differences in 1975 and age 25 differences in 2000 and finding a large change. But it's not obvious that this is fully a cohort change in the sense of age 18 through 65 people in 1925 versus age 18 through 65 in 2000 as opposed to an age x survey interaction in the sense that older people in 1975 (but less so younger) versus older people in 2000 (which is not the same as e.g., younger people in 1925 versus 1975). Anyways, I think that you should make a note concerning this issue. Generally, it's not clear if your "cohort analysis" is better than the survey year analysis in terms of determining the true cross age cohort effect.
(The proper interpretation should be, "There is a much larger older age gap in 1975 versus 2000" as opposed to, "There was a larger 1925 to 2000 cohort narrowing".)
I have made a lot of changes, but I will upload later.
One important change is the removal (at least temporarily) of my logistic regression analysis. I know that MacCallum et al. (2002) have already treated the practice of dichotomization of a continuous variable. They say it has problems because it lowers the reliability of the variable and can possibly alter the interpretation of it. One illustration can help to understand. Imagine you have 4 people with different level of fear about spider, A (100%), B (60%), C (40%), D (0%). You dichotomize the variable at the mean or median, so that A and B have value of 1 (fear) while C and D have value of 0 (no fear) and yet B and C are more alike than either A and B or C and D. This labeling is totally arbitrary and not justified. However, these authors applied this criticism to correlational, ANOVA and regression analyses. I was using logistic regression, which attempt to estimate the likelihood of having value of 1 versus 0. But now that I think about it, I'm not so sure about its robustness. One can still argue that my labeling (0-7 vs 8-10) is arbitrary but at the same time, the categories of my dummy variables must also be arbitrary. So, I will email MacCallum and ask him what he thinks about it. Of course, I know lot of people in recent papers conducted such dichotomization for logistic regression, but none of them have cited MacCallum et al. In light of this, I have replaced this analysis by another; I computed the d gap of wordlow and wordhigh, by dividing the black-white difference by the SD given in Table 4.
I have also added an explanation of the tobit coefficient, just in case someone would like to ask me to write it.
Now, the comments.
Emil :
For age variable, you don't need to guess what means a range of 18-69. But for region, you don't know what values are assigned to each regions.
Generally, I read that people accept the idea that a variable is (can be) thought as continuous when it has at least 5 values. In the variables you mentioned, they have more than 5 values.
As for the link to the thread, I couldn't (but now I can), because when I wrote the article, I have not created this thread. But of course I will link to the OSF later.
Chuck :
I have made all the modifications you indicated. However, concerning the number of digit after zero, they differ because it's how it is presented in Huang & Hauser (0.024 for cohort and -0.0176 for race*cohort). In the case of interaction variable, in my experience, I have seen quite a lot of time that even a small coefficient can have meaningful effect, so in my opinion, I find it justified to add one more digit for this variable. Another reason where I think it's justified to add more digit (for unstandardized coeff, but not standardized coeff) after the zero is when the variable can take on a large number of values, such as age (16-69). Concerning the correlation of wordsum with age (0.1005) it's how Stata has displayed the result. If you insist, I can round it at 0.10. Also, I have removed the word "dysgenic" and replaced it by "negative".
This would be unfortunate I think. When someone discusses black-white gap (IQ or achievement) he or she always attempts to understand the causes of it. If I don't attempt to explain the meaning of the gap narrowing in verbal IQ, I don't understand the meaning of such analysis. For example, Huang & Hauser (2001) don't buy the hereditarian argument. And they show the gap narrowing is due to gain in SES over time (although I think their analyses don't really prove it). But most people do not attempt to weigh the hypotheses. It's really of no use to test the environmental hypothesis if you don't think about the prediction that the hereditarian hypothesis can make. I think most researchers should stop making fallacies like the "confirmation bias". It's easy to fall into this trap. I don't remember how many times I have read "environmental variables explain the gap, we have proved this hypothesis to be true". If the BW gap has narrowed (and even if it didn't), I need to discuss the consequences. During this period studied, blacks have certainly improved in social status, probably more than did whites, so I need to talk about the relevance of the environmental and genetic hypotheses, even if people don't like it. Of course, I can delete the word genetic and replace it by hereditarian. That is a less provocative term, by the idea is still the same.
Tell me if I'm right. You want me to conduct a tobit regression with cohort, race, age, cohort*race, cohort*age variables ? And you say that you suspect the cohort*age effect to become stronger in later cohorts ?
The syntax looks something like this :
gen ageC1 = age*cohortdummy1
gen ageC2 = age*cohortdummy2
gen ageC3 = age*cohortdummy3
gen ageC4 = age*cohortdummy4
gen ageC5 = age*cohortdummy5
gen ageC6 = age*cohortdummy6
tobit wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex ageC2 ageC3 ageC4 ageC5 ageC6 [pweight = weight], ll(0) ul(10)
tobit wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 age sex ageC2 ageC3 ageC4 ageC5 ageC6 [pweight = weight], ll(0) ul(10)
tobit wordsum cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 age sex ageC2 ageC3 ageC4 ageC5 ageC6 [pweight = weight], ll(0) ul(10)
So, how do we interpret this outcome ? It seems to me that the age gap becomes larger at later cohort, because you have positive coefficients that become stronger over time. When you controlling for age*cohort interaction, the cohort effect is negative. That is, the wordsum score for the entire group diminishes over time. There is still a meaningful black-white narrowing.
One important change is the removal (at least temporarily) of my logistic regression analysis. I know that MacCallum et al. (2002) have already treated the practice of dichotomization of a continuous variable. They say it has problems because it lowers the reliability of the variable and can possibly alter the interpretation of it. One illustration can help to understand. Imagine you have 4 people with different level of fear about spider, A (100%), B (60%), C (40%), D (0%). You dichotomize the variable at the mean or median, so that A and B have value of 1 (fear) while C and D have value of 0 (no fear) and yet B and C are more alike than either A and B or C and D. This labeling is totally arbitrary and not justified. However, these authors applied this criticism to correlational, ANOVA and regression analyses. I was using logistic regression, which attempt to estimate the likelihood of having value of 1 versus 0. But now that I think about it, I'm not so sure about its robustness. One can still argue that my labeling (0-7 vs 8-10) is arbitrary but at the same time, the categories of my dummy variables must also be arbitrary. So, I will email MacCallum and ask him what he thinks about it. Of course, I know lot of people in recent papers conducted such dichotomization for logistic regression, but none of them have cited MacCallum et al. In light of this, I have replaced this analysis by another; I computed the d gap of wordlow and wordhigh, by dividing the black-white difference by the SD given in Table 4.
I have also added an explanation of the tobit coefficient, just in case someone would like to ask me to write it.
Now, the comments.
Emil :
Why do you mention the range of numerical variables, and all the possible values of nominal variables?
For age variable, you don't need to guess what means a range of 18-69. But for region, you don't know what values are assigned to each regions.
It is unclear how the nominal variables are used in the regression models. Hopefully you have not used them as continuous variables, as that makes no sense at all. Reg16 (region lived) and family16 are clearly not even quasi-continuous variables. Regression on that as they were is clearly nonsense. Res16 is quasi-continuous, so regression with it is okay.
Generally, I read that people accept the idea that a variable is (can be) thought as continuous when it has at least 5 values. In the variables you mentioned, they have more than 5 values.
As for the link to the thread, I couldn't (but now I can), because when I wrote the article, I have not created this thread. But of course I will link to the OSF later.
Chuck :
I have made all the modifications you indicated. However, concerning the number of digit after zero, they differ because it's how it is presented in Huang & Hauser (0.024 for cohort and -0.0176 for race*cohort). In the case of interaction variable, in my experience, I have seen quite a lot of time that even a small coefficient can have meaningful effect, so in my opinion, I find it justified to add one more digit for this variable. Another reason where I think it's justified to add more digit (for unstandardized coeff, but not standardized coeff) after the zero is when the variable can take on a large number of values, such as age (16-69). Concerning the correlation of wordsum with age (0.1005) it's how Stata has displayed the result. If you insist, I can round it at 0.10. Also, I have removed the word "dysgenic" and replaced it by "negative".
I wouldn't, in this paper, discuss this. It's not directly relevant to the topic of the paper and it unnecessarily geneticizes the discussion (thus turning off potential readers).
This would be unfortunate I think. When someone discusses black-white gap (IQ or achievement) he or she always attempts to understand the causes of it. If I don't attempt to explain the meaning of the gap narrowing in verbal IQ, I don't understand the meaning of such analysis. For example, Huang & Hauser (2001) don't buy the hereditarian argument. And they show the gap narrowing is due to gain in SES over time (although I think their analyses don't really prove it). But most people do not attempt to weigh the hypotheses. It's really of no use to test the environmental hypothesis if you don't think about the prediction that the hereditarian hypothesis can make. I think most researchers should stop making fallacies like the "confirmation bias". It's easy to fall into this trap. I don't remember how many times I have read "environmental variables explain the gap, we have proved this hypothesis to be true". If the BW gap has narrowed (and even if it didn't), I need to discuss the consequences. During this period studied, blacks have certainly improved in social status, probably more than did whites, so I need to talk about the relevance of the environmental and genetic hypotheses, even if people don't like it. Of course, I can delete the word genetic and replace it by hereditarian. That is a less provocative term, by the idea is still the same.
I get what's happening as I've looked at the results prior. Basically, in 1975 older (50-75) African Americans perform much worse than mid age and younger (18-50) ones. During later years, the older age gap narrows. Now, one might take this as indicating a (cross age) cohort narrowing, yet another interpretation would be that it represents an older age narrowing i.e., there is less a difference between older people in 2000 than 1975.
Tell me if I'm right. You want me to conduct a tobit regression with cohort, race, age, cohort*race, cohort*age variables ? And you say that you suspect the cohort*age effect to become stronger in later cohorts ?
The syntax looks something like this :
gen ageC1 = age*cohortdummy1
gen ageC2 = age*cohortdummy2
gen ageC3 = age*cohortdummy3
gen ageC4 = age*cohortdummy4
gen ageC5 = age*cohortdummy5
gen ageC6 = age*cohortdummy6
tobit wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex ageC2 ageC3 ageC4 ageC5 ageC6 [pweight = weight], ll(0) ul(10)
Tobit regression Number of obs = 22156
F( 18, 22138) = 92.49
Prob > F = 0.0000
Log pseudolikelihood = -47838.71 Pseudo R2 = 0.0176
------------------------------------------------------------------------------
| Robust
wordsum | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bw1 | 2.023368 .1559898 12.97 0.000 1.717617 2.32912
cohortdummy2 | -.5709338 .564835 -1.01 0.312 -1.678051 .536183
cohortdummy3 | -.6702743 .5388143 -1.24 0.214 -1.726389 .3858401
cohortdummy4 | -1.338139 .5381658 -2.49 0.013 -2.392982 -.2832961
cohortdummy5 | -1.531758 .5444232 -2.81 0.005 -2.598866 -.46465
cohortdummy6 | -1.503185 .5799227 -2.59 0.010 -2.639874 -.3664949
bwC2 | -.2301317 .1922563 -1.20 0.231 -.6069678 .1467043
bwC3 | -.5600697 .1797855 -3.12 0.002 -.912462 -.2076774
bwC4 | -.6036657 .1770871 -3.41 0.001 -.9507689 -.2565624
bwC5 | -1.006898 .1803751 -5.58 0.000 -1.360446 -.6533498
bwC6 | -1.004008 .1995938 -5.03 0.000 -1.395226 -.6127898
age | -.0131813 .0082283 -1.60 0.109 -.0293094 .0029467
sex | .1739637 .0317265 5.48 0.000 .1117776 .2361498
ageC2 | .0157154 .009103 1.73 0.084 -.0021271 .033558
ageC3 | .0311774 .0087378 3.57 0.000 .0140508 .0483041
ageC4 | .042542 .0088985 4.78 0.000 .0251004 .0599837
ageC5 | .0591789 .0094609 6.26 0.000 .0406349 .0777228
ageC6 | .0695841 .0124641 5.58 0.000 .0451535 .0940146
_cons | 4.84196 .5199612 9.31 0.000 3.822799 5.86112
-------------+----------------------------------------------------------------
/sigma | 2.107472 .0135601 2.080893 2.134051
------------------------------------------------------------------------------
Obs. summary: 140 left-censored observations at wordsum<=0
20698 uncensored observations
1318 right-censored observations at wordsum>=10
tobit wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 age sex ageC2 ageC3 ageC4 ageC5 ageC6 [pweight = weight], ll(0) ul(10)
Tobit regression Number of obs = 22156
F( 13, 22143) = 123.37
Prob > F = 0.0000
Log pseudolikelihood = -47868.274 Pseudo R2 = 0.0170
------------------------------------------------------------------------------
| Robust
wordsum | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bw1 | 1.414415 .04254 33.25 0.000 1.331034 1.497797
cohortdummy2 | -.7754692 .5424091 -1.43 0.153 -1.83863 .2876911
cohortdummy3 | -1.182844 .519948 -2.27 0.023 -2.201979 -.1637094
cohortdummy4 | -1.889955 .5191508 -3.64 0.000 -2.907528 -.8723828
cohortdummy5 | -2.448969 .5250497 -4.66 0.000 -3.478104 -1.419835
cohortdummy6 | -2.410204 .5602023 -4.30 0.000 -3.50824 -1.312167
age | -.0131711 .0082734 -1.59 0.111 -.0293876 .0030455
sex | .1779512 .031749 5.60 0.000 .115721 .2401814
ageC2 | .0153997 .0091492 1.68 0.092 -.0025334 .0333328
ageC3 | .0311452 .0087811 3.55 0.000 .0139337 .0483567
ageC4 | .0425241 .0089401 4.76 0.000 .0250008 .0600474
ageC5 | .0600193 .0095079 6.31 0.000 .0413832 .0786553
ageC6 | .0708083 .0125489 5.64 0.000 .0462115 .0954051
_cons | 5.392481 .5070554 10.63 0.000 4.398616 6.386345
-------------+----------------------------------------------------------------
/sigma | 2.110143 .0135794 2.083527 2.13676
------------------------------------------------------------------------------
Obs. summary: 140 left-censored observations at wordsum<=0
20698 uncensored observations
1318 right-censored observations at wordsum>=10
tobit wordsum cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 age sex ageC2 ageC3 ageC4 ageC5 ageC6 [pweight = weight], ll(0) ul(10)
Tobit regression Number of obs = 23817
F( 12, 23805) = 40.13
Prob > F = 0.0000
Log pseudolikelihood = -52876.4 Pseudo R2 = 0.0054
------------------------------------------------------------------------------
| Robust
wordsum | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
cohortdummy2 | -.6895328 .5584058 -1.23 0.217 -1.784044 .4049781
cohortdummy3 | -1.159271 .5356257 -2.16 0.030 -2.209131 -.1094102
cohortdummy4 | -1.81035 .534596 -3.39 0.001 -2.858192 -.7625077
cohortdummy5 | -2.206941 .538878 -4.10 0.000 -3.263176 -1.150705
cohortdummy6 | -2.340127 .5706149 -4.10 0.000 -3.458569 -1.221686
age | -.0126966 .008511 -1.49 0.136 -.0293787 .0039854
sex | .1219685 .031794 3.84 0.000 .0596502 .1842868
ageC2 | .0124503 .0094042 1.32 0.186 -.0059825 .0308831
ageC3 | .0280788 .0090236 3.11 0.002 .0103921 .0457656
ageC4 | .0368479 .0091736 4.02 0.000 .0188671 .0548287
ageC5 | .0455797 .0096493 4.72 0.000 .0266664 .0644929
ageC6 | .0533736 .0123475 4.32 0.000 .0291717 .0775754
_cons | 6.736302 .5206905 12.94 0.000 5.715715 7.756888
-------------+----------------------------------------------------------------
/sigma | 2.197996 .0134352 2.171662 2.224329
------------------------------------------------------------------------------
Obs. summary: 184 left-censored observations at wordsum<=0
22282 uncensored observations
1351 right-censored observations at wordsum>=10
So, how do we interpret this outcome ? It seems to me that the age gap becomes larger at later cohort, because you have positive coefficients that become stronger over time. When you controlling for age*cohort interaction, the cohort effect is negative. That is, the wordsum score for the entire group diminishes over time. There is still a meaningful black-white narrowing.
One important change is the removal (at least temporarily) of my logistic regression analysis. I know that MacCallum et al. (2002) have already treated the practice of dichotomization of a continuous variable. They say it has problems because it lowers the reliability of the variable and can possibly alter the interpretation of it. One illustration can help to understand. Imagine you have 4 people with different level of fear about spider, A (100%), B (60%), C (40%), D (0%). You dichotomize the variable at the mean or median, so that A and B have value of 1 (fear) while C and D have value of 0 (no fear) and yet B and C are more alike than either A and B or C and D. This labeling is totally arbitrary and not justified. However, these authors applied this criticism to correlational, ANOVA and regression analyses. I was using logistic regression, which attempt to estimate the likelihood of having value of 1 versus 0. But now that I think about it, I'm not so sure about its robustness. One can still argue that my labeling (0-7 vs 8-10) is arbitrary but at the same time, the categories of my dummy variables must also be arbitrary. So, I will email MacCallum and ask him what he thinks about it. Of course, I know lot of people in recent papers conducted such dichotomization for logistic regression, but none of them have cited MacCallum et al. In light of this, I have replaced this analysis by another; I computed the d gap of wordlow and wordhigh, by dividing the black-white difference by the SD given in Table 4.
You could explore the effect of dichotomizing it in different places. You used 0-7 vs. 8-10. You could try 0-6 vs. 7-10, 0-5 vs. 6-10 (the most 'natural' since it is split evenly along the scale), and 0-8 vs. 9-10.
Generally, I read that people accept the idea that a variable is (can be) thought as continuous when it has at least 5 values. In the variables you mentioned, they have more than 5 values.
Look at the variable for region. It is:
The variable reg16 has the following values; 0=Foreign, 1=New England, 2=Middle Atlantic, 3=East North Central, 4=West North Central, 5=South Atlantic, 6=East South Atlantic, 7=West South Atlantic, 8=Mountain, 9=Pacific.
What does it mean to be higher in this variable? What does it mean to be lower? Nothing. These are not places along a scale of something. It is a nominal variable. Using it as a continuous variable is nonsense.
The variable family16 has the following values; 0=other arrangement with relatives (e.g., aunt, uncle, grandparents), 1=mother & father, 2=father & stepmother, 3=mother & stepfather, 4=father, 5=mother, 6=male relative, 7=female relative, 8=male & female relatives.
Same for this. There is no answer to the question "What does it mean to be higher in family16?". It is because it is a nominal variable.
Finally,
he variable res16 has the following values; 1= in open country but not on a farm, 2=on a farm, 3=town lower than 50,000, 4=50,000 to 250,000, 5=in a suburb near a big city, 6=city greater than 250,000.
Is fine because: 1) there are 5 or more possible values, 2) there is a sensible answer to the question "What does it mean to be higher in res16?" The answer is that the higher one is in res16, the more people lives around oneself. Or reversely, the lower, the less people live around oneself. Or one could answer it with density of people in the area, etc. There are sensible answers. Variables like this one are called "quasi-continuous" (or "quasi-interval") because they are not quite continuous (every real number between min and max value is possible), but they are sort of continuous because there are a number of possible values between AND because interpretation of it as a scale is sensible.
Scales of measurement are usually discussed in the beginning of introductory statistics books. There is one in this book, section 2.2: http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/
Emil,
When you control for any variable, it's obvious that there are two things. One is that you adjust for differences in subgroups. For example, one subgroup (father) can have lower mean IQ than another subgroup (mother). If you adjust for it, and whatever the order of the values, the interpretation of the other coefficients won't change. However, and this is the second thing, in the variables you mentioned, e.g., res16, reg16, family16, a higher value obviously has no meaning. Thus, their coefficients have no meaning. But I'm not interested in these things. I don't care about what their coefficient is. I care only about BW changes over time.
For the logistic regression, I can obviously use different splitting. Obviously, one problem with my split is that 5-7 are not "low score" but medium or high. So, perhaps, try 0-4 vs 8-10 or 7-10. If I get similar results, perhaps it would mean that such dichotomization had not distorted the construct of interest. But what if there is a difference ? There are so many possible dichotomization that I don't know if it's relevant anymore. I will see if I can reach some agreement with MacCallum.
What does it mean to be higher in this variable? What does it mean to be lower? Nothing. These are not places along a scale of something. It is a nominal variable. Using it as a continuous variable is nonsense.
When you control for any variable, it's obvious that there are two things. One is that you adjust for differences in subgroups. For example, one subgroup (father) can have lower mean IQ than another subgroup (mother). If you adjust for it, and whatever the order of the values, the interpretation of the other coefficients won't change. However, and this is the second thing, in the variables you mentioned, e.g., res16, reg16, family16, a higher value obviously has no meaning. Thus, their coefficients have no meaning. But I'm not interested in these things. I don't care about what their coefficient is. I care only about BW changes over time.
For the logistic regression, I can obviously use different splitting. Obviously, one problem with my split is that 5-7 are not "low score" but medium or high. So, perhaps, try 0-4 vs 8-10 or 7-10. If I get similar results, perhaps it would mean that such dichotomization had not distorted the construct of interest. But what if there is a difference ? There are so many possible dichotomization that I don't know if it's relevant anymore. I will see if I can reach some agreement with MacCallum.
I don't think it even works as a control. You need to divide them up into dichotomous dummy variables to control for them in MR I think.
When you control for a given variable, the other coefficients are expressed at the mean of the controled variable. Say, you control for dichotomized race (1;2). In that case, the other coefficients are expressed as if race is equal to 1.5. If race is coded 0;1, then it's 0.5 (note it's also how it works in ANCOVA). In most cases, when you recode your variable, you'll likely get similar estimates for the other (non-recoded) coefficients. The thing that may be subjected to large change is the intercept, especially if you reverse code the original variable.
I have coded family16 differently (e.g., assigned 0 instead of 5, 1 instead of 2, etc.) but that didn't change the results.
Here's a try.
keep if age<70
gen weight = wtssall*oversamp
gen blackwhite2000after=1 if year>=2000 & race==1 & hispanic==1
replace blackwhite2000after=0 if year>=2000 & race==2 & hispanic==1
gen blackwhite2000before=1 if year<2000 & race==1
replace blackwhite2000before=0 if year<2000 & race==2
gen bw1 = max(blackwhite2000after,blackwhite2000before)
gen income = realinc
replace income = . if income==0
gen logincome = log(income)
replace educ = . if educ>20
replace degree = . if degree>4
replace sibs = . if sibs==-1
replace sibs = . if sibs>37
replace res16 = . if res16==0
replace res16 = . if res16>=8
replace family16 = . if family16==-1
replace family16 = . if family16==9
replace wordsum = . if wordsum<0
replace wordsum = . if wordsum>10
replace cohort = . if cohort==0
replace cohort = . if cohort==9999
recode cohort (1905/1928=1) (1929/1943=2) (1944/1953=3) (1954/1962=4) (1963/1973=5) (1974/1994=6), generate(cohort6)
replace cohort6 = . if cohort6>6
tabulate cohort6, gen(cohortdummy)
gen bwC1 = bw1*cohortdummy1
gen bwC2 = bw1*cohortdummy2
gen bwC3 = bw1*cohortdummy3
gen bwC4 = bw1*cohortdummy4
gen bwC5 = bw1*cohortdummy5
gen bwC6 = bw1*cohortdummy6
gen familyrecode = .
replace familyrecode = 0 if family16==0
replace familyrecode = 1 if family16==8
replace familyrecode = 2 if family16==7
replace familyrecode = 3 if family16==6
replace familyrecode = 4 if family16==5
replace familyrecode = 5 if family16==4
replace familyrecode = 6 if family16==3
replace familyrecode = 7 if family16==2
replace familyrecode = 8 if family16==1
gen familyrecode1 = .
replace familyrecode1 = 0 if family16==5
replace familyrecode1 = 1 if family16==2
replace familyrecode1 = 2 if family16==7
replace familyrecode1 = 3 if family16==3
replace familyrecode1 = 4 if family16==1
replace familyrecode1 = 5 if family16==0
replace familyrecode1 = 6 if family16==8
replace familyrecode1 = 7 if family16==4
replace familyrecode1 = 8 if family16==6
gen familyrecode2 = .
replace familyrecode2 = 0 if family16==0
replace familyrecode2 = 1 if family16==4
replace familyrecode2 = 2 if family16==7
replace familyrecode2 = 3 if family16==3
replace familyrecode2 = 4 if family16==6
replace familyrecode2 = 5 if family16==5
replace familyrecode2 = 6 if family16==8
replace familyrecode2 = 7 if family16==1
replace familyrecode2 = 8 if family16==2
gen reg = .
replace reg = 0 if reg16==3
replace reg = 1 if reg16==5
replace reg = 2 if reg16==7
replace reg = 3 if reg16==6
replace reg = 4 if reg16==8
replace reg = 5 if reg16==4
replace reg = 6 if reg16==0
replace reg = 7 if reg16==2
replace reg = 8 if reg16==1
replace reg = 9 if reg16==9
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg16 res16 family16 sibs [pweight = weight], beta
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg16 res16 familyrecode sibs [pweight = weight], beta
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg16 res16 familyrecode1 sibs [pweight = weight], beta
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg16 res16 familyrecode2 sibs [pweight = weight], beta
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg res16 family16 sibs [pweight = weight], beta
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg res16 familyrecode1 sibs [pweight = weight], beta
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg res16 familyrecode2 sibs [pweight = weight], beta
I have coded family16 differently (e.g., assigned 0 instead of 5, 1 instead of 2, etc.) but that didn't change the results.
Here's a try.
keep if age<70
gen weight = wtssall*oversamp
gen blackwhite2000after=1 if year>=2000 & race==1 & hispanic==1
replace blackwhite2000after=0 if year>=2000 & race==2 & hispanic==1
gen blackwhite2000before=1 if year<2000 & race==1
replace blackwhite2000before=0 if year<2000 & race==2
gen bw1 = max(blackwhite2000after,blackwhite2000before)
gen income = realinc
replace income = . if income==0
gen logincome = log(income)
replace educ = . if educ>20
replace degree = . if degree>4
replace sibs = . if sibs==-1
replace sibs = . if sibs>37
replace res16 = . if res16==0
replace res16 = . if res16>=8
replace family16 = . if family16==-1
replace family16 = . if family16==9
replace wordsum = . if wordsum<0
replace wordsum = . if wordsum>10
replace cohort = . if cohort==0
replace cohort = . if cohort==9999
recode cohort (1905/1928=1) (1929/1943=2) (1944/1953=3) (1954/1962=4) (1963/1973=5) (1974/1994=6), generate(cohort6)
replace cohort6 = . if cohort6>6
tabulate cohort6, gen(cohortdummy)
gen bwC1 = bw1*cohortdummy1
gen bwC2 = bw1*cohortdummy2
gen bwC3 = bw1*cohortdummy3
gen bwC4 = bw1*cohortdummy4
gen bwC5 = bw1*cohortdummy5
gen bwC6 = bw1*cohortdummy6
gen familyrecode = .
replace familyrecode = 0 if family16==0
replace familyrecode = 1 if family16==8
replace familyrecode = 2 if family16==7
replace familyrecode = 3 if family16==6
replace familyrecode = 4 if family16==5
replace familyrecode = 5 if family16==4
replace familyrecode = 6 if family16==3
replace familyrecode = 7 if family16==2
replace familyrecode = 8 if family16==1
gen familyrecode1 = .
replace familyrecode1 = 0 if family16==5
replace familyrecode1 = 1 if family16==2
replace familyrecode1 = 2 if family16==7
replace familyrecode1 = 3 if family16==3
replace familyrecode1 = 4 if family16==1
replace familyrecode1 = 5 if family16==0
replace familyrecode1 = 6 if family16==8
replace familyrecode1 = 7 if family16==4
replace familyrecode1 = 8 if family16==6
gen familyrecode2 = .
replace familyrecode2 = 0 if family16==0
replace familyrecode2 = 1 if family16==4
replace familyrecode2 = 2 if family16==7
replace familyrecode2 = 3 if family16==3
replace familyrecode2 = 4 if family16==6
replace familyrecode2 = 5 if family16==5
replace familyrecode2 = 6 if family16==8
replace familyrecode2 = 7 if family16==1
replace familyrecode2 = 8 if family16==2
gen reg = .
replace reg = 0 if reg16==3
replace reg = 1 if reg16==5
replace reg = 2 if reg16==7
replace reg = 3 if reg16==6
replace reg = 4 if reg16==8
replace reg = 5 if reg16==4
replace reg = 6 if reg16==0
replace reg = 7 if reg16==2
replace reg = 8 if reg16==1
replace reg = 9 if reg16==9
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg16 res16 family16 sibs [pweight = weight], beta
Linear regression Number of obs = 20226
F( 20, 20205) = 364.79
Prob > F = 0.0000
R-squared = 0.3117
Root MSE = 1.697
------------------------------------------------------------------------------
| Robust
wordsum | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
bw1 | 1.095699 .1468845 7.46 0.000 .1777428
cohortdummy2 | -.2432754 .1667967 -1.46 0.145 -.0467582
cohortdummy3 | -.1327666 .1610207 -0.82 0.410 -.0273688
cohortdummy4 | -.4442166 .1630993 -2.72 0.006 -.08921
cohortdummy5 | -.3711671 .1662279 -2.23 0.026 -.0669138
cohortdummy6 | -.1091146 .1879013 -0.58 0.561 -.0148669
bwC2 | -.0214332 .1742682 -0.12 0.902 -.003947
bwC3 | -.1655899 .1658092 -1.00 0.318 -.0327039
bwC4 | -.1458145 .1663234 -0.88 0.381 -.0277045
bwC5 | -.40005 .1687139 -2.37 0.018 -.0673835
bwC6 | -.4935015 .1931291 -2.56 0.011 -.0609301
age | .0060567 .0014348 4.22 0.000 .0408087
sex | .2701488 .0266859 10.12 0.000 .0658305
logincome | .1710022 .0158248 10.81 0.000 .0794625
degree | .175303 .0268235 6.54 0.000 .09789
educ | .2571534 .0118184 21.76 0.000 .3546749
reg16 | -.0073669 .0058151 -1.27 0.205 -.0088605
res16 | .095045 .0090247 10.53 0.000 .0711172
family16 | .0025916 .0079647 0.33 0.745 .0022032
sibs | -.0465609 .0046251 -10.07 0.000 -.0685357
_cons | -.581039 .2460116 -2.36 0.018 .
------------------------------------------------------------------------------
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg16 res16 familyrecode sibs [pweight = weight], beta
Linear regression Number of obs = 20226
F( 20, 20205) = 365.08
Prob > F = 0.0000
R-squared = 0.3118
Root MSE = 1.697
------------------------------------------------------------------------------
| Robust
wordsum | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
bw1 | 1.098148 .1468342 7.48 0.000 .17814
cohortdummy2 | -.2428329 .1667948 -1.46 0.145 -.0466731
cohortdummy3 | -.1326954 .1610526 -0.82 0.410 -.0273542
cohortdummy4 | -.4447821 .1631311 -2.73 0.006 -.0893236
cohortdummy5 | -.3724344 .1662453 -2.24 0.025 -.0671423
cohortdummy6 | -.1116337 .1879246 -0.59 0.552 -.0152101
bwC2 | -.0224281 .174277 -0.13 0.898 -.0041302
bwC3 | -.1664026 .1658642 -1.00 0.316 -.0328644
bwC4 | -.1466234 .1663764 -0.88 0.378 -.0278582
bwC5 | -.4007139 .1687489 -2.37 0.018 -.0674953
bwC6 | -.4932983 .1931479 -2.55 0.011 -.060905
age | .0060158 .0014359 4.19 0.000 .0405335
sex | .2700135 .0266828 10.12 0.000 .0657975
logincome | .1715282 .0158549 10.82 0.000 .079707
degree | .1755253 .0268157 6.55 0.000 .0980142
educ | .2573497 .0118336 21.75 0.000 .3549456
reg16 | -.007438 .0058213 -1.28 0.201 -.0089461
res16 | .0948378 .0090363 10.50 0.000 .0709622
familyrecode | -.0051175 .0069678 -0.73 0.463 -.0050225
sibs | -.0466206 .0046247 -10.08 0.000 -.0686237
_cons | -.5469528 .2459031 -2.22 0.026 .
------------------------------------------------------------------------------
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg16 res16 familyrecode1 sibs [pweight = weight], beta
Linear regression Number of obs = 20226
F( 20, 20205) = 365.04
Prob > F = 0.0000
R-squared = 0.3118
Root MSE = 1.6969
-------------------------------------------------------------------------------
| Robust
wordsum | Coef. Std. Err. t P>|t| Beta
--------------+----------------------------------------------------------------
bw1 | 1.095838 .1465374 7.48 0.000 .1777654
cohortdummy2 | -.2452279 .1667372 -1.47 0.141 -.0471335
cohortdummy3 | -.139467 .1609478 -0.87 0.386 -.0287501
cohortdummy4 | -.4513169 .1630004 -2.77 0.006 -.0906359
cohortdummy5 | -.3816927 .1661727 -2.30 0.022 -.0688114
cohortdummy6 | -.1194981 .1878523 -0.64 0.525 -.0162816
bwC2 | -.0207212 .1742131 -0.12 0.905 -.0038159
bwC3 | -.1597148 .1657395 -0.96 0.335 -.0315436
bwC4 | -.1399076 .1662173 -0.84 0.400 -.0265822
bwC5 | -.3924364 .168619 -2.33 0.020 -.066101
bwC6 | -.4848992 .193071 -2.51 0.012 -.059868
age | .0060752 .0014345 4.24 0.000 .0409337
sex | .2694222 .0266878 10.10 0.000 .0656535
logincome | .1712699 .0158322 10.82 0.000 .0795869
degree | .1756636 .0268296 6.55 0.000 .0980914
educ | .2570987 .0118139 21.76 0.000 .3545995
reg16 | -.0075259 .0058152 -1.29 0.196 -.0090518
res16 | .0943371 .0090397 10.44 0.000 .0705875
familyrecode1 | -.0137886 .0090215 -1.53 0.126 -.0102153
sibs | -.0466329 .0046277 -10.08 0.000 -.0686417
_cons | -.5246691 .2463459 -2.13 0.033 .
-------------------------------------------------------------------------------
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg16 res16 familyrecode2 sibs [pweight = weight], beta
Linear regression Number of obs = 20226
F( 20, 20205) = 365.08
Prob > F = 0.0000
R-squared = 0.3118
Root MSE = 1.6969
-------------------------------------------------------------------------------
| Robust
wordsum | Coef. Std. Err. t P>|t| Beta
--------------+----------------------------------------------------------------
bw1 | 1.097943 .1464629 7.50 0.000 .1781067
cohortdummy2 | -.2424547 .166609 -1.46 0.146 -.0466004
cohortdummy3 | -.1341314 .1608393 -0.83 0.404 -.0276502
cohortdummy4 | -.4472196 .1629636 -2.74 0.006 -.0898131
cohortdummy5 | -.3764029 .1660956 -2.27 0.023 -.0678577
cohortdummy6 | -.1174024 .1877553 -0.63 0.532 -.0159961
bwC2 | -.0236267 .1741071 -0.14 0.892 -.0043509
bwC3 | -.1661975 .1656591 -1.00 0.316 -.0328239
bwC4 | -.1462968 .1662056 -0.88 0.379 -.0277961
bwC5 | -.4005288 .1685777 -2.38 0.018 -.0674641
bwC6 | -.4930299 .1929912 -2.55 0.011 -.0608718
age | .0059868 .001435 4.17 0.000 .0403379
sex | .2694145 .0266841 10.10 0.000 .0656516
logincome | .1720981 .0158374 10.87 0.000 .0799718
degree | .1757301 .0268169 6.55 0.000 .0981285
educ | .2576631 .0118325 21.78 0.000 .3553779
reg16 | -.0076389 .0058247 -1.31 0.190 -.0091877
res16 | .0944661 .0090305 10.46 0.000 .070684
familyrecode2 | -.0113374 .0079373 -1.43 0.153 -.0094531
sibs | -.046764 .0046238 -10.11 0.000 -.0688347
_cons | -.5154476 .246452 -2.09 0.036 .
-------------------------------------------------------------------------------
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg res16 family16 sibs [pweight = weight], beta
Linear regression Number of obs = 20226
F( 20, 20205) = 365.16
Prob > F = 0.0000
R-squared = 0.3128
Root MSE = 1.6957
------------------------------------------------------------------------------
| Robust
wordsum | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
bw1 | 1.077054 .1463924 7.36 0.000 .1747182
cohortdummy2 | -.2437178 .1663023 -1.47 0.143 -.0468432
cohortdummy3 | -.1299419 .1604554 -0.81 0.418 -.0267866
cohortdummy4 | -.4435006 .1626352 -2.73 0.006 -.0890662
cohortdummy5 | -.3727247 .1657568 -2.25 0.025 -.0671946
cohortdummy6 | -.1158754 .1878084 -0.62 0.537 -.015788
bwC2 | -.024472 .1738017 -0.14 0.888 -.0045066
bwC3 | -.1737067 .1652074 -1.05 0.293 -.034307
bwC4 | -.1513894 .1658148 -0.91 0.361 -.0287637
bwC5 | -.4049843 .1681489 -2.41 0.016 -.0682146
bwC6 | -.491967 .1930388 -2.55 0.011 -.0607406
age | .0060645 .0014337 4.23 0.000 .0408615
sex | .2691577 .0266842 10.09 0.000 .065589
logincome | .1697286 .0157905 10.75 0.000 .0788707
degree | .1759572 .0267701 6.57 0.000 .0982553
educ | .2558024 .011793 21.69 0.000 .3528116
reg | .0218362 .004184 5.22 0.000 .0338935
res16 | .0910688 .0090408 10.07 0.000 .068142
family16 | .0012812 .0079736 0.16 0.872 .0010892
sibs | -.0462154 .0046184 -10.01 0.000 -.0680273
_cons | -.6298851 .2438412 -2.58 0.010 .
------------------------------------------------------------------------------
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg res16 familyrecode1 sibs [pweight = weight], beta
Linear regression Number of obs = 20226
F( 20, 20205) = 365.47
Prob > F = 0.0000
R-squared = 0.3129
Root MSE = 1.6956
-------------------------------------------------------------------------------
| Robust
wordsum | Coef. Std. Err. t P>|t| Beta
--------------+----------------------------------------------------------------
bw1 | 1.078355 .1460574 7.38 0.000 .1749292
cohortdummy2 | -.2455091 .1662688 -1.48 0.140 -.0471875
cohortdummy3 | -.1358898 .1604158 -0.85 0.397 -.0280127
cohortdummy4 | -.4500099 .1625733 -2.77 0.006 -.0903734
cohortdummy5 | -.3824976 .165738 -2.31 0.021 -.0689565
cohortdummy6 | -.1258436 .1877791 -0.67 0.503 -.0171462
bwC2 | -.0238513 .1737711 -0.14 0.891 -.0043923
bwC3 | -.1685687 .1651688 -1.02 0.307 -.0332922
bwC4 | -.1461107 .1657415 -0.88 0.378 -.0277608
bwC5 | -.3982269 .1680839 -2.37 0.018 -.0670764
bwC6 | -.4841555 .1929902 -2.51 0.012 -.0597762
age | .0060783 .0014334 4.24 0.000 .0409543
sex | .2684576 .0266865 10.06 0.000 .0654184
logincome | .1700912 .0157964 10.77 0.000 .0790392
degree | .1764089 .0267762 6.59 0.000 .0985075
educ | .2557707 .0117892 21.70 0.000 .3527678
reg | .0217097 .0041827 5.19 0.000 .0336972
res16 | .0904257 .0090558 9.99 0.000 .0676608
familyrecode1 | -.0123615 .009031 -1.37 0.171 -.009158
sibs | -.0462869 .004621 -10.02 0.000 -.0681324
_cons | -.5837063 .2442501 -2.39 0.017 .
-------------------------------------------------------------------------------
regress wordsum bw1 cohortdummy2 cohortdummy3 cohortdummy4 cohortdummy5 cohortdummy6 bwC2 bwC3 bwC4 bwC5 bwC6 age sex logincome degree educ reg res16 familyrecode2 sibs [pweight = weight], beta
Linear regression Number of obs = 20226
F( 20, 20205) = 365.46
Prob > F = 0.0000
R-squared = 0.3128
Root MSE = 1.6956
-------------------------------------------------------------------------------
| Robust
wordsum | Coef. Std. Err. t P>|t| Beta
--------------+----------------------------------------------------------------
bw1 | 1.080223 .1460067 7.40 0.000 .1752322
cohortdummy2 | -.2430345 .1661583 -1.46 0.144 -.0467119
cohortdummy3 | -.1310357 .1603239 -0.82 0.414 -.027012
cohortdummy4 | -.4462279 .1625364 -2.75 0.006 -.0896139
cohortdummy5 | -.3776067 .1656619 -2.28 0.023 -.0680748
cohortdummy6 | -.1237979 .1877031 -0.66 0.510 -.0168675
bwC2 | -.0264528 .1736817 -0.15 0.879 -.0048713
bwC3 | -.1744731 .1651051 -1.06 0.291 -.0344583
bwC4 | -.15195 .1657303 -0.92 0.359 -.0288702
bwC5 | -.4056042 .1680526 -2.41 0.016 -.068319
bwC6 | -.4915503 .1929303 -2.55 0.011 -.0606892
age | .006 .001434 4.18 0.000 .0404265
sex | .2684602 .0266825 10.06 0.000 .065419
logincome | .1708242 .015807 10.81 0.000 .0793798
degree | .1764796 .0267638 6.59 0.000 .098547
educ | .256259 .0118057 21.71 0.000 .3534413
reg | .0217382 .0041794 5.20 0.000 .0337414
res16 | .0905556 .0090466 10.01 0.000 .067758
familyrecode2 | -.0099752 .0079317 -1.26 0.209 -.0083173
sibs | -.0464019 .0046177 -10.05 0.000 -.0683017
_cons | -.5769559 .2440902 -2.36 0.018 .
-------------------------------------------------------------------------------
Say, you control for dichotomized race (1;2). In that case, the other coefficients are expressed as if race is equal to 1.5. If race is coded 0;1, then it's 0.5 (note it's also how it works in ANCOVA).
Only if the Ns of the two racial groups are equal, no?
If you look at your estimated beta's, they are not the same:
reg16:
family16 .0022032 reg16 -.0088605
familyrecode -.0050225 reg16 -.0089461
familyrecode1 -.0102153 reg16 -.0090518
familyrecode2 -.0094531 reg16 -.0091877
reg:
family16 .0010892 reg .0338935
familyrecode1 reg .0336972
familyrecode2 reg -.0083173
(u forgot familyrecode + reg)
The results are all similar because these variables are probably not very important. However, they are not the same because MR is treating them as interval variables, and setting the 'mean' (also nonsense) to different values.
The overall results are not changed much. R2's all near .31. You could correlate the betas from each model to see how similar they are. I did it for model1 x model2.
model1.betas = c(.1777428,-.0467582,-.0273688,-.08921,-.0669138,-.0148669,-.003947,
-.0327039,-.0277045,-.0673835,-.0609301,.0408087,.0658305,.0794625,
.09789,.3546749,-.0088605,.0711172,.0022032,-.0685357)
model2.betas = c(.17814,-.0466731,-.0273542,-.0893236,-.0671423,-.0152101,-.0041302,
-.0328644,-.0278582,-.0674953,-.060905,.0405335,.0657975,.079707,
.0980142,.3549456,-.0089461,.0709622,-.0050225,-.0686237)
cor(model1.betas,model2.betas)
[1] 0.9998823
So, yes, it seems to be not worth bothering about.
Only if the Ns of the two racial groups are equal, no?
I don't think the N is the problem. See here for example :
http://menghublog.wordpress.com/2014/03/25/how-to-remove-the-influence-of-confoundings-on-a-continuous-variable-by-way-of-linear-regression/
If you look at your estimated beta's, they are not the same:
reg16:
family16 .0022032 reg16 -.0088605
familyrecode -.0050225 reg16 -.0089461
familyrecode1 -.0102153 reg16 -.0090518
familyrecode2 -.0094531 reg16 -.0091877
reg:
family16 .0010892 reg .0338935
familyrecode1 reg .0336972
familyrecode2 reg -.0083173
(u forgot familyrecode + reg)
The results are all similar because these variables are probably not very important. However, they are not the same because MR is treating them as interval variables, and setting the 'mean' (also nonsense) to different values.
I know that family16 and reg16 have very low correlation. I have chosen these variables because you have problems with them. Hopefully, Stata gives the 7 digits after zero, so you can look at the beta with much precision. They are not perfectly identical, of course, but they are similar. In the numbers you show me, it's only familyrecode2 that is different than the other "family" variables. However, I'm not surprised. I did not say that the coefficients of family16 and reg16 would be (almost) the same. I said the coefficients of other variables will remain almost the same. I see it's true. Of course, the fact that family16 and reg16 have low correlations with wordsum helped a little bit, as you noted.
What were the gaps for each age group in the first and last survey year?