Back to [Archive] Withdrawn submissions

[ODP] An update on the narrowing of the black-white gap in the Wordsum
What were the gaps for each age group in the first and last survey year?


Cohort or survey year ? I thought you would need cohort as well ?
I don't know how you want me to calculate the age gap. By way of correlation ?
The numbers are displayed below, for survey year categories. I also display the SD of wordsum and age for black and white groups separately, you see that the SD of the two variables declines over time. This means that the increase in age gap by survey year is under-estimated. Still, I think the increase in the age gap is clearly big. (the correlations are computed for the entire group, I did not separate blacks and whites)

correlate wordsum age [aweight = weight] if year4==1

r=0.0684

sd of wordsum
2.01 (for blacks) 2.12 (for whites)
sd of age
14.62 (for blacks) 14.61 (for whites)

correlate wordsum age [aweight = weight] if year4==2

r=0.0934

sd of wordsum
1.95 (for blacks) 2.09 (for whites)
sd of age
14.51 (for blacks) 14.12 (for whites)

correlate wordsum age [aweight = weight] if year4==3

r=0.1174

sd of wordsum
1.86 (for blacks) 2.02 (for whites)
sd of age
12.81 (for blacks) 13.29 (for whites)

correlate wordsum age [aweight = weight] if year4==4

r=0.1541

sd of wordsum
1.87 (for blacks) 1.84 (for whites)
sd of age
13.52 (for blacks) 13.80 (for whites)

The numbers below are the correlations for blacks, and whites separately.

correlate wordsum age [aweight = weight] if year4==1 & bw1==0
correlate wordsum age [aweight = weight] if year4==2 & bw1==0
correlate wordsum age [aweight = weight] if year4==3 & bw1==0
correlate wordsum age [aweight = weight] if year4==4 & bw1==0

-0.0480
0.0298
-0.0592
0.0059

correlate wordsum age [aweight = weight] if year4==1 & bw1==1
correlate wordsum age [aweight = weight] if year4==2 & bw1==1
correlate wordsum age [aweight = weight] if year4==3 & bw1==1
correlate wordsum age [aweight = weight] if year4==4 & bw1==1

0.0706
0.0897
0.1318
0.1258

No increase in age gap among blacks but a large one for whites.

Now, if you want the results by way of regression, the unstandardized coefficients are shown below, again for each category of survey year. For blacks first, and for whites next.

tobit wordsum age if year4==1 & bw1==0 [pweight = weight], ll ul
tobit wordsum age if year4==2 & bw1==0 [pweight = weight], ll ul
tobit wordsum age if year4==3 & bw1==0 [pweight = weight], ll ul
tobit wordsum age if year4==4 & bw1==0 [pweight = weight], ll ul

-.0070517
.0041364
-.0091082
.0007717

tobit wordsum age if year4==1 & bw1==1 [pweight = weight], ll ul
tobit wordsum age if year4==2 & bw1==1 [pweight = weight], ll ul
tobit wordsum age if year4==3 & bw1==1 [pweight = weight], ll ul
tobit wordsum age if year4==4 & bw1==1 [pweight = weight], ll ul

.0113877
.0145335
.0220242
.0179635

There is no tendency for age gap to increase in blacks. But the tendency is visible for whites. Remember that the unstandardized coeff is the effect of one-unit change in age var on the wordsum. To obtain a gap between, say, 18 and 68-yrs-old, you must do 0.0179*50=0.895, i.e., a difference of almost one word correct. The age gap for the first category of survey year is 0.01138*50=0.569. The difference is 0.33 word correct, which is more or less the same amount of the gap narrowing in black-white when you look at survey year which is about 0.30.

So, what do you think ?

----

By the way, I have detected a silly mistake in my syntax. All analyses are weighted, indeed. Except my correlation of wordsum with age. Initially, I reported it to be 0.1005. But the weighted correlation is 0.1085. So, I made this modification (in the appendix and in the text) in my latest version.

This mistake is due to the fact I'm not very used to Stata, even though I know the commands well. In SPSS, when you use option weight, all of your subsequent analyses are weighted, unless you decide to unable weight option. In Stata however, each analysis must be weighted individually (or not). So that's a big difference.
It's difficult for me to make sense of the correlations. Could you just supply B/W d-values e.g.,

Survey year group 1 (first)

Age group 1
Age group 2
Age group 3
Age group 4

Average:

Survey year group 4 (Last)

Age group 1
Age group 2
Age group 3
Age group 4

Average:
But I thought the regression would be easy to interpret, even if in unstandardized coefficients. Anyway, I attached the file. Generally, the BW gap diminished over time only when you examine the intermediate age groups (2 and 3). But the BW gap is smaller (irrespective of survey years) when the age group is younger.

The syntax I have used is :

recode age (18/27=1) (28/40=2) (41/52=3) (53/69=4), generate(agegroup)

by bw1 agegroup year4, sort : summarize wordsum age [aweight = weight] if age<70

Black-White d gap
by survey year

agegroup1

year1 0.6440
year2 0.5585
year3 0.4721
year4 0.7106

agegroup2

year1 0.6610
year2 0.7089
year3 0.5908
year4 0.5726

agegroup3

year1 0.8831
year2 0.6103
year3 0.7265
year4 0.7368

agegroup4

year1 0.8914
year2 0.6657
year3 0.9255
year4 0.8280

Black-White d gap
averaged by age group within survey year

year1 0.7699
year2 0.6358
year3 0.6787
year4 0.7120
1) Capitalize Wordsum throughout, it's a proper name.

2) "The right procedure should be to use a tobit regression (for an introduction, see, McDonald & Moffitt, 1980)."

Can you add a brief explanation of why tobit regression is the right procedure to use here?

3) "Since the year 2000, the GSS begins to ask whether the respondent is hispanic or not"

Began to ask. Capitalize Hispanic.

4) "The variable educ has values going from 0 to 20."

Are the values years or what?

5) It would be better to describe the variables and their values in a table.

6) "The full result is available"

--> "the full results are available"

7) "with no subsequent convergent since"

--> convergence

8) "the ceiling effect in the white sample suggests that the black-white difference (but not the changes) was under-estimated"

Why is the d-value change correctly estimated in your models despite the ceiling effect? Elaborate.

9) "In early cohorts, there was a strong ceiling effect in the white sample, but this effect progressively disappears in the successive cohorts."

Does it disappear, or just diminish?

10) "Generally, there is some indication that the black-white gap has been underestimated in early cohorts. And by the same token, the magnitude of the gap narrowing. But at the same time, the white trend could have been even flatter or turned out to be somewhat dysgenic."

This passage is obscure for me. Elaborate a bit.

11) You mention that before 2000 Hispanics were not disaggregated from non-Hispanics in the GSS. This seems to be a potentially important confound for black-white differences. What's the effect on the b-w gap of including or not including Hispanics since 2000?

12) In the references list, remove "Notes and Shorter Communications" from the name of Lynn's paper.

13) In Figures 2 and 3, could you draw lines between the data points? At least those two figures should be in the main text. It doesn't matter if they cannot be smoothly embedded in the main text because it's a much greater nuisance to have to scroll to the end of the paper to look at them. Remember that figures showing the main results are THE MOST IMPORTANT THING about a paper because they are what most readers look at first, and the only thing many readers look at.
Admin
MH in fact fixed many of those things. It is just that instead of uploading the revision paper (or revision in progress), he sent it to me in email. This meant that Dalliard spent time reviewing an older draft for no reason. It would be proper if the author would update the revision on OSF as soon as a new one is available so reviewers (who work for free, remember) don't waste their time.
Dalliard :

1) Ok

2) I added the paragraph already in my version 2. But I hesitate to post it before I have reply from Chuck. He wants me to explain why the use of cohort or year makes a difference. But I have no clue, so I have not added discussion about that. Note that Huang & Hauser didn't attempt to explain such difference.

3) Chuck asked me to do it. And I capitalized all names.

4) Yes. I will add the information.

5) Should I delete these paragraphs and make a summary table instead ? I'm not sure how I should present it. Let me think about it.

6) Ok

7) ok

8 ) I'm afraid you're right. I don't know why I have written this. I should have said the magnitude of d gap and the narrowing are both under-estimated. I will make the changes.

9) It has diminished. But the difference is still meaningful. I have posted the graphs here :
http://openpsych.net/forum/showthread.php?tid=168&pid=1879#pid1879

10) It is the white score distribution that is "censored" at the right tail, i.e., ceiling effect. I don't see such right censoring for blacks. So, I think the white true score is under-estimated. But as I said, it's only in the early cohorts that you see the ceiling effect among whites. There is only a modest ceiling effect in late cohorts. So, if ceiling effect diminishes over time, this necessarily implies that the scores for whites in early cohorts have been under-estimated compared to late cohorts. Consequently, the BW narrowing is under-estimated.

11) I did not say that race before 2000 did not disaggregate whites from hispanics (perhaps I should add a note about that). In footnote 8, I have written "According to the GSS codebook, the "white" category in variable "race" (before the year 2000) includes Mexicans, Spaniards and Puerto Ricans "who appear to be white"." Of course, you can think that misclassification has occurred. I also believe that, but I have the impression that it won't make a big difference. If I'm not mistaken, most mexicans, spaniards and puerto ricans don't appear white at all. Although I can't tell for sure; I don't live in the U.S.

That said, I have created two variables bw1, and bw. You see in the syntax that I use bw1 because it removes all hispanics from the category "white" (after survey year 2000). When I used bw instead of bw1, I remembered that nothing really changed. But you're right that I hadn't said anything about it in the limitation section. I will make the changes.

12) It was like this when I cited it in google scholar, I have hesitated, but perhaps you're right.

13) Personally, I don't mind scrolling down to the end to look at the graphs. But since you and Emil asked me to do that, I will do it. But I maintain I don't understand the bother. For the lines in the graphs, I have tried, but I don't know how to do that. I will search more today, and if I find a way, I will make the graphs with the lines.


-------

Emil

I sent you the version 2 (although today I have added more things to it) because I wanted to know which version you prefer. The one with the graphs at the end of the paper or the other version ? Because I couldn't find a way to make "resize" without deterioration of quality, although I think they are still acceptable.
But the BW gap is smaller (irrespective of survey years) when the age group is younger.


Which explains why you get different results when using (a) survey year controlling for age and (b) survey year minus age. It's not clear which better represents the true cohort effect, which would best be indexed by looking at same age persons at widely different times. Generally, in my opinion, the GSS provides no clear evidence that the B/W difference has substantially diminished across time. The time periods for which you can actually match age groups show only a small decrease in difference. And H&H's method is confounded by the presence of an age x BW d interaction. Add a brief discussion of this.

I don't see any material problems regarding statistical analysis. Post an updated draft so that we can double check for language problems.
I attach the new doc file. The pdf is at OSF. (Look for version 2).

https://osf.io/hiuzk/files/

As well as the new xls file (with the Stata syntax as well; look for version 2). It contains the analysis with bw variable instead of bw1 variable, as requested by Dalliard. No meaningful change (only 0.1 word correct).

I have made lot of modifications in the text, and notably rewritten the sentences as suggested by Chuck. I fixed language issues, and added reference about Hunter & Schmidt (2004) as requested by Emil, explanations about what tobit model does, discussion about the potential effect of racial misclassification (hispanics and whites together or apart for years 2000+), and the absence of test of normal and homoskedastic errors for tobit regression.

But the major change is in tables 5-6. Initially, i used tobit for looking at the changes among low and high scoring groups. But after reflexion, it is preferable to use OLS. After all, if tobit is used to estimate the true population value of Y when the data is censored (either clustering or ceiling/floor effect), the goal of my analyses on the subsamples (high and low ability score) is just to examine the behavior in these parts of the entire group. I'm not interested in the true population value of Y (which, i have probably forgotten), then tobit is not the best choice. So I used OLS instead. But if you look at the new xls and the old one, you see there is a difference in the parameters (mean score and size of gap) but no difference in the behavior of the trend lines.

By the way, I have emailed lot of people, notably Sean Reardon and Derek Neal, concerning the ambiguity between cohort and period effect in my analyses. No one has responded. If you know someone else I can contact, feel free to tell me so. Furthermore, I have emailed several econometricians about my use of tobit regression but I didn't have lot of comment, except the one from Jeong-han Kang (but I'm still debating with him). I have also emailed MacCallum concerning dichotomization of continuous var for logistic regression. No answer, but I have already decided to remove my analysis anyway.

I have re-emailed Huang, and asked him what SD he has used to compute d values, because what he termed pooled across-years and pooled within-years is not clear to me. He said he used the "SD of the entire sample" (this is his own words). By this, I think he just did something like "summarize wordsum" in stata. That is to say, not disaggregated by race or cohort or year.

Then, i asked Paul Sackett for his opinion. He said he prefers the use of pooled SD, and does not recommend SD of entire sample or population. Either you should use black and white pooled SD or the majority group SD (i.e., whites). However, he agrees with Huang that it is better to use a "fixed" SD because it's invariant over time. Instead, I have used mean(white)-mean(black) in cohort1 divided by SD obtained in cohort1, and did the same for the other categories. Even though Sackett won't recommend it, I think you should remember he did not gave me argument, only his opinion. And I think his opinion is seriously wrong.

Just think about it. The SD in wordsum declines over time. What has caused this effect ? If the gain in wordsum, especially for blacks, is due to environment, one can guess that the lowest IQs in the black group has gained more than the highest IQs in the black group (this is true to a lesser extent in the GSS data). Automatically, this reduces variance in IQ, hence the SD.

If you use the SD of the entire sample (2.11) which is similar to the SD of the earliest years and cohorts, you're trying to express the contemporaneous black-white difference in terms of the environments in the past, not the environments today. Today, there is less overlap in score distribution because of reduced SD. This reduced SD is closely tied to the diminished score difference. If you want to examine the contemporaneous raw difference, you must use the contemporaneous SD as well. That makes sense.

The choice of the SD is crucial. I have uploaded the new xls file, using my former method and the one preferred by Huang. Using the latter, Huang is correct. There is a larger BW gap closing in both survey year and cohort, although the effect is much stronger with cohort. Using my method, there is (virtually) no gap narrowing in terms of survey year.

For this reason, I added my explanation in the text about why I used within-year pooled SD, instead of the SD for the entire time period (1974-2012).

Also, I want to take this occasion to say that most of the times, the econometricians talk about data "clustering" in the dependent var to justify tobit models. In wordsum, there is no "clustering" but instead what appears to be a "truncation" because the frequency at lower end is extremely low but the frequency is much higher at the higher end, i.e., there is no perfect symmetry. But even in that case, you can (should) use tobit. See below.
http://www.ats.ucla.edu/stat/stata/output/Stata_Tobit.htm

Finally...

Chuck, if you think H&H has problem due to confounding with age*BW-d interaction, my analysis would also suffer from this bias, because H&H also controlled for age, even though they used dummy vars of the age variable. Their method and mine are very similar. They have however used more SES variables, such as parental educ, but I didn't include them because I think I have enough, and because, more importantly, the sample size would be greatly reduced. As I said in the article, the main problem with H&H is the low sample size for blacks.

Dalliard, this is off-topic, but if you're not aware of it, I should tell you that your name does not appear in your paper with John, and your reply to Kaplan is still invisible. See here for the explanation.
http://www.openpsych.net/forum/showthread.php?tid=47&pid=1846#pid1846

EDIT:

I just detected two mistakes in the new version.

A next analysis will be conducted using OLS regression in a subsample having low wordsum scores (0-5) and another subsample having high Wordsum scores (5-10).


The first wordsum is not capitalized.

The tobit model relies heavily on normality and homoscedasticity in the underlying latent variable


The sentence would be clearer if rewritten as "in the underlying latent dependent variable".

I will make the changes in a later version.
Chuck, if you think H&H has problem due to confounding with age*BW-d interaction, my analysis would also suffer from this bias, because H&H also controlled for age, even though they used dummy vars of the age variable. Their method and mine are very similar.


Yes, your "cohort" analysis is confounded by an age x d interaction. The d-values which you have provided quite clearly show this.
Yes, your "cohort" analysis is confounded by an age x d interaction. The d-values which you have provided quite clearly show this.


To be clearer: Either (a) explain why I'm wrong about this or (b) briefly acknowledge the issue in your paper.
To be clearer: Either (a) explain why I'm wrong about this or (b) briefly acknowledge the issue in your paper.


Not sure what you mean. Earlier, you said "It's not clear which better represents the true cohort effect, which would best be indexed by looking at same age persons at widely different times." but if I control for age, that means the d*cohort interaction is expressed in terms of constant age. When you control for age, it is fixed for all other independent vars. So, when using age as control var, I'm looking "at same age persons at widely different times", no ?

Besides, I also wanted to say that I have fixed few words (whites, wordsum) that have not been capitalized. More importantly, I will redo all my regressions (OLS and tobit). I have just read a few moments ago one of my blog post (draft) I have written in Word, several months ago. It says that when we use age in regression (and even in correlation) we should always add the squared (age^2) and cubed (age^3) terms of the main effect age. This is because in many instances, age is not linearly correlated with many things. Of course, in my analysis, age is not the key thing, but controlling for age with at least squared and cubed terms is more effective than not controlling for these terms. Still, this is annoying, that I knew this problem with nonlinearity of age, but I have completely forgotten to do that. I look like a fool; this is pathetic.

I attach the picture of nonlinearity of age with wordsum, just to show you. The second graph depicts the curve for blacks and whites. There is a positive bw gap interaction with age because older blacks lose a lot of "IQ" points whereas old and middle-aged whites are stagnating.

The syntax was :

gen age2 = age^2
gen age3 = age^3
gen bw1age = bw1*age
gen bw1age2 = bw1*age2
gen bw1age3 = bw1*age3
tobit wordsum age age2 age3 [pweight = weight], ll ul
predict preda, xb
scatter preda age
tobit wordsum age age2 age3 bw1 bw1age bw1age2 bw1age3 [pweight = weight], ll ul
predict predb, xb
graph twoway (scatter predb age if bw1==0, msymbol(Oh)) (scatter predb age if bw1==1, msymbol(O)), legend(label(1 black) label(2 white))

I will update the whole thing when it's done. If someone has some other things they would like me to include, just say it.
Admin
I attach the picture of nonlinearity of age with wordsum, just to show you. The second graph depicts the curve for blacks and whites. There is a positive bw gap interaction with age because older blacks lose a lot of "IQ" points whereas old and middle-aged whites are stagnating.


This could potentially be a real effect, not some quirk. There is evidence that higher g protects against diseases of old age such as Alzheimer's. Indeed, one of the replicated g SNPs is one of the Alzheimer's SNPs too.

Or maybe it is something else.
There is evidence that higher g protects against diseases of old age such as Alzheimer's. Indeed, one of the replicated g SNPs is one of the Alzheimer's SNPs too.


I agree it's a potential explanation. That said, for blacks who lose their IQ points, an environmental hypothesis can say that it's due to the cumulative effect of pervasive racism in the job market, that keeps black wages low, and this causes them to live poorly, etc. They can come up with so many plausible theories that I would prefer not to take risks in assuming that higher IQ protects against disease.
So, when using age as control var, I'm looking "at same age persons at widely different times", no ?


I was conflating two issues. Imagine the following scenario: Hypothetical Wordsum data (with the same age distribution that you have):

year1 0.7699
year2 0.6358
year3 0.6787
year4 0.7120

year5 0.7699
year6 0.6358
year7 0.6787
year8 0.7120

year9 0.7699
year10 0.6358
year11 0.6787
year12 0.7120

MH's data

year13 0.7699
year14 0.6358
year15 0.6787
year16 0.7120

Based on your year 13 to 16 data you couldn't possibly infer year 1 to 4 data. So, no you are not looking at same age persons at widely different times. (What I mean is that the results don't imply that the gap in e.g., 1925 or 1950 was much larger than that in 1975 or 2000.) If you do the analysis correct, though, you can estimate the independent effect of cohort for years 13 to 16. But you have to use a viable ACP (age, cohort, period) model, (arguably) such as discussed in the attached papers. See, for example: http://yangclaireyang.web.unc.edu/age-period-cohort-analysis-new-models-methods-and-empirical-applications/chapter-7/
There are two things in my latest version (still working).

First, instead of using age variable, I use a serie of dummy vars of age ((18/26=1) (27/35=2) (36/44=3) (45/55=4) (56/69=5)). I think it's better than to use age+age^2+age^3, because with dummies, the reference cateogory has a direct impact on the intercept. For example, if age dummy3 is the reference category, the intercept will reflect this, and should be interpreted as the wordsum score for race=0, cohortdummy=1, agedummy=3. As the mean age of the entire sample is 41.48, I think it's more appropriate to use dummy and specify agedummy3 as the reference category. I did just that, but as expected, the result is left unchanged compared to what you have in version 2.

Second, as there is no recent report of reliability of the Wordsum, I have attempted to estimate the Cronbach reliability, using Cronbach's alpha in Stata. The reliability is 0.71 for total sample (but with age<70), while for blacks, it's 0.63 and for whites, it's 0.70. Those are comparable with the numbers reported by H&H (0.63 and 0.71). However, for dichotomous variables, it's better to use Kuder-Richardson (KR-20) reliability method. But the numbers are identical, see below :

keep if age<70

replace wordsum = . if wordsum<0
replace wordsum = . if wordsum>10

replace worda = . if worda<0
replace worda = 0 if worda==9
replace wordb = . if wordb<0
replace wordb = 0 if wordb==9
replace wordc = . if wordc<0
replace wordc = 0 if wordc==9
replace wordd = . if wordd<0
replace wordd = 0 if wordd==9
replace worde = . if worde<0
replace worde = 0 if worde==9
replace wordf = . if wordf<0
replace wordf = 0 if wordf==9
replace wordg = . if wordg<0
replace wordg = 0 if wordg==9
replace wordh = . if wordh<0
replace wordh = 0 if wordh==9
replace wordi = . if wordi<0
replace wordi = 0 if wordi==9
replace wordj = . if wordj<0
replace wordj = 0 if wordj==9

ssc install kr20

kr20 worda wordb wordc wordd worde wordf wordg wordh wordi wordj if !missing(wordsum) & bw1==0

Kuder-Richarson coefficient of reliability (KR-20)

Number of items in the scale = 10
Number of complete observations = 3550

Item Item Item-rest
Item | Obs difficulty variance correlation
---------+------------------------------------------
worda | 3550 0.7854 0.1686 0.2691
wordb | 3550 0.8293 0.1416 0.4023
wordc | 3550 0.1285 0.1120 0.2212
wordd | 3550 0.8408 0.1338 0.3905
worde | 3550 0.6882 0.2146 0.4460
wordf | 3550 0.5645 0.2458 0.4025
wordg | 3550 0.1741 0.1438 0.1130
wordh | 3550 0.1454 0.1242 0.2309
wordi | 3550 0.6530 0.2266 0.3028
wordj | 3550 0.1121 0.0995 0.2410
---------+------------------------------------------
Test | 0.4921 0.3019

KR20 coefficient is 0.6366

kr20 worda wordb wordc wordd worde wordf wordg wordh wordi wordj if !missing(wordsum) & bw1==1

Kuder-Richarson coefficient of reliability (KR-20)

Number of items in the scale = 10
Number of complete observations = 18606

Item Item Item-rest
Item | Obs difficulty variance correlation
---------+------------------------------------------
worda | 18606 0.8369 0.1365 0.2627
wordb | 18606 0.9459 0.0512 0.3144
wordc | 18606 0.2434 0.1842 0.3696
wordd | 18606 0.9561 0.0419 0.2853
worde | 18606 0.7681 0.1781 0.4391
wordf | 18606 0.8296 0.1413 0.4290
wordg | 18606 0.3574 0.2297 0.3974
wordh | 18606 0.3225 0.2185 0.4274
wordi | 18606 0.7986 0.1609 0.2707
wordj | 18606 0.2672 0.1958 0.4444
---------+------------------------------------------
Test | 0.6326 0.3640

KR20 coefficient is 0.7015


The corresponding syntax in SPSS is :

RECODE wordsum (0 thru 10=COPY) (ELSE=SYSMIS) INTO GSSwordsum.
RECODE worda (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_a.
RECODE wordb (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_b.
RECODE wordc (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_c.
RECODE wordd (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_d.
RECODE worde (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_e.
RECODE wordf (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_f.
RECODE wordg (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_g.
RECODE wordh (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_h.
RECODE wordi (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_i.
RECODE wordj (0=0) (1=1) (9=0) (ELSE=SYSMIS) INTO word_j.

SELECT if age<70.
EXECUTE.

SELECT IF(NOT MISSING(GSSwordsum)).
EXECUTE.

RELIABILITY
/VARIABLES=word_a word_b word_c word_d word_e word_f word_g word_h word_i word_j
/SCALE('ALL VARIABLES') ALL
/MODEL=ALPHA
/STATISTICS=DESCRIPTIVE SCALE HOTELLING CORR COV TUKEY
/SUMMARY=TOTAL MEANS VARIANCE COV CORR.


Using the data set here, you get similar result (given rounding). The above syntax is for Cronbach's alpha, not the KR-20 alpha. I don't know how to do that on SPSS, but the results are identical, so it's not really a problem.
http://openpsych.net/datasets/GSSsubset.7z

Now, the problem is what Chuck has said, that both H&H and my method can't really answer the question of whether the BW gap has diminished or not (due to possible confound with age and race gap over time). The method I should have employed is probably a hierarchical linear mixed regression model. I'm not familiar with the technique, so I need more time before I submit the new version.

Of course, I can instead admit that my method cannot disentangle age, period and cohort effects, and Chuck said it would be OK this way, but I think it's better to try HLM. I have to say, however, that if I employ HLM in my additional analyses, the HLM is not tobit, and thus does not correct for censored distribution. But I think it's not too much of a problem. OLS and tobit regressions have produced similar results.
I have been busy asking people around. At least one guy accepted to help me, but he is busy and I have to wait. Just to let you know in advance, my linear mixed effects (LME) model contradicts my tobit regressions.

Here's the syntax and output :

replace wordsum = . if wordsum<0
replace wordsum = . if wordsum>10

recode cohort (1905/1920=1) (1921/1930=2) (1931/1939=3) (1940/1947=4) (1948/1953=5) (1954/1959=6) (1960/1968=7) (1969/1979=8) (1980/1994=9), generate(cohort9)

replace cohort9 = . if cohort9>9

recode age (18/23=1) (24/28=2) (29/33=3) (34/38=4) (39/44=5) (45/50=6) (51/56=7) (57/62=8) (63/69=9), generate(age9)

gen bw1age9 = bw1*age9

. xtmixed wordsum bw1 || cohort9: age9 bw1age9

Performing EM optimization:

Performing gradient-based optimization:

Iteration 0: log likelihood = -46852.106
Iteration 1: log likelihood = -46852.106

Computing standard errors:

Mixed-effects ML regression Number of obs = 22156
Group variable: cohort9 Number of groups = 9

Obs per group: min = 877
avg = 2461.8
max = 3663


Wald chi2(1) = 253.03
Log likelihood = -46852.106 Prob > chi2 = 0.0000

------------------------------------------------------------------------------
wordsum | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bw1 | 1.122804 .0705859 15.91 0.000 .9844581 1.26115
_cons | 4.751633 .134117 35.43 0.000 4.488769 5.014497
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
cohort9: Independent |
sd(age9) | .1354198 .0394182 .0765443 .2395803
sd(bw1age9) | .0628879 .0207408 .0329485 .1200323
sd(_cons) | .3193018 .0947142 .1785296 .571074
-----------------------------+------------------------------------------------
sd(Residual) | 2.001279 .0095137 1.982719 2.020012
------------------------------------------------------------------------------
LR test vs. linear regression: chi2(3) = 401.60 Prob > chi2 = 0.0000

Note: LR test is conservative and provided only for reference.

predict xtfitted9, fitted

twoway (scatter xtfitted9 cohort9, msymbol(Oh) jitter(0)) (lfit xtfitted9 cohort9 if bw1==0, lcolor(red)) (lfit xtfitted9 cohort9 if bw1==1, lcolor(green)), by(age9) legend(cols(4) size(small)) title("Wordsum trend", size(medsmall)) legend(order(2 "black" 3 "white"))


The coefficients of random effects tell you nothing at all. I think the most important stuff is the graph. So I attached the nine graphs generated by Stata. As you see, the reduction in BW gap is real in all of the nine categories of age, but it's relatively small.

Now, I should talk about the logic of mixed effects. Here, race variable is the fixed effect variable, in the fixed component of LME. Cohort9 is a random intercept, i.e., the mean wordsum score is allowed to freely vary over my cohort categorical vars. Finally, age and its interaction with race are used as random slopes within Cohort9; that is to say, I'm specifying that age varies across cohorts, but that age*race also varies across cohorts. To explain why race should be in the fixed component, although using it as random intercept does not change meaningfully the result, is due to the fact that the random component of LME evaluates the deviation across the values of your variables. With a dichotomy var like bw1, I don't think it's wise to do that. Generally, I see people, notably in medical or experimental studies, using LME models with groups as fixed vars, and individuals (i.e., id) as random intercepts with random slopes such as doses, tests, scores, etc. freely varying across individuals. On the other hand, I see many times in textbooks (specifically, Stata books) some examples using region (4) and census divisions (9) in the US. The N of categories is small, but it is still doable. A good introduction of LME is presented by a certain "Chuck Huber" at Stata blog, here.

What I did is not exactly what Yang & Land (2008) did, but I don't care. I think it does what I want.

In any case, if LME tells the truth, the controversy raised by H&H about the difference in period and cohort effect is mostly (but not entirely) due to age effects that were wrongly taken as cohort effects. Chuck was probably right.
Admin
Any updates on this submission?
I'm working on it. As you already know, multilevel regression is very complex. Probably no less complex than SEM, MGCFA and IRT. I thought I should published that two weeks ago. Then, I came across a problem of the application of sampling weight in multilevel models, which is not straightforward. Brady T West helped me with that. But then, I tried to find a way to compare the random intercept model and random slope model using likelihood ratio test and a measure of effect size. I contacted many authors. They say that they are not aware of any other measure than R² and the Cohen's F² measure (preferably, I would recommend f² but not R²) to compare the effect size of a given (set of) variables. But the problem here is that in many papers and books I have read, f² and R² are applicable only to the fixed effect component of the multilevel model, not the random component. So, I give up with this. The other way of comparing model fit is to use likelihood ratio test (LRT) for nested models (which I have). But then I read this page, which says that the application of sampling weight in analyses involving maximum likelihood estimation does not allow LRT, on the grounds that "The “likelihood” for pweighted or clustered MLEs is not a true likelihood" and "Where there are pweights, the “likelihood” does not fully account for the “randomness” of the weighted sampling". Worse, in Stata, when you conduct regression with robust standard error and compute LRT between two nested models, you have an error, which says that LRT is not suited with models applying robust standard errors. But in Stata, when you use sampling weight in multilevel regression, you automatically get the robust standard errors, instead of ML standard errors. Of course, I conducted LRT without sampling weights, for both ML and REML estimations, but I would like to do the same test with sampling weight applied. The link recommends using Wald test, but I have heard this is not the best way of doing things, especially when the number of groups (i.e., 2nd level variable) is small, which is true in my articles, where the latest version has only 21 cohort groups (the recommended number of groups by some statisticians is at least 30 or 50 for having unbiased standard errors); some authors prefer to use the Hausman's test, but I still need to read what it is exactly.

It's the only thing I need to find answers, because the other points made in my article are good, and there is no black-white gap narrowing at all. I think it should be ready soon (I hope, because I keep saying this for many days now).
Admin
You should probably recruit an external reviewer for this paper. I don't know anyone else here is very familiar with these methods. I am certainly not. Perhaps an econometricist or statistician.
Every affirmation I make concerning multilevel model, and tobit regression, were backed by references (books and pages). And they are available (one can email me if they want the listed reference). Unless I didn't understand the paragraphs of the texts I referred to, it's unlikely I'm mistaken. You'll see when the latest version is available. If you want, I will cite in this thread all of the relevant paragraphs of the books/articles I have cited, and you'll see for yourself.

I would like to encourage some experts to review it, but given my extremely low response rate to my very modest, short questions (e.g., what's the consequence of ignoring slope-intercept covariance, can we use an effect size measure for random component in multilevel regression), I don't think they will spend more time in reviewing a 21-page article. If someone accepts reviewing it, I will keep you informed.