Back to [Archive] Withdrawn submissions

[ODP] An update on the narrowing of the black-white gap in the Wordsum
I would like to encourage some experts to review it, but given my extremely low response rate to my very modest, short questions (e.g., what's the consequence of ignoring slope-intercept covariance, can we use an effect size measure for random component in multilevel regression), I don't think they will spend more time in reviewing a 21-page article. If someone accepts reviewing it, I will keep you informed.


The analysis has grown beyond my technical competences, too. Perhaps you should consider submitting it to another journal, one with reviewers who are more familial with the specific methods. Regardless, I will be happy to concept check your method when you are done. Just make a list of your procedures and their justifications with citations added.
I'm very late. Two reasons. The big one : I had to read tons of books (4-5) and articles (20-30? don't remember) on multilevel regressions, and on the road, I may have lost my motivation, so I was spacing out all the time. The small one : I dislike R, especially when I'm forced to use it.

I have re-examined all numbers in my spreadsheet and my Stata syntax, and updated them, due to new analyses. But I will publish them when I'm certain that my submitted version is going to be the last one. Unless you really want to see them now.

Anyway, here's the 3rd version.
https://osf.io/9tgmi/
I don't know what happens with OSF; I don't see the overview of the first page, but it should be possible to download it.

In short, I added multilevel analyses, and extended my tobit regressions by using continuous variables of cohort/year instead of dummies, and modified the variables family16 and reg16 that Emil found bothersome due to their original coding. I have also added an appendix for R syntax, so that everyone can reproduce what I did with a free (but so damn ugly) software.

As always, to make things easier for you, in yellow are the changes I've made (from 2nd to 3rd version). Thus, it goes without saying that if you haven't looked at the changes made from version 1 to version 2, you won't see them in yellow in this 3rd version.

I'm really exhausted, so in this post I will not write all the important paragraphs from the books I have read. I have already noted all the pages that were important in my article. Even if you don't have these books materially, you already know where you can get them.
Admin
Meng Hu,

You did a lot of work on this, I can see.

Again, you should not put the R code in the PDF. Copying from the PDF is bad (screws up formatting) and it inflates the page count since almost all readers do not care about the syntax. Please put the source code in the OSF repository. The same applies to footnote 1. Put all the data, syntax, files etc. in the OSF repository, so that it is not necessarily to contact authors. You know how slow they are, if they even reply.

The figures are still not placed in the text. This means that the reader has to jump back and forth between the end of the paper and the text. You should move them up so they are near the text wherein they are discussed.

The variable "reg16" has been dichotomized as 0=0, 1/4=1, 5/7=0, 8/9=1, so that the values of the new variable is divided into South(=0)/North(=1) or, equivalently, into Wordsum score below 6 (value=0) or above 6 (value=1).


This doesn't sound right. Those living in the south all received scores below <6 and everybody in the north >6?

Bold text means there is something that needs attention. Either grammar or wrong word or something.

However, I will focuse on age and


I decide to model cohort as random effects


The respective numbers for Blacks were 58.9, 50.5, 41.5, 36.6, 31.3, 24.8, respectively.


These numbers are also suggestive that the wordsum narrowing has ceased somewhat.


Footnote 14: The Shapiro-Wilk W value should be close to 0.99 to indicate normality.


You mean, should be above .99 to indicate normality.

Is that following my test results? http://emilkirkegaard.dk/en/?p=4452 If so, cite it. If you got it from some other reference, cite that.

homoscedasticity in the underlying latent dependentvariable


The Cronbach's alpha analysis of the ten items of the Wordsum (worda to wordj) yielded an alpha of 0.6365 for Blacks (N=3,550) and 0.7015 for Whites (N=18,606). The alpha for the entire sample is 0.7137 (N=23,817). This is quite comparable to the early estimates.


Odd that merging two subgroups with α's of .70 and .64 increases the merged group α to .71? Increased variation increases α?

Hopefully, Yang (2006) addressed this question


In the source code

m<- vglm(wordsum ~ bw1 + agec + agec2 + agec3 + bw1agec + bw1agec2 + bw1agec3, tobit(Upper=10),
data=d, weights=d$weight)
summary(m)
d$wordsumpredictedage<-
5.210807+1.355356*d$bw1+-.0051079*d$agec+-.0016632*d$agec2+.000017*d$agec3+.0169752*d$bw1a
gec+.0002868*d$bw1agec2+.0000104*d$bw1agec3


You can extract the values from the model object so you don't need to type them manually. This is better as it reduces data errors and increases precision. You can find the attribute you need by using attributes(m).
I didn't know the syntax wouldn't work on pdf. I always copy pasted from my document file.

This doesn't sound right. Those living in the south all received scores below <6 and everybody in the north >6?


That's the truth. When I do a cross tabulation of wordsum per region, I get wordsum higher than 6 for regions in the North, and less than 6 for regions in the South.

The pictures are not embedded into the text this time. Because they are too many. Especially, the figure 7 is too big. If I reduce it, that will be worthless because the nine graphs of figure 7 are already small. In that case, I should delete it. If you want me to put that with the text, you get a big empty space in several pages, and i don't like this. Also, I have 4 graphs for tobit regressions, and the "result" section for tobit has only 2 pages. In other words, I have to insert the third graph in section 2 and the fourth graph in section 3.2, i.e., multilevel regression. That doesn't sound right to me.

I attached the file as document. If you want, you can try to rearrange the graphs as you like. But when i tried, I was not satisfied with this, especially with figure 7.

I see no problem in "However, I will focuse on age and". Concerning "I decide to model cohort as random effects" I can tell most people write random effects with an "s" and rarely without the "s" because random effects are the parameter values for each of the values of your level-2 variable. I have 21 categories in my cohort variables, so 21 random effects, i.e., variations around the average slope estimated in the fixed component of the multilevel regression.

I remove "respectively" and wrote "indicate" instead of "are also suggestive" and I will think about "hopefully" later, although I don't understand the problem. Concerning "homoscedasticity in the underlying latent dependent variable" I don't see the problem.

Odd that merging two subgroups with α's of .70 and .64 increases the merged group α to .71? Increased variation increases α?


I don't know why. Cronbach's is supposed to be affected by two factors : 1) the strength of the correlation and 2) the number of variables. But 2) does not move, so it must be 1).

You mean, should be above .99 to indicate normality.

Is that following my test results? http://emilkirkegaard.dk/en/?p=4452 If so, cite it. If you got it from some other reference, cite that.


I don't think it's above .99. In several instances, some authors even said that >.95 is sufficient for indicating normality. But I didn't cite these references (found in google books). I want a good reference, and I coudn't find one. Even the original paper of Shapiro and Wilk (1965) didn't not mention which value you should be seeking. If you find a good reference, i will add it of course.

By the way, regarding this comment

You can extract the values from the model object so you don't need to type them manually. This is better as it reduces data errors and increases precision. You can find the attribute you need by using attributes(m).


You have to tell me what is the syntax you have in mind. Is this the one you think ?

yhat<-fitted(m)

Because if this is the case, I think it's wrong. If you don't believe me, just try to make a scatterplot of the object yhat with, say, cohort, age, or year, or any other variable. And see the result. This object "yhat" is not an actual variable, so you have to create one by yourself. That's why R is pretty stupid. When you save your predicted Y (Y-hat) in SPSS or Stata, you get a new variable, and you can correlate it. But not in R. Or at least, not with fitted() function. Concerning possible mistakes in the numbers I have typed, I do not type them myself. I copy pasted them from Stata (or R). However, I read the lines carefully (twice) so as to make sure there is no mistake.

Also, because the object yhat is not an actual variable, I dont know how to remove the undesirable values from it. Because in all software I have tried (stata, spss, R), the predicted Y and/or residuals are calculated even for the observations that are missing in your regression model. If you correlate this yhat, that is a problem.
Admin
I didn't know the syntax wouldn't work on pdf. I always copy pasted from my document file.


Copying from PDF causes problems with copying whitespaces including linebreaks. There is no good reason to put code in PDF unless it is very short. Source code should be in supplementary material on OSF.

That's the truth. When I do a cross tabulation of wordsum per region, I get wordsum higher than 6 for regions in the North, and less than 6 for regions in the South.


EVERYBODY in the south received scores below 6, and EVERYBODY in the north above? That is clearly not true. You must mean something else.

The pictures are not embedded into the text this time. Because they are too many. Especially, the figure 7 is too big. If I reduce it, that will be worthless because the nine graphs of figure 7 are already small. In that case, I should delete it. If you want me to put that with the text, you get a big empty space in several pages, and i don't like this. Also, I have 4 graphs for tobit regressions, and the "result" section for tobit has only 2 pages. In other words, I have to insert the third graph in section 2 and the fourth graph in section 3.2, i.e., multilevel regression. That doesn't sound right to me.

I attached the file as document. If you want, you can try to rearrange the graphs as you like. But when i tried, I was not satisfied with this, especially with figure 7.


I don't understand what the problem is, aside from the hassle of writing in Word Processors like this. I moved the Figures to the nearest text location. Figure 4 wasn't mentioned in the text at all, so I left it in the end.

Why can't you do it something like what I did?

I see no problem in "However, I will focuse on age and".


"focuse" is a spelling error of "focus".

"I decide to model cohort as random effects"


Congruence error (singular and plural).

I remove "respectively" and wrote "indicate" instead of "are also suggestive" and I will think about "hopefully" later, although I don't understand the problem. Concerning "homoscedasticity in the underlying latent dependent variable" I don't see the problem.


In your (private) email to me just now, you used "hopefully" wrong again. You mean "luckily". Luckily = It was good luck that it happened. Hopefully = hope it will happen in the future.

I don't think it's above .99. In several instances, some authors even said that >.95 is sufficient for indicating normality. But I didn't cite these references (found in google books). I want a good reference, and I coudn't find one. Even the original paper of Shapiro and Wilk (1965) didn't not mention which value you should be seeking. If you find a good reference, i will add it of course.


They are clearly wrong. Look at my simulations. Anything below .99 is hopelessly unnormal.

You have to tell me what is the syntax you have in mind. Is this the one you think ?

yhat<-fitted(m)


No. I simply told you to extract the values from the object instead of typing them off in the code. You can use attributes(object) to see which attributes an object in R has.

For instance, if one uses factor analysis with the fa() function (from psych package), one can extract the scores and loadings from the factor analysis object. E.g.

object.fa = fa(object) #factor analysis on [i]object [/i]
object.fa$loadings #factor loadings vector
object.fa$scores #factor scores vector


Because if this is the case, I think it's wrong. If you don't believe me, just try to make a scatterplot of the object yhat with, say, cohort, age, or year, or any other variable. And see the result. This object "yhat" is not an actual variable, so you have to create one by yourself. That's why R is pretty stupid. When you save your predicted Y (Y-hat) in SPSS or Stata, you get a new variable, and you can correlate it. But not in R. Or at least, not with fitted() function. Concerning possible mistakes in the numbers I have typed, I do not type them myself. I copy pasted them from Stata (or R). However, I read the lines carefully (twice) so as to make sure there is no mistake.

Also, because the object yhat is not an actual variable, I dont know how to remove the undesirable values from it. Because in all software I have tried (stata, spss, R), the predicted Y and/or residuals are calculated even for the observations that are missing in your regression model. If you correlate this yhat, that is a problem.


This is not a flaw in R. It is because you are not using it correctly...

In the case of extracting model parameters:

>object.lm = lm(variable1 ~ variable2 , data.frame) #fit the model
>attributes(lm.euro.IQ.wt)
$names
[1] "coefficients" "residuals" "fitted.values" "effects" "weights" "rank"
[7] "assign" "qr" "df.residual" "na.action" "xlevels" "call"
[13] "terms" "model"

$class
[1] "lm"


You can extract various values from it. I don't know which ones you want. From your syntax it looks like you want coefficients? Or fitted values? Coefficients are in

object.lm$coefficients #unstd. coefficients
predict(object.lm) #fitted values
fitted(object.lm) #same
object.lm$fitted.values #same


This also appears to be what your syntax does, but in a silly way.
EVERYBODY in the south received scores below 6, and EVERYBODY in the north above? That is clearly not true. You must mean something else.


Where did I say that ? I said the mean of wordsum is <6 in the south and >6 in the north.

I don't understand what the problem is, aside from the hassle of writing in Word Processors like this. I moved the Figures to the nearest text location. Figure 4 wasn't mentioned in the text at all, so I left it in the end.


Figure 4 was mentioned : "Figures 3 and 4 display the White and Black trends in Wordsum score derived from model 2 (i.e., with age and sex partialled out) for birth cohort and survey year".

I sent you the file as document for a reason. If you can rearrange the pictures and avoid the big empty space caused by figure 7, i will take it. But you have to make sure that the figures 3-6 are inserted only in the section belonging to tobit analysis (i.e., section 3.1, which has only 2 pages). I repeat, I tried it before, but that big empty space right in the middle of the page is annoying. Not even mentioning the graphs 5-6. Where should I put them ? If you want me to put the graphs where you really want, I should remove figures 5, 6 and 7. Because they are the source of the trouble.

I will correct the errors in grammar later.

They are clearly wrong. Look at my simulations. Anything below .99 is hopelessly unnormal.


I disagree again. See here. Anyway I would be surprised if they just come up with this cutoff of >0.95 without knowing what they are talking about. I think they have some references, and that's why they decided on this cutoff. But where are these references mentioning we could use that cutoff value ? I can't find them.

object.lm$coefficients #unstd. coefficients
predict(object.lm) #fitted values
fitted(object.lm) #same
object.lm$fitted.values #same


This also appears to be what your syntax does, but in a silly way.


About how to get fitted values, let me ask you a question. Did you try the syntax I've suggested ? The answer is probably no. Because if you do, you will see that it won't work. Like I said, SPSS and Stata also had the same flaws; residuals and/or predicted Y values are computed even for the observations not included in your model. But in these softwares you can remove the values from the observations that are absent in your regression model. In stata, you just do :

predict predicted_values if e(sample), xb

instead of just :

predict predicted_values, xb

But R is not Stata. I know of no other way to remove these undesirable values other than this :

d$wordsumpredictedage<- 5.210807+1.355356*d$bw1+-.0051079*d$agec+-.0016632*d$agec2+.000017*d$agec3+.0169752*d$bw1agec+.0002868*d$bw1agec2+.0000104*d$bw1agec3

d$wordsumpredictedage<-d$wordsumpredictedage+d$wordsum+d$bw1+d$age
d$wordsumpredictedage<-d$wordsumpredictedage-d$wordsum-d$bw1-d$age


Since age and bw1 have much larger N (i.e., values other than NAs) than wordsum and wordsumpredictedage, when you do these maths, the NAs in bw1 and age will become NAs in the variable wordsumpredictedage. It's a weird way of doing it, but I don't care. It works perfectly.
Admin
Mh,

Where did I say that ? I said the mean of wordsum is <6 in the south and >6 in the north.


In the paper:

The variable "reg16" has been dichotomized as 0=0, 1/4=1, 5/7=0, 8/9=1, so that the values of the new variable is divided into South(=0)/North(=1) or, equivalently, into Wordsum score below 6 (value=0) or above 6 (value=1).


And just before:

This doesn't sound right. Those living in the south all received scores below <6 and everybody in the north >6?


That's the truth. When I do a cross tabulation of wordsum per region, I get wordsum higher than 6 for regions in the North, and less than 6 for regions in the South.


Notice, the "all" in my quote above. You seems to have missed this. That's why I made it extra clear with "EVERYBODY".

What I think you meant:

The mean score of the southern regions is below 6. The mean score of the northern regions is above 6.

What you actually stated:

Everybody in the southern regions scored below 6. Everybody in the northern regions scored above 6.

Figure 4 was mentioned : "Figures 3 and 4 display the White and Black trends in Wordsum score derived from model 2 (i.e., with age and sex partialled out) for birth cohort and survey year".


My mistake. I searched for it using "Figure 4", but since you wrote "Figures 3 and 4", it did not turn up in my search.

I sent you the file as document for a reason. If you can rearrange the pictures and avoid the big empty space caused by figure 7, i will take it. But you have to make sure that the figures 3-6 are inserted only in the section belonging to tobit analysis (i.e., section 3.1, which has only 2 pages). I repeat, I tried it before, but that big empty space right in the middle of the page is annoying. Not even mentioning the graphs 5-6. Where should I put them ? If you want me to put the graphs where you really want, I should remove figures 5, 6 and 7. Because they are the source of the trouble.


I made a version of your paper where I moved them around. I forgot to attach it! It is attached now.

I disagree again. See here. Anyway I would be surprised if they just come up with this cutoff of >0.95 without knowing what they are talking about. I think they have some references, and that's why they decided on this cutoff. But where are these references mentioning we could use that cutoff value ? I can't find them.


Look at what they actually talk about. P values, not W value. Their W value for the non-normal distribution is .98. Above the .95 value you propose. For it to be normal, it should be above .99 as in my tests. Do some more tests in R if you want to see.

About how to get fitted values, let me ask you a question. Did you try the syntax I've suggested ? The answer is probably no. Because if you do, you will see that it won't work. Like I said, SPSS and Stata also had the same flaws; residuals and/or predicted Y values are computed even for the observations not included in your model. But in these softwares you can remove the values from the observations that are absent in your regression model. In stata, you just do :


So, you just need to remove them. Which observations are not included in your model?

Since age and bw1 have much larger N (i.e., values other than NAs) than wordsum and wordsumpredictedage, when you do these maths, the NAs in bw1 and age will become NAs in the variable wordsumpredictedage. It's a weird way of doing it, but I don't care. It works perfectly.


In R, whenever you add or multiple variables together and one of them is NA, you will get a NA result.

As far as I can understand without running the code myself, the trouble is that when you use fitted/predict etc., it will calculate predictions for the observations that have missing values using the rest of the weights. Since you don't want this, you have resorted to calculating them manually is a very bad way. If you wanted to calculate them manually, you should still extract the coefficients from the model object.

In any case, what you want to do is just remove the cases with missing values before fitting the model to them. Then you can use the predict() function. Right?

To remove cases with missing values, you can use the complete.cases() function. I don't know which variables you want to include exactly, so I cannot provide the code for you.
Look at what they actually talk about. P values, not W value. Their W value for the non-normal distribution is .98. Above the .95 value you propose. For it to be normal, it should be above .99 as in my tests. Do some more tests in R if you want to see.


You missed the QQ plot. Perfectly normal. But W is below 0.99. Also, you don't answer my objection. See here for still another example. It's not true that below 0.99, the distribution is necessarily not normal. There is no proof of what you say.

So, you just need to remove them. Which observations are not included in your model?


I said it before. But here's an illustration :

y x1 x2 x3

1 . . .
2 . 11 88
3 7 28 77
4 . 22 .
5 4 12 93
6 2 17 90
7 3 . 81
8 5 19 74
9 5 19 93
10 5 15 76


A listwise analysis would involve here 6 people, out of 10. But when you compute Y_hat, probably all statistical software become silly and compute Y_hat for all the 10 observations (because one variable here has 10 observations). Anyway, I said already that I found the solution. It's just that it is not elegant because I couldn't find a better way.

In R, whenever you add or multiple variables together and one of them is NA, you will get a NA result.


I know. But it's precisely why I did it that way. Concerning complete.cases() function. I'm aware of it. But how can you use it when you do this :

yhat<- 5.21 + 1.35*x1 + -.0051*x2 + -.00166*x3 + .000017*x4 + .01697*x5 + .0002868*x6 + .0000104​*x7

Again, I don't know.

Since you don't want this, you have resorted to calculating them manually is a very bad way. If you wanted to calculate them manually, you should still extract the coefficients from the model object.

In any case, what you want to do is just remove the cases with missing values before fitting the model to them. Then you can use the predict() function. Right?


No. I repeat again you don't understand. And you probably never tried it before. I said that predict() or fitted() will never give you an actual variable. Just do this.

Use this :

http://openpsych.net/datasets/GSSsubset.7z

And type this :

setwd("c:\\Users\\ ............. \\)
entiredata<-read.csv("GSSsubset.csv")
d<- subset(entiredata, age<70)

d$weight<- d$wtssall*d$oversamp # interaction variable

d$bw<- rep(NA) # generate variable with only missing values
d$bw[d$race==1] <- 1 # assign a value of "1" under the condition specified in bracket
d$bw[d$race==2] <- 0 # assign a value of "0" under the condition specified in bracket
d$bw0<- rep(NA)
d$bw0[d$year>=2000 & d$race==1 & d$hispanic==1] <- 1
d$bw0[d$year>=2000 & d$race==2 & d$hispanic==1] <- 0
d$bw00<- rep(NA)
d$bw00[d$year<2000 & d$race==1] <- 1
d$bw00[d$year<2000 & d$race==2] <- 0
d$bw1<-d$bw0 # combine the two vars, by incorporating the first var
d$bw1[is.na(d$bw0)]<-d$bw00[is.na(d$bw0)] # then the second, without Nas

d$agec<- d$age-40.62 # mean-centering of age
d$agec2<- d$agec^2 # squared term of age
d$agec3<- d$agec^3 # cubic term of age
d$bw1agec = d$bw1*d$agec
d$bw1agec2 = d$bw1*d$agec2
d$bw1agec3 = d$bw1*d$agec3

d$wordsum[d$wordsum==-1] <- NA
d$wordsum[d$wordsum==99] <- NA

require(VGAM)

I_hate_you_R<- vglm(wordsum ~ bw1 + agec + agec2 + agec3 + bw1agec + bw1agec2 + bw1agec3, tobit(Upper=10), data=d, weights=d$weight)
summary(I_hate_you_R)
yhat<-fitted(I_hate_you_R)


Now, you have your yhat. If you still don't believe me, then use this :

View(d)

And see... where is your yhat object ? There is no column with yhat object. As I said, it's not an actual variable, and correlating it with an actual variable listed in your window data causes either error message or produces weird graphs.

Just compare :

d$wordsumpredictedage<- 5.210807+1.355356*d$bw1+-.0051079*d$agec+-.0016632*d$agec2+.000017*d$agec3+.0169752*d$bw1agec+.0002868*d$bw1agec2+.0000104*d$bw1agec3

require(lattice)

xyplot(yhat~ d$age, data=d, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))

xyplot(d$wordsumpredictedage ~ d$age, data=d, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))


Nonetheless, the yhat object has not been corrected by removing the missing data. But my variable d$wordsumpredictedage also has this problem. I didn't remove the NAs. Now let's do this :

d$wordsumpredictedage1<-d$wordsumpredictedage+d$wordsum+d$bw1+d$age
d$wordsumpredictedage1<-d$wordsumpredictedage1-d$wordsum-d$bw1-d$age


Let's redraw the graph :

xyplot(d$wordsumpredictedage1 ~ d$age, data=d, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))

You see that the graph is pretty much the same. The only difference is that it must have less data points (because lot of observations are dropped). I'm not sure about its effect on the graph. I had the impression on the other day that the Y values were smaller (or bigger ?) but that the pattern of the curves are the same. But still, you get a graph that is readable, with a nice regression line. You can't do it with the yhat object. The scatter you obtain from the yhat object is similar to what you get if you plot wordsum against age : a cloud of points.

Imagine you do :

wordsum=age+race

You plot Y_hat by computing it the usual way :

new_variable = intercept + beta*age + beta*race

What you will have is a scatter of wordsum versus age that is identical for blacks and whites. But why so ? That is because you did not include age*race interaction. That is, age effect on wordsum was not allowed to vary across races.

What I mean is that you must have a neat graph which has a "regression line" rather than a "cloud" of points scattered everywhere. This is a proof that fitted() or predict() can't be used to obtain a true Y_hat variable. At least, not in the usual way. In most software, you surely can save the predicted Y variable, and correlate it (but after removing the undesirable missing data), but not with R.

But even with Stata, SPSS, or even SAS (which I haven't tried, but maybe I will, because it's free now), most of the time, you will have to compute Y_hat by hand. The reason is that the Y_hat you request from your software will compute the predicted Y for all variables in your model. Imagine I have added income, educ, degree, sibs, region, family, etc. The requested Y_hat will have data points locating each person for each value of sibs, each value of income, each value of educ, etc., around each values of your Y variable. But if I want only to plot the predicted Y based on age and race while controlling for SES/background variables, I don't need to add the coefficients of these variables in my Y_hat variable because they will add some "variation" around the plotted line. This graph will become impossible to read because the spread of the data points do not have the form of a single regression line. If my explanation is too abstract, I have planned (one month ago) that when my article is published, I will publish 3 articles in a row : 1) how to compute Y_hat and make "predicted plots" and how to avoid the pitfalls in doing it, 2) tobit and truncated regression, 3) multilevel regression. But if you want now, I have attached a graph, based on the same regression above, but added all SES/background variables. I have requested the Y_hat directly from Stata and plotted it against age. The syntax was :

tobit wordsum bw1 agec agec2 agec3 bw1agec bw1agec2 bw1agec3 logincomec educc degreec gender sibs residence reg family [pweight = weight], ll ul
predict wordsumpredictedage if e(sample), xb
label variable wordsumpredictedage "Predicted Wordsum scores"
graph twoway (scatter wordsumpredictedage age if bw1==0, msymbol(Oh)) (scatter wordsumpredictedage age if bw1==1, msymbol(O)), legend(label(1 black) label(2 white))


You see the difficulty with that graph. Anyway, this discussion is pointless. If anyone here has suspicions and believes I mistyped some of my numbers, then you just run the analysis in R. I checked the numbers at least 3 times, so that the probability of misreporting (even a little one) is close to zero. Furthermore, saying that I calculate it in a bad way, what does that mean ? That it is not how Y_hat should be computed ? In that case, I disagree, obviously. Because it's exactly what all textbooks say to do. The formula is Y_hat=intercept+beta*(var1)+beta*(var2) ... etc.

Finally, the document you sent has a problem. You have reduced Figure 7. I don't want that. Otherwise I prefer to delete it, because there are 9 graphs, all were already small. By reducing figure 7, all the 9 graphs are now tiny. I will try to rearrange them as you want even if the inclusion of Figure 4 will be tedious, but I don't think it's reasonable to reduce the size of Figure 7. Because a graph is useless if you can't see with much precision what's happening here. So they must be reasonably big.
Admin
MH,

You missed the QQ plot. Perfectly normal. But W is below 0.99. Also, you don't answer my objection. See here for still another example. It's not true that below 0.99, the distribution is necessarily not normal. There is no proof of what you say.


The reason W is below .99 is this case is a small sample size. In my tests N=5k in all cases. In the above n=20. If you use rnorm with a larger sample, the W metric functions as in my simulations. With small N's, one can of course get strange results.

In your case, N is very large, so these small-scale simulations are not relevant.

A listwise analysis would involve here 6 people, out of 10. But when you compute Y_hat, probably all statistical software become silly and compute Y_hat for all the 10 observations (because one variable here has 10 observations). Anyway, I said already that I found the solution. It's just that it is not elegant because I couldn't find a better way.


Just remove the observations with missing values before fitting, then you won't have this problem.

I know. But it's precisely why I did it that way. Concerning complete.cases() function. I'm aware of it. But how can you use it when you do this :
yhat<- 5.21 + 1.35*x1 + -.0051*x2 + -.00166*x3 + .000017*x4 + .01697*x5 + .0002868*x6 + .0000104​*x7
Again, I don't know.


You don't use it when one does stuff like that, because the above is a very poor way of doing things. As I said, if you wanted to calculate the predicted values manually per above, then you should use the $coefficients attribute.

In any case,

There are many of ways to handle missing values. One simple way is just to make a new data.frame (df) and only add the variables you want.

E.g.
#this makes a new data.frame with the 5 variables of interest
df.complete.data = cbind(myVariable1,myVariable2,myVariable3,myVariable4,myVariable5)
#retain cases with full data only (should use imputation)
df.complete.data = df.complete.data[complete.cases(df.complete.data),]
#now runs stuff on it


Perhaps the easiest way is to use the subset function (see below).

No. I repeat again you don't understand. And you probably never tried it before. I said that predict() or fitted() will never give you an actual variable. Just do this.


Of course I tried it. I tested it prior to telling you. :)

Here's proof that it does return values as I say:



One can plot them vs. the predicted data too if one makes sure there are no missing values:

#predict IQs using all ancestry
#make subset of relevant vars
test.df.for.mh = subset(DF.supermega, select=c(LV2012estimatedIQ,AmergenomeFuerst2014,AfrgenomeFuerst2014))
test.df.for.mh = test.df.for.mh[complete.cases(test.df.for.mh),] #keep only complete cases
#fit model
lm.IQ12.ancestry.wt = lm(LV2012estimatedIQ ~ AmergenomeFuerst2014+AfrgenomeFuerst2014, DF.supermega, weights=sqrt(Population))
lm.IQ12.ancestry.wt.summary = summary(lm.euro.IQ12.wt) #summary
lm.beta(lm.euro.IQ12.wt) #std betas, = -.54 Amer and -1.23 Afro
test.df.for.mh["fitted"] = predict(lm.euro.IQ12.wt) #fitted values
main.title = paste0("Predicted national IQ based on genomic ancestry for the Americas\nAdj.r is ",round(sqrt(lm.IQ12.ancestry.wt.summary$adj.r.squared),2),", n=",nrow(test.df.for.mh),", weighted by log(pop)")
scatterplot(LV2012estimatedIQ ~ fitted, test.df.for.mh,
smoother=FALSE,
labels=rownames(test.df.for.mh),id.n=nrow(test.df.for.mh),
main = main.title, xlab ="Predicted IQ",ylab="National IQ")



(correct, it is actually sqrt(pop))
The reason W is below .99 is this case is a small sample size. In my tests N=5k in all cases. In the above n=20. If you use rnorm with a larger sample, the W metric functions as in my simulations. With small N's, one can of course get strange results.

In your case, N is very large, so these small-scale simulations are not relevant.


Your main claim was that anything below .99 is not normal. You did not specify any condition. Now you specify a condition : that N should not be small. Either way, this means your first statement was plain wrong.

Anyway, I tried this in the GSS data :

keep if born==2
keep if age<70
histogram age
swilk age

histogram cohort
swilk cohort


The sample sizes are large. Histogram shows normal distribution, but W values are all below 0.99; age has 0.970 and cohort has 0.984.

I could have also said there is another thing which is not right with your statement of W>0.99. The reason is that in my multilevel analysis, the W values for level-2 residuals were around 0.94-0.95 but histogram shows nearly normal distribution. Same thing with Q-Q plots. There was just a small-to-modest deviation from normality.

Just remove the observations with missing values before fitting, then you won't have this problem.


I will try that.

But first... Why are you so nitpicking... ? I said already my own method actually works. If you don't like it, then it's just like saying "Meng, you use Times New Roman for your style but I preferred Arial, so unless you switch back to Arial style, I must disapprove your paper". If that was merely an advice, I will thank you for this, because you are a fast learner, unlike me, and I may need help with R (sometimes). However, I hope it's not a requirement for publication. Because all of your objections (instead the "swilk" thing) is about "change the placement of the graphs" "make the syntax looking more stylish" "don't type the numbers of your parameters" etc. All of them are unreasonable requirements, in my opinion. It's, after all, just a question of taste. Nothing to do with the content of the article (method, theory, etc.).

There are many of ways to handle missing values. One simple way is just to make a new data.frame (df) and only add the variables you want.


I know that. But i didn't want to use it first. I wanted to "keep" only cases with age<70. But not to remove cases with missing values on wordsum. It may contain some useful information (e.g., people having wordsum may have different values on a certain variables than people who don't have wordsum). I prefer to use a function like "do if". In SAS, Stata and SPSS you can easily do it.

Here's proof that it does return values as I say:


It works for you because you remove all the missing data before. I prefer to run the regression using a statement like "do if not missing x1 x2 x3 etc".

However, you use your own data. Not mine.

Let's do it.

d1= subset(d, select=c(wordsum, bw1, age, agec, agec2, agec3, bw1agec, bw1agec2, bw1agec3, weight))
d1 = d1[complete.cases(d1),] #keep only complete cases


Now, look at your data window. How many rows ?

View(d1)

Now, we do this :

library(VGAM) # vglm() function
library(lattice) # xyplot() function

still_hate_R<- vglm(wordsum ~ bw1 + agec + agec2 + agec3 + bw1agec + bw1agec2 + bw1agec3, tobit(Upper=10), data=d1, weights=d1$weight)
summary(still_hate_R)

d1$wordsumpredictedage<- 5.210807+1.355356*d1$bw1+-.0051079*d1$agec+-.0016632*d1$agec2+.000017*d1$agec3+.0169752*d1$bw1agec+.0002868*d1$bw1agec2+.0000104*d1$bw1agec3

d1$fitted = predict(still_hate_R) #fitted values

xyplot(d1$fitted ~ d1$age, data=d1, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))


The plot is nicer than before. But why all these values at zero in Y axis along all the values of X ? If you want the answer, you do this :

View(d1)

And compare the columns fitted and wordsumpredictedage. Then, scroll down to the end. How many rows do you have ? Anyway, even if you can fix that, it's not the problem here. Because you have ignored my most important comment. I will not repeat it again. I don't have time for this. I will just ask you to do this :

d$educ[d$educ>20] <- NA

d11 = subset(d, select=c(wordsum, bw1, age, agec, agec2, agec3, bw1agec, bw1agec2, bw1agec3, educ, weight))
d11 = d11[complete.cases(d11),] #keep only complete cases

R_go_to_hell<- vglm(wordsum ~ bw1 + agec + agec2 + agec3 + bw1agec + bw1agec2 + bw1agec3 + educ, tobit(Upper=10), data=d11, weights=d11$weight)
summary(R_go_to_hell)

d11$fitted = predict(R_go_to_hell) #fitted values

xyplot(d11$fitted ~ d11$age, data=d11, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))


Like I said. A cloud of points. And again, with data points at Y=0 along the X axis ? So, you use again View(d11) and look at the values of the predicted Y. But whatever. The scatter plot illustrates what I'm saying. My method is not wrong, you just don't understand how Y_hat is computed when requested directly from the software. It takes all variables, but I have already said they should not be taken into account in computing Y_hat.
Admin
Your main claim was that anything below .99 is not normal. You did not specify any condition. Now you specify a condition : that N should not be small. Either way, this means your first statement was plain wrong.


Not plain wrong, but restricted to large datasets.

The sample sizes are large. Histogram shows normal distribution, but W values are all below 0.99; age has 0.970 and cohort has 0.984.

I could have also said there is another thing which is not right with your statement of W>0.99. The reason is that in my multilevel analysis, the W values for level-2 residuals were around 0.94-0.95 but histogram shows nearly normal distribution. Same thing with Q-Q plots. There was just a small-to-modest deviation from normality.


There is nothing wrong with the .99 guide. Your judgments based on Q-Q plots do not show much. Subjective judgments are bad in science (and everywhere else). If there is an objective way, use that instead.

Yes, there is deviation from normal. That is what W shows. Why are we disagreeing about? First you say .99 doesn't work even for large datasets. Then you agree that the two variables with .94-.95 show small to modest deviation from normality. Which way is it?

But first... Why are you so nitpicking... ? I said already my own method actually works. If you don't like it, then it's just like saying "Meng, you use Times New Roman for your style but I preferred Arial, so unless you switch back to Arial style, I must disapprove your paper". If that was merely an advice, I will thank you for this, because you are a fast learner, unlike me, and I may need help with R (sometimes). However, I hope it's not a requirement for publication. Because all of your objections (instead the "swilk" thing) is about "change the placement of the graphs" "make the syntax looking more stylish" "don't type the numbers of your parameters" etc. All of them are unreasonable requirements, in my opinion. It's, after all, just a question of taste. Nothing to do with the content of the article (method, theory, etc.).


Writing bad code makes it hard for others to follow your method if they wish to examine the code at a later point. And when you hard-code stuff, you risk introducing errors at every point. (Case in point below.)

I have not said that it was a requirement for publication.

As with John, I don't understand multi-level regression, so I'm not competent to judge that part of the method.

But you should fix the figures before publication. It makes no sense for them to be in the back. It's like reading working papers in economics. It is debatable whether the tables should be in the text as well. I am ok with them being in the back.

I know that. But i didn't want to use it first. I wanted to "keep" only cases with age<70. But not to remove cases with missing values on wordsum. It may contain some useful information (e.g., people having wordsum may have different values on a certain variables than people who don't have wordsum). I prefer to use a function like "do if". In SAS, Stata and SPSS you can easily do it.


Well, you can of course have both datasets. I don't see the problem here.

It works for you because you remove all the missing data before. I prefer to run the regression using a statement like "do if not missing x1 x2 x3 etc".


I'm not aware of any R command like that. Don't try to force STATA/whatever commands/syntax on R. Use the R commands and syntax to solve the problems.

However, you use your own data. Not mine.


For convenience. I already have my own data loaded and my code is otherwise good.

However, for completeness.

source("merger.R") #my functions
entiredata = read.csv("GSSsubset.csv", stringsAsFactors=FALSE) #load file, not with factors

d = subset(entiredata, age<70)

d$weight<- d$wtssall*d$oversamp # interaction variable

d$bw<- rep(NA) # generate variable with only missing values
d$bw[d$race==1] = 1 # assign a value of "1" under the condition specified in bracket
d$bw[d$race==2] = 0 # assign a value of "0" under the condition specified in bracket
d$bw0 = rep(NA)
d$bw0[d$year>=2000 & d$race==1 & d$hispanic==1] = 1
d$bw0[d$year>=2000 & d$race==2 & d$hispanic==1] = 0
d$bw00 = rep(NA)
d$bw00[d$year<2000 & d$race==1] = 1
d$bw00[d$year<2000 & d$race==2] = 0
d$bw1 = d$bw0 # combine the two vars, by incorporating the first var
d$bw1[is.na(d$bw0)] = d$bw00[is.na(d$bw0)] # then the second, without Nas

d$agec = d$age-mean(d$age) #40.62 # mean-centering of age, mean was wrong
d$agec2 = d$agec^2 # squared term of age
d$agec3 = d$agec^3 # cubic term of age
d$bw1agec = d$bw1*d$agec
d$bw1agec2 = d$bw1*d$agec2
d$bw1agec3 = d$bw1*d$agec3

d$wordsum[d$wordsum==-1] = NA
d$wordsum[d$wordsum==99] = NA

require(VGAM) #multilevel stuff
library(lattice) #xy plot

#setsub it
d1= subset(d, select=c(wordsum, bw1, age, agec, agec2, agec3, bw1agec, bw1agec2, bw1agec3, weight))
d1 = d1[complete.cases(d1),] #keep only complete cases
#22156 rows

R_rocks<- vglm(wordsum ~ bw1 + agec + agec2 + agec3 + bw1agec + bw1agec2 + bw1agec3, tobit(Upper=10), data=d1, weights=d1$weight)
summary(R_rocks)

d1$wordsumpredictedage = 5.210807+1.355356*d1$bw1+-.0051079*d1$agec+-.0016632*d1$agec2+.000017*d1$agec3+.0169752*d1$bw1agec+.0002868*d1$bw1agec2+.0000104*d1$bw1agec3
coefs.mh.manually.typed = c(1.355356,.0051079,.0016632,.000017,.0169752,.0002868,.0000104)
coefs.extracted = coefficients(R_rocks)[-c(1,2)] #dont get intercepts
hist(coefs.extracted-coefs.mh.manually.typed) #what are the differences?
#pretty small

d1$fitted = predict(R_rocks)[,1] #fitted values, you only first the first column
plot(d1$fitted,d1$wordsumpredictedage) #plot and compare
cor(d1$fitted,d1$wordsumpredictedage) #correlation
#.9986497 the rest perhaps being due to imprecision in typing the coefficients off above

#fancy plot
xyplot(d1$fitted ~ d1$age, data=d1, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))


I found some errors in the code. E.g. you had calculated the wrong mean when re-centering. Again, you had inputted values manually instead of using a function. Bad.

The coefficients you typed in are not identical to those in the summary file, perhaps because I fixed the re-centering issue. They are however, very similar.

Plot attached. It seems normal.

The largest problem was loading the datafile correctly. In your code, you load a CSV file. However, you linked to a SAV file (SPSS). I tried loading the SPSS file, which works, but it treats some columns as factors (bad). I then had to open SPSS and export as CSV, and then load it as CSV without stringsAsFactors=FALSE to get rid of the factors.
There is nothing wrong with the .99 guide.


Everything is wrong, if I can prove you I can get normal distribution with <0.99 W value. And I did.

Your judgments based on Q-Q plots do not show much. Subjective judgments are bad in science (and everywhere else). If there is an objective way, use that instead.


Your comment assume that SW test is objective. But this can't be and you know that. There is no universal value which tells you what is normal and is not.

Your comment is weird. I don't need GDP to tell that africans are poorer than europeans for the exact same reason I don't need the W value of SW test to tell me if my distribution is normal or not. I can separate the rich from the poor by looking at them exactly the same way I can tell the kind of distribution I have from a quick look at the histogram. And anyone who tells me he needs GDP or any other values for knowing whether africans are poor or not is beyond help. There is nothing universal in science. You ask the impossible or you mischaracterize science. If there is no certainty in subjective judgment ("I recognize a poor when I see it") there is no more certainty among the numbers and their arbitrary cut offs and differing suggestions made by scientists, based on different models and simulations. It's just a matter of precision. If I have problem to sort the poor and the rich, that means there is high probability their wealth are not very different. The same is true if I hesitate between normal and non-normal distributions. Because this implies there is certainly some non-normality, maybe slight or moderate, but there is definitely a little something that is non-normal.

You even knew that the "universal" p-value cut off of 0.05 is arbitrary, but you are ready to believe that your value of W=0.99 is universal ? Probably not. Then, your guide is equally subjective. But if you don't believe W=0.99 is universal, why are you telling me that subjective judgments on QQ/histogram are bad ? It cannot be worse than using your guide, which would certainly prove most dangerous for me. Most researchers, if not all, would kill me if they learn that I use a rule applied to a statistic that has been recommended by someone who did not publish his paper on a peer-reviewed article. For a cut-off value of 0.99 so important in stats, it's clear that nobody will accept your judgment if it can't pass the peer-review. So, I have every reason to think that your suggestion to cite your blog post and follow your cut-off is extremely dangerous. And not a reasonable request.

Writing bad code makes it hard for others to follow your method if they wish to examine the code at a later point. And when you hard-code stuff, you risk introducing errors at every point. (Case in point below.)


I don't know how many times I should repeat it. Your comment assume you don't trust me, that is, you think I must have necessarily made a mistake somewhere. Why not rerun the syntax, if you want to see ? Like I've said, I examined the numbers many times before. Yes. Too many times.

And it's not a bad code. Perhaps in a subjective way, but not in an objective way. A code is objectively "bad" or "wrong" only if it produces erroneous results. All of my results are correct.

As with John, I don't understand multi-level regression, so I'm not competent to judge that part of the method.


It is still possible to comment on this stuff. If for example, there are affirmations that were not backed by references, you can mention them, or if you don't understand something, ask to clarify. I have cited everything I think I could. The only change I will operate (unless someone points out something else) is the parapraph concerning intercept-slope correlation. Considering the mails I have received from Snijders and Hox, that correlation can be interpreted as a standard correlation, bounded between -1 and +1, but only when the random effects variables are not constrained (e.g., to be zero). This is the case in my study. So, I have to rectify this later.

But you should fix the figures before publication. It makes no sense for them to be in the back. It's like reading working papers in economics. It is debatable whether the tables should be in the text as well. I am ok with them being in the back.


In what way it makes no sense ? According to you and Dalliard, the reasons you gave me is clearly a matter of taste, and thus are entirely subjective. Not objective. You two said, basically, that it is a bother and a waste of time to scroll down to the end. Which will only takes 1-2 seconds. Like you've said before, if someone wants to read and check the syntax, he will look at the supplementary files, so there was no problem for having a pdf article and a pdf for supplementary files. But I can tell the same thing for the graphs. Where I have put the graph is not making it more difficult to read. What can make them difficult to read is if there is no title, misleading title, no legend, or insufficient information in the legend in the graph. But no one here has complained about that.

I found some errors in the code. E.g. you had calculated the wrong mean when re-centering. Again, you had inputted values manually instead of using a function. Bad.


What do you mean by re-centering ? I centered, but I did not re-centered. I also don't understand "inputted values". If you mean imputed values, I didn't do anything like this. It seems you said it was the value 40.62 which was the problem. Once again, I said I'm not wrong. In my article, I said I have taken the mean age for people having wordsum score and I have also applied sampling weight for all of my analyses. See below :

. summarize wordsum age logincome educ degree year cohort [aweight = weight] if !missing(wordsum)

Variable | Obs Weight Mean Std. Dev. Min Max
-------------+-----------------------------------------------------------------
wordsum | 23817 24541.0623 6.013715 2.102101 0 10
age | 23817 24541.0623 40.62188 13.87005 18 69
logincome | 21797 22231.922 10.12833 .9611794 5.501258 11.95208
educ | 23775 24496.7364 13.09723 2.897454 0 20
degree | 23779 24501.2668 1.396848 1.142989 0 4
-------------+-----------------------------------------------------------------
year | 23817 24541.0623 1992.703 11.16641 1974 2012
cohort | 23817 24541.0623 1952.081 17.3142 1905 1994



d1$fitted = predict(R_rocks)[,1] #fitted values, you only first the first column


This is not what I did. What is the purpose of [,1] here ?

Anyway, you're just showing me that I was right about R. No, R doesn't rock, it sucks. When you request fitted() without [,1] you will actually get the weird plot I had before. Like I've said, R is annoying. It sucks when you have missing data. Even when you correct for missing data, you still have problem, because you need to know first that fitted() must be used in conjunction with [,1]. You knew that, but I didn't. You'll never, never, never have this sort of problems in other softwares, where the procedure is straightforward. But R is silly.

The largest problem was loading the datafile correctly


You didn't pay any attention to my syntax. Remember :

entiredata<-read.csv("GSSsubset.csv")

It's csv, not sav file. This means I have already converted my sav into csv file. It has no problem of factor variables because everything in csv files is numerical.
I have attempted some simulations. See below. Something is definitely odd with the Shapiro-Wilk test. I do not recommend this test.

a<-rnorm(2000,0,2)
b<-rnorm(500,-2,4)
c<-rnorm(1500,4,4)
x<-c(a,b,c)
hist(x, freq=FALSE)
shapiro.test(x)


On the above, almost everyone can easily agree from the histogram that x is normal or roughly normal. But not according to the SW "W" value, if I use your cut off. I recommend you to repeat the copy-paste on R. You see always the same thing. W is always below 0.99 and the histogram always look the same; slight skew and slight kurtosis.

x<-rnbinom(5000,7,.3)
hist(x)
shapiro.test(x)


Repeat the copy-pasting again. You see the distribution is skewed. It's modestly non-normal, and most people will agree with my look on the histogram. But W value is always around 0.96.

x <- rlnorm(5000,3,.22)
hist(x)
shapiro.test(x)


I left the best in the end. Try it several times. And if possible, try it many times. Now, I attach two graphs. sw1 and sw2. (EDIT : there is no title in the graphs. Anyway, sw1 is the graph on the left, attachment 577, and sw2 is on the right, attachment 578.) Most of the time, the above syntax generates something very to close sw1. But sometimes, you get something like sw2. You can tell by "eyeballing" that the two distributions are not alike. You will surely agree that most people can easily accept the sw1 as fairly normal, but you will surely also agree that most people won't accept the sw2 as normal, because of its high kurtosis.

But wait... look at the W value in the two pictures. They are always the same. Always around 0.97. This tells us that SW cannot detect an important source of non-normality, e.g., kurtosis. So, histogram has now three superior features compared to SW test.

1. Most (if not all) people use SW test with p value, not W value.
2. Histogram can tell which distribution you have, and can guide you toward the most appropriate method.
3. SW test cannot distinguish biases in distribution that histogram can definitely do.

Considering the cut-off value of 0.99, because it's a "guide" and "recommendation" I think it definitely needs to pass the peer review. Otherwise, it will be difficult for researchers to agree on that "rule" if the eminent experts can show it wrong.

But I haven't found anything yet about cut off for W value. If it comes to this, I will perhaps think of removing it. In any case, SW test adds nothing more to histogram and P-P/Q-Q plots.

Anyway,

I said before :

There is nothing universal in science


I should have been explicit : "in social science".

I have also said I will publish a blog article concerning the use of Yhat. But I changed my mind. I've decided to publish it right now.
http://humanvarieties.org/2014/12/18/how-to-calculate-and-use-predicted-y-values-in-multiple-regression/

EDIT2 : I got a response from Kenneth C Land (a nice guy, I can tell). He agrees with my conclusion that the age effect with respect to black-white difference was wrongly taken as cohort effect. So, the tobit/OLS regressions were clearly wrong. And multilevel regression the correct procedure.
EDIT2 : I got a response from Kenneth C Land (a nice guy, I can tell). He agrees with my conclusion that the age effect with respect to black-white difference was wrongly taken as cohort effect. So, the tobit/OLS regressions were clearly wrong. And multilevel regression the correct procedure.


I suspected as much. Conceptually, I now have no problem with the analysis.

Regarding the paper -- you're not going to like this -- I think that you should rewrite some of it so to emphasize the novelty of the analysis. Basically, what you did was tack a multilevel regression analysis onto your original analysis. But you don't make the point of the paper clear.

What you should do is clearly state:

H&H's method was incorrect. Multilevel regression should be used. I will demonstrate the methodological effect. First, I will replicate H&H's findings using OLS, etc.. Second I will analyze the data using the more proper multilevel regression. By this, H&H's findings don't replicate. There is no wordsum gap narrowing. This illustrates the importance of using the most appropriate method.

Edit: If you need help rewriting the paper, let me know.

(Also, since you have all the code, maybe you could write a quick follow up paper looking at the White-Hispanic gap using: generation, cohort, age, and period. I don't think that hitherto we have been properly disentangling the effects.)
I would like you to tell me what you want me to add, modify, change, or remove (if there is anything). If I read it correctly, you only asked that I modify the abstract. Is there anything else ?
I would like you to tell me what you want me to add, modify, change, or remove (if there is anything). If I read it correctly, you only asked that I modify the abstract. Is there anything else ?


I would like you to modify the discussion, conclusion, and abstract. I would like you to be clearer about what you are doing and about the novelty and importance of your research. I would like you to make clear that the other methods were sub-optimal. This point does not come across in the HV post or the paper. When I read both, the impression I get is: "Well, different methods, different results and maybe the B/W cohort narrowing wasn't as large as thought." And I think: "This Meng Hu guy is really conscientious (if not excessively so), checking so many different models. Kind of a waste of time, though." What I want to get is: "Hey, the proper way of conducting the analysis is Multi-level regression. I will demonstrate. When we do it, we see no narrowing. Readers, make sure to check your method. " And I think: "This Meng Hu guy really knows his stats."
I would like you to modify the discussion, conclusion, and abstract.


start here:

A. "The aim of this article is to provide an update to Huang & Hauser's (2001) study. I use the General Social Survey (GSS) to analyze the trend in the Black-White difference in Wordsum scores by survey years and by cohorts. Tobit regression models show that the Black-White difference diminishes over time by cohorts but just slightly by survey years. That is, the gap closing is mainly a cohort effect. The Black-White narrowing may have ceased in the last cohorts and periods. A multilevel regression is performed to examine if the result still holds, by modeling age as fixed effects and cohort as random effects. There was no racial gap narrowing. Explanations for these conflicting results are provided."

Try ~

B. "The aim of this article is to demonstrate the importance of using multilevel regression when analyzing cohort data. To show this, I analyse the Black-White difference in Wordsum scores by survey years and by cohorts, using the General Social Survey (GSS). Replicating Huang & Hauser's (2001) findings, I find a substantial narrowing of the difference when using Tobit regression models. However, when using the more appropriate multilevel regression models which disentangles cohort from age effects, I find no such narrowing. An explanation for the difference in results is provided."

(The discussion and conclusion should clearly convey the same points as in B.)
Admin
The reason I have not followed up is that MH was getting overly emotional and hostile. I thus decided to let him cool off for a bit before replying.
The reason I have not followed up...


This isn't how his replies read to me; but my theory of mind interpretations are subject to change with my mood.

Let me clarify a point. You said:

"As with John, I don't understand multi-level regression, so I'm not competent to judge that part of the method."

I understand the method conceptually; and I can generally recognize when it's called for. (In contrast, H&H's reviewers didn't.) I don't, though, grasp the fine nuances in the technique, so I won't be able to catch technical errors. This would be like not understanding certain technical aspects of OLS, say regarding normality assumptions, despite being able to make sense of OLS and grasping the basics and being able to identifying clear nonsense. I do trust MH's judgement regarding these technical points; we usually disagree when it comes to more theoretical issue like the meaning of a Jensen Effect.
Chuck, I will make the modifications. Note that there is a discussion section, but no conclusion section. I don't think I need to add a conclusion section, on top of what will be written in the discussion section.

Also, concerning H&H, I think most people didn't understand it either at the time. The use of multilevel regression has been recommended, but only recently, by Yang & Land in 2004 and 2006. However, you also have an article by Miyazaki & Raudenbush (2000) "Tests for Linkage of Multiple Cohorts in an Accelerated Longitudinal Design" where they attempt to separate age and cohort effects, although this study uses a longitudinal survey (NLSY79, if I read it correctly). It seems to me that today, Hauser is fully aware of this problem, because I saw his name in a paper on Hierarchical APC model written by Yang and/or Land.

EDIT : Later, I will attach the mails (in a doc file) I received from Land, West and Snijders concerning my questions on multilevel regression. Concerning Land, he only had two comments. The first is that I should use the specific calendar years that define the cohorts. What he has in mind is probably what he did in all his papers on the GSS data, i.e., 1905-1910, 1911-1915, 1916-1920, 1921-1925, etc. I didn't do something like this because I was afraid about the sample sizes (extremely small at the first and last cohorts, if you do it like he asked). He also said he would prefer to use age, age^2, age^3, race, race*age, race*age^2, race*age^3, rather than a series of dummy variables, because it's simpler to interpret. I responded that I did it, but the results were the same. He responded that he expected this, and that there was nothing surprising in these patterns.

The reason I have not followed up is that MH was getting overly emotional and hostile. I thus decided to let him cool off for a bit before replying.


Maybe a little bit, but at the same time, you requested a lot of very unreasonable things.

I said : "my R syntax maybe is not "elegant" to you, but produce correct results".

You said : "no it's bad I don't like it, and it's error prone, so you should modify it".

I respond : "your answer sound like you suspect me having made errors in typing the numbers; if so, check the numbers by yourself rather than continuing saying that the syntax is wrongdoing and bad".

It's not an exaggeration to say this whole discussion is pointless, and time-wasting. And I don't really want to waste time on something like this. Normally, these advices should be given elsewhere, e.g. "Other Discussions".

Another weird request is the figures. When I make a request about the shape and presentation of the article, it's not based on whether I like it or not. The way you usually cite references (i.e., [1], [2], [3] etc. instead of Kirkegaard et al. (2014) etc.) in all of your papers is very annoying to me, but I never said anything about it, because as I said, it's subjective, so it should not be a requirement. On the other hand, if I think the tables and figures would make problems for the credibility of the journal (e.g., using screenshot from blogpost) I will probably say it and it is a reasonable request. But concerning where I should put my figures, I decided this based on a certain logic : I never saw someone else doing as you requested. Either figures+tables are included within the text, or at the very end of the paper. The only reason you have given is because the figures are more important. I wonder in what way. Figure 8 and table 9 are equally important. Figure 8 only plots the parameter estimates given in table 9, where the depiction of the BW gap changes is already, and clearly visible (see column "Coeff. β1j").

You also said things that are very contradictional. Such as that we cannot tell from histogram if the distribution is normal or not. If it weren't you, I would say you're joking and try to play with my mind. In your blog post here, for example, you always accompany each Shapiro-Wilk test with histogram. Why, if you don't need histograms ? The reason is because you can't understand the numbers without graphical representations. I said it many times but you keep saying this. I have also the feeling you know you're wrong, but won't admit it. Otherwise, you wouldn't use graphs with SW test. The only reason why we can guess the extent of the non-normality with SW test is only because we had graphical representations about what a W value of <0.90, 0.95 or >0.99 might be. In fact, SW cannot be understood without histogram, but you don't need SW to understand a histogram. A single number like this is too abstract and it's not possible to understand the shape of the distribution of your data.

Another contradiction is that if eyeballing is silly, I'm wondering why you asked me to put the figures in a "more visible way" to the readers, on the basis that the figures were more important than the tables. That doesn't make sense to me, if I think about what you said about eyeballing.

As I said before, if someone insist on these requests, I will probably have to make it, but that doesn't mean I find them reasonable.