Emil Wrote:Look at what they actually talk about. P values, not W value. Their W value for the non-normal distribution is .98. Above the .95 value you propose. For it to be normal, it should be above .99 as in my tests. Do some more tests in R if you want to see.

You missed the QQ plot. Perfectly normal. But W is below 0.99. Also, you don't answer my objection.

See here for still another example. It's not true that below 0.99, the distribution is necessarily not normal. There is no proof of what you say.

Emil Wrote:So, you just need to remove them. Which observations are not included in your model?

I said it before. But here's an illustration :

Code:

`y x1 x2 x3`

1 . . .

2 . 11 88

3 7 28 77

4 . 22 .

5 4 12 93

6 2 17 90

7 3 . 81

8 5 19 74

9 5 19 93

10 5 15 76

A listwise analysis would involve here 6 people, out of 10. But when you compute Y_hat, probably all statistical software become silly and compute Y_hat for all the 10 observations (because one variable here has 10 observations). Anyway, I said already that I found the solution. It's just that it is not elegant because I couldn't find a better way.

Emil Wrote:In R, whenever you add or multiple variables together and one of them is NA, you will get a NA result.

I know. But it's precisely why I did it that way. Concerning complete.cases() function. I'm aware of it. But how can you use it when you do this :

yhat<- 5.21 + 1.35*x1 + -.0051*x2 + -.00166*x3 + .000017*x4 + .01697*x5 + .0002868*x6 + .0000104*x7

Again, I don't know.

Emil Wrote:Since you don't want this, you have resorted to calculating them manually is a very bad way. If you wanted to calculate them manually, you should still extract the coefficients from the model object.

In any case, what you want to do is just remove the cases with missing values before fitting the model to them. Then you can use the predict() function. Right?

No. I repeat again you don't understand. And you probably never tried it before. I said that predict() or fitted() will never give you an actual variable. Just do this.

Use this :

http://openpsych.net/datasets/GSSsubset.7z
And type this :

Code:

`setwd("c:\\Users\\ ............. \\)`

entiredata<-read.csv("GSSsubset.csv")

d<- subset(entiredata, age<70)

d$weight<- d$wtssall*d$oversamp # interaction variable

d$bw<- rep(NA) # generate variable with only missing values

d$bw[d$race==1] <- 1 # assign a value of "1" under the condition specified in bracket

d$bw[d$race==2] <- 0 # assign a value of "0" under the condition specified in bracket

d$bw0<- rep(NA)

d$bw0[d$year>=2000 & d$race==1 & d$hispanic==1] <- 1

d$bw0[d$year>=2000 & d$race==2 & d$hispanic==1] <- 0

d$bw00<- rep(NA)

d$bw00[d$year<2000 & d$race==1] <- 1

d$bw00[d$year<2000 & d$race==2] <- 0

d$bw1<-d$bw0 # combine the two vars, by incorporating the first var

d$bw1[is.na(d$bw0)]<-d$bw00[is.na(d$bw0)] # then the second, without Nas

d$agec<- d$age-40.62 # mean-centering of age

d$agec2<- d$agec^2 # squared term of age

d$agec3<- d$agec^3 # cubic term of age

d$bw1agec = d$bw1*d$agec

d$bw1agec2 = d$bw1*d$agec2

d$bw1agec3 = d$bw1*d$agec3

d$wordsum[d$wordsum==-1] <- NA

d$wordsum[d$wordsum==99] <- NA

require(VGAM)

I_hate_you_R<- vglm(wordsum ~ bw1 + agec + agec2 + agec3 + bw1agec + bw1agec2 + bw1agec3, tobit(Upper=10), data=d, weights=d$weight)

summary(I_hate_you_R)

yhat<-fitted(I_hate_you_R)

Now, you have your yhat. If you still don't believe me, then use this :

And see... where is your yhat object ? There is no column with yhat object. As I said, it's not an actual variable, and correlating it with an actual variable listed in your window data causes either error message or produces weird graphs.

Just compare :

Code:

`d$wordsumpredictedage<- 5.210807+1.355356*d$bw1+-.0051079*d$agec+-.0016632*d$agec2+.000017*d$agec3+.0169752*d$bw1agec+.0002868*d$bw1agec2+.0000104*d$bw1agec3`

require(lattice)

xyplot(yhat~ d$age, data=d, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))

xyplot(d$wordsumpredictedage ~ d$age, data=d, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))

Nonetheless, the yhat object has not been corrected by removing the missing data. But my variable d$wordsumpredictedage also has this problem. I didn't remove the NAs. Now let's do this :

Code:

`d$wordsumpredictedage1<-d$wordsumpredictedage+d$wordsum+d$bw1+d$age`

d$wordsumpredictedage1<-d$wordsumpredictedage1-d$wordsum-d$bw1-d$age

Let's redraw the graph :

Code:

`xyplot(d$wordsumpredictedage1 ~ d$age, data=d, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))`

You see that the graph is pretty much the same. The only difference is that it must have less data points (because lot of observations are dropped). I'm not sure about its effect on the graph. I had the impression on the other day that the Y values were smaller (or bigger ?) but that the pattern of the curves are the same. But still, you get a graph that is readable, with a nice regression line. You can't do it with the yhat object. The scatter you obtain from the yhat object is similar to what you get if you plot wordsum against age : a cloud of points.

Imagine you do :

wordsum=age+race

You plot Y_hat by computing it the usual way :

new_variable = intercept + beta*age + beta*race

What you will have is a scatter of wordsum versus age that is identical for blacks and whites. But why so ? That is because you did not include age*race interaction. That is, age effect on wordsum was not allowed to vary across races.

What I mean is that you must have a neat graph which has a "regression line" rather than a "cloud" of points scattered everywhere. This is a proof that fitted() or predict() can't be used to obtain a true Y_hat variable. At least, not in the usual way. In most software, you surely can save the predicted Y variable, and correlate it (but after removing the undesirable missing data), but not with R.

But even with Stata, SPSS, or even SAS (which I haven't tried, but maybe I will, because it's free now), most of the time, you will have to compute Y_hat by hand. The reason is that the Y_hat you request from your software will compute the predicted Y for all variables in your model. Imagine I have added income, educ, degree, sibs, region, family, etc. The requested Y_hat will have data points locating each person for each value of sibs, each value of income, each value of educ, etc., around each values of your Y variable. But if I want only to plot the predicted Y based on age and race while controlling for SES/background variables, I don't need to add the coefficients of these variables in my Y_hat variable because they will add some "variation" around the plotted line. This graph will become impossible to read because the spread of the data points do not have the form of a single regression line. If my explanation is too abstract, I have planned (one month ago) that when my article is published, I will publish 3 articles in a row : 1) how to compute Y_hat and make "predicted plots" and how to avoid the pitfalls in doing it, 2) tobit and truncated regression, 3) multilevel regression. But if you want now, I have attached a graph, based on the same regression above, but added all SES/background variables. I have requested the Y_hat directly from Stata and plotted it against age. The syntax was :

Code:

`tobit wordsum bw1 agec agec2 agec3 bw1agec bw1agec2 bw1agec3 logincomec educc degreec gender sibs residence reg family [pweight = weight], ll ul`

predict wordsumpredictedage if e(sample), xb

label variable wordsumpredictedage "Predicted Wordsum scores"

graph twoway (scatter wordsumpredictedage age if bw1==0, msymbol(Oh)) (scatter wordsumpredictedage age if bw1==1, msymbol(O)), legend(label(1 black) label(2 white))

You see the difficulty with that graph. Anyway, this discussion is pointless. If anyone here has suspicions and believes I mistyped some of my numbers, then you just run the analysis in R. I checked the numbers at least 3 times, so that the probability of misreporting (even a little one) is close to zero. Furthermore, saying that I calculate it in a bad way, what does that mean ? That it is not how Y_hat should be computed ? In that case, I disagree, obviously. Because it's exactly what all textbooks say to do. The formula is Y_hat=intercept+beta*(var1)+beta*(var2) ... etc.

Finally, the document you sent has a problem. You have reduced Figure 7. I don't want that. Otherwise I prefer to delete it, because there are 9 graphs, all were already small. By reducing figure 7, all the 9 graphs are now tiny. I will try to rearrange them as you want even if the inclusion of Figure 4 will be tedious, but I don't think it's reasonable to reduce the size of Figure 7. Because a graph is useless if you can't see with much precision what's happening here. So they must be reasonably big.