Exploring GSS with R

Exploring GSS with R

Sun 07 Sep 2014 08:41

Admin

MH and I are exploring GSS data in R. I will post the syntax here so others can explore it as well if they are interested.

Dataset: http://openpsych.net/datasets/GSSsubset.7z

library(foreign) #needed to load SPSS
library(plyr) #for easy recode
library(Hmisc) #for rcorr
library(psych) #for stuff
library(gplots) #for plotmeans

data = read.spss("GSSsubset.sav") #read SPSS file
DF = as.data.frame(data) #convert to DF

#fix wordsum
DF$wordsum[DF$wordsum==-1] = NA #recodes -1's as NA
DF$wordsum[DF$wordsum==99] = NA #recodes 99's as NA

#sex and race
describeBy(DF$wordsum,DF$sex) # descrip. stats by sex
describeBy(DF$wordsum,DF$race) # descrip. stats by race

#yearly changes
hist(DF$year) #histogram of years to see distribution
rcorr(DF$year, DF$wordsum) #year x wordsum cor
cor(DF$year, DF$wordsum, use="pairwise.complete.obs") #using cor fun with pairwise complete

year.mean = by(DF$wordsum, INDICES = DF$year, FUN = mean, na.rm=TRUE) #get mean by year, remove missing values
year.mean.matrix = as.matrix(year.mean) #convert to matrix for plotting
plot(year.mean.matrix) #plots the matrix, but it doesnt show the year properly

plotmeans(formula = wordsum ~ year, data=DF, n.label=F) #much easier way of plotting means by year,
#remove labels for sample size. No clear FLynn effect.

John Fuerst

Tue 09 Sep 2014 18:13

MH and I are exploring GSS data in R. I will post the syntax here so others can explore it as well if they are interested.

Dataset: http://openpsych.net/datasets/GSSsubset.7z

library(foreign) #needed to load SPSS library(plyr) #for easy recode library(Hmisc) #for rcorr library(psych) #for stuff library(gplots) #for plotmeans data = read.spss("GSSsubset.sav") #read SPSS file DF = as.data.frame(data) #convert to DF #fix wordsum DF$wordsum[DF$wordsum==-1] = NA #recodes -1's as NA DF$wordsum[DF$wordsum==99] = NA #recodes 99's as NA #sex and race describeBy(DF$wordsum,DF$sex) # descrip. stats by sex describeBy(DF$wordsum,DF$race) # descrip. stats by race #yearly changes hist(DF$year) #histogram of years to see distribution rcorr(DF$year, DF$wordsum) #year x wordsum cor cor(DF$year, DF$wordsum, use="pairwise.complete.obs") #using cor fun with pairwise complete year.mean = by(DF$wordsum, INDICES = DF$year, FUN = mean, na.rm=TRUE) #get mean by year, remove missing values year.mean.matrix = as.matrix(year.mean) #convert to matrix for plotting plot(year.mean.matrix) #plots the matrix, but it doesnt show the year properly plotmeans(formula = wordsum ~ year, data=DF, n.label=F) #much easier way of plotting means by year, #remove labels for sample size. No clear FLynn effect.

I was thinking that you might try to replicate your general SES factor results on the individual level and see what the IQ x g of SES correlation is on that level. Maybe you could start with GSS, since you're playing around with it.

Emil O. W. Kirkegaard

Tue 09 Sep 2014 18:59

Admin

I wanted to do that.

Anyway, FLynn effect is odd here. It is clearly seen in the subset samples by race. It is stronger in blacks than whites, so the BW is diminishing.

#divide DF manually by race
whites = subset(DF, DF$race=="WHITE") 
blacks = subset(DF, DF$race=="BLACK")

year.mean.whites = by(whites$wordsum, INDICES = whites$year, FUN = mean, na.rm=TRUE) #get mean by year, remove missing values
year.mean.blacks = by(blacks$wordsum, INDICES = blacks$year, FUN = mean, na.rm=TRUE) #get mean by year, remove missing values
year.mean.BW = year.mean.whites-year.mean.blacks #get yearly difference
year.mean.BW.vector = as.vector(year.mean.BW) #vector
years = as.numeric(dimnames(year.mean.BW.matrix)[[1]]) #list of years, as numbers

#alternative way of plotting
DF2 = as.data.frame(years) #make data.frame with years
DF2["year.mean.BW"] = as.vector(year.mean.BW) #add BW diffs
DF2["year.mean.whites"] = as.vector(year.mean.whites) #add whites
DF2["year.mean.blacks"] = as.vector(year.mean.blacks) #add blacks
scatterplot(year.mean.BW.vector ~ years, data=DF2, smoother=F) #plot difference over time
scatterplot(year.mean.whites ~ years, data=DF2, smoother=F) #plot whites only
scatterplot(year.mean.blacks ~ years, data=DF2, smoother=F) #plot blacks only

John Fuerst

Tue 09 Sep 2014 19:32

I wanted to do that.

Anyway, FLynn effect is odd here. It is clearly seen in the subset samples by race. It is stronger in blacks than whites, so the BW is diminishing.

In this sample. That's been noted before. Wordsum isn't a great measure of IQ, though. I noted elsewhere:

"Despite being frequently used, GSS WORDSUM isn’t the best measure of cognitive ability. As an alternative, I created a composite score based on GSS WORDSUM (a vocabulary test) and ALIKE (a similarity test). The similarity test was given only in 1994, so this created composite has only limited utility. Below shows the R-matrix for WORDSUM, COMPOSITE, ALIKESUM, EDUC, and INCOME. As it can be seen the composite, theoretically a better measure of g, has slightly better predictive validity than either WORDSUM or ALIKE alone, as one would expect."

The vocabulary (WORDSUM)-similarity (ALIKE) correlation was only 0.41. This is a good deal lower than the typical Vocabulary-Similarity WAIS association, likely because we are dealing with a mere 10 questions.

Emil O. W. Kirkegaard

Tue 09 Sep 2014 20:28

Admin

I looked closer at the data from the first year and the latest year, 1974 and 2012.

Simple desc. stats:

> describe(whites.1974$wordsum)
  vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
1    1 1277 6.21 2.19      6    6.24 1.48   0  10    10 -0.11    -0.39 0.06
> describe(whites.2012$wordsum)
  vars   n mean sd median trimmed  mad min max range  skew kurtosis   se
1    1 963 6.22  2      6    6.27 1.48   0  10    10 -0.27     0.07 0.06
> describe(blacks.1974$wordsum)
  vars   n mean   sd median trimmed  mad min max range  skew kurtosis   se
1    1 166 4.66 1.87      5    4.69 1.48   0   9     9 -0.12    -0.37 0.14
> describe(blacks.2012$wordsum)
  vars   n mean   sd median trimmed  mad min max range  skew kurtosis   se
1    1 199 5.07 1.92      5    5.11 1.48   0  10    10 -0.26     0.26 0.14

Here are the combined histograms within race. As one can see, there is clearly something strange going on with the whites. The number who got 10 right fall markedly. Also, note that the '74 and '12 white samples got approx. the same mean. It is odd that the means fluctuate so much given the large size of the white samples.

Second, there is clearly a floor effect on the black scores pushing the mean somewhat upward, and a ceiling effect on the white scores, pushing the mean somewhat down.

Shapiro tests reject every sample as normal with p values at least at 0.0015 (blacks, 1972, N=166).

> shapiro.test(whites.1974$wordsum) #test for normality

	Shapiro-Wilk normality test

data:  whites.1974$wordsum
W = 0.9667, p-value < 2.2e-16

> shapiro.test(blacks.1974$wordsum) #test for normality

	Shapiro-Wilk normality test

data:  blacks.1974$wordsum
W = 0.9708, p-value = 0.001408

> shapiro.test(whites.2012$wordsum) #test for normality

	Shapiro-Wilk normality test

data:  whites.2012$wordsum
W = 0.967, p-value = 5.981e-14

> shapiro.test(blacks.2012$wordsum) #test for normality

	Shapiro-Wilk normality test

data:  blacks.2012$wordsum
W = 0.9652, p-value = 7.799e-05

How does one get a floor/ceiling-bias free estimate from the data?

I tried using trimmed means, but that actually made things worse.

> mean(whites.1974$wordsum, na.rm=TRUE)-mean(blacks.1974$wordsum, na.rm=TRUE)
[1] 1.550108
> #10%
> mean(whites.1974$wordsum, trim=.1, na.rm=TRUE)-mean(blacks.1974$wordsum, trim=.1, na.rm=TRUE)
[1] 1.542529
> #20%
> mean(whites.1974$wordsum, trim=.2, na.rm=TRUE)-mean(blacks.1974$wordsum, trim=.2, na.rm=TRUE)
[1] 1.508175
> #30%
> mean(whites.1974$wordsum, trim=.3, na.rm=TRUE)-mean(blacks.1974$wordsum, trim=.3, na.rm=TRUE)
[1] 1.351761
> #trimmed means
> #0%
> mean(whites.2012$wordsum, na.rm=TRUE)-mean(blacks.2012$wordsum, na.rm=TRUE)
[1] 1.156896
> #10%
> mean(whites.2012$wordsum, trim=.1, na.rm=TRUE)-mean(blacks.2012$wordsum, trim=.1, na.rm=TRUE)
[1] 1.159275
> #20%
> mean(whites.2012$wordsum, trim=.2, na.rm=TRUE)-mean(blacks.2012$wordsum, trim=.2, na.rm=TRUE)
[1] 1.138555
> #30%
> mean(whites.2012$wordsum, trim=.3, na.rm=TRUE)-mean(blacks.2012$wordsum, trim=.3, na.rm=TRUE)
[1] 1.060293

Anyone know how to deal with data with ceilings and floors? Maybe get the max. difference using all the possible values of trim.

Someone suggested to me that "white" includes hispanics, so if the hispanic fraction of whites is growing, and hispanics has lower wordsum scores, then the relatively weaker Flynn-Lynn effect on whites seem explained. Does anyone know?

Meng Hu

Tue 09 Sep 2014 21:49

Admin

My article talked about it. There is indeed some dysgenic effect (however, I think you should better use "cohort" variable, and not "year"). When I used logistic regression (dep var: wordsum coded 0 and 1 for raw scores 0-7 and 8-10 respectively; indep var: black-white dichotomized variable, cohort dummy with 6 category, and interaction between the dichotomy and each of the cohort dummy), the decline by cohort over time is much stronger for whites than for blacks. Concerning the question of hispanics, this variable is only available for 2000 and after. That's a good question, I will look at it, and try to finish it, and submit it here. My articles take time, however, because i work on multiple things at the same time. It's bad habit of mine. It's difficult to get rid of it...

By the way Shapiro should not be recommended. It's biased toward highly significance when you get large samples. Better look at histogram + QQ plots. I always do that, and this is largely sufficient.

Emil O. W. Kirkegaard

Wed 10 Sep 2014 16:17

Admin

By the way Shapiro should not be recommended. It's biased toward highly significance when you get large samples. Better look at histogram + QQ plots. I always do that, and this is largely sufficient.

It is a not a bias, just a feature. All null hypothesis tests easily lead to statistically certain results with large samples. The reason to prefer a test over visual inspection is that tests give a specific number for the amount of violation of normality.

Meng Hu

Sat 06 Dec 2014 22:10

Admin

My article is almost finished. I need to add one thing, and I will upload the 3rd version. The thing I want to add is the R² measure for GLMM. And the syntax is given in the appendix of this article.

Article pdf
Appendix

As you can see, however, the syntax is for R. I hate R but (damn!) I have no other choice.

So, my syntax, using the subset of GSS data uploaded by Emil, reads :

Sys.setenv(LANG = "en")

install.packages('car') # recode() function
install.packages('lme4') # multilevel regression ('lme' does not work on R version 3.1.2)
install.packages('MuMIn') # r.squaredGLMM() function

require(car)
require(lme4)
require(MuMIn)

setwd("c:\\location of your file")
dd<-read.csv("GSSsubset.csv")
entiredata<-as.data.frame(dd)
d<- subset(entiredata, age<70)

d$bw<- rep(NA)  # generate variable with only missing values
d$bw[d$race==1] <- 1 # assign a value of "1" under specified condition
d$bw[d$race==2] <- 0 # assign a value of "0" under specified condition
d$bw0<- rep(NA)
d$bw0[d$year>=2000 & d$race==1 & d$hispanic==1] <- 1
d$bw0[d$year>=2000 & d$race==2 & d$hispanic==1] <- 0
d$bw00<- rep(NA)
d$bw00[d$year<2000 & d$race==1] <- 1
d$bw00[d$year<2000 & d$race==2] <- 0
d$bw1<-d$bw0 # combine the two vars, by incorporating the first var
d$bw1[is.na(d$bw0)]<-d$bw00[is.na(d$bw0)] # then the second, without NAs
d$wordsum[d$wordsum==-1] <- NA
d$wordsum[d$wordsum==99] <- NA
d$cohort[d$cohort==9999] <- NA

d$cohort21<-recode(d$cohort, "1905:1915='0'; 1916:1920='1'; 1921:1925='2'; 1926:1929='3'; 1930:1933='4'; 1934:1937='5'; 1938:1941='6'; 1942:1944='7'; 1945:1947='8'; 1948:1950='9'; 1951:1953='10'; 1954:1956='11'; 1957:1959='12'; 1960:1962='13'; 1963:1965='14'; 1966:1968='15'; 1969:1972='16'; 1973:1976='17'; 1977:1980='18'; 1981:1985='19'; 1986:1994='20'")
d$cohort21[d$cohort21>=21] <- NA

d$age9<-recode(d$age, "18:23='1'; 24:28='2'; 29:33='3'; 34:38='4'; 39:44='5'; 45:50='6'; 51:56='7'; 57:62='8'; 63:69='9'")
d$aged1<- as.numeric(d$age>=18 & d$age<=23)
d$aged2<- as.numeric(d$age>=24 & d$age<=28)
d$aged3<- as.numeric(d$age>=29 & d$age<=33)
d$aged4<- as.numeric(d$age>=34 & d$age<=38)
d$aged5<- as.numeric(d$age>=39 & d$age<=44)
d$aged6<- as.numeric(d$age>=45 & d$age<=50)
d$aged7<- as.numeric(d$age>=51 & d$age<=56)
d$aged8<- as.numeric(d$age>=57 & d$age<=62)
d$aged9<- as.numeric(d$age>=63 & d$age<=69)

d$bw1aged1<- d$bw1*d$aged1
d$bw1aged2<- d$bw1*d$aged2
d$bw1aged3<- d$bw1*d$aged3
d$bw1aged4<- d$bw1*d$aged4
d$bw1aged5<- d$bw1*d$aged5
d$bw1aged6<- d$bw1*d$aged6
d$bw1aged7<- d$bw1*d$aged7
d$bw1aged8<- d$bw1*d$aged8
d$bw1aged9<- d$bw1*d$aged9

modelri<- lmer(wordsum ~ bw1 + aged1 + aged2 + aged3 + aged4 + aged6 + aged7 + aged8 + aged9 + bw1aged1 + bw1aged2 + bw1aged3 + bw1aged4 + bw1aged6 + bw1aged7 + bw1aged8 + bw1aged9 + (1 | cohort21), data = d, REML=TRUE)
summary(modelri)
coefs<- data.frame(coef(summary(modelri)))
coefs$p.z <- 2 * (1 - pnorm(abs(coefs$t.value)))
coefs

X <- model.matrix(modelri)
n <- nrow(X)
Beta <- fixef(modelri)
Sf <- var(X %*% Beta)
(Sigma.list <- VarCorr(modelri))
Sl <- 
  sum(
    sapply(Sigma.list,
      function(Sigma)
      {
        Z <-X[,rownames(Sigma)]
        sum(diag(Z %*% Sigma %*% t(Z)))/n
      }))
Se <- attr(Sigma.list, "sc")^2
Sd <- 0
total.var <- Sf + Sl + Se + Sd
(Rsq.m <- Sf / total.var) 
(Rsq.c <- (Sf + Sl) / total.var) 

rm(X, n, Beta, Sf, Sigma.list, Sl, Se, Sd, total.var, Rsq.m, Rsq.c)

modelrs<- lmer(wordsum ~ bw1 + aged1 + aged2 + aged3 + aged4 + aged6 + aged7 + aged8 + aged9 + bw1aged1 + bw1aged2 + bw1aged3 + bw1aged4 + bw1aged6 + bw1aged7 + bw1aged8 + bw1aged9 + (bw1 | cohort21), data = d, REML=TRUE)
summary(modelrs)
coefs<- data.frame(coef(summary(modelrs)))
coefs$p.z <- 2 * (1 - pnorm(abs(coefs$t.value)))
coefs

X <- model.matrix(modelrs)
n <- nrow(X)
Beta <- fixef(modelrs)
Sf <- var(X %*% Beta)
Sigma.list <- VarCorr(modelrs)
Sl <- 
  sum(
    sapply(Sigma.list,
      function(Sigma)
      {
        Z <-X[,rownames(Sigma)]
        sum(diag(Z %*% Sigma %*% t(Z)))/n
      }))
Se <- attr(Sigma.list, "sc")^2
Sd <- 0
total.var <- Sf + Sl + Se + Sd
(Rsq.m <- Sf / total.var) 
(Rsq.c <- (Sf + Sl) / total.var) 

# compute the mixture Chi-Square test
chistat <- anova(modelri, modelrs)[2,"Chisq"] 
0.5 * pchisq(chistat, 1, lower.tail=FALSE) + 0.5 * pchisq(chistat, 2, lower.tail=FALSE)

My error message is :

> Sl <- 
+   sum(
+     sapply(Sigma.list,
+       function(Sigma)
+       {
+         Z <-X[,rownames(Sigma)]
+         sum(diag(Z %*% Sigma %*% t(Z)))/n
+       }))
Warning messages:
1: Reached total allocation of 6052Mb: see help(memory.size) 
2: Reached total allocation of 6052Mb: see help(memory.size) 
3: Reached total allocation of 6052Mb: see help(memory.size) 
4: Reached total allocation of 6052Mb: see help(memory.size) 
Error in diag(Z %*% Sigma %*% t(Z)) : 
  error in evaluating the argument 'x' in selecting a method for function 'diag': Error: cannot allocate vector of size 3.7 Gb
>

And i'm trying desperately to fix it, without any success so far. If I can't go through it, I will publish the article without the R² anyway.

Emil O. W. Kirkegaard

Sat 06 Dec 2014 22:55

Admin

As far as I can tell, it is because you have insufficient memory.

Why post in this thread instead of the submission thread?

Meng Hu

Sat 06 Dec 2014 23:18

Admin

It's more appropriate, because I need help with the syntax.

Concerning insufficient memory, I obviously tried this :

http://r.789695.n4.nabble.com/Memory-allocation-in-64-bit-R-td2952092.html

memory.limit(16000)

But ...

> Sl <- 
+   sum(
+     sapply(Sigma.list,
+       function(Sigma)
+       {
+         Z <-X[,rownames(Sigma)]
+         sum(diag(Z %*% Sigma %*% t(Z)))/n
+       }))
Error: cannot allocate vector of size 3.7 Gb

Emil O. W. Kirkegaard

Sun 07 Dec 2014 10:47

Admin

You don't have 3.7 GB free memory. I will try running your code. I have 16GB of RAM.

Meng Hu

Sun 07 Dec 2014 13:42

Admin

I hope it's not a problem of not having enough RAM. I don't want to buy another computer just for that. My computer, if my memory is correct, has 6.0 GB of RAM. And it's not an old PC (2 years or so).

Emil O. W. Kirkegaard

Sun 07 Dec 2014 13:50

Admin

Are you using 32-bit?

Meng Hu

Sun 07 Dec 2014 15:11

Admin

64-bit, I'm sure of it.

Meng Hu

Mon 08 Dec 2014 13:52

Admin

I emailed Paul Johnson, the author of that article.

Dear Meng Hu,

The problem is that this row

sum(diag(Z %*% Sigma %*% t(Z)))/n

is very wasteful of memory so causes problems with large data sets, which I never noticed before the article and code were published. Replacing it with

sum(rowSums((Z %*% Sigma) * Z))/n

does the same calculation without wasting so much memory. The problem and solution are explained in more detail below. Kamil Barton fixed it in MuMIn a while ago but there is no facility for appending a solution to the code on the journal website (I’ll put it on my own website when I have time).

Hope this works.

Best wishes,
Paul

Thank you a lot (and for the sharing of the other mail as well).
It works properly now.

Emil O. W. Kirkegaard

Mon 08 Dec 2014 14:32

Admin

So it was a memory problem. :)