Back to [Archive] Other discussions

1
R - open discussion and help
Admin
Meng Hu has been trying to learn R, but instead of reading a textbook (I read two), he kept sending me questions in emails. Finally he proposed that we post it here in case others also need help. :)

"Btw, you are using the old version of the merger.R file. The new one does not include the libraries that gave you trouble."

You're right. It's my fault for not having downloaded the 2nd version, because now it finally works.

> write.meta()
Error: could not find function "write.meta"


You forgot to run the source("merger.R") first, which loads the write.mega() function.

Perhaps you can open a thread for exposing this basic stuff. I'm no good at R when it comes to prepare data. For example, I don't understand the two # warning sentences here :

read.mega = function(filename){
return(read.csv(filename,sep=";",row.names=1, #this loads the rownames
stringsAsFactors=FALSE, #FACTORS SUCK AVOID LIKE THE PLAGUE
check.names=FALSE)) #avoid prepending X to columns
}


R has a special object type for nominal data called a factor. When you load a file, R attempts to auto-detect strings that are factors. This caused severe problems when working with the megadataset due to some technical point I won't go into here. Short story is that I spent 6 hours trying to find the source of the error and it was that default setting (the default is stringsAsFactors=TRUE).

check.names=TRUE makes R add an X to variable names that begin with a number. I thought it was silly, but then I discovered that many functions in R don't work if the variable has a number first in the name. So in the 3rd version of the merger.R I removed that part.

I will try to work with Rcmdr package next time. I know for having tried that I didn't understand what to do with it. I may try Rstudio later. I have noticed that in several youtube videos that users usually work through Rstudio.


You mean R-commander? It's a GUI not a package. RStudio is an IDE as for other programming languages. I use RStudio always. Great IDE. :)

The reason why I need menus is because they ensure you're not making mistakes in your coding, and the data window can show you what your columns look like. When you use menus, the output is displayed along with the syntax. That's how I learn to make variables, display stats with the conditional "do if" or perform the analysis repeatedly by groups. See attachment.


As I mentioned, RStudio lets you view your data objects.

I don't understand why they need to make R so complicated. In Stata/SPSS, when you create vars with missing values, you don't need to do stuff like :

use="complete"
na.rm=TRUE

It is not normal that R forces you to learn so much. This is extremely error prone. Of course, the fact that the authors who work with R do not show their syntax, does not make things easier. This is irritating. R is free software, so it is very helpful to show us the syntax.


Many authors show their code. When I blog research, I post the code too so others can replicate my analyses. Many bloggers do that. Many researchers attach syntax as supplementary material. This should be mandatory for anyone using SYNTAX to ensure that analyses are replicable.

R is somewhat silly handling missing data. Some functions automatically ignore missing data, some require you to add "na.rm=TRUE" (removing missing), others want you to specific exactly how (e.g. use="pairwise.complete.obs" (for cor() when you want a correlation matrix).

It is true that R allows weight. But it is not true that you can work with them. For what I know, there is one package you can use to create an object that will weight your analysis.
http://r-survey.r-forge.r-project.org/survey/html/svydesign.html

However, this object works only within the package svydesign.

In other words, you can't do linear regression, correlation, tobit, etc. usually. You have to use the relevant functions within the svydesign. And in this package, I don't see regression, logistic regression, tobit, SEM, factor analysis, etc.

Even if you can create the following object:

dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)

You cannot do this :

model<-lm(wordsum ~ race + cohort + race:cohort, data=GSSsubset, dstrat)

This is bothersome. It's just like AMOS. In that situation, the only thing you can do with R is to use input data matrix instead of raw data. By using Stata, SAS, or SPSS, you generate your correlation/covariance matrix by using sampling weight. And then switch to R. But in that case, I prefer to do the entire analysis on Stata/SPSS.


I there there are some more functions for working with weighted data. Some functions support it natively, for instance lm() (look in the help page for it) supports it natively. This means that you can do any linear regression including simple correlations with weights.

There is another package called weights that has a few more genetic functions, such as a weighted t-test.

-> "Read the help file for read.csv() if you need to know how to read semi-colon separated files."

Ok, but I asked the question because I wanted to work with AMOS. I think SEM should provide graphical output of the results, which AMOS does but not R (at least, not in lavaan package). Furthermore, as I think R is error prone, I prefer to replicate my analysis when using another software.


No graphical output as far as I am aware. Maybe there is a package for that. I haven't used SEM extensively, so I don't know.

Concerning the relevant question, the syntax does not display everything I wanted. Notably the partial corr, but I managed to get them by using this :

partial.r(DF.C.PISA.IQ, c(1:18),19)

There are indeed numbers outside the normal range -1;+1, and even a correlation of -2.16. As you know, I hate R for lot of stuff. For example, the syntax above partial out var19, which should be 19th in your list. The name is not displayed, so it's not easy. What is var19, then ? LV2012IQ ?


Yes. Just use colnames(DF.C.PISA.IQ) to get the variable names.

By the way, why did you use

lm = lm(formula,DF.C.PISA.IQ.Z)

When I once worked with regression, I sometimes standardized my vars before running the regression, but I don't think it's good idea, and anyway, the regression provides both unstandardized and standardized coeff. So, why using z-score then ?


Joost de Winter asked me to do it that way.

lm() does not provide standardized coefficients by default (very silly). However, you can get them using lm.beta() from QuantPsyc package. The standardized betas were identical to the unstandardized betas when you standardize the variables before running the regression. However I have seen a case where this was not true (results slightly different).
To be sure, what Emil and i were talking about was related to this :

https://osf.io/nauc8/
(R syntax)
https://osf.io/74cew/
(dataset)

I have examined several books you recommended me and some others I found, such as :

An introduction to R, by L. Lam 2010
Bivand, R. S., Pebesma, E. J., & Gómez-Rubio, V. (2008). Applied spatial data analysis with R (Vol. 747248717). New York: Springer.
Spector, P. (2008). Data manipulation with R. Springer.
Vinod, H. D. (2009). Advances in social science research using R (Vol. 196). Springer.
Maindonald, J., & Braun, W. J. (2010). Data analysis and graphics using R: an example-based approach (Vol. 10). Cambridge University Press.
Navarro, D. Learning statistics with R: A tutorial for psychology students and other beginners (Version 0.4).
Knoblauch, K., & Maloney, L. T. (2012). Modeling psychophysical data in R (Vol. 32). London: Springer.
Hothorn, T., & Everitt, B. S. (2009). A handbook of statistical analyses using R (Vol. 12). CRC Press.
Verzani, J. (2014). Using R for introductory statistics. CRC Press.

But I didn't understand most of what they say. I gave up many times. I learned new functions but not how to circumvent the many warning messages I got. As I said, R is quite complex. For me, a better presentation would be to compare R with another stats software. I can recommend the following links.

Getting Started in Linear Regression using R (with some examples in Stata)
Getting Started in R~Stata Notes on Exploring Data
Translation Syntax (SPSS, Stata, SAS and R)

For the write.meta(), I don't know what's happening because I run your syntax from the beginning, including this :

library(psych)
library(QuantPsyc)
library(Hmisc)
source("merger.R")

DF.mega = read.mega("Megadataset_v1.7.csv")


Concerning weight, you're right. I saw it's also possible to do it with vglm() so tobit + weight is possible.

I agree with bloggers usually show how to use R, but it's not true for researchers who publish their papers in most journals. I have seen many times they use R, but no syntax is provided.

Another odd thing I found in R is that in tobit regression you can compute the residuals. As i said in my article on GSS wordsum, you can't get residuals the same way you do in OLS. And a special procedure must be applied, that's why Stata does not give you the residuals. Then, if R is intelligent, it should not have allowed the computation of residuals.

There was also one problem I came very often; the warning message saying "x and y have different length". It can be fixed this way :

model5a<- censReg(wordsum ~ bw1 + yeardummy2 + yeardummy3 + yeardummy4 + bwY2 + bwY3 + bwY4, right=10, data=d)
summary(model5a)
model5<- vglm(wordsum ~ bw1 + yeardummy2 + yeardummy3 + yeardummy4 + bwY2 + bwY3 + bwY4, tobit(Upper=10), data=d, na.action=na.exclude)
summary(model5)
pred5<-fitted(model5)
plot(cbind(d$year4, pred5, group="d$bw1"))

Notice the nasty thing. If you don't do "na.action=na.exclude" your plot will fail.
> plot(cbind(d$yeard4, pred5, group="d$bw1"))
Warning message:
In cbind(d$yeard4, pred5, group = "d$bw1") :
number of rows of result is not a multiple of vector length (arg 1)
>


Don't ask me what the error is, I have no idea. And if you don't do insert cbind() into the plot() function, your plot will fail.

> plot(d$year4, pred5, group="d$bw1")
Warning messages:
1: In plot.window(...) : "group" is not a graphical parameter
2: In plot.xy(xy, type, ...) : "group" is not a graphical parameter
3: In axis(side = side, at = at, labels = labels, ...) :
"group" is not a graphical parameter
4: In axis(side = side, at = at, labels = labels, ...) :
"group" is not a graphical parameter
5: In box(...) : "group" is not a graphical parameter
6: In title(...) : "group" is not a graphical parameter
>


Again, no idea what it means. R is so fu**** up.

Of course, with R, you always need to attach the vars to the "data". If you do :

dd<-read.csv("GSSsubset.csv")
entiredata<-as.data.frame(dd)
d<- subset(entiredata, age<70)

Then, every vars must be d$var1, d$var2, etc. Annoying...
Admin
But I didn't understand most of what they say. I gave up many times. I learned new functions but not how to circumvent the many warning messages I got. As I said, R is quite complex. For me, a better presentation would be to compare R with another stats software. I can recommend the following links.


Did you jump around in them? Read them from the beginning. Learning statistics with R: A tutorial for psychology students and other beginners starts from the very basics, so everybody can learn R from it.

For the write.meta(), I don't know what's happening because I run your syntax from the beginning, including this :


Well, I can't tell because you didn't copy the errors in here. For the syntax you quote to work, you must have the merger.R and Megadataset_v1.7.csv files in the R working directory.

Don't ask me what the error is, I have no idea. And if you don't do insert cbind() into the plot() function, your plot will fail.


I think it is very strange to create your dataset within the plot function. I always do it on separate lines.

-

I have no idea what you are trying to plot, so I can't help you.

-

R likes to have variables within dataframes. They can be accessed in multiple ways: 1) dataframe[1] (by index number), dataframe["name"] (by name). Both can be useful. The first can be used to select the variable even if you don't know which index it has, or if you keep adding variables to a dataframe so that their index changes (such as in the megadataset). The index method is good for quick and dirty solutions, e.g. dataframe[1:10] gets the first 10 variables.
1