Back to [Archive] Other discussions

1
The interpretation of multiple regression results
Admin
In another thread Meng Hu wrote refered to his recent post on multiple regressions:

http://humanvarieties.org/2014/06/07/multiple-regression-multiple-fallacies/

Thus, in classical regression models, we are left with a big dark hole. That is, the uncertainty about the indirect effects. Then, if the effect of x1 is not moderated the least by any other independent var., e.g., x2 and/or x3, we can safely make the conclusion that the total effect of x1 is only composed of direct (i.e., independent) effect. If, on the contrary, x1 is moderated by x2 and/or x3, then there are 4 possibilities :

A. x1 causes x2.
B. x2 causes x1.
C. each variable causes the other (but not necessarily to the same extent).
D. x1 and x2 are caused by another, omitted variable.


There is another option, namely that the r (A x B) is due to a chance happening.

The more I learn about stats, the more disappointed I am about most research papers. This is why, in situation like this, I hoped I would stayed a complete ignorant. Too late. My mental health has been seriously endangered now. But there is worse to my mind. My expectation is that even the statisticians are not necessarily aware of this problem. If they were, it is impossible they remain silent. No, they will voice against the misuses of regressions. And we won’t see such fallacies today. And yet…


The only solution is to 'become' a researcher and do things better yourself. "Be the change you wish to see" -Unknown.
hmm ... "chance happening" ? Not sure I grasp the idea. I have the feeling you meant "sampling error".

That said, the thing that must be kept in mind here, to summarize shortly is this paragraph :

Consider the case of a variable x1 that has a strong effect on Y but all of its effect is indirect (given the set of other predictors). Then, its regression coefficient will be necessarily zero. On the other hand, x2 can have a small total effect on Y but all of its effect is direct (given the set of other predictors). Then, its regression coefficient will be non-zero. Thus, it follows that the strength of the regression coefficient depends solely on the strength of the direct effect of the independent variables.


When you read this, you understand easily what's wrong with multiple regression.

Besides I am emailing several authors by now. And link them to my cross-post on my personal blog. I do that because there are people actually reading the article in human varieties. Since I lost my old blog, I lost all my readers on the new one, and each day the number of views in my blog is close to zero (whereas it was 200-250 views per day in the old one). So I can detect easily if they have clicked on the link I suggest to them.

If you come to my blog and see the post "multiple regression multiple fallacies" I recommend you NOT TO TOUCH that link. I just want to make a prediction. The authors I am emailing will not

1) even care to respond to me
2) even touch the link

In other words, in the followings days and weeks I expect the number of "views" to be extremely low, near zero.

If it turns out to be true, then it's ugly. That means they are not going to have a look at it. That means they are close-minded, stubborn. For the moment, the number of views is still zero. So, happily, my prediction holds. For the moment at least.

Also, Emil, if you remember some authors using multiple regression and saying stuff like "x1 stronger than x2 thus x1 is the best predictor" can you show me these papers ? i will email the author.
Admin
I read the post, so there should be at least 1 view. Robots from search engines will also read it and boost the views count (unless they are excluded).

I think John and I wrote something similar in our papers on the immigrations in Denmark and Norway. There is one true sense of it, though, in those studies, the best predictor that we tried, were such and such. No claims being made about all the predictors we did not try.

I think emailing authors like this will likely not work. Not just because they are stubborn or close-minded, but because they are very busy and must choose where to spend their time (opportunity cost).

Statisticians have been saying the things you said for decades. There are journals dedicated to these things even. But most scholars do not read statistical journals, so there is little effect of such criticism. Think about how many years many scholars have been making the sociologist's fallacy? It is similar to what you describe.
I read the post, so there should be at least 1 view.


Impossible. In my stats, i have 1 view. It's from Italy. And I know who read it since I receive the following email from Vittorio Daniele:

Many thanks for your very useful article on the possible regression pitfalls and for your interesting blog.
Kind regards,
VD


Sadly, my prediction has already been falsified !

I think emailing authors like this will likely not work. Not just because they are stubborn or close-minded, but because they are very busy and must choose where to spend their time (opportunity cost).


I begin the mail with the title "technical problem with [name of their paper]". If this title do not even elicit their attention, I'm sorry, they are just despicable.

Statisticians have been saying the things you said for decades.


I would like to see that because I don't remember I have read it. In general, the textbook I have they express the very same idea I've criticized. I'm serious when I say I doubt the statisticians...

Think about how many years many scholars have been making the sociologist's fallacy?


It's something quite different here. In social science, they have in predictor variables several measures of personality, and yet they conclude one domain is more important than the other just because the regress coefficient is stronger. Or they have different measures of SES, e.g., income vs education. and they conclude education is stronger than income because of the coefficient. This kind of stuff...
I do not expect anyone to agree with me. It's always safest to side with the majority. Well... if you look at my new paragraph and the 2 new graphs on my posts. Anyone will see I have definitely settled the debate. If anyone, any expert is tempted to use arguments from authority against me, he can still try.

Regardless, now that I established the proof, you should not expect me to approve any papers that make statements like "X1 has better coefficient than X2 so X1 is more important". If I approve, I will be dishonest with myself. I would like to avoid it.
Admin
The solution is just to have authors avoid over-interpreting the results. If in a MR X1 comes out ahead of X2 and X3, then it can be said to be the strongest predictor in that analysis. This is of course speaking only of the predictive ability in that specific analysis along with X2 and X3. If one wants to compare the total predictive ability of different variables, one should calculate their zero-order correlations with the dependent variable, I should think.
It's easy to request the zero-order correlation in MR either in SPSS or Stata. But it will never tell you the magnitude of total effect for each predictors. Why would it be possible after all, if SEM cannot even do that ?

There are certain circumstances where you can more or less guess which one is under-estimated. For example, when you put income and education var., you have to think : which one cause which one. (your) Income can cause (your) education ? Surely not. (your) Education can cause (your) income ? Probably a lot of people will say yes.

But well, my finding tells me now that it's probably safer to avoid using a ton of independent var., because the more you have, and the more you need to disentangle the large amount of indirect paths. Best is to factor analyze the indepdent var. which were initially strongly correlated (e.g., with at least 0.40-0.50). This helps to reduce the number of independent var.
I remember that in Bias in Mental Testing, Jensen said the following :

But the interpretation of such partial correlations is very tricky. They are easily misleading. The high partial correlation of education and occupation, for example, would seem to imply that almost anyone, given the necessary amount of education, could attain the corresponding occupational status more or less regardless of his or her IQ, as the partial correlation of IQ and occupation is quite low. But this would be a false inference, because not everyone can attain the educational thresholds required by the higher occupations. Holding IQ constant statistically, as a partial correlation, only means that, among those whose IQs are above the threshold required for any given occupation, educational attainment then becomes the chief determinant of occupational level. The low partial correlation between IQ and occupation does not contradict the importance of the threshold property of IQ in relation to occupational status. If the true relationship between IQ and occupation were as low as the partial correlations would seem to suggest, we should find every level of IQ in every type of occupation. But of course this is far from true, even in occupations to which entry involves little or no formal education. Moreover, not all high-IQ persons choose to enter the professions or other high-status occupations, but those who do so work to attain the required educational levels; and hence educational level is more highly correlated with occupational level than is IQ per se.


When reading the paragraph (2/3 months ago perhaps) I thought about what has caused the indep. var. to be held equal. Because by equating them, we must also remove some causal factors that would be responsible for this. As Jensen implied, this operation would hide a portion of the effect of IQ that was necessary to achieve this threshold, assuming of course that IQ had any causal role in this.

There had been 2 key things that helped me to understand the problem with most application of MR. First was my beginnings with SEM. When I understood that MR and SEM are the same thing (i.e., regression modeling, the latter being just more sophisticated) and that SEM gives the indirect path that MR do not give. The second enlightment was from the above passage. When I combined the two things, that's how I discovered the problem.

Anyway, I don't think Jensen necessarily understands the problem of MR, given what I remember he has written about MR in The g Factor (chapter 9).
Admin
http://www.pnas.org/content/early/2011/01/20/1010076108.abstract

This paper seems to make the mistakes MH talks about. It looks like they used MR to compare three predictors (and other types of MR, logistic and Poisson). They find that each work controlling for the others (presumably, meaning that they entered all three into MR). They don't notice that the effects of g can go though self-control. This was hinted because self-control and IQ correlated .44, and the variables used to form their self-control measurement are known to be g-loaded.
Hmm... This makes me remember something.

It has been said sometimes that longitudinal study differs from cross-sectional study in the "implied" causal relationship. Perhaps it's wrong. Imagine they use multiple regression. In Cross-sectional data, the only thing estimated is the direct effect, despite researchers interpreting them as the total effects. Now, with longitudinal repeated measures, you can have your direct and indirect effects. Thus, imagine in MR, the impact of the predictor of interest is zero but in the longitudinal research, the direct effect is indeed zero but the indirect effect is strong. Then, it is not true anymore that cross sectional analyses must necessarily differ from longitudinal studies.
1