Back to [Archive] Post-review discussions

1
[OQSPS] Inequality across US counties: an S factor analysis
Admin
Journal:
Open Quantitative Sociology and Political Science.

Authors:
Emil O. W. Kirkegaard

Title:
Inequality across US counties: an S factor analysis

Abstract:
A dataset of socioeconomic, demographic and geographic data for US counties (N≈3,100) was created by merging data from several sources. A suitable subset of 31 socioeconomic indicators were chosen for analysis of which 3 were excluded for being redundant with other variables. Factor analysis revealed a clear general socioeconomic factor (S factor) which was stable across extraction methods and different samples of indicators (absolute split-half sampling reliability = .85).

Self-identified race/ethnicity (SIRE) population percentages were strongly but non-linearly related to S. In general, the effect of White% and Asian% were positive while those for Black%, Hispanic% and Amerindian% were negative, while the effect was unclear for Other/mixed%. The best model consisted of White%, Black%, Asian% and Amerindian% and explained about 50% of the variance in S among counties.

SIRE homogeneity had a non-linear relationship to S both with and without taking into account the effects of SIRE variables. Overall, the effect was slightly negative due to low S, high White% areas.

An analysis of the SIRE composition of the top 100 counties showed that Whites and Asians were overrepresented (73.3% and 8.8% in top 100 vs. 64.8% and 4.5% in the total population for Whites and Asians, respectively). Then, a prediction about the expected proportions based on a cognitive meritocratic model was made and compared to the real numbers. The results showed that Blacks and Hispanics were overrepresented by large amounts (53% and 36%, respectively) while Whites and Amerindians were underrepresented (-11% and -16%, respectively). Some possible explanations of this pattern were offered.
Geospatial (latitude and longitude, elevation) and climatological (temperature, precipitation) predictors were tested in models. In linear regression, they added little to the variance explained (delta adjusted R2 = .05). However, there was evidence of non-linear relationships. When a model was fitted that allowed for non-linear effects of the environmental predictors, they were able to add a moderate amount of validity (delta adjusted R2 = .1). LASSO regression, however, suggested that much of this predictive validity was due to overfitting.

Spatial patterns in the data were examined using multiple methods, all of which indicated strong spatial autocorrelation for S and SIRE (k nearest spatial neighbor regression correlations [KNSNR] of .69 to
.88). Model residuals were also spatially autocorrelated and for this reason the model was re-fit controlling for spatial autocorrelation using KNSNR-based S residuals and spatial local regression. The results indicated that the effects of SIREs were not due to spatially autocorrelated confounds except possibly for Black% which was about 40% weaker after the controls.

Pseudo-multilevel analyses of both the factor structure of S and the SIRE predictive model showed results consistent with the main analyses. Specifically, the factor structure was similar across levels of analysis (states and counties) and within states.

Key words:
general socioeconomic factor, S factor, Japan, prefectures, inequality, intelligence, IQ, cognitive ability, cognitive sociology

Length:
26 pages, 8288 words, 51000 characters (excluding references).

Files:
https://osf.io/cknjr/

External reviewers:
I have no particular recommendations for external reviewers. I am open to suggestions.

Known issues:
- The tables lack borders and are not centered. I have not done so yet because I want to go over the numbers once more time. This means possibly repasting all the tables, wasting my time if I finalize their form now. They should be readable enough for peer review as they are.
The paper "Inequality across US counties: an S factor analysis" is competently done, but I have a few notes and comments in the attached file and below, that might help improve the paper.

1. The version of the paper that I reviewed did not have an introduction, so the introduction for the final paper will hopefully indicate what past research in this area has found and discuss the importance of and/or the need for the present research.

2. The Discussion and Conclusion might have more value if that section highlighted what was learned from the new analyses and/or what previous findings were corroborated. The section mentions results from the paper, but a reader unfamiliar with the literature might not appreciate what about the paper was new.

3. The paper is dense in terms of data presentation. It might be worth considering focusing the main paper on fewer figures and/or tables, and placing some figures and/or tables in an appendix, if these figures and tables are not necessary to understand the main points of the paper (e.g., Table 11 perhaps).

4. The Discussion and Conclusion notes that high-white-percentage counties tend to have lower S scores but that this effect is probably not causal and reflects only the history of these areas; this history explanation is used to justify model specification. But if this "only history" explanation can be said about other correlations in the study, then the paper might benefit from a more thorough discussion of inference issues.
Admin
LJ,

Thank you for your review. The version submitted here and on OSF does have an introduction and has several improvements over the earlier version you read. However, most were stylistic.

However, it turns out that RCA found a way to estimate cognitive ability for the counties using school district data. This data was not available to me but should be included. This requires that the entire paper is redone with the cognitive ability variable as well. See details on Twitter:

The data:
https://twitter.com/RCAFDM/status/717736004317286401

Cognitive ability x S:
https://twitter.com/RCAFDM/status/717797530554204160

Note that before he posted it, I predicted an r between .60 and .70. He found .69 with weights. :)

This information is pertinent to the current paper. The cognitive sociology model holds that SIRE x S is mediated almost entirely by cognitive ability. This can now be tested.
Admin
Apparently, I had forgotten to upload the latest version (with the introduction etc.). Sorry about that. It is uploaded now.
Emil,

This is a tricky analysis to do because you're trying to film a moving object. If a U.S. county does well socially and economically, it will experience more job growth and become, in general, a more attractive place to live. As a result, there will be an influx of people who have trouble finding work elsewhere, typically in services and construction. So the current demographic makeup of a county reflects its history of job growth, which ultimately reflects its past demographic makeup. These two variables -- current demographic makeup and past demographic makeup -- can be very different.
Admin
LJ,

I have finished the new revision. This adds a bunch of new analyses using the cognitive ability data, as well as new sections dealing with mediation and Jensen's method. This also increased the length to 50 pages, including references.

I have incorporated the grammatical fixes you suggested.

2. The Discussion and Conclusion might have more value if that section highlighted what was learned from the new analyses and/or what previous findings were corroborated. The section mentions results from the paper, but a reader unfamiliar with the literature might not appreciate what about the paper was new.

3. The paper is dense in terms of data presentation. It might be worth considering focusing the main paper on fewer figures and/or tables, and placing some figures and/or tables in an appendix, if these figures and tables are not necessary to understand the main points of the paper (e.g., Table 11 perhaps).


I dislike it when papers try to do this. In my opinion, they usually do it too soon, e.g. after the first small study. In this case, we are working with a very large dataset but the data are all correlational and measured at approximately the same time. This necessarily makes it somewhat difficult to draw causal conclusions.

I prefer the presentation style where the evidence is generally just presented and the reader can then make up his own mind with regards to how significant the findings are. Show, not tell.

4. The Discussion and Conclusion notes that high-white-percentage counties tend to have lower S scores but that this effect is probably not causal and reflects only the history of these areas; this history explanation is used to justify model specification. But if this "only history" explanation can be said about other correlations in the study, then the paper might benefit from a more thorough discussion of inference issues.


It was not of particular interest to my investigations. However, I had in mind the models presented in Albion's Seed. See e.g. http://slatestarcodex.com/2016/04/27/book-review-albions-seed/

--

Peter,

This is a tricky analysis to do because you're trying to film a moving object. If a U.S. country does well socially and economically, it will experience more job growth and become, in general, a more attractive place to live. As a result, there will be an influx of people who have trouble finding work elsewhere, typically in services and construction. So the current demographic makeup of a county reflects its history of job growth, which ultimately reflects its past demographic makeup. These two variables -- current demographic makeup and past demographic makeup -- can be very different.


You are right: the demographics of counties is constantly changing in response to the social conditions in the same counties. Research wise, this is good because it makes it possible to do cross-lagged longitudinal studies. Suppose that we propose that Asian Americans have a positive influence on S, then, we can check if counties that increased its share of Asian Americans over a timespan (e.g. 2000 to 2010) also increased its S score.

It would be possible to take into account the past demographics of a county to see whether that predicts the future S of a county regardless of the future demographics, a kind of enduring demographic effect.

These type of analyses are, however, not possible to do with the current dataset because it lacks longitudinal S data. Census data has SIRE data for 2000 and 2010, so one could use these years. It may be possible to find cognitive ability data as well. The school-district dataset spans the years 2004 to 2009, which may be a sufficiently long window to do this kind of research, but unfortunately, it does not match up with the census years. The S data is only available for 2009 and 2010, which gives only a 1 year window. My guess is that even if one could obtain SIRE data for 2009 and 2010, a window of only 1 year would probably mean that there is too much noise to detect real effects.

--

Revision #5
https://osf.io/btnx5/
Admin
Interesting and very thorough paper. I have some minor points and suggestions:

1. I would recommend slightly more aesthetically-minded formatting (e.g., justifying text, centering table columns etc.), but I leave this at Emil's discretion.

2. The Abstract is perhaps slightly too long. It reads more like an abridged introductory section. But again, I leave this at Emil's discretion.

3. The y-axis on Figure 4 says "S", but it should say "CA". In fact, Figure 4 appears to be identical to Figure 5.

4. Why not include a scatterplot showing the relationship between cognitive ability and the S-factor across counties?

5. At the beginning of Section 5:

Since the top 100 counties have a mean S of 1.43 and S has a correlation to CA of perhaps .60 at the individual level (Strenze, 2007)8, the top 100 counties group should have a mean CA score of 1.43 * .6= .86 Z, or about 113 IQ."


Unless I am mistaken, this assumes that there are no spillover effects of cognitive ability on the S-factor. Jones (2015; 'Hive Mind') argues that there are such spillover effects (e.g., high IQ people are more co-operative, and so are more willing to fund public goods; high IQ people vote for more market-oriented policies, which leads to higher incomes). The assumption of no spillover effects should be noted in the text.

6. In Section 8, it might be interesting to estimate state fixed-effects models, i.e., multiple OLS models of the form:

county_s-factor_score = county_cognitive_ability + county_race_variables + state_dummies
Admin
Noah,

1. I would recommend slightly more aesthetically-minded formatting (e.g., justifying text, centering table columns etc.), but I leave this at Emil's discretion.


When you say "centering table columns", do you mean like this?

[attachment=721]

The second table has centered text, the first uses whatever the default setting was.

2. The Abstract is perhaps slightly too long. It reads more like an abridged introductory section. But again, I leave this at Emil's discretion.


It is my understanding that the point of an abstract is to summarize the findings. To do this, one needs to summarize the main findings. A paper that contains many analyses thus necessitates a longer abstract.

Personally, I often re-read abstracts of papers because I forgot what the main findings of the paper were. When papers do not present these in the abstract, I have to skim the actual paper. Sometimes, I just need a single number. I am trying to avoid giving others this problem by actually presenting the main results in the abstract.

3. The y-axis on Figure 4 says "S", but it should say "CA". In fact, Figure 4 appears to be identical to Figure 5.


You are right. It was the wrong figure. I have put the right one there now.

Unless I am mistaken, this assumes that there are no spillover effects of cognitive ability on the S-factor. Jones (2015; 'Hive Mind') argues that there are such spillover effects (e.g., high IQ people are more co-operative, and so are more willing to fund public goods; high IQ people vote for more market-oriented policies, which leads to higher incomes). The assumption of no spillover effects should be noted in the text.


It assumes a lot of things, both parameter values and causal relationships. In particular, it assumes that positive-feedback aggregation effects are not present (as in Hive Mind, but I haven't read it). I think this is what you call spillover effects. In other words, it assumes that aggregate-level traits are a simple composition of the individual-level effects or traits.

To note, this section was written before I had cognitive ability data. Now that I have this, I checked that this assumption roughly holds. In other words, what is the mean cognitive ability among the top 100 S counties? It turns out, it is 1.36! Only slightly lower than the S which is 1.43. So this would imply a correlation of almost .95 at the individual-level, which is clearly untenable.

I have removed this section from the paper (and the abstract).

6. In Section 8, it might be interesting to estimate state fixed-effects models, i.e., multiple OLS models of the form:

county_s-factor_score = county_cognitive_ability + county_race_variables + state_dummies


I ran the model. Output:


> lm("S ~ CA + White + Black + Asian + Amerindian + State", data = d_main, weight = Total.Population) %>%
+ MOD_summary(kfold = F)
$coefs
Beta SE CI.lower CI.upper
CA 0.67 0.02 0.64 0.70
White 0.06 0.02 0.03 0.09
Black -0.13 0.02 -0.17 -0.10
Asian 0.11 0.01 0.10 0.12
Amerindian -0.12 0.02 -0.16 -0.08
State: Alaska 0.67 0.18 0.32 1.02
State: Arizona 0.33 0.11 0.10 0.55
State: Arkansas -0.41 0.10 -0.61 -0.22
State: California 0.47 0.08 0.32 0.62
State: Colorado 0.24 0.09 0.07 0.42
State: Connecticut 0.14 0.10 -0.05 0.33
State: Delaware 0.15 0.15 -0.16 0.45
State: Florida -0.22 0.07 -0.36 -0.08
State: Georgia 0.05 0.08 -0.10 0.20
State: Idaho 0.09 0.13 -0.15 0.34
State: Illinois 0.09 0.07 -0.05 0.24
State: Indiana -0.35 0.08 -0.52 -0.19
State: Iowa 0.34 0.10 0.15 0.54
State: Kansas -0.06 0.10 -0.26 0.14
State: Kentucky -0.75 0.09 -0.92 -0.57
State: Louisiana 0.00 0.09 -0.17 0.18
State: Maine 0.07 0.13 -0.19 0.33
State: Maryland 0.17 0.09 0.01 0.34
State: Massachusetts -0.15 0.09 -0.32 0.01
State: Michigan -0.04 0.07 -0.19 0.11
State: Minnesota 0.27 0.09 0.10 0.44
State: Mississippi -0.25 0.10 -0.44 -0.05
State: Missouri -0.27 0.08 -0.43 -0.11
State: Montana -0.21 0.15 -0.51 0.08
State: Nebraska 0.17 0.12 -0.06 0.40
State: Nevada 0.26 0.11 0.05 0.47
State: New Hampshire 0.26 0.13 0.00 0.52
State: New Jersey -0.21 0.08 -0.38 -0.05
State: New Mexico 0.54 0.12 0.30 0.78
State: New York -0.17 0.07 -0.31 -0.03
State: North Carolina -0.27 0.08 -0.42 -0.12
State: North Dakota 0.06 0.18 -0.29 0.40
State: Ohio -0.52 0.08 -0.66 -0.37
State: Oklahoma -0.17 0.10 -0.35 0.02
State: Oregon 0.20 0.09 0.02 0.39
State: Pennsylvania -0.26 0.07 -0.41 -0.12
State: Rhode Island 0.35 0.14 0.07 0.64
State: South Carolina -0.10 0.09 -0.27 0.08
State: South Dakota -0.01 0.16 -0.33 0.31
State: Tennessee -0.10 0.08 -0.26 0.06
State: Texas -0.44 0.07 -0.58 -0.29
State: Utah 0.79 0.10 0.58 0.99
State: Vermont 0.14 0.18 -0.21 0.49
State: Virginia 0.15 0.08 0.00 0.31
State: Washington 0.04 0.08 -0.13 0.20
State: West Virginia -0.15 0.12 -0.38 0.08
State: Wisconsin 0.18 0.08 0.02 0.35
State: Wyoming 0.10 0.19 -0.28 0.47

$meta
N R2 R2 adj.
3086.00 0.78 0.78


One could also add the environmental variables. Output:

$coefs
Beta SE CI.lower CI.upper
CA 0.68 0.02 0.65 0.72
White 0.06 0.02 0.02 0.09
Black -0.14 0.02 -0.18 -0.10
Asian 0.11 0.01 0.10 0.12
Amerindian -0.12 0.02 -0.17 -0.08
State: Alaska 0.70 0.34 0.04 1.36
State: Arizona 0.75 0.16 0.44 1.07
State: Arkansas -0.39 0.11 -0.61 -0.17
State: California 0.95 0.16 0.64 1.25
State: Colorado 0.43 0.13 0.17 0.69
State: Connecticut -0.29 0.15 -0.57 0.00
State: Delaware -0.17 0.18 -0.52 0.18
State: Florida -0.18 0.09 -0.34 -0.01
State: Georgia 0.12 0.09 -0.05 0.29
State: Idaho 0.26 0.18 -0.10 0.62
State: Illinois -0.07 0.11 -0.28 0.14
State: Indiana -0.55 0.11 -0.76 -0.34
State: Iowa 0.24 0.13 -0.01 0.49
State: Kansas -0.04 0.12 -0.28 0.19
State: Kentucky -0.83 0.11 -1.06 -0.61
State: Louisiana 0.13 0.10 -0.07 0.32
State: Maine -0.46 0.19 -0.83 -0.09
State: Maryland -0.13 0.12 -0.37 0.11
State: Massachusetts -0.61 0.14 -0.89 -0.33
State: Michigan -0.29 0.12 -0.52 -0.06
State: Minnesota 0.09 0.13 -0.17 0.35
State: Mississippi -0.21 0.11 -0.43 0.00
State: Missouri -0.33 0.10 -0.52 -0.13
State: Montana -0.18 0.20 -0.57 0.22
State: Nebraska 0.14 0.14 -0.14 0.43
State: Nevada 0.64 0.17 0.30 0.97
State: New Hampshire -0.20 0.18 -0.56 0.15
State: New Jersey -0.57 0.13 -0.82 -0.31
State: New Mexico 0.88 0.15 0.58 1.18
State: New York -0.57 0.13 -0.82 -0.33
State: North Carolina -0.43 0.09 -0.61 -0.25
State: North Dakota -0.08 0.22 -0.51 0.34
State: Ohio -0.75 0.11 -0.96 -0.54
State: Oklahoma -0.10 0.11 -0.32 0.12
State: Oregon 0.37 0.18 0.02 0.72
State: Pennsylvania -0.58 0.11 -0.81 -0.36
State: Rhode Island -0.07 0.19 -0.45 0.31
State: South Carolina -0.21 0.10 -0.40 -0.01
State: South Dakota -0.05 0.19 -0.43 0.33
State: Tennessee -0.15 0.09 -0.33 0.02
State: Texas -0.23 0.09 -0.41 -0.04
State: Utah 1.04 0.15 0.74 1.33
State: Vermont -0.32 0.22 -0.75 0.12
State: Virginia -0.20 0.11 -0.42 0.03
State: Washington 0.10 0.18 -0.25 0.45
State: West Virginia -0.33 0.14 -0.60 -0.06
State: Wisconsin -0.03 0.12 -0.27 0.21
State: Wyoming 0.23 0.22 -0.21 0.67
lat 0.18 0.05 0.07 0.28
lon 0.19 0.06 0.08 0.30
precip 0.03 0.02 0.00 0.07
temp 0.04 0.04 -0.04 0.12

$meta
N R2 R2 adj.
2682.00 0.79 0.78


The betas change a little, but the R2 is about the same. CA is the still the driving force. E.g. if one calculates eta squared, then CA has 47%, black 10%, Asian 14%, state 8% and everything else rounds to <1%.

I have added a new section with these results.

--

Revision 6 uploaded. https://osf.io/btnx5/

--

Revisions 7-8 uploaded. They have only a few minor cosmetic changes.
Admin
When you say "centering table columns", do you mean like this?


Yes, I would recommend formatting the tables like that.

One more minor point:

In the latest version of the paper, Figure 10 is overlapping with footnote 8, which looks rather unsightly.

Once these two minor points (and a few typos/spelling errors) have been dealt with, I will approve the paper for publication.
Admin
When you say "centering table columns", do you mean like this?


Yes, I would recommend formatting the tables like that.

One more minor point:

In the latest version of the paper, Figure 10 is overlapping with footnote 8, which looks rather unsightly.

Once these two minor points (and a few typos/spelling errors) have been dealt with, I will approve the paper for publication.


I have changed the table formatting. This is manual labor as there does not appear to be a way to do this automatically in LibreOffice.

I have fixed the figure overlapping.

I have fixed a number of minor cosmetic errors. Furthermore, I have updated the abstract to include the cognitive ability results.

New revision uploaded.
Admin
Publication approved.
I approve the submission, but I think the manuscript could use one more review for spelling and grammar. I was working with the May 2 version, and some of the imperfections have been addressed in the May 4 update, but things that can still be addressed include:

* Because correlations are measures [of] linear associations [association?]

* most previous research on the topic have [has] used correlations

* I don't have [a] hypothesis for why this is the case.

* To investigate, all the environmental predictors were discretized into 10 bins the same was [way] temperature was before and entered into a regression model
Admin
Zigerell,

Thank you. I will have someone with a better eye for this kind of thing go over it. It is hard to find language mistakes in your own work because you already know what all the sentences say, so you don't read them as detailed as you would with other sentences. I have fixed the 4 errors pointed out above (revision #10).

With regards to reviewing. Pesta is also reviewing this, and with his approval, there would be 3 approvals.
I read the paper, but did not scrutinize many of the complex statistics presented here, as frankly I am not expert on these. In that sense, my review may not be very helpful (I don’t know if any statisticians have provided feedback above).

The paper is perhaps overly data-driven and theory light, but I see no real issues. It makes several important contributions at a less-well studied level of analysis (i.e., counties in the USA). The results converge nicely with what’s found using higher-level data like states and nations. I’m still not sure statistical analyses of data like these can get at cause and effect (and note that the author doesn’t claim causality anyway).

Section 9 was very clear, and the strongest part of the paper, in my opinion.

Also, whenever I tried publishing aggregate-level data, I was told to mention the ecological fallacy. I think the author does this once, but it might be good to mention it again in the second last paragraph of the “discussion and conclusion.” County-level mediation doesn’t imply the same when looking at individual differences for race, CA, and S.

Finally, there’s still some minor grammar / typo issues in places (e.g., Section 11, “very was”), but I otherwise approve publication.

Bryan
Admin
Bryan,

Thank you for taking the time to review this. It is a long paper.

I read the paper, but did not scrutinize many of the complex statistics presented here, as frankly I am not expert on these. In that sense, my review may not be very helpful (I don’t know if any statisticians have provided feedback above).


Unfortunately, it is very difficult to find persons with expertise in sociology, differential psychology, factor analysis (including the new methods I devised) as well as model selection with LASSO regression.

If you have any suggestions for someone who has the expertise and has the time to review this paper, it would be fine with me if we could recruit a fourth reviewer for this paper (only 3 are mandatory).

The paper is perhaps overly data-driven and theory light, but I see no real issues. It makes several important contributions at a less-well studied level of analysis (i.e., counties in the USA). The results converge nicely with what’s found using higher-level data like states and nations. I’m still not sure statistical analyses of data like these can get at cause and effect (and note that the author doesn’t claim causality anyway).


In general, I prefer to take a show, not tell approach to science and publishing. Lots of tables and figures so that the reader gets a good understanding of the data. The interpretation of the results is mostly up to the reader.

As you note, these kind of cross-sectional data are not very good at deciding between causal models. For this reason, I did not try to draw strong conclusions about causality. This is about the same approach as was done with the Admixture in the Americas paper. This fits with the show not tell approach because it lets the reader draw his own causal conclusions.

Also, whenever I tried publishing aggregate-level data, I was told to mention the ecological fallacy. I think the author does this once, but it might be good to mention it again in the second last paragraph of the “discussion and conclusion.” County-level mediation doesn’t imply the same when looking at individual differences for race, CA, and S.


I mentioned the problem in passing, but not by that name. I wrote:

The conflicting results may be due to aggregation effects (lack of ergodicity), that is, analyzing the data at too high a level. If there is an aggregation effect at the state-level that causes the conflicting results for White and S, then it should not be present when the data is analyzed at the county-level since this is one level below the state-level.

Lack of ergodicity is exactly what gives rise to inferential problems between levels of analyses and thus the ecological fallacy. I will add a paragraph about it in the discussion.

Finally, there’s still some minor grammar / typo issues in places (e.g., Section 11, “very was”), but I otherwise approve publication.


You are right. I will go over the paper again.
Hello,


I don't think a stats person will say anything that makes the paper unacceptable. I say approve it.


Bryan
Admin
Moving thread to the post-review forum...
Admin
When I did the study I didn't know how to map the data. However, I have since learned that. Out of curiosity, I mapped some of the data. This revealed a problem with the method used to find the mean temperature for each county. The map shows it all.

There are large areas even in highly populated areas with no temperature data. This is clearly a mistake. The way I gathered the temperature data by county was by finding the nearest county for each weather station. If there were multiple, I averaged the data. Since there were many more weather stations than counties, I figured this would cover almost all of them, or at least the important ones. This is wrong. Instead I should proceed the other way, namely finding the nearest weather station for each county and using that. This will always result in the counties having a datapoint, but it may not be a good one if the nearest station is far away. Still, should be better than the present.

An alternative is to impute the climatic data.