Moved this submission to the submission forum, now that we have a journal and a review team. -Emil
I have another paper I don't know where to send to. I will post it here for now and try to get some reviewers to comment here.
Title
Inequality among 32 London Boroughs: An S factor analysis
Abstract
A dataset of 30 diverse socioeconomic variables was collected covering 32 London boroughs. Factor analysis of the data revealed a general socioeconomic factor. This factor was strongly related to GCSE scores (r's .813 to .819) and and had weak to medium sized negative relationships to demographic variables related to immigrants (r's -.224 to -.489). Jensen's method indicated that these relationships was related to the underlying general factor, especially for GCSE (Jensen coefficients .67 to .84, and -.45 to -.60).
Key words:
general socioeconomic factor, S factor, inequality, London, boroughs, United Kingdom, cognitive ability, IQ, intelligence, scholastic ability, GCSE, immigrants
PDF and files:
https://osf.io/p6fwh/
Publishing tweet:
https://twitter.com/KirkegaardEmil/status/646598876032425984
Back to [Archive] Post-review discussions
This paper analyses educational, demographic and socio-economic data from 32 London boroughs. It derives a general socio-economic factor, and confirms that this factor is correlated in the expected direction with the educational and demographic variables. The analyses are appropriate, and appear to support the conclusions enunciated in the text. In addition, the paper is clearly written, and adequately referenced. Therefore, I believe it is ready to published. I would, however, offer the following suggestions to the author:
1. Consider relabelling variables so that their names are easier to read (e.g., deleting underscores).
2. Consider including a horizontal line in Table 1, in order to separate variable names from the reported correlations.
3. Note that '% 5+ A*-C with Eng. and Math.' is still a somewhat blunt measure of cognitive ability, insofar as: the exact grades are not specified, so a borough with X% getting 5+ Cs could theoretically obtain the same percentage as a borough with X% getting 5+ A*s; three of the five subjects are not specified, so could be either comparatively easy (e.g., Media Studies), or comparatively difficult (e.g., Physics); many children in Britain take 10 or more GCSEs.
4. Note that one cannot draw strong conclusions about the aggregate-level correlation between educational achievement and cognitive ability across London boroughs from the aggregate-level correlation between these two variables across countries.
1. Consider relabelling variables so that their names are easier to read (e.g., deleting underscores).
2. Consider including a horizontal line in Table 1, in order to separate variable names from the reported correlations.
3. Note that '% 5+ A*-C with Eng. and Math.' is still a somewhat blunt measure of cognitive ability, insofar as: the exact grades are not specified, so a borough with X% getting 5+ Cs could theoretically obtain the same percentage as a borough with X% getting 5+ A*s; three of the five subjects are not specified, so could be either comparatively easy (e.g., Media Studies), or comparatively difficult (e.g., Physics); many children in Britain take 10 or more GCSEs.
4. Note that one cannot draw strong conclusions about the aggregate-level correlation between educational achievement and cognitive ability across London boroughs from the aggregate-level correlation between these two variables across countries.
Noah,
Thanks for the review.
1)
R does not like it when variable names have spaces (even in data.frames), so I have used underscores. One could also use CamelCase instead, but I think underscores are more readable. My preference is to keep the code consistent with the paper (for those who wish to analyze it more closely) over that of a slightly more polished presentation.
2)
I have added borders to the table to make it more readable.
3)
The achievement measure is clearly not optimal, but one has to make do with what there is. Perhaps one can find a better measure. You seem familiar with the data, could you perhaps tell me whether one of these measures are better? http://data.london.gov.uk/dataset/gcse-results-location-pupil-residence-borough
It looks like there are several measures one could perhaps factor analyze or otherwise combine to get a composite measure:
a) All Pupils at the End of KS4 Achieving 5+ A* - C
b) All Pupils at the End of KS4 Achieving 5+ A* - G
c) All Pupils at the End of KS4 Achieving 5+ A* - C Including English and Mathematics
d) All Pupils at the End of KS4 Achieving 5+ A* - G Including English and Mathematics
e) All Pupils at the End of KS4 Achieving the Basics
f) All Pupils at the End of KS4 Entering the English Baccalaureate
g) All pupils at the End of KS4 Achieving the English Baccalaureate
h) Average GCSE and Equivalent Point Score Per Pupil at the End of KS4
i) Average Capped GCSE and Equivalent Point Score Per Pupil at the End of KS4
As far as I can tell, many of these are threshold versions of more continuous variables (cf. http://www.lagriffedulion.f2s.com/adverse.htm). Such variables have somewhat non-linear relationships. Perhaps (h) is the best variable to use? It looks like a mean score type variable, meaning that no threshold transformation has been applied to it.
I did analyze them. The currently used variable (c) has a factor loading of .96, but the loadings of all the variables are in the .69-.98 range, so it would probably not matter so much. The highest loading is (i).
In fact, because we have 9 variables all measuring scholastic ability, one can use Jensen's method. The prediction being that the variables that better measure scholastic ability should show higher correlations with the criteria variable (S). This was in fact found, r's .94-.95 (depending on which S score vector was used).
All correlations between GCSE variables and S were substantial r's .582 to .886. The strongest correlation is with (h) as one could expect because it is the underlying continuous variable. (i) seems to be some capped (?) version of this, which introduces a ceiling effect.
So it seems to me that one should use (h).
It seems somewhat unnecessary to include all this in the main part of the paper. Perhaps add an appendix discussing the GCSE variables and the above analysis?
Let me know what you would prefer.
4)
I agree. I changed it to:
--
Files updated.
Thanks for the review.
1)
R does not like it when variable names have spaces (even in data.frames), so I have used underscores. One could also use CamelCase instead, but I think underscores are more readable. My preference is to keep the code consistent with the paper (for those who wish to analyze it more closely) over that of a slightly more polished presentation.
2)
I have added borders to the table to make it more readable.
3)
The achievement measure is clearly not optimal, but one has to make do with what there is. Perhaps one can find a better measure. You seem familiar with the data, could you perhaps tell me whether one of these measures are better? http://data.london.gov.uk/dataset/gcse-results-location-pupil-residence-borough
It looks like there are several measures one could perhaps factor analyze or otherwise combine to get a composite measure:
a) All Pupils at the End of KS4 Achieving 5+ A* - C
b) All Pupils at the End of KS4 Achieving 5+ A* - G
c) All Pupils at the End of KS4 Achieving 5+ A* - C Including English and Mathematics
d) All Pupils at the End of KS4 Achieving 5+ A* - G Including English and Mathematics
e) All Pupils at the End of KS4 Achieving the Basics
f) All Pupils at the End of KS4 Entering the English Baccalaureate
g) All pupils at the End of KS4 Achieving the English Baccalaureate
h) Average GCSE and Equivalent Point Score Per Pupil at the End of KS4
i) Average Capped GCSE and Equivalent Point Score Per Pupil at the End of KS4
As far as I can tell, many of these are threshold versions of more continuous variables (cf. http://www.lagriffedulion.f2s.com/adverse.htm). Such variables have somewhat non-linear relationships. Perhaps (h) is the best variable to use? It looks like a mean score type variable, meaning that no threshold transformation has been applied to it.
I did analyze them. The currently used variable (c) has a factor loading of .96, but the loadings of all the variables are in the .69-.98 range, so it would probably not matter so much. The highest loading is (i).
In fact, because we have 9 variables all measuring scholastic ability, one can use Jensen's method. The prediction being that the variables that better measure scholastic ability should show higher correlations with the criteria variable (S). This was in fact found, r's .94-.95 (depending on which S score vector was used).
All correlations between GCSE variables and S were substantial r's .582 to .886. The strongest correlation is with (h) as one could expect because it is the underlying continuous variable. (i) seems to be some capped (?) version of this, which introduces a ceiling effect.
So it seems to me that one should use (h).
It seems somewhat unnecessary to include all this in the main part of the paper. Perhaps add an appendix discussing the GCSE variables and the above analysis?
Let me know what you would prefer.
4)
I agree. I changed it to:
In line with much other research (15,16), one would expect higher cognitive ability to lead to higher S. The GCSE grades are not exactly an IQ test (17,18), but it has been found that at the national-level, scholastic ability and cognitive ability as measured by traditional IQ tests are nearly perfectly correlated (19,20). This suggests that it may also be a useful proxy at the borough-level, but this may not be the case. Prior research using similar data has found strong relationships between scholastic/ability ability and S, so a correlation in the vicinity of .40 to .90 would be expected here.
--
Files updated.
All changes proposed are fine. In regard to 3), I agree that (h) seems like the best overall measure of cognitive ability among those available. A short appendix discussing the different GCSE variables would suffice.
Noah,
I have uploaded a new version. It now has an appendix dealing with the GCSE stuff, 2 scatter plots of the main findings (GCSE x S, BAME x S), a brief analysis of mediation as well as the other changes discussed above.
Files updated at OSF.
I have uploaded a new version. It now has an appendix dealing with the GCSE stuff, 2 scatter plots of the main findings (GCSE x S, BAME x S), a brief analysis of mediation as well as the other changes discussed above.
Files updated at OSF.
I have asked Kenya Kura to review this.
As I have read this manuscript, it seems to have been fully completed for publication without further analysis. Statistical analyses are enough sophisticated (e.g., Figure 1) and the results are very robust and consistent with previous findings like in Boston. Hereafter, let me just state two of my impressions of the findings.
1. The title of the Figure 4 should be “Scatter plot of S and Pct_BAME”. I found this relationship to be apparently weaker than the correlation in Figure 3, which makes a lot of sense because S and GCSE are stats from all students (or people in fact) including British gentiles. This relation may not as strong as the international S factor but should be fairly strong as the cases in Italy, Spain, or Japan with similar north-south gradients.
On the other hand, as the author acknowledges, the relation between S and Pct_BAME came from extremely diverse immigrant samples, including Scandinavians, who are very close to British people, to Sub-Saharan countries, Pakistan, India and China, who are very far at Fst level and also many different kinds of selection processes/pressures should have been existed. This also seems to be true for MCV of these two correlations in Figure 5, and 6.
2. The reason why S and female wage rate has a negative correlation is a puzzle. I doubt if women in affluent districts are not as eager to make money as those in poorer districts. For example, when husband earns a lot more, their (assortative-mated) wives rather wants to be housewives and/or feel less obliged to work long hours or work in high paying jobs and so on. There may be a non-linear relationship in this case. This is just my guess.
It has been continuously found in a surprisingly consistent manner that S exists among human populations as the meta-factor of the socioeconomic variables. I am afraid that they may be too inconvenient findings to acknowledge for the present time, when so many refugees desperately need help.
1. The title of the Figure 4 should be “Scatter plot of S and Pct_BAME”. I found this relationship to be apparently weaker than the correlation in Figure 3, which makes a lot of sense because S and GCSE are stats from all students (or people in fact) including British gentiles. This relation may not as strong as the international S factor but should be fairly strong as the cases in Italy, Spain, or Japan with similar north-south gradients.
On the other hand, as the author acknowledges, the relation between S and Pct_BAME came from extremely diverse immigrant samples, including Scandinavians, who are very close to British people, to Sub-Saharan countries, Pakistan, India and China, who are very far at Fst level and also many different kinds of selection processes/pressures should have been existed. This also seems to be true for MCV of these two correlations in Figure 5, and 6.
2. The reason why S and female wage rate has a negative correlation is a puzzle. I doubt if women in affluent districts are not as eager to make money as those in poorer districts. For example, when husband earns a lot more, their (assortative-mated) wives rather wants to be housewives and/or feel less obliged to work long hours or work in high paying jobs and so on. There may be a non-linear relationship in this case. This is just my guess.
It has been continuously found in a surprisingly consistent manner that S exists among human populations as the meta-factor of the socioeconomic variables. I am afraid that they may be too inconvenient findings to acknowledge for the present time, when so many refugees desperately need help.
Ken,
Thank you for taking the time to review this.
I have fixed the error with Figure 4.
Furthermore, I reran the code. The split-half factor reliability was lower in the re-run. Then I reran it with a 10x larger sample size (N=5000), giving a value of .79. I updated the paper accordingly.
https://osf.io/f4uc2/files/
Thank you for taking the time to review this.
I have fixed the error with Figure 4.
Furthermore, I reran the code. The split-half factor reliability was lower in the re-run. Then I reran it with a 10x larger sample size (N=5000), giving a value of .79. I updated the paper accordingly.
https://osf.io/f4uc2/files/
I have another dataset covering the same units. The dataset contains crime data for about 30 types of time given in a unusable format. I have converted it to a useful format and calculate per capita measures. This results in 60 variables.
I have substantially rewritten the draft because of this. Results with regards to GCSE and immigrant variables were mostly unchanged.
I have also added multiple regression fitting results (best subsets and lasso).
Files updated:
https://osf.io/f4uc2/files/
I have substantially rewritten the draft because of this. Results with regards to GCSE and immigrant variables were mostly unchanged.
I have also added multiple regression fitting results (best subsets and lasso).
Files updated:
https://osf.io/f4uc2/files/
The manuscript is good. I have attached a file with some grammar edits and comments. I approve the manuscript after those edits are addressed.
A few notes that are not necessary to address for my approval:
1. I'm not sure it's necessary to add the Japan S-factor results, but that's at least something to consider.
2. It might be worth considering the legibility of graphs. The text on the y-axis for Figure 4 overlaps slightly; Figures 5 and 6 have labels that fall off the graph or are on top of each other; and Figures 7 and 8 have a lot of labels on top of each other. If the labels aren't necessary for Figures 7 and 8, it might be best to exclude the labels; otherwise, I'm not sure of a solution.
3. The unexpected result for female pay might reflect a positive outcome. The dataset had a gross annual pay variable that was not disaggregated by sex, so maybe that would be a better measure in the future, if it's not theoretically clear that lower pay for one sex would be a negative outcome (because of, for instance, the assortive mating that Kenya mentioned).
A few notes that are not necessary to address for my approval:
1. I'm not sure it's necessary to add the Japan S-factor results, but that's at least something to consider.
2. It might be worth considering the legibility of graphs. The text on the y-axis for Figure 4 overlaps slightly; Figures 5 and 6 have labels that fall off the graph or are on top of each other; and Figures 7 and 8 have a lot of labels on top of each other. If the labels aren't necessary for Figures 7 and 8, it might be best to exclude the labels; otherwise, I'm not sure of a solution.
3. The unexpected result for female pay might reflect a positive outcome. The dataset had a gross annual pay variable that was not disaggregated by sex, so maybe that would be a better measure in the future, if it's not theoretically clear that lower pay for one sex would be a negative outcome (because of, for instance, the assortive mating that Kenya mentioned).
Attachment added.
The manuscript is good. I have attached a file with some grammar edits and comments. I approve the manuscript after those edits are addressed.
A few notes that are not necessary to address for my approval:
1. I'm not sure it's necessary to add the Japan S-factor results, but that's at least something to consider.
2. It might be worth considering the legibility of graphs. The text on the y-axis for Figure 4 overlaps slightly; Figures 5 and 6 have labels that fall off the graph or are on top of each other; and Figures 7 and 8 have a lot of labels on top of each other. If the labels aren't necessary for Figures 7 and 8, it might be best to exclude the labels; otherwise, I'm not sure of a solution.
3. The unexpected result for female pay might reflect a positive outcome. The dataset had a gross annual pay variable that was not disaggregated by sex, so maybe that would be a better measure in the future, if it's not theoretically clear that lower pay for one sex would be a negative outcome (because of, for instance, the assortive mating that Kenya mentioned).
Thank you. I am pressed for time right now, but will try to get back with a new draft fixing the issues.
New version uploaded, version 6. https://osf.io/f4uc2/files/
Changes:
Abstract:
Added GCSE full name.
Fixed grammar.
Page 2:
Grammar fixes.
Moved a clause.
Page 3:
Added 2 paragraphs of explanatory text below the list.
Page 5:
Added a parenthesis with Grievous bodily harm (GBH).
Page 7:
Removed one “thus”.
Inserted missing “of”.
Page 8:
Changed “one” to “another”.
Added the remaining part of the sentence (don't know what happened there, probably got interrupted).
Page 10:
Added capital to United States.
Added “in society” to make the meaning of “fare worse” more clear.
Page 12:
Fixed typo.
Page 15:
Fixed wrong word (“each country of origin”, not “each country of country”!).
Page 17:
Fixed typo.
Others:
Fixed slight overlap in indicator names in Figure 4 (loadings plot). There has been an update to the ggplot2 package, so the plots now look somewhat different. I could not find a way to increase the line heights, so I had to make the text even smaller (font size 5). Also added more breaks and increased limits to -1 to 1.
For the Jensen's method plots, removed the overlapping names and increased the limits slightly.
Changes:
Abstract:
Added GCSE full name.
Fixed grammar.
Page 2:
Grammar fixes.
Moved a clause.
Page 3:
Added 2 paragraphs of explanatory text below the list.
Page 5:
Added a parenthesis with Grievous bodily harm (GBH).
Page 7:
Removed one “thus”.
Inserted missing “of”.
Page 8:
Changed “one” to “another”.
Added the remaining part of the sentence (don't know what happened there, probably got interrupted).
Page 10:
Added capital to United States.
Added “in society” to make the meaning of “fare worse” more clear.
Page 12:
Fixed typo.
Page 15:
Fixed wrong word (“each country of origin”, not “each country of country”!).
Page 17:
Fixed typo.
Others:
Fixed slight overlap in indicator names in Figure 4 (loadings plot). There has been an update to the ggplot2 package, so the plots now look somewhat different. I could not find a way to increase the line heights, so I had to make the text even smaller (font size 5). Also added more breaks and increased limits to -1 to 1.
For the Jensen's method plots, removed the overlapping names and increased the limits slightly.
Thank you, Emil. I approve publication.
New version uploaded, version 6. https://osf.io/f4uc2/files/
Changes:
Abstract:
Added GCSE full name.
Fixed grammar.
Page 2:
Grammar fixes.
Moved a clause.
Page 3:
Added 2 paragraphs of explanatory text below the list.
Page 5:
Added a parenthesis with Grievous bodily harm (GBH).
Page 7:
Removed one “thus”.
Inserted missing “of”.
Page 8:
Changed “one” to “another”.
Added the remaining part of the sentence (don't know what happened there, probably got interrupted).
Page 10:
Added capital to United States.
Added “in society” to make the meaning of “fare worse” more clear.
Page 12:
Fixed typo.
Page 15:
Fixed wrong word (“each country of origin”, not “each country of country”!).
Page 17:
Fixed typo.
Others:
Fixed slight overlap in indicator names in Figure 4 (loadings plot). There has been an update to the ggplot2 package, so the plots now look somewhat different. I could not find a way to increase the line heights, so I had to make the text even smaller (font size 5). Also added more breaks and increased limits to -1 to 1.
For the Jensen's method plots, removed the overlapping names and increased the limits slightly.
Thanks. I will post it on the journal website unless there are objections.
Fixed and updated.