Back to [Archive] Post-review discussions

1
[OQSPS] Name features and social status
Admin
Journal:
Open Quantitative Sociology & Political Science

Authors:
Emil O. W. Kirkegaard

Title:
Name features and social status: an exploratory study of 1,890 Danish first names

Abstract:
A dataset of the relative social status of 1,890 first names of persons living in Denmark was obtained from a previous study. Linguistic features were generated based mostly on n-grams augmented by regex and each name was scored on each feature. The list of features were then pruned based on their presence in the data such that features that never or almost never appeared in the data were discarded. An initial check using t-tests showed strong signal in the features taken as a whole and that this was due mostly to low status names being more similar. LASSO regression was then used to select a subset of features with above chance level validity. Finally, an OLS model was fit based on the LASSO chosen features. This model showed a cross-validated R2 of .40, equivalent to a correlation of .63. It was concluded that subtle linguistic features in first names have substantial validity for predicting relative social status in Denmark.


Key words:
first name, social status, linguistics, computational linguistics, variable selection, n-gram

Length:

~1300 words, ~5 pages.

Files:
https://osf.io/adgwd/

Reviewers:
None suggested.
Comments on 22. November version of manuscript

The phrase regular expression should make an appearance somewhere on the paper.

It is not clear if you have information on the distribution of social status for given name. If for example the standard deviations for the names were available, it could be possible to estimate the R^2 for a model where status(name) = average status of name. This could then be compared to the obtained R^2 of 0.40.
Admin
Hi hvc,

Thanks for looking it over.


The phrase regular expression should make an appearance somewhere on the paper.



You are right. I have added it to section 3.

It is not clear if you have information on the distribution of social status for given name. If for example the standard deviations for the names were available, it could be possible to estimate the R^2 for a model where status(name) = average status of name. This could then be compared to the obtained R^2 of 0.40.


There are no dispersion values for the names unfortunately. Please see the previous paper for details about the dataset or download a copy yourself. Briefly, the data concerns a few socioeconomic outcomes for each first name and only the mean (for income etc.) or rate (for convictions), no measures of dispersion. There are some more numbers like most common occupations but these were hard to convert to useful comparable numeric data and were not used in this study or the previous.
Admin
Finally had time to revise this paper. It's now a lot better I think. Files updated on OSF.

https://osf.io/adgwd/

PDF link
This study is an interesting exploration of whether first names communicate information related to socioeconomic status of name bearers. I like this piece, and I think it can be published with some modest revisions:
        I wish there were more links to the literature on naming practices. There is a great deal of literature about naming practices and the information that names communicate (e.g., Abel & Kruger, 2007; Edwards & Caballero, 2008; Fryer & Levitt, 2004; Lieberson & Mikelson, 1995; Varnum & Kitayama, 2011)
        Page 1: “The results showed strong evidence of validity.” This sentence is vague. Do you mean the results showed strong evidence that linguistic characteristics of names correlated with socioeconomic indicators?
        I’m confused by the claim that one can infer approximate ancestry from first names (page 2). Please provide some detail (e.g., percentage of non-immigrants who have a name matching their ethnicity) and a citation. (Perhaps I’m skeptical because none of my children have names matching my or my spouse’s predominant ethnicities.)
        Page 2: The author states that income, criminal convictions, house ownership, and unemployment are all positively correlated. Shouldn’t criminal convictions and unemployment be negatively correlated with the other two variables?
        The biggest problem with the manuscript is that it does not make it clear to a non-expert how the name scoring procedure resulted in a variable value for a name. For example, was a point value for assigned to each pattern and then these were summed for each name? Did the process create a unique score for each name? Please add a few sentences to clarify this point.
        Don’t say that a p-value distribution is uniform “by chance” (p. 4). This is vague. Say that this is the expected distribution of p-values if the null hypothesis were perfectly true. I suggest making a similar change on p. 13.
        It’s not clear that, “This can be inferred because only rare names would tend to produce very large effect sizes” (p. 4). What is your logic here? Please explain.
        For Table 1, add two columns that would let your reader know the percentage of adults in Denmark with each of these first names. That would help provide some context. (Ignore this suggestion if this information is not available.)
        Adding a mean, SD, and median S factor score for Danish and non-Danish names (perhaps in the caption of Figure 3) would be helpful.
        Please provide evidence or a citation that high status secular Turks would give their children Turkish names instead of Muslim names. (Sounds plausible, but some supporting evidence is needed.)
        Yes, the paragraph on pp. 11-12 stating that the skew in Figure 3 is caused by the inclusion of non-Danish names is almost certainly true. Figure 4 indicates that three foreign groups of names in particular (Arabic, Polish, and Turkish) scored MUCH lower on the S factor—on average—than the overall mean. This might be worth mentioning.
        Correct a few instances of awkward writing, vague language, or grammatical errors:
  Eliminate the use of “we” in a general sense (p. 2).
  The second-to-last sentence on the first paragraph on p. 2 is very hard to understand.
  The brackets at the end of p. 2 are confusing. At first, I thought that “aeiouæøå” was an example of a “fraction vowel.”
  Whose expectations are you referring to on p. 7?
  Page 8: You use the term “significant.” Please specify whether this is statistical, practical, or clinical significance. (Never use “significant” alone.)
  The phrase “we subset” is awkward because (1) there is only 1 author, and (2) I’m not sure “subset” is a verb.
I have been able to reproduce the statistical results and plots from the paper using the code provided. I have not attempted to retrieve the original data myself.

Comments on methods

  • Given that the feature are 3-grams, it would be interesting to see the coefficients on each, to try to pick out the ones most (independently) associated with S. I guess you will probably see "Abd". This would be helpful insofar as there are language-typical letter triplets, which may help pinpoint language-level associations, in addition to the manual testing done with the behindthename data.
  • p 8. Clarify what is meant by performance in that context
  • It would be interesting to show an R2 curve showing the changes in R2 with number of predictors, showing what % of features can be retained with minimal loss in accuracy. This could be done by selecting the best predictor triplet from a bivariate regression, then running N-1 regressions and picking the next one, and so on.
  • There is a slight mismatch between the alphas that are used and the ones that the paper mentions. The ones used are 1,0.325, 0.55, and 0.775. This does not affect the conclusions.
  • Given that you are willing to run 5k CV iterations, why not do a straight LOOCV? This would get results as good or better as 5K CVs with fewer model runs. Also, switching to K folds (With K higher than usual, say 100) would also enable faster iteration, possible allowing for an optimisation of alpha in addition to lambda. Ideally one would use bayesian optimization here, but this feels excessive. Overall, I don't expect the results to change substantially were these methods tried.
  • Regarding the conclusions and the adjusted R2 vs CV R2, I see as an interesting avenue of research to study precisely the issue you raise: Is the traditional Adjusted R2 formula good enough?

Typos:
  • "The list predictors" -> predictors
  • "The featured" -> features
Admin
Thanks to reviewers for the detailed and constructive criticism. Many changes were made. Specific replies are given below. Files on OSF and Rpubs are updated.

Dr. g,


I wish there were more links to the literature on naming practices. There is a great deal of literature about naming practices and the information that names communicate (e.g., Abel & Kruger, 2007; Edwards & Caballero, 2008; Fryer & Levitt, 2004; Lieberson & Mikelson, 1995; Varnum & Kitayama, 2011)



Both the introduction and discussion cites a number of other studies. I generally prefer to avoid long literature review sections in papers, the reader can consult the literature for themselves as necessary. Nevertheless, I looked at the ones you suggested:

Abel, E. L., & Kruger, M. L. (2007). Symbolic significance of initials on longevity. Perceptual and motor skills, 104(1), 179-182.

This was a false positive, see reply:

http://journals.sagepub.com/doi/10.2466/05.PMS.112.1.211-216

Edwards, R., & Caballero, C. (2008). What's in a name? An exploration of the significance of personal naming of ‘mixed’children for parents from different racial, ethnic and faith backgrounds. The Sociological Review, 56(1), 39-60.

This was not a quantitative study, just some interviews.

Fryer Jr, R. G., & Levitt, S. D. (2004). The causes and consequences of distinctively black names. The Quarterly Journal of Economics, 119(3), 767-805.

Was already cited in the discussion.

Lieberson, S., & Mikelson, K. S. (1995). Distinctive African American names: An experimental, historical, and linguistic analysis of innovation. American Sociological Review, 928-946.

Is somewhat similar, but was only related to inferring the sex of offspring with new/rare names among African Americans.

Varnum, M. E., & Kitayama, S. (2011). What’s in a name? Popular names are less common on frontiers. Psychological science, 22(2), 176-183.

Is interesting, but does not involve research on social status, and is based on regional data.

Page 1: “The results showed strong evidence of validity.” This sentence is vague. Do you mean the results showed strong evidence that linguistic characteristics of names correlated with socioeconomic indicators?


I meant to refer to the findings from the t-tests result, which was an initial test for signal in the data. I have amended the sentence to “The results showed strong evidence of signal in the data.”

I’m confused by the claim that one can infer approximate ancestry from first names (page 2). Please provide some detail (e.g., percentage of non-immigrants who have a name matching their ethnicity) and a citation. (Perhaps I’m skeptical because none of my children have names matching my or my spouse’s predominant ethnicities.)


Non-immigrants only have one ethnicity (Danish), so one cannot supply such data. The point is that one can look up a given name in a database of first names and see where it comes from. We used behindthename.com. For instance, if we look up my name (https://www.behindthename.com/name/emil), we can see it is tagged as “Swedish, Norwegian, Danish, German, Romanian, Bulgarian, Czech, Slovak, Polish, Russian, Slovene, Serbian, Croatian, Macedonian, Hungarian, Icelandic, English”. I guess they are in order of strength of association and we can see that all 3 Scandinavian countries are the first on the list. I’d say they are also right about the primary origin is likely to be Sweden based on my personal impressions. If you check your name (https://www.behindthename.com/name/russell), it is just given as “English”, which is not so informative considering the various countries that trace their population to English settles (including USA of course).

The results in Figure 5 suggest that this estimation method is highly accurate because the estimates of social status are highly congruent with the known ones from official data (r = .72).

Page 2: The author states that income, criminal convictions, house ownership, and unemployment are all positively correlated. Shouldn’t criminal convictions and unemployment be negatively correlated with the other two variables?


Good catch. I added the clause “when negative outcomes were reversed”.

The biggest problem with the manuscript is that it does not make it clear to a non-expert how the name scoring procedure resulted in a variable value for a name. For example, was a point value for assigned to each pattern and then these were summed for each name? Did the process create a unique score for each name? Please add a few sentences to clarify this point.


You are right an example should have been given. I have added the following example:

“For instance, the name Peter would be scored as having the following n-grams: p, e, t, r, pe, et, te, er, pet, ete, ter, as well as their initial and ending variants. It would furthermore have a vowel fraction of 2/5, stop sound fraction of 2/5, nasal sound fraction of 0, and be negative for presence of a dash. All the other features would be negative. Thus, each name has 1,099 features associated with it, of which 1,995 are binary, and 4 are numeric.”

Don’t say that a p-value distribution is uniform “by chance” (p. 4). This is vague. Say that this is the expected distribution of p-values if the null hypothesis were perfectly true. I suggest making a similar change on p. 13.


I have added “(i.e. if the null hypothesis of no signal is true)” to each.

It’s not clear that, “This can be inferred because only rare names would tend to produce very large effect sizes” (p. 4). What is your logic here? Please explain.


The rare would names have rare linguistic patterns because of they belong to small immigrant populations. Generally, large effect sizes would tend to involve rare patterns since it is difficult to get a large effect size if the number of persons with such a name in the population is large. The exact reasoning here is somewhat too long for me to elaborate on in the section, as it is a side remark. If desired, I could produce a long footnote with it.

For Table 1, add two columns that would let your reader know the percentage of adults in Denmark with each of these first names. That would help provide some context. (Ignore this suggestion if this information is not available.)


Data for just adults is not available as far as I know, but I have added the number of persons with each name in 2012. The Danish population was about 5.6 million at that time, so the number of males was about 2.75 million. So every name will be only a tiny fraction of the total, the most popular one (7279) was about 0.27%.

Adding a mean, SD, and median S factor score for Danish and non-Danish names (perhaps in the caption of Figure 3) would be helpful.


Added “The mean/sd for Danish names was 0.37/0.72, and for non-Danish -0.81/1.04.” to the caption of Figure 4.

Please provide evidence or a citation that high status secular Turks would give their children Turkish names instead of Muslim names. (Sounds plausible, but some supporting evidence is needed.)


This was meant as a hypothesis, not established fact. I searched a bit, and one can find literature in this direction, though not entirely satisfactory: https://www.tandfonline.com/doi/full/10.1080/00263206.2012.703617?mobileUi=0 https://www.cairn-int.info/resume.php?ID_ARTICLE=E_RHMC_602_0018

I have amended the sentence to "e.g. high status, secular Turks might give their children Turkish names while most others give their children Muslim names".

Yes, the paragraph on pp. 11-12 stating that the skew in Figure 3 is caused by the inclusion of non-Danish names is almost certainly true. Figure 4 indicates that three foreign groups of names in particular (Arabic, Polish, and Turkish) scored MUCH lower on the S factor—on average—than the overall mean. This might be worth mentioning.


Not sure exactly what you are suggesting I add. Figure 5 shows the average status of the origin groups whether identified by name or by official data.

Correct a few instances of awkward writing, vague language, or grammatical errors:
Eliminate the use of “we” in a general sense (p. 2).


I replaced the plural first person with singular first person. The paper originally had multiple authors and the text reflected that fact. It now has the somewhat unusual singular seen in economics papers.

The second-to-last sentence on the first paragraph on p. 2 is very hard to understand.


I have split up the long sentence on page 2.

The brackets at the end of p. 2 are confusing. At first, I thought that “aeiouæøå” was an example of a “fraction vowel.”


I have added separating commas to the brackets. The bracket notation is standard in linguistics, but I see why it can be confusing to readers without a background in that field (as I have).

Whose expectations are you referring to on p. 7?


Expectations readers would have based on reading other material about immigrant groups in Denmark, origin countries well being, and the world at large. Generally speaking, Northwest Europeans do best, then other Europeans, then various non-Western groups except for a few Asian countries. I have amended the sentence to “The relative ranking of the top 10 origin groups corresponds fairly well to expectations based on the origin countries’ well-being (Kirkegaard, 2014).”.

Page 8: You use the term “significant.” Please specify whether this is statistical, practical, or clinical significance. (Never use “significant” alone.)


Good catch on the word significant. I generally avoid using it altogether because of these confusions. I have changed it to “substantial”.

The phrase “we subset” is awkward because (1) there is only 1 author, and (2) I’m not sure “subset” is a verb.


Subset is a verb too. https://en.wiktionary.org/wiki/subset#Verb This usage is somewhat unsual, but not at all without precedent, e.g.: http://www2.sas.com/proceedings/sugi22/CODERS/PAPER79.PDF https://www.r-bloggers.com/5-ways-to-subset-a-data-frame-in-r/ http://adv-r.had.co.nz/Subsetting.html

Jose,


Given that the feature are 3-grams, it would be interesting to see the coefficients on each, to try to pick out the ones most (independently) associated with S. I guess you will probably see "Abd". This would be helpful insofar as there are language-typical letter triplets, which may help pinpoint language-level associations, in addition to the manual testing done with the behindthename data.


The features are also 1- and 2-grams. There are too many of them to list them in the paper, but one can of course inspect the strongest features from the t-tests. As you guessed, the strongest feature, or rather set of features, are 3 equivalent ones that tag the same set of 7 names (bd, abd, _abd) with a d value of -2.44 (p = 2.35e-5). The most positive feature is _lau (d = 1.70, p = 9.17e-5) as this tags 7 high ranking names, including #1 and #6 seen in Table 1. I did save the results to files, but only in RDS format. I have added XLSX versions as well for the non-R readers.

“p 8. Clarify what is meant by performance in that context”


Amended to “Because the Muslim countries generally perform poorly (correlation between Muslim% in origin country and general social status = -.63 (Kirkegaard & Fuerst, 2014)),”.

It would be interesting to show an R2 curve showing the changes in R2 with number of predictors, showing what % of features can be retained with minimal loss in accuracy. This could be done by selecting the best predictor triplet from a bivariate regression, then running N-1 regressions and picking the next one, and so on.


One could also just increase the penalty in the lasso. However, this was not an objective of the current study and would take a substantial amount of new coding to include.

There is a slight mismatch between the alphas that are used and the ones that the paper mentions. The ones used are 1,0.325, 0.55, and 0.775. This does not affect the conclusions.


Good catch. Yes, I recoded these at some point after writing the text.

Given that you are willing to run 5k CV iterations, why not do a straight LOOCV? This would get results as good or better as 5K CVs with fewer model runs. Also, switching to K folds (With K higher than usual, say 100) would also enable faster iteration, possible allowing for an optimisation of alpha in addition to lambda. Ideally one would use bayesian optimization here, but this feels excessive. Overall, I don't expect the results to change substantially were these methods tried.


One could probably improve upon the specific scheme used in this paper in a number of ways. As you note, however, this would be very unlikely to affect the conclusions. I don’t recall exactly why CV was chosen over LOOCV, aside from the usual reasons (high variance in LOOCV).



Typos:
  • "The list predictors" -> predictors
  • "The featured" -> features


Fixed.
This paper is clearer and better than the previous version. I have four minor comments:
• Page 2: The sentence that begins with “I did not have stereotypes . . .” would be clearer if it were divided in two, with the second sentence starting at “However . . .”
• Page 4: Eliminate the phrase “by chance” completely and just say “. . . 5% expected if the null hypothesis of no signal is true.”
• Page 4: Yes, add a footnote explaining why only rare names would produce large effect sizes.
• Let me clarify my statement about the paragraph on pp. 11-12 from my previous review. The paragraph says that the skew in the histogram in Figure 3 is likely caused by the inclusion of non-Danish names. I agree with this assessment. But it might be worth mentioning that Figure 4 shows that three non-European foreign groups of names score much lower on the S factor than European foreign names. This almost certainly contributes to the negative skew in Figure 3 mentioned in pp. 11-12. This is a minor issue, though, and if you don’t want to mention it in the text, then it’s fine. (I just think it’s helpful when a conclusion has multiple pieces of supporting evidence from the data.)

There is no need for any more feedback from me.

-Dr. g
I have re-read the last version, and I have no further comments.
Admin
Dr. g,

I have amended all the sentences you mentioned. I have added the following footnote:


Rare name features result mainly from rare patterns in Danish names and from names from other languages. Since most non-Danish languages are below Danish social status (especially the non-European ones) (Kirkegaard & Fuerst, 2014), the patterns in them will generally have a negative effect size and be rare, producing the left tail of effect sizes. Furthermore, by sampling theory, we expect larger effect sizes to come from smaller samples in general and this will be seen in both tails. In fact, every pattern with an absolute effect size above 1 (n = 93) had a sample size of 71 or fewer (see plots in supplementary materials), and the correlation between absolute effect size and sample size is -.13 [95CI: -.19 to -.07].


which includes what you suggested (non-European names particularly low in social status and particularly foreign, resulting in many rare patterns with large negative effect sizes). The whole reasoning here is a bit elaborate and not really worth going into for this paper and this side-finding. The more distant a language is to Danish, the more linguistic patterns will be associated with it, and of course the more genetically distant the speakers will be etc.

Files updated.
I examined the revised version of the paper, and I am satisfied with its current version.

-Dr. g
1