Linguistic features in names and social status: an exploratory study of 1,890 Danish first names

Open Differential Psychology , Dec. 12, 2018, ISSN: 2446-3884


A dataset of the relative general social status (S factor) of 1,890 first names of persons living in Denmark was obtained from a previous study. 1,100 linguistic features were generated based on n-grams augmented by regex and each name was scored on each feature. An initial check using t-tests showed strong signal in the features taken as a whole (42.5 % of p values were < .05), and that this was due mostly to low status names having rarer patterns. OLS and lasso regression were used to combine the linguistic features into a single model. The results showed strong evidence of signal in the data. As a control, the main geographic origin of each name was inferred using data from I validated this by comparing social status by origin group with data from official sources, r = .72, n = 28. The main origin for each name was then entered as a covariate and models were rerun. The results indicated that subtle linguistic features still provide substantial incremental validity, though a precise numerical estimate was difficult to arrive at. I validated this conclusion by training the model only on the subset of data identified as Danish. Model out of sample predictive validity was substantial in general, r = .75 (including origin covariate), and r = .46 in the Danish subset (linguistic features only). I conclude that it is possible to train fairly accurate social status predictors from subtle linguistic patterns in names. It’s possible that humans might pick up on such cues to inform social perception when limited data is available.
Download citation

S factor, Denmark, social inequality, n-gram, penalized regression, social status, lasso, first name, computational linguistics, variable selection, given name

Reviewed by

Review time 751 days