No Fair Sex in Academia: Is Hiring to Editorial Boards Gender Biased?

Forum Bot

Mon 12 Jul 2021 20:51

Bot

Authors have updated the submission to version #2

Forum Bot

Sun 18 Jul 2021 20:04

Bot

Authors have updated the submission to version #3

Reviewer 2

Fri 06 Aug 2021 18:51

Reviewer

Hello. What follows is my review of this paper.

Brief Summary

This study is chiefly about sex differences in research output among academics who sit on elite editorial boards in four of the core fields of social science – in particular, anthropology, political science, psychology, and economics.

The authors collect a novel dataset of individual academics positioned on editorial boards from 30 top-ranked journals in each of these fields during the period between March and June of 2020. The dataset contains some basic demographic information (gender and total years spent publishing) as well as several standard measures of research performance (citations, H-Index, etc.).

The authors use this dataset to investigate whether, among academics sitting on editorial boards, there remain sex differences in research performance that favor men (as such differences typically do among academics overall). They argue that the existence (and/or magnitude) of sex differences in performance among academics on editorial boards can be taken as a crude test of the existence of anti-meritocratic bias against one sex or another, although they occasionally note that this test requires several critical assumptions ("other things being equal").

The authors find that, among the population they study, men achieve much higher research output than women. There is some speculative discussion about which controls are appropriate (e.g., years spent publishing) and whether age confounds may affect the interpretation of their results, but broadly the authors conclude that bias against women in editorial board selection likely does not exist, or even that existing bias acts to disfavor men.

Lastly, the authors conduct an online survey using Prolific of 425 individuals with PhDs that is finally restricted to 231 individuals who say they work in academia or are currently publishing scientific papers. The survey asks respondents several normative (and also a few empirical) questions about whether diversity by age or gender is important to have on academic editorial boards, whether there are age or sex differences in academic ability, and about their own evaluation of the current preferences of those who hire editorial board members. They most importantly find that survey respondents are far more willing to reveal a preference for discriminating against men than vice versa. They conclude that this is consistent with their empirical conclusions in the previous part of their paper.

Specific Points of Concern

Organization.

The paper is not very well organized. As just one example among many, why is Nielsen (2015) first discussed only at the end of the paper, and not in the introduction during the discussion of existing literature on bias in academic selection? There also seems to be more emphasis on the literature on sex differences in cognitive ability and academic/vocational interests than on the literature concerning bias in selection and hiring of academics. As this paper is primarily about the latter, the literature review should mostly concern previous research in this domain (i.e., the various methods that other researchers have used to test for such biases, what they have found, how the authors' "simple test" compares to these other methods and its advantages and disadvantages in comparison, etc.). It doesn't seem to me that the authors provide a full overview of this literature. It also seems that the authors' discussion of the literature on sex differences in intelligence is somewhat slanted. My understanding is that some studies do find that women have an advantage on verbal skills such as reading comprehension and verbal fluency, not withstanding Lynn's (2021) study of the Wechsler Intelligence Scale, but this is never mentioned. See, e.g., Lynn and Mikk (2009).

The discussion in the introduction of the political leanings of different fields of social science does not seem highly relevant to the main results of the paper. It is odd that Table 1 the first table presented in the paper. The authors admit they are basically unable to conduct formal tests concerning the issue at hand: "To properly test for any gender bias arising from political opinion between subjects we would need to include more subjects" (p. 12).

The discussion of how the authors' selection of fields might "bias" their results seems confused, and this point is related to the necessity of a clearer discussion of the caveats regarding what conclusions the authors' test allows them to draw. The questions of what scientific domains the authors have chosen to study relates more to what specific question they are trying to answer ("Does academia in general have an anti-female bias?' or "Do certain particular academic fields have an anti-female bias?") than to any potential bias in their findings or methodology.

In general, the introduction should focus more explicitly on what exactly the authors are contributing to the existing literature with their empirical investigation, and why the authors believe the empirical tests of their paper will allow them to draw the conclusions they do. The problems and caveats with their central test (which are vaguely gestured at) should be more directly discussed – more on this below.

Concerns about the dataset.

The summary statistics in Table 2 indicate that the author's dataset contains entries that are obviously erroneous. Who exactly has an H-Index of 356 in any of anthropology, political science, psychology, or economics? How is it possible that anyone has an H-Index Since 2016 of 2455? For each individual researcher, mustn't his H-Index Since 2016 always be strictly less than his H-Index? These data seem impossible. The authors may argue that they use robust regression to alleviate concerns about human error in data collection, and that the errors are not their fault, but Google Scholar's, but the presence of these figures makes the reader skeptical that the dataset provides an accurate representation of academics' research output.

I also wonder how it can be that the H-Index and the H-Index Since 2016 have a correlation of 0.37, but the H-Index and the Transformed H-Index Since 2016 have a correlation of 0.85? In general, I am surprised by the uniquely low correlations between all of the other variables and the H-Index Since 2016. Perhaps this is an artifact of the log transform and the nature of the H-Index measure, but my intuition goes the other way. What do the authors think?

The authors should provide a list of the 30 academic journals from each field used in the paper. It could go in the appendix. The reader cannot tell, for example, whether the Journal of Political Economy was in fact excluded. It should not have been, as it is an elite "Top 5" journal in economics.

Methodology.

A more natural way to check the results for robustness to errors and outliers (and to some degree to genuine sex differences in variance of ability that especially affect the right tail of output) would be to simply winsorize the dependent variables (at, say, the 98th percentile for all measures) and re-run the analysis on the winsorized dataset.

The use of robust regression is interesting, but, especially as it does not appear to affect the main conclusions, it could be relegated to the appendix. A more important issue is whether the authors are using heteroskedastic-robust standard errors (White 1980) in their main tables (Tables 6 and 7). This is important for accurate inference and to ensure that the authors' statistical tests are correctly-sized. Furthermore, the authors evidently are clustering their standard errors at the level of the individual researcher (which will tend – although not always – to produce the smaller standard errors), but this is not the most conservative approach to inference and ought to be explicitly justified if the authors intend it, as unobserved factors affecting research performance may well be correlated within disciplines or within editorial boards.

The discussion section about whether years spent publishing is an appropriate control (p. 17) is not very clear. It is hard to follow and largely speculative. The authors speculate about confoundedness with age, but as they have no variable for age in their dataset, they have no way to test any relevant model.

I may have misunderstood how the authors normalize their dependent variables, but if they normalize at the journal-level (and journals are nested within fields), why are there main effects for different fields in columns (9–12) in Tables 6 and 7? (This might be a misunderstanding on my part of what exactly the authors are doing.)

Caveats regarding the authors' "simple test."

The test that the authors propose for whether there is discrimination against women in editorial board selection is not as simple as they at times suggest. For one, it requires that editorial board selection is determined solely on the basis of whether a single latent "merit" variable surpasses a fixed threshold. For another, it requires that there are no mean or variance differences in the sex-specific distributions of this latent "merit" variable. If this second requirement fails, then even under the single threshold model, ultimate means of output of men and women who surpass the threshold will not necessarily be equal. One gets the sense that the authors are aware of all of this, but it is not discussed with as much emphasis or detail as it should be. For example, the authors admit there are "two competing explanations" for their findings, i.e., "that men are in general higher performing academics than women and that journals are biased in favor of women" (p. 20). This needs more detailed and explicit discussion, and should be discussed earlier in the paper at the point when the basic test is introduced.

Of course, the size of the effect of mean or variance differences in the latent trait(s) on the ultimate differences in mean performance above a single fixed threshold is an empirical question, and the effect may be small in magnitude. But the fact that the authors' test relies on a single threshold model of selection combined with equal means and variances of latent ability should be explicitly stated. (As such, it is evidently not a highly precise test of the existence of bias.) This issue seems especially relevant also given that the authors appear sympathetic to the "higher male variability" hypothesis in general. Given the authors' beliefs about this hypothesis, what effect sizes do they expect for differences above a fixed threshold if selection is meritocratic? How do their empirical findings line up with these expectations? None of these questions is answered explicitly. A second, broader problem – related to reliance on a threshold model of selection – is that optimal editorial board selection may be multi-dimensional, with research performance being just one dimension of qualification. Indeed, it is not entirely clear that filling editorial boards with the "best" researchers "from the top down" would be ideal, as these "service jobs" in academia typically take time away from one's own research.

Survey results.

I don't have any substantial criticisms to make of the authors' online survey. Perhaps the most relevant results would be better suited to a visual representation rather than the representation in Table 9, so that readers could see the entire distribution of responses.

Conclusion

I think the issues discussed above are weighty enough that the paper needs substantial revision and should not be published in its current state. However, the novel collection of data and the main results presented are of great scientific interest, so with more work I think it could ultimately make for a nice paper.