Back to [Archive] Post-review discussions

[ODP] The OKCupid dataset: A very large public dataset of dating site users
I find the paper OK. Only one thing I noticed. You mention that only 0.7% have answered all 14 questions on the cognitive test, but in Table 3 you mention that sample size is about 55,000. I guess this should be marked as the maximal sample size, because it is only about 400 with answers on all 14 questions.
Admin
Yes, this is the maximum sample size. I will update the table caption. IRT takes into account the exact amount of missing data to estimate scores of persons who only have partial data and makes a best guess at how they would have answered had they attempted the item. Not sure exactly how this works, but seems to be standard practice.
Admin
I have added the complete sample sizes for each of the 13 tests. The 55k number is the number of persons who answered the first item.
This revision looks OK to me.
Admin
That brings the approval count to 2/3.
That brings the approval count to 2/3.


I have reviewed it and offer a few comments.  These should be easy to quickly resolve:

I reviewed the PDF version identified as 4. October 2016

I found that the paper clearly defined the procedures used and the applied methods.  The one thing that was missing, that I think should be included somewhere, is a clear statement of the objective.  The only thing I could find was in the second paragraph: "As an example of the analyses one can do with the dataset, a cognitive ability test is constructed from 14 suitable items."  Was this the entire objective?  If so, is there a way to give it a section title?  My impression with the paper is that it is an exercise that demonstrates that some information can be obtained from scraping and that it can be tested against its own contents to indicate a degree of validity.  

Related to the missing objective... The reader can obviously infer, to some degree, what this methodology might be used for in various research efforts.  Even so, I think it would be helpful for the authors to tell us what they see as an end-use value of the approach.

In the prior discussion, Emil wrote:
"The analyses presented in the paper are only presented to showcase what kind of analyses one can do with the dataset and show that one finds known results when doing so (successful calibration).

I don't understand how this was not clear to you. Let me know if you have any suggestions for how to make this more clear if you think it should be."

I still think a few words to convey this thought to the reader would be helpful.

A few typo comments:

Page 2, first paragraph
Change "may simple not have" to "may simply not have."

Page 2, third paragraph
I find the last sentence in the paragraph to be somewhat unclear.  In the part that includes this: "linked
biological relatives to do a behavioral genetic analysis of the relationship between behavioral problems"  it might read better, if there were a comma after the word "relatives."  

Page 2, item 3 at the bottom
The text reads: "The seemingly decline in uses over time is perhaps..."
I assume the authors meant "seeming."  

Page 3, paragraph 3
The text reads: "This means that unlike questions asked by scientists, the questions concern many odd domains not normally considered by scientists and, reversely, do not concern many domains of interest to scientists."
I was thrown off by "reversely."  Is there a better wording possible?
Admin
Thanks for the review, Bob. I will work on a revision to fix the problems.
Admin
Bob,

Quotes are yours.


I found that the paper clearly defined the procedures used and the applied methods.  The one thing that was missing, that I think should be included somewhere, is a clear statement of the objective.  The only thing I could find was in the second paragraph: "As an example of the analyses one can do with the dataset, a cognitive ability test is constructed from 14 suitable items."  Was this the entire objective?  If so, is there a way to give it a section title?  My impression with the paper is that it is an exercise that demonstrates that some information can be obtained from scraping and that it can be tested against its own contents to indicate a degree of validity.

Related to the missing objective... The reader can obviously infer, to some degree, what this methodology might be used for in various research efforts.  Even so, I think it would be helpful for the authors to tell us what they see as an end-use value of the approach.


It seemed clear enough to me. The curse of knowledge, perhaps. Given that multiple people found it hard to understand what the goal was. I have added another sentence to the end of the Introduction:

The purpose of this article is to describe the data collection process including sampling procedures and present some example analyses done using the dataset to showcase its usefulness for psychological research. Our hope is that others will use the dataset for their own purposes and do so in a transparent way that allows for large-sample, reproducible research.

Hopefully, it should now be clear what the goal was, what the point of the article is and what the long-term/end goal is.

Page 2, first paragraph

Change "may simple not have" to "may simply not have."



Fixed.

Page 2, third paragraph
I find the last sentence in the paragraph to be somewhat unclear.  In the part that includes this: "linked
biological relatives to do a behavioral genetic analysis of the relationship between behavioral problems"  it might read better, if there were a comma after the word "relatives."


Changed to:

... linked biological relatives so as to enable one to do a behavioral genetic analysis of the relationship between behavioral problems and the timing of menarche.


Page 2, item 3 at the bottom
The text reads: "The seemingly decline in uses over time is perhaps..."
I assume the authors meant "seeming."


Fixed.


Page 3, paragraph 3
The text reads: "This means that unlike questions asked by scientists, the questions concern many odd domains not normally considered by scientists and, reversely, do not concern many domains of interest to scientists."
I was thrown off by "reversely."  Is there a better wording possible?


Changed to:

This means that unlike questions asked by scientists, the questions concern many odd domains not normally considered by scientists. Likewise, the questions do not include many that scientists would have included such as items from standard personality inventories.

Let me know if this is better.

---

The files have been updated
The changes were quite helpful, especially in the few areas that I thought were unclear.

I approve for publication.