Hello There, Guest!  

Ethical questions surrounding the release of OKCupid data

#1
This post was posted in the review thread of this paper, but it's not a scientific criticism and hence belongs somewhere else. I hence created this thread for people to discuss ethical issues if they so want. All non-scientific questions on this topic go into this thread. -Emil

--

This paper is highly deficient in its approach to ethical questions around the collection of the data.

I myself might be in the dataset, and have consented only to the uses outlined in OKCupid's Privacy Policy.

Were the administrators of the site contacted? Was permission asked? Did the authors get a response? Did the authors ask for any form of external ethical guidance from their universities in Aarhus and Aalborg? Are the authors affiliated to a university, do they hope to list this publication on their academic CV? Who are the editors of this journal? Does this journal have any guidelines concerning ethical review requirements? Are the editors affiliated with a university? Do they themselves have ethical guidelines to follow?

I would need to get answers to all of those questions, some of them written in the paper, to consider this article noteworthy of publication. Lacking such responses, this paper barely scratches the real issues around open data efforts in science when these efforts involve personal data. This paper simply declare the data "public", and circumvents the really thorny ethical issues and the technical challenges of anonymization. Instead, it spends some time discussing the relatively trivial issue of the scraping of the data, and the circumvention of security measures implemented by OkCupid to prevent precisely such efforts.

Ultimately, it would be very damaging to whatever reputation this journal might have if this article was to be published as is.

The end never justifies the means.

[EDIT: For more opinions on the topic, see https://ironholds.org/blog/when-science-...-internet/ and http://emilygorcenski.com/blog/when-open...ata-breach ]
 Reply
#2
It seems that there are two potential ethical issues here: (1) was the data collection and publication legal; (2) is the research using this data ‘human subject research’, thus asking for ethical board approval. From what I understand, the collection and publication of the data did not violate any Danish laws. Regarding the second point at least in the U.S., per the common rule, IRB approval is only necessary when dealing with human subjects, where that is "living individual[s] about whom an investigator (whether professional or student) conducting research obtains (1) data through intervention or interaction with the individual, or (2) identifiable private information.” Since no experimental/interactive research (1) is being conducted, the only issue is with identifiability (2). (Note: Some people have mentioned concerns about consent, yet, this arises in context to human subject research and so pertains to situations where (1) and/or (2) hold.)

Could the authors verify that the data collection/publication was legal given the jurisdiction in which they collected it? If so, could the authors anonymize the data, thus mooting human subject research related concerns?
 Reply
#3
For completeness, I raised similar questions in the first comment of this thread, but they were rejected as "non-scientific criticism" and moved to another thread by the author, while presumably acting as editor-in-chief.

[EDIT: This comment is no longer relevant. It was posted elsewhere and later moved into the same thread]
 Reply
#4
I also posted a set of questions here about the research ethics variables of the project (also emailed to the author), but that post has been removed without any communication to me. Addressing research ethics is central to peer-review.

[Edited to add] I also notice that the dataset is now password-protected.
 Reply
#5
(2016-May-12, 22:02:03)MichaelZimmer Wrote: I also posted a set of questions here about the research ethics variables of the project (also emailed to the author), but that post has been removed without any communication to me. Addressing research ethics is central to peer-review.

[Edited to add] I also notice that the dataset is now password-protected.


I was able to find what happened to my post by clicking on my username. But it does seem like in your case the post was deleted.
 Reply
#6
(2016-May-12, 22:02:03)MichaelZimmer Wrote: I also posted a set of questions here about the research ethics variables of the project (also emailed to the author), but that post has been removed without any communication to me. Addressing research ethics is central to peer-review.


Michael and pdehaye,

Could you detail your concerns in as logical, constructive and pithy of a manner as possible? (I'm busy with other projects and so don't have a whole lot of time to follow blog discussions, search for missing comments, etc.). As noted above, by my understanding, anonymizing would largely allay the ethical concern (granting the legality of the data obtainment). But you two seem to disagree. Could you clarify why, with reference to specific standards?

[Note: Apparently, Emil started a separate thread: http://openpsych.net/forum/showthread.php?tid=281; but, to me, it seems reasonable to discuss the matter here, so long as the conversation concerns compliance with generally accepted research standards.]
 Reply
#7
(2016-May-12, 23:15:01)Chuck Wrote:
(2016-May-12, 22:02:03)MichaelZimmer Wrote: I also posted a set of questions here about the research ethics variables of the project (also emailed to the author), but that post has been removed without any communication to me. Addressing research ethics is central to peer-review.


Michael and pdehaye,

Could you detail your concerns in as logical, constructive and pithy of a manner as possible? (I'm busy with other projects and so don't have a whole lot of time to follow blog discussions, search for missing comments, etc.). As noted above, by my understanding, anonymizing would largely allay the ethical concern (granting the legality of the data obtainment). But you two seem to disagree. Could you clarify why, with reference to specific standards?


I do not disagree with MichaelZimmer, and rather suspect I actually agree with him. I would however first need to see what he wrote to establish that, so please restore his post.

I personally believe that:

1) the authors' actions were "grossly negligent in the planning, performing and reporting of research results", which precisely qualifies these actions as "scientific misconduct" under the Danish Code of Conduct for Research Integrity (p. 21), as defined by the the Danish Committees for Scientific Dishonesty (UVVU). This determination should however be made by the authors' universities, not by me, the referees or editors.

2) The fact that the authors cannot approach the journal with an IRB clearance or equivalent obtained prior to conducting the data collection is telltale of a significant problem regarding the planning of the study, and basis for outright rejection or at least extreme precaution by the editorial committee, as demonstrated in your post.

3) Any suggestion to retroactively anonymise the dataset, after having publicly released it, is a futile attempt to mitigate irreparable harm. This dataset has already been downloaded by dozens of people checking if they were included in the data, and will surface in its original version on many websites.

4) These actions were in fact illegal under Danish law, if Danish law transcribed accurately Section III article 8 of the EU Data Protection Directive 95/46/EC, regarding "Special Categories of Processing". This is however mostly a determination for the universities and the Datatilsynet to make, unless an individual sues in Danish court.

5) These actions were a breach of OkCupid's Terms of Service, as they have themselves publicly stated.

6) These actions were a violation of the US Computer Fraud and Abuse Act (which are relevant since OkCupid users agree to New York jurisdiction for any grievance between the company and users. The authors acknowledge in the paper that they have intentionally incriminated a few friends in their scheme, by reusing their accounts.
 Reply
#8
Quote:1) the authors' actions were "grossly negligent in the planning, performing and reporting of research results", which precisely qualifies these actions as "scientific misconduct" under the Danish Code of Conduct for Research Integrity (p. 21), as defined by the the Danish Committees for Scientific Dishonesty (UVVU). This determination should however be made by the authors' universities, not by me, the referees or editors.

2) The fact that the authors cannot approach the journal with an IRB clearance or equivalent obtained prior to conducting the data collection is telltale of a significant problem regarding the planning of the study, and basis for outright rejection or at least extreme precaution by the editorial committee, as demonstrated in your post.

Comment: Part of the question is whether the research is a concern for ethical review boards in the first place. The determination of whether ethical board approval should be sought would need to be made by authors, referees and editors -- since this determination necessarily stands prior to consultation with a review board. As it is, much social science research doesn't require IRB approval because it's not human subject research, so there is no prior expectation for such approval. In the case of open journals this procedure happens to open itself to problems because a researcher informally "publishes" datasets on submission. I don't see an obvious remedy for this situation.

Quote:5) These actions were a breach of OkCupid's Terms of Service, as they have themselves publicly stated.

6) These actions were a violation of the US Computer Fraud and Abuse Act (which are relevant since OkCupid users agree to New York jurisdiction for any grievance between the company and users. The authors acknowledge in the paper that they have intentionally incriminated a few friends in their scheme, by reusing their accounts.

Comment: If (5) doesn't hold, it's not obvious to me why (6) would. As for OKCupids policy, the relevant section seems to be:

By accessing this Website... You further agree that you will not use personal information about other users of this Website for any reason without the express prior consent of the user that has provided such information to you.

The authors would have violated the policy only if they logged in as a user to scrap data. If they were able to scrap data from the outside, however, there would be no policy violation. How did the scraper work? (Point 1.)

Quote:4) These actions were in fact illegal under Danish law, if Danish law transcribed accurately Section III article 8 of the EU Data Protection Directive 95/46/EC, regarding "Special Categories of Processing". This is however mostly a determination for the universities and the Datatilsynet to make, unless an individual sues in Danish court.

You seem to be referring to the following passage:

Member States shall prohibit the processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, and the processing of data concerning health or sex life.

Personal data means identifiable data. So the downloading and organizing of the data with the identifiers may have indeed been a violation. (Point 2.) Can someone clarify the matter? On the other hand, a reprocessing of the dataset in which the personal IDs were removed and the data was resaved would seem not to be.

Quote:3) Any suggestion to retroactively anonymise the dataset, after having publicly released it, is a futile attempt to mitigate irreparable harm. This dataset has already been downloaded by dozens of people checking if they were included in the data, and will surface in its original version on many websites.

Comment: Assuming that the data was scraped from the outside and that there was no significant EU member-state data use violation, if the database was reprocessed and IDs were removed, we would seem to be left with the above concern. Let's try to get clarity about points 1 and 2 first.
 Reply
#9
(2016-May-12, 22:02:03)MichaelZimmer Wrote: I also posted a set of questions here about the research ethics variables of the project (also emailed to the author), but that post has been removed without any communication to me. Addressing research ethics is central to peer-review.

[Edited to add] I also notice that the dataset is now password-protected.


You sent the same questions per email to which you received an answer. However, because you were impatient and decided to duplicate the questions here as well, I removed the forum version.

I recommend some patience in receiving answers and not duplicating content.
 Reply
#10
I disagree with the details of your description of the mechanics of seeking ethical approval. I limit myself here to answering Points 1 and 2.

Point 1: The authors state (and their code reflects) that they _had_ to log in to get to the data. Footnote 5 in the paper suggests additional steps they could have taken to get to more data, but seem to not have. They also stated elsewhere (but can't recollect where) that they used friends' accounts to do some of the scraping, as they needed multiple accounts to avoid the protection mechanisms.

Point 2: Ultimately, given the intense targeting of Danish female users under 30 that the authors explicitly refer to in their paper, the processing of data in Denmark, and the murky relationship between the authors and their respective universities, I think the appropriate institution to ask for clarification on the legality is the Datatilsynet (i.e. the Danish Data Protection Office, ultimately in charge of such issues for Denmark). If the editorial board is really considering peer-reviewed publication, I would suggest the editors to force the authors to consult with this office in order to obtain retroactive confirmation that these actions were unproblematic.

Quote:Personal data means identifiable data. So the downloading and organizing of the data with the identifiers may have indeed been a violation. (Point 2.) Can someone clarify the matter? On the other hand, a reprocessing of the dataset in which the personal IDs were removed and the data was resaved would seem not to be.

The data has been downloaded 500 to 600 times in its unredacted form, by the authors own admission. On its own, the irreversible act of public release of the dataset is in violation of any ethical norm. The comments below concern a hypothetical re-release of the dataset in redacted form.

Personal data is data related to an identified or identifiable individual. This data need not be identifiable from exclusively within the dataset, however, but needs to be considered in conjunction with data available to others (even if not publicly). So even simple removal of the personal IDs would not be sufficient.

This reasoning is based on:
  • recital 26 of the EU Data Protection Directive (emphasis mine):
    Quote: (26) Whereas the principles of protection must apply to any information concerning an identified or identifiable person; whereas, to determine whether a person is identifiable, account should be taken of all the means likely reasonably to be used either by the controller or by any other person to identify the said person; whereas the principles of protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable; whereas codes of conduct within the meaning of Article 27 may be a useful instrument for providing guidance as to the ways in which data may be rendered anonymous and retained in a form in which identification of the data subject is no longer possible;

  • the recent EU Court of Justice Rynes case, which concerns the use of CCTV to film a street (without consent of individuals walking in the street), for the purpose of protection of personal property (i.e. a stronger, more justifiable, motive than testing scientific hypotheses without ethical consent). My interpretation of the authors' actions in light of this judgement might however need to be tested in court.

I have limited myself in this particular post to the points 1 and 2 that you have listed, but am even more convinced now of the rest of the six points I had made in my previous post.

I am further appalled by Kirkegaard's reactions to the criticism he has received from the research community. Were I an editor of this journal, I would find it damaging to keep associating myself to this journal while he remains editor-in-chief. For these reasons, I encourage the editorial board to dissociate themselves from Kirkegaard by either asking him to step down as editor-in-chief, or, should he refuse, themselves resigning from their positions.
 Reply
 
 
Forum Jump:

Users browsing this thread: 1 Guest(s)