Back to [Archive] Other discussions

Ethical questions surrounding the release of OKCupid data
This post was posted in the review thread of this paper, but it's not a scientific criticism and hence belongs somewhere else. I hence created this thread for people to discuss ethical issues if they so want. All non-scientific questions on this topic go into this thread. -Emil

--

This paper is highly deficient in its approach to ethical questions around the collection of the data.

I myself might be in the dataset, and have consented only to the uses outlined in OKCupid's Privacy Policy.

Were the administrators of the site contacted? Was permission asked? Did the authors get a response? Did the authors ask for any form of external ethical guidance from their universities in Aarhus and Aalborg? Are the authors affiliated to a university, do they hope to list this publication on their academic CV? Who are the editors of this journal? Does this journal have any guidelines concerning ethical review requirements? Are the editors affiliated with a university? Do they themselves have ethical guidelines to follow?

I would need to get answers to all of those questions, some of them written in the paper, to consider this article noteworthy of publication. Lacking such responses, this paper barely scratches the real issues around open data efforts in science when these efforts involve personal data. This paper simply declare the data "public", and circumvents the really thorny ethical issues and the technical challenges of anonymization. Instead, it spends some time discussing the relatively trivial issue of the scraping of the data, and the circumvention of security measures implemented by OkCupid to prevent precisely such efforts.

Ultimately, it would be very damaging to whatever reputation this journal might have if this article was to be published as is.

The end never justifies the means.

[EDIT: For more opinions on the topic, see https://ironholds.org/blog/when-science-goes-bad-consent-data-and-doubling-down-on-the-internet/ and http://emilygorcenski.com/blog/when-open-science-isn-t-the-okcupid-data-breach ]
It seems that there are two potential ethical issues here: (1) was the data collection and publication legal; (2) is the research using this data ‘human subject research’, thus asking for ethical board approval. From what I understand, the collection and publication of the data did not violate any Danish laws. Regarding the second point at least in the U.S., per the common rule, IRB approval is only necessary when dealing with human subjects, where that is "living individual[s] about whom an investigator (whether professional or student) conducting research obtains (1) data through intervention or interaction with the individual, or (2) identifiable private information.” Since no experimental/interactive research (1) is being conducted, the only issue is with identifiability (2). (Note: Some people have mentioned concerns about consent, yet, this arises in context to human subject research and so pertains to situations where (1) and/or (2) hold.)

Could the authors verify that the data collection/publication was legal given the jurisdiction in which they collected it? If so, could the authors anonymize the data, thus mooting human subject research related concerns?
For completeness, I raised similar questions in the first comment of this thread, but they were rejected as "non-scientific criticism" and moved to another thread by the author, while presumably acting as editor-in-chief.

[EDIT: This comment is no longer relevant. It was posted elsewhere and later moved into the same thread]
I also posted a set of questions here about the research ethics variables of the project (also emailed to the author), but that post has been removed without any communication to me. Addressing research ethics is central to peer-review.

[Edited to add] I also notice that the dataset is now password-protected.
I also posted a set of questions here about the research ethics variables of the project (also emailed to the author), but that post has been removed without any communication to me. Addressing research ethics is central to peer-review.

[Edited to add] I also notice that the dataset is now password-protected.


I was able to find what happened to my post by clicking on my username. But it does seem like in your case the post was deleted.
I also posted a set of questions here about the research ethics variables of the project (also emailed to the author), but that post has been removed without any communication to me. Addressing research ethics is central to peer-review.


Michael and pdehaye,

Could you detail your concerns in as logical, constructive and pithy of a manner as possible? (I'm busy with other projects and so don't have a whole lot of time to follow blog discussions, search for missing comments, etc.). As noted above, by my understanding, anonymizing would largely allay the ethical concern (granting the legality of the data obtainment). But you two seem to disagree. Could you clarify why, with reference to specific standards?

[Note: Apparently, Emil started a separate thread: http://openpsych.net/forum/showthread.php?tid=281; but, to me, it seems reasonable to discuss the matter here, so long as the conversation concerns compliance with generally accepted research standards.]
I also posted a set of questions here about the research ethics variables of the project (also emailed to the author), but that post has been removed without any communication to me. Addressing research ethics is central to peer-review.


Michael and pdehaye,

Could you detail your concerns in as logical, constructive and pithy of a manner as possible? (I'm busy with other projects and so don't have a whole lot of time to follow blog discussions, search for missing comments, etc.). As noted above, by my understanding, anonymizing would largely allay the ethical concern (granting the legality of the data obtainment). But you two seem to disagree. Could you clarify why, with reference to specific standards?


I do not disagree with MichaelZimmer, and rather suspect I actually agree with him. I would however first need to see what he wrote to establish that, so please restore his post.

I personally believe that:

1) the authors' actions were "grossly negligent in the planning, performing and reporting of research results", which precisely qualifies these actions as "scientific misconduct" under the Danish Code of Conduct for Research Integrity (p. 21), as defined by the the Danish Committees for Scientific Dishonesty (UVVU). This determination should however be made by the authors' universities, not by me, the referees or editors.

2) The fact that the authors cannot approach the journal with an IRB clearance or equivalent obtained prior to conducting the data collection is telltale of a significant problem regarding the planning of the study, and basis for outright rejection or at least extreme precaution by the editorial committee, as demonstrated in your post.

3) Any suggestion to retroactively anonymise the dataset, after having publicly released it, is a futile attempt to mitigate irreparable harm. This dataset has already been downloaded by dozens of people checking if they were included in the data, and will surface in its original version on many websites.

4) These actions were in fact illegal under Danish law, if Danish law transcribed accurately Section III article 8 of the EU Data Protection Directive 95/46/EC, regarding "Special Categories of Processing". This is however mostly a determination for the universities and the Datatilsynet to make, unless an individual sues in Danish court.

5) These actions were a breach of OkCupid's Terms of Service, as they have themselves publicly stated.

6) These actions were a violation of the US Computer Fraud and Abuse Act (which are relevant since OkCupid users agree to New York jurisdiction for any grievance between the company and users. The authors acknowledge in the paper that they have intentionally incriminated a few friends in their scheme, by reusing their accounts.
1) the authors' actions were "grossly negligent in the planning, performing and reporting of research results", which precisely qualifies these actions as "scientific misconduct" under the Danish Code of Conduct for Research Integrity (p. 21), as defined by the the Danish Committees for Scientific Dishonesty (UVVU). This determination should however be made by the authors' universities, not by me, the referees or editors.

2) The fact that the authors cannot approach the journal with an IRB clearance or equivalent obtained prior to conducting the data collection is telltale of a significant problem regarding the planning of the study, and basis for outright rejection or at least extreme precaution by the editorial committee, as demonstrated in your post.


Comment: Part of the question is whether the research is a concern for ethical review boards in the first place. The determination of whether ethical board approval should be sought would need to be made by authors, referees and editors -- since this determination necessarily stands prior to consultation with a review board. As it is, much social science research doesn't require IRB approval because it's not human subject research, so there is no prior expectation for such approval. In the case of open journals this procedure happens to open itself to problems because a researcher informally "publishes" datasets on submission. I don't see an obvious remedy for this situation.

5) These actions were a breach of OkCupid's Terms of Service, as they have themselves publicly stated.

6) These actions were a violation of the US Computer Fraud and Abuse Act (which are relevant since OkCupid users agree to New York jurisdiction for any grievance between the company and users. The authors acknowledge in the paper that they have intentionally incriminated a few friends in their scheme, by reusing their accounts.


Comment: If (5) doesn't hold, it's not obvious to me why (6) would. As for OKCupids policy, the relevant section seems to be:

By accessing this Website... You further agree that you will not use personal information about other users of this Website for any reason without the express prior consent of the user that has provided such information to you.

The authors would have violated the policy only if they logged in as a user to scrap data. If they were able to scrap data from the outside, however, there would be no policy violation. How did the scraper work? (Point 1.)

4) These actions were in fact illegal under Danish law, if Danish law transcribed accurately Section III article 8 of the EU Data Protection Directive 95/46/EC, regarding "Special Categories of Processing". This is however mostly a determination for the universities and the Datatilsynet to make, unless an individual sues in Danish court.


You seem to be referring to the following passage:

Member States shall prohibit the processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, and the processing of data concerning health or sex life.

Personal data means identifiable data. So the downloading and organizing of the data with the identifiers may have indeed been a violation. (Point 2.) Can someone clarify the matter? On the other hand, a reprocessing of the dataset in which the personal IDs were removed and the data was resaved would seem not to be.

3) Any suggestion to retroactively anonymise the dataset, after having publicly released it, is a futile attempt to mitigate irreparable harm. This dataset has already been downloaded by dozens of people checking if they were included in the data, and will surface in its original version on many websites.


Comment: Assuming that the data was scraped from the outside and that there was no significant EU member-state data use violation, if the database was reprocessed and IDs were removed, we would seem to be left with the above concern. Let's try to get clarity about points 1 and 2 first.
Admin
I also posted a set of questions here about the research ethics variables of the project (also emailed to the author), but that post has been removed without any communication to me. Addressing research ethics is central to peer-review.

[Edited to add] I also notice that the dataset is now password-protected.


You sent the same questions per email to which you received an answer. However, because you were impatient and decided to duplicate the questions here as well, I removed the forum version.

I recommend some patience in receiving answers and not duplicating content.
I disagree with the details of your description of the mechanics of seeking ethical approval. I limit myself here to answering Points 1 and 2.

Point 1: The authors state (and their code reflects) that they _had_ to log in to get to the data. Footnote 5 in the paper suggests additional steps they could have taken to get to more data, but seem to not have. They also stated elsewhere (but can't recollect where) that they used friends' accounts to do some of the scraping, as they needed multiple accounts to avoid the protection mechanisms.

Point 2: Ultimately, given the intense targeting of Danish female users under 30 that the authors explicitly refer to in their paper, the processing of data in Denmark, and the murky relationship between the authors and their respective universities, I think the appropriate institution to ask for clarification on the legality is the Datatilsynet (i.e. the Danish Data Protection Office, ultimately in charge of such issues for Denmark). If the editorial board is really considering peer-reviewed publication, I would suggest the editors to force the authors to consult with this office in order to obtain retroactive confirmation that these actions were unproblematic.

Personal data means identifiable data. So the downloading and organizing of the data with the identifiers may have indeed been a violation. (Point 2.) Can someone clarify the matter? On the other hand, a reprocessing of the dataset in which the personal IDs were removed and the data was resaved would seem not to be.


The data has been downloaded 500 to 600 times in its unredacted form, by the authors own admission. On its own, the irreversible act of public release of the dataset is in violation of any ethical norm. The comments below concern a hypothetical re-release of the dataset in redacted form.

Personal data is data related to an identified or identifiable individual. This data need not be identifiable from exclusively within the dataset, however, but needs to be considered in conjunction with data available to others (even if not publicly). So even simple removal of the personal IDs would not be sufficient.

This reasoning is based on:
  • recital 26 of the EU Data Protection Directive (emphasis mine):
  • (26) Whereas the principles of protection must apply to any information concerning an identified or identifiable person; whereas, to determine whether a person is identifiable, account should be taken of all the means likely reasonably to be used either by the controller or by any other person to identify the said person; whereas the principles of protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable; whereas codes of conduct within the meaning of Article 27 may be a useful instrument for providing guidance as to the ways in which data may be rendered anonymous and retained in a form in which identification of the data subject is no longer possible;

  • the recent EU Court of Justice Rynes case, which concerns the use of CCTV to film a street (without consent of individuals walking in the street), for the purpose of protection of personal property (i.e. a stronger, more justifiable, motive than testing scientific hypotheses without ethical consent). My interpretation of the authors' actions in light of this judgement might however need to be tested in court.


I have limited myself in this particular post to the points 1 and 2 that you have listed, but am even more convinced now of the rest of the six points I had made in my previous post.

I am further appalled by Kirkegaard's reactions to the criticism he has received from the research community. Were I an editor of this journal, I would find it damaging to keep associating myself to this journal while he remains editor-in-chief. For these reasons, I encourage the editorial board to dissociate themselves from Kirkegaard by either asking him to step down as editor-in-chief, or, should he refuse, themselves resigning from their positions.
I also posted a set of questions here about the research ethics variables of the project (also emailed to the author), but that post has been removed without any communication to me. Addressing research ethics is central to peer-review.

[Edited to add] I also notice that the dataset is now password-protected.


You sent the same questions per email to which you received an answer. However, because you were impatient and decided to duplicate the questions here as well, I removed the forum version.

I recommend some patience in receiving answers and not duplicating content.


This is active re-structuring of debate, and akin to censorship. Proper reaction to avoid duplication would have been to attach your answer to Michael Zimmer to the public forum, not the private email.
Note: non-scientific discussion of the dataset is moved to this thread. Peer review threads are for just that, actual scientific peer review. Yes, that does mean that your discussion posts about that topic go into that thread.


It appears that "non-scientific" thread that is providing peer-review of the research methodology has been deleted.
Note: non-scientific discussion of the dataset is moved to this thread. Peer review threads are for just that, actual scientific peer review. Yes, that does mean that your discussion posts about that topic go into that thread.


It appears that "non-scientific" thread that is providing peer-review of the research methodology has been deleted.


While the link you have in "this thread" above doesn't go anywhere, I do now see the "ethical discussions" thread. Was that temporarily deleted?

More to the point, why do you not consider peer-review of the methodology used in this paper appropriate for this "open" peer-review process and forum?
I also posted a set of questions here about the research ethics variables of the project (also emailed to the author), but that post has been removed without any communication to me. Addressing research ethics is central to peer-review.

[Edited to add] I also notice that the dataset is now password-protected.


You sent the same questions per email to which you received an answer. However, because you were impatient and decided to duplicate the questions here as well, I removed the forum version.

I recommend some patience in receiving answers and not duplicating content.


You responded to my email, but you did not answer my questions.
The lead author and some members of the editorial board are presently attending an international conference in London. Discussion of the matter will have to resume next week. I requested that any accessible copies of the dataset be removed from OSF.
More to the point, why do you not consider peer-review of the methodology used in this paper appropriate for this "open" peer-review process and forum?


See here.
I note that I have been banned from posting on this site for a while and am again allowed to post.

Personal data means identifiable data. So the downloading and organizing of the data with the identifiers may have indeed been a violation. (Point 2.) Can someone clarify the matter? On the other hand, a reprocessing of the dataset in which the personal IDs were removed and the data was resaved would seem not to be.


This reasoning is based on:
  • recital 26 of the EU Data Protection Directive (emphasis mine):
  • (26) Whereas the principles of protection must apply to any information concerning an identified or identifiable person; whereas, to determine whether a person is identifiable, account should be taken of all the means likely reasonably to be used either by the controller or by any other person to identify the said person; whereas the principles of protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable; whereas codes of conduct within the meaning of Article 27 may be a useful instrument for providing guidance as to the ways in which data may be rendered anonymous and retained in a form in which identification of the data subject is no longer possible;

  • the recent EU Court of Justice Rynes case, which concerns the use of CCTV to film a street (without consent of individuals walking in the street), for the purpose of protection of personal property (i.e. a stronger, more justifiable, motive than testing scientific hypotheses without ethical consent). My interpretation of the authors' actions in light of this judgement might however need to be tested in court.


I have limited myself in this particular post to the points 1 and 2 that you have listed, but am even more convinced now of the rest of the six points I had made in my previous post.


My reasoning above relied on Danish Data Protection Law faithfully transcribing the EU directive into Danish law. It turns out that the Danish version is even more restrictive. The collection of data was clearly illegal at the outset.

Of relevance are:
  • The Danish Data Protection Act;
  • The guidance of the Danish Data Protection Authority;
  • The standard terms for research projects, which apply for research projects successfully registered with the data protection authority, if no modification to those terms is granted.


Judging from the register of private research projects (click "Fortegnelse"=="list", then search for the authors name in the form you land to as "Dataansvarlig"=="data controllers"), the authors have NOT registered their study with the Data Protection Authority, which should have occurred prior to the collection of data. Quoting from the guidance above: "The Act on Processing of Personal Data states that it is punishable by law to refrain from notifying a project to the Danish Data Protection Agency, and that it is punishable by law to violate the conditions stipulated by the Danish Data Protection Agency. The maximum penalty is a fine or imprisonment for up to four months."

Even if they had notified the data protection authority, the authors would most likely have needed to ask for a change in the standard terms, as those explicitly forbid disclosure and transfer to third countries. Both conditions were violated by the original release of data, including usernames, and would still be violated after removal of usernames. While the first disclosure could be argued to be a naive mistake, it becomes harder to argue that in light of the responses the initial release has received, without consulting directly with the Danish Data Protection Authority (they are unfortunately closed until Tuesday).

Technically, the Danish Data Protection Act is also relevant, in that it could be argued that an approval could be obtained outside of the purview of the Danish Data Protection Authority. This is true, but is however reserved for truly exceptional circumstances. That possibility is, I think, merely envisioned when the government wants to use its executive power to circumvent existing laws transcribing European laws. In those cases, special notification is required to other EU Member States, for instance.

I would argue in consequence that:
  • The original data collection was illegal in Danish law;
  • The original data release was illegal in Danish law;
  • A new data release of non-aggregated data would be illegal in Danish law;
  • The original data collection was illegal in multiple other laws, because the Danish Data Protection Act includes this: "Any rules on the processing of personal data in other legislation which give the data subject a better legal protection shall take precedence over the rules laid down in this Act.". Technically this exposes the authors to liability in just about any jurisdiction with strong privacy laws, with the full recognition of Denmark, and extreme obligations of reciprocity towards other EU Member States.


I repeat my call for a thorough investigation by the board of this journal of the situation, the resignation of its Editor-in-Chief (and, should he refuse, the board of the journal).

I add a call for a thorough investigation of other papers by the same authors in this journal, as I spotted a few which failed according to the same legal standards.

Finally, I want to observe that the author has actively used his role as Editor-in-Chief/Forum Administrator/Lead Author to:
  • inject his opinions into the review of this paper, by rejecting criticism of the ethical aspects of his work as "non-scientific" (an argument that has also been criticised by a world expert on that exact topic)
  • actively banning me from the forums for posting in this thread and engaging with the rest of the Editorial Board (even if only temporarily).


Even if my call for permanent resignation by the Editor-in-Chief is not heard, it seems to me that he should recuse himself from his role as Forum Administrator while discussion of his paper is ongoing, if he can't take criticism and has to use censorship.
I note that I have been banned from posting on this site for a while and am again allowed to post.

Personal data means identifiable data. So the downloading and organizing of the data with the identifiers may have indeed been a violation. (Point 2.) Can someone clarify the matter? On the other hand, a reprocessing of the dataset in which the personal IDs were removed and the data was resaved would seem not to be.


This reasoning is based on:
  • recital 26 of the EU Data Protection Directive (emphasis mine):
  • (26) Whereas the principles of protection must apply to any information concerning an identified or identifiable person; whereas, to determine whether a person is identifiable, account should be taken of all the means likely reasonably to be used either by the controller or by any other person to identify the said person; whereas the principles of protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable; whereas codes of conduct within the meaning of Article 27 may be a useful instrument for providing guidance as to the ways in which data may be rendered anonymous and retained in a form in which identification of the data subject is no longer possible;

  • the recent EU Court of Justice Rynes case, which concerns the use of CCTV to film a street (without consent of individuals walking in the street), for the purpose of protection of personal property (i.e. a stronger, more justifiable, motive than testing scientific hypotheses without ethical consent). My interpretation of the authors' actions in light of this judgement might however need to be tested in court.


I have limited myself in this particular post to the points 1 and 2 that you have listed, but am even more convinced now of the rest of the six points I had made in my previous post.


My reasoning above relied on Danish Data Protection Law faithfully transcribing the EU directive into Danish law. It turns out that the Danish version is even more restrictive. The collection of data was clearly illegal at the outset.

Of relevance are:
  • The Danish Data Protection Act;
  • The guidance of the Danish Data Protection Authority;
  • The standard terms for research projects, which apply for research projects successfully registered with the data protection authority, if no modification to those terms is granted.


Judging from the register of private research projects (click "Fortegnelse"=="list", then search for the authors name in the form you land to as "Dataansvarlig"=="data controllers"), the authors have NOT registered their study with the Data Protection Authority, which should have occurred prior to the collection of data. Quoting from the guidance above: "The Act on Processing of Personal Data states that it is punishable by law to refrain from notifying a project to the Danish Data Protection Agency, and that it is punishable by law to violate the conditions stipulated by the Danish Data Protection Agency. The maximum penalty is a fine or imprisonment for up to four months."

Even if they had notified the data protection authority, the authors would most likely have needed to ask for a change in the standard terms, as those explicitly forbid disclosure and transfer to third countries. Both conditions were violated by the original release of data, including usernames, and would still be violated after removal of usernames. While the first disclosure could be argued to be a naive mistake, it becomes harder to argue that in light of the responses the initial release has received, without consulting directly with the Danish Data Protection Authority (they are unfortunately closed until Tuesday).

Technically, the Danish Data Protection Act is also relevant, in that it could be argued that an approval could be obtained outside of the purview of the Danish Data Protection Authority. This is true, but is however reserved for truly exceptional circumstances. That possibility is, I think, merely envisioned when the government wants to use its executive power to circumvent existing laws transcribing European laws. In those cases, special notification is required to other EU Member States, for instance.

I would argue in consequence that:
  • The original data collection was illegal in Danish law;
  • The original data release was illegal in Danish law;
  • A new data release of non-aggregated data would be illegal in Danish law;
  • The original data collection was illegal in multiple other laws, because the Danish Data Protection Act includes this: "Any rules on the processing of personal data in other legislation which give the data subject a better legal protection shall take precedence over the rules laid down in this Act.". Technically this exposes the authors to liability in just about any jurisdiction with strong privacy laws, with the full recognition of Denmark, and extreme obligations of reciprocity towards other EU Member States.


I repeat my call for a thorough investigation by the board of this journal of the situation, the resignation of its Editor-in-Chief (and, should he refuse, the board of the journal).

I add a call for a thorough investigation of other papers by the same authors in this journal, as I spotted a few which failed according to the same legal standards.

Finally, I want to observe that the author has actively used his role as Editor-in-Chief/Forum Administrator/Lead Author to:
  • inject his opinions into the review of this paper, by rejecting criticism of the ethical aspects of his work as "non-scientific" (an argument that has also been criticised by a world expert on that exact topic)
  • actively banning me from the forums for posting in this thread and engaging with the rest of the Editorial Board (even if only temporarily).


Even if my call for permanent resignation by the Editor-in-Chief is not heard, it seems to me that he should recuse himself from his role as Forum Administrator while discussion of his paper is ongoing, if he can't take criticism and has to use censorship.



I invite users to remember that the privilege of posting direct, unfiltered comments is very unusual in academic journals. They are usually screened by an editor who has full executive power and can decide whether to publish any complaints or not. My complaints to editors regarding plagiarism or lack of transparency have often been ignored and no answer was received but being they closed journals, nobody of course knows this. For someone who cares so much about ethics, some gratitude for being allowed to use this platform is warranted.
As co-founder of this journal and editor of a sub-journal (OBG), I can act as moderator of this thread and the OKC paper thread if both parties (Emil) and pdehaye agree. I am not a lawyer or a legal expert on ethics so I cannot provide guidance but merely the role of neutral moderator.
Just information...

I was looking at Facebook a few minutes ago, when I got a notice that there was a new video from PC Magazine. I occasionally watch them for news about computers and electronics. Shortly into the video, they began to discuss this paper and called out Emil, with a number of nasty and false comments. Their description of the OKCupid news item (it has been popping up in various places) was inaccurate. Obviously, they have not read the paper.

Meanwhile, I have read the paper twice and wanted to post some comments, but I am unsure of the mechanics others are using. I think I can replicate the format though.