Rendered at 16:47:29 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
michaelt 2 days ago [-]
> It has given 20,000 researchers around the world access under strict agreements that prohibit sharing data further.
To me it seems rather naive to have done that.
After all, you can't un-leak medical data. So even if the "strict agreement" included huge punishments, there's no getting the toothpaste back in the tube.
If you want to ensure compliance before a leak happens you have to (ugh) audit their compliance. And that isn't something that scales to 20,000 researchers.
Too late to do anything about it now though :(
petterroea 1 days ago [-]
One of the favorite lessons I learned is that anything at scale has to be designed for idiots. I am pretty sure every person reading this has had days where they have done absolutely stupid things without realizing. Now assume there are thousands of users, and you could be providing tools to the smartest people in the world and still have people do stupid stuff all the time. This doesn't just apply to UX.
Then there's the question of trust. You probably have friends you know not to tell certain secrets to, because they believe they get to delegate your secrets onwards to people they trust. The further away someone is from you, the less respect they will show. Researchers have been loaning the dataset in good faith to people who they trust, but who probably didn't take the whole secrecy thing as seriously.
With 20k researchers this was inevitable. The kind of factors above need to be factored in when designing on what grounds such a dataset is to be released.
LeifCarrotson 22 hours ago [-]
"Favorite" in what sense? "Most critical" is a very different meaning of the word, usually it implies pleasure/happiness.
Internalizing that people are frequently less intelligent and less respectful and less kind and less thoughtful than you previously believed is a tragic loss of innocence and hope, not a favorable experience.
petterroea 22 hours ago [-]
Favourite as in "realizing it was an interesting aha moment". Sure, it's a depressing loss of innocence. But on the other hand I always love learning things where I feel like I'm closer to knowing what I'm doing. Not that I think I'll ever get there, it's an ever lasting journey
ACCount37 1 days ago [-]
Not giving the data to researchers means not getting the scientific benefits from that data. Which was the point of collecting that data in the first place.
Reckless harm prevention is the root of many evils.
nxobject 1 days ago [-]
As a biostatistician who's touched epidemiological studies, I'd argue losing the trust of participants and the public is one of the biggest threats to the viability of the whole research enterprise. It's reckless to jeopardize that as well. Conversely, this dataset will be mined for at least 30-50 years - there are an infinite number of questions that can be asked of this dat. Given that timescale, I think a little delay here is acceptable.
Cynddl 1 days ago [-]
It's not a zero-sum game, you can both protect people and reap the benefits of health data. Many countries have much safer approaches. UK Biobank typically leads with the scale of the data, but not with its infrastructure.
bonesss 1 days ago [-]
That’s a false dichotomy.
Sensitive research systems thread that needle by giving remote access to researchers with the data in the control and supervision of the responsible organization. Strong internal data access controls and data siloing alongside strict verified extraction routines. Specifically: limited project-dedicated DB access, full logging of data interactions, and full lockouts/freezes if something feels off.
‘The five safes’ is a good presentation from the NHS(?) a decade ago covering the approaches.
Data publishing restrictions around health data aren’t reckless. Modern computing and digital permanence mean we have to be extra cautious.
ACCount37 1 days ago [-]
No, this is a real tradeoff.
Any friction you add to "access the data" process makes it harder for legitimate researchers to get access to, and get benefits from that data.
So, at what point do stricter data controls begin to choke you at the throat?
blitzar 1 days ago [-]
We have dozens of data / db startups - kinda odd that there isnt one (I have seen) that focuses on this problem.
Perhaps our future ai overlords will feel its important to compartmentalise, and log data access more agressively.
SilverElfin 2 days ago [-]
That’s insane. And what does researcher even mean - some random university student? What would they know about securing that data? I wonder if the people whose data is out there even know this is happening
They clearly do include "some random student" as the data can be shared with others from the eligible research group which are almost always university students who have zero clue about itsec.
globular-toast 1 days ago [-]
I worked in this field. It's not just the students. Hardly anyone seemed to understand how and why you would keep data out of a git repo.
nxobject 23 hours ago [-]
I'm curious – in which context? I've worked on NIH-funded grants in academic medical centers, throughout the research lifecycle, and I've seen how both stringently data management plans are vetted, and how annual IRB certification drills the basics even into the oldest tech-phobic investigators.
That being said, I may be as pessmistic as you are: I don't think people right now grasp how standards for deidentification may no longer be enough, and how easy and automated deanonymization changes everything. Unfortunately, cuts to federal science agencies means that I doubt any well-informed guidance will come soon.
"The charity did not specify the types of data that were included, but Murray stated in the Commons that several markers were included in the listings:
- Gender
- Age
- Month and year of birth
- Assessment center data
- Attendance dates
- Socioeconomic status
- Lifestyle habits
- Measures from biological samples related to haematology, biology, and chemistry
- Sleep, diet, work environment, mental health, and health outcomes data."
fastaguy88 1 days ago [-]
BioBank claims (1) only de-identified data was available and (2) none of the data was actually sold before the datasets were taken down.
john_strinlai 1 days ago [-]
unfortunately for most people, de-identified data is typically a very short analysis away from being re-identified.
the field of de-anonymization is booming.
nxobject 23 hours ago [-]
Especially for a nation-state that's already hoovering up data broker products.
aaron695 17 hours ago [-]
[dead]
anitil 2 days ago [-]
I've opted in to Australia's version of the biobank knowing that it's inevitable that it will be leaked some day, I think the data is so valuable in perpetuity that it's worth it. I remember Ben Goldacre has been working on how to make data more accessible in a safer way to (in part) avoid this very thing, but I haven't heard much of it since [0]
This is the right mindset. Securing huge piles of heterogeneous data while giving PhD students the freedom to "play" with it are quite conflicting goals.
nxobject 2 days ago [-]
I like their idea of an audit log of analysis runs -- beyond transparency, I'm sure it'll help future researchers know how much iteration is needed to work with the messiness of medical records...
I'm also amused (in a good way) by the fact that SAS isn't supported as an analysis platform...
anitil 2 days ago [-]
It's certainly an interesting idea, I remember he was on a few podcasts talking about it. I might submit it here to see if it gets some conversation going
philipwhiuk 1 days ago [-]
I have huge problems with Goldacre's project because that project has never been disclosed to the general public let alone some form of opt-in/opt-out.
harvey9 1 days ago [-]
I think the opt out process is the same as the one that came to prominence in the care.data days. It is not a very good process since it is opt out not opt in, and reliant on motivated people searching for a form.
On the other hand I like opensafely's approach to security: no individual data is ever shared with researchers.
> The following steps require the ukbunpack and ukbconv utilities from the UK Biobank website. The file decrypt_all.sh will run through the following steps on one of the on-prem servers.
> Once the data is downloaded, it needs to be "ukbunpacked" which decrypts it, and then converts it to a file format of choice. Both ukbunpack and ukbconv are available from the UK Biobank's website. The decryption has to happen on a linux system if you download the linux tools, e.g. the Broad's on-prem servers. Note that you need plenty of space to decrypt/unpack, and the programs may fail silently if disk space runs out during the middle.
Good catch! The data is everywhere, re-uploaded every week.
I am aware of ~30 repositories that UK Biobank has asked GitHub to delete, and can still be found elsewhere online. They know the site, they have managed to delete data from that site before, and yet the files are still there.
philipwhiuk 1 days ago [-]
I don't think either of those links contain actual PII (or anonymised PII).
The irony is, they don’t even provide the data to the participants themselves.
vain 2 days ago [-]
Huh? I got my report over email. I think you have to ask for it.
dariosalvi78 1 days ago [-]
the issue is with jupyter notebooks because they keep some of the data in the output (typically a few rows, but still). They should strongly recommend to use regular python scripts, and keep the jupyter books just for verification, which is a very sane thing to do also from a SW engineering perspective.
paulwetzel 1 days ago [-]
I cant really understand why Jupyter Notebooks do this in the first place. It makes it (a) really hard to version control, as there will always be some random blob of non-textual data in the notebook that pops up in a diff and makes it basically unreadable and (b) I can't really see the benefit, as it only stores some part of the data, and not the full table, as far as I am aware.
Enforcing Jupytext is a good adaption, and gives you all the, arguably really nice, comfort from a notebook, and the proper code practice from SW engineering.
lmc 1 days ago [-]
marimo notebooks give you the best of both worlds (https://marimo.io)
John7878781 2 days ago [-]
What are the pros/cons of just open-sourcing everything for future bio bank projects?
michaelt 2 days ago [-]
It's exceptionally difficult to avoid the data being de-anonymised.
If an 'anonymised' medical record says the person was born 6th September 1969, received treatment for a broken arm on 1 April 2004, and received a course of treatment in 2009 after catching the clap on holiday in Thailand - that's enough bits of information to uniquely identify me.
And medical researchers are usually very big on 'fully informed consent' so they can't gloss over that reality, hide it in fine print or obsfucate it with flowerly language. They usually have to make sure the participants really understand what they're agreeing to.
It might still work out fine, of course - 95% of people's medical histories don't contain anything particularly embarrassing, so you might be able to get plenty of participants anyway.
yosame 2 days ago [-]
In my experience with health data, the dates are usually offset by a random but constant amount for each person (e.g. id 12345 will have all their dates shifted by +5 weeks) to avoid identification by dates.
Unfortunately the sequence of treatments and locations are usually enough to identify someone, especially if it's a rarer condition.
cameldrv 2 days ago [-]
Location data is very readily available, so you can easily correlate visits to a health facility with a treatment, and even with an offset, you can probably uniquely identify someone with 4 visits depending on the size of the medical facility.
evandijk70 1 days ago [-]
I had access to several health datasets for my research in the past. Date of birth was rarely given, especially for the bigger projects where there were more resources to allocate to privacy protection. Neither was date of death, location, or visits to a health facility with a treatment. Typically the relevant variables are age (in years), treatment type and possibly number of cycles. Probably insufficient to identify someone without access to hospital records. But if you have that, you have all these data anyways.
Most researchers likely would want to summarize these data in a similar way anyway, so this works out nicely.
jjgreen 1 days ago [-]
... received a course of treatment in 2009 after catching the clap on holiday in Thailand
Yeah, sorry about that
culi 2 days ago [-]
The people who agreed to contribute their biodata did not consent to that.
If you want such a project you need to have a new project with a different agreement. I doubt you could get as many volunteers to freely give away such intimate data to anyone who wants though
shellac 2 days ago [-]
'Anonymisation' schemes are a little like encryption, in that they just get monotonically weaker over time as people work out attacks. But the attacks tend to be much worse. I work in academic open data publishing, and the netflix prize (https://arxiv.org/abs/cs/0610105) hangs over our heads.
But what this illustrates to me is that researchers are just really careless, despite everything we make them agree to in data transfer agreements. It seems absurd to have little cubicles like this https://safepodnetwork.ac.uk/ (think Mission Impossible 1) but I do despair.
Cynddl 2 days ago [-]
You mean giving anyone access to the data? Or open sourcing the code? If the latter, I think that's a generally a good practice. Security through obscurity is never good for public infrastructure. In this case, UK Biobank has now switched to a remote access platform (not particularly secure, as the data was found for sale on Alibaba today), but contracting it to DNAnexus and Amazon. Private companies have no incentives to open source data, unless mandated to do so.
In the EU, there is a bigger interest in building scalable but also secure platforms for health data. Hopefully good innovation will come from there.
tptacek 2 days ago [-]
One of the most important "con"'s is that without controls, fewer people will allow their data to be included in the data sets.
Cynddl 2 days ago [-]
That's a very important point. The people who opt out first are typically not a random fraction of the population, and this makes it much harder to make any analyses with the resulting datasets: it gets very hard to know if your analyses are representative of the population, or not.
tptacek 2 days ago [-]
This is why it was such a big deal when that researcher at Cleveland State misappropriated UKBB data for a race-science study with Emil Kirkegaard. After he was fired, people on Twitter were all like "this is just suppression of science", but the reality is that what they did, contravening UKBB rules, constituted potentially an existential threat to the whole program.
ChrisRR 1 days ago [-]
They need to sell the data to fund the project
gdevenyi 1 days ago [-]
This.
renewiltord 2 days ago [-]
Hard to do. The same people with the collection and tracking infrastructure required are infinitely sue-able so you need legal protection if anything goes wrong.
ashley95 2 days ago [-]
Really don't think this is any issue given the post we are commenting on...
nxobject 2 days ago [-]
From the perspective of someone who's worked with (biostatisticians who touch) Medicaid and Medicare billing data...
It looks like they've identified the institutions, at least... but aren't identifying it to the public for now. Are there going to be consequences? Are they going to be identified and sanctioned beyond "having their access suspended?"
In the US, HHS wouldn't hestitate to name, shame, and impose a sanction with corrective action plans. Not knowing much about how things work across the pond, I'm sure CMS PII gets used more often in research without these leaks left and right.
NGRhodes 1 days ago [-]
Thank you for sharing.
I work in a central RSE team and have raised this topic to the team, with a view of bringing attention to this issue and better educating our researchers (as part of our training offerings and documentation).
khelavastr 20 hours ago [-]
UK government fails to prosecute criminally negligent software developers..
anordal 1 days ago [-]
Have medical researchers not heard of gitignore? Any hypothesis about the mechanism here?
mhh__ 1 days ago [-]
I haven't been paying attention to it but wasn't there some kerfuffle over some people threatening to leak it over not being allowed to publish controversial findings?
To me it seems rather naive to have done that.
After all, you can't un-leak medical data. So even if the "strict agreement" included huge punishments, there's no getting the toothpaste back in the tube.
If you want to ensure compliance before a leak happens you have to (ugh) audit their compliance. And that isn't something that scales to 20,000 researchers.
Too late to do anything about it now though :(
Then there's the question of trust. You probably have friends you know not to tell certain secrets to, because they believe they get to delegate your secrets onwards to people they trust. The further away someone is from you, the less respect they will show. Researchers have been loaning the dataset in good faith to people who they trust, but who probably didn't take the whole secrecy thing as seriously.
With 20k researchers this was inevitable. The kind of factors above need to be factored in when designing on what grounds such a dataset is to be released.
Internalizing that people are frequently less intelligent and less respectful and less kind and less thoughtful than you previously believed is a tragic loss of innocence and hope, not a favorable experience.
Reckless harm prevention is the root of many evils.
Sensitive research systems thread that needle by giving remote access to researchers with the data in the control and supervision of the responsible organization. Strong internal data access controls and data siloing alongside strict verified extraction routines. Specifically: limited project-dedicated DB access, full logging of data interactions, and full lockouts/freezes if something feels off.
‘The five safes’ is a good presentation from the NHS(?) a decade ago covering the approaches.
Data publishing restrictions around health data aren’t reckless. Modern computing and digital permanence mean we have to be extra cautious.
Any friction you add to "access the data" process makes it harder for legitimate researchers to get access to, and get benefits from that data.
So, at what point do stricter data controls begin to choke you at the throat?
Perhaps our future ai overlords will feel its important to compartmentalise, and log data access more agressively.
That being said, I may be as pessmistic as you are: I don't think people right now grasp how standards for deidentification may no longer be enough, and how easy and automated deanonymization changes everything. Unfortunately, cuts to federal science agencies means that I doubt any well-informed guidance will come soon.
All 500,000 participants for sale on Alibaba...
And official response: https://www.ukbiobank.ac.uk/news/a-message-to-our-participan...
"The charity did not specify the types of data that were included, but Murray stated in the Commons that several markers were included in the listings:
- Gender
- Age
- Month and year of birth
- Assessment center data
- Attendance dates
- Socioeconomic status
- Lifestyle habits
- Measures from biological samples related to haematology, biology, and chemistry
- Sleep, diet, work environment, mental health, and health outcomes data."
the field of de-anonymization is booming.
[0] https://www.bennett.ox.ac.uk/blog/2025/02/opensafely-in-brie...
I'm also amused (in a good way) by the fact that SAS isn't supported as an analysis platform...
On the other hand I like opensafely's approach to security: no individual data is ever shared with researchers.
And some information on how they were distributing it to researchers: https://github.com/broadinstitute/ml4h/blob/master/ingest/uk...
> The following steps require the ukbunpack and ukbconv utilities from the UK Biobank website. The file decrypt_all.sh will run through the following steps on one of the on-prem servers.
> Once the data is downloaded, it needs to be "ukbunpacked" which decrypts it, and then converts it to a file format of choice. Both ukbunpack and ukbconv are available from the UK Biobank's website. The decryption has to happen on a linux system if you download the linux tools, e.g. the Broad's on-prem servers. Note that you need plenty of space to decrypt/unpack, and the programs may fail silently if disk space runs out during the middle.
https://biobank.ctsu.ox.ac.uk/crystal/download.cgi
I am aware of ~30 repositories that UK Biobank has asked GitHub to delete, and can still be found elsewhere online. They know the site, they have managed to delete data from that site before, and yet the files are still there.
(The first is a GitHub repo for https://www.weizmann.ac.il/math/tanay/home )
Enforcing Jupytext is a good adaption, and gives you all the, arguably really nice, comfort from a notebook, and the proper code practice from SW engineering.
If an 'anonymised' medical record says the person was born 6th September 1969, received treatment for a broken arm on 1 April 2004, and received a course of treatment in 2009 after catching the clap on holiday in Thailand - that's enough bits of information to uniquely identify me.
And medical researchers are usually very big on 'fully informed consent' so they can't gloss over that reality, hide it in fine print or obsfucate it with flowerly language. They usually have to make sure the participants really understand what they're agreeing to.
It might still work out fine, of course - 95% of people's medical histories don't contain anything particularly embarrassing, so you might be able to get plenty of participants anyway.
Unfortunately the sequence of treatments and locations are usually enough to identify someone, especially if it's a rarer condition.
Most researchers likely would want to summarize these data in a similar way anyway, so this works out nicely.
Yeah, sorry about that
If you want such a project you need to have a new project with a different agreement. I doubt you could get as many volunteers to freely give away such intimate data to anyone who wants though
But what this illustrates to me is that researchers are just really careless, despite everything we make them agree to in data transfer agreements. It seems absurd to have little cubicles like this https://safepodnetwork.ac.uk/ (think Mission Impossible 1) but I do despair.
In the EU, there is a bigger interest in building scalable but also secure platforms for health data. Hopefully good innovation will come from there.
It looks like they've identified the institutions, at least... but aren't identifying it to the public for now. Are there going to be consequences? Are they going to be identified and sanctioned beyond "having their access suspended?"
In the US, HHS wouldn't hestitate to name, shame, and impose a sanction with corrective action plans. Not knowing much about how things work across the pond, I'm sure CMS PII gets used more often in research without these leaks left and right.