Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Open Sourcing My Personal Medical Record (hdphealth.com)
103 points by blaurenceclark on May 4, 2017 | hide | past | favorite | 79 comments


Me and many others have done this for a long time. The Harvard Personal Genome Project [1] is a large open database of people's genetic and phenotypic information. Here is my profile: https://my.pgp-hms.org/profile/hu1247AF

You can add your Personal Health Record to it.

[1] http://personalgenomes.org/


The Harvard Personal Genome Project is great (I'm a participant) but there are some other projects that are complementary as well, such as Open Humans [1] and Open SNP [2].

[1] https://www.openhumans.org/

[2] https://opensnp.org/


Ah awesome!!!! Shoot me an email would love to chat more about that. The one thing that I wanted to offer that's different than those other sources is if I was an engineer who wanted to see what working with EMR data was going to be like I can't find an EMR export online anywhere (CCDA and raw notes) for using ML and NLP to analyze.


I have a friend who is a medical researcher and it definitely seems they are stuck in the past.

In order to study something, he has to:

* Come up with a hypothesis that X may cause Y

* Request access to data about that hypothesis

* He is only given the data regarding his hypothesis

* He can then study whether his hypothesis has merit or not

We should be dumping these whole datasets into machine learning and having computers give us potential links to explore. Obviously there will be plenty of things that turn out to be unrelated, but it's also very likely the computer can find links that a human would not have considered.

I don't see it changing any time soon in the US, but I suspect other countries with this data will use it, and we'll find the next generation of medical breakthroughs no longer come from the US.


Every usage of every bit of medical data requires patient consent.

The potential for abuse is not hypothetical.

While I was implementing medical information exchanges, every single participant considered patient data to be their own, to be used as they wish. Our (grand)parent company, a lab, was negotiating with Microsoft, Google, pharmas, etc. Each was trying to figure out how to monetize it. For example, targeted ads.

The C (executive) level players mocked HIPAA and the other (meager) patient and consumer protections the same way they mocked Sarbanes-Oxley, environmental protections, financial reporting requirements, etc. If you think Google and Facebook are bad...

---

My data, all that is known about me, is my identity. It's me.

At the very least, if someone's going to profit from my data, I want my cut.


I agree with kharms

>You're describing P-value hacking

Here's an example of what can happen when you take a huge corpus of data and throw an equally huge number of hypotheses at it to see what sticks: https://io9.gizmodo.com/i-fooled-millions-into-thinking-choc...

tl;dr: he "proved" chocolate causes weight loss by comparing chocolate- and non-chocolate-eaters on a very high number of health indicators.

That also introduces the multiple testing problem: https://www.wikiwand.com/en/Multiple_comparisons_problem

The more statistical tests you run against a set of data (EDIT: the more variables you test against a dataset), the higher the chance you get a statistically significant result from random error alone.


The solution to false positives is not to artificially rate-limit testing or blind yourself to potentially useful data. It's to understand that 5% is an insufficient significant threshold when your prior belief in a correlation is low.

There are really three solutions to the problem of multiple comparisons: Either (1) you use a different threshold, (2) you use a different test, and/or (3) you correctly interpret that p=5% does not imply the effect is 95% likely.

There's absolutely nothing wrong with exploring a data set, as long as you are responsible in the conclusions you draw.


IANAS, but does this mean that a set of raw data loses value, as more information is extracted from it? If I use your old raw data to validate my hypothesis, does that somehow also weaken the statistical evidence for your hypothesis?

I really need to go back and study statistics, this is getting embarrassing.


I worded my comment incorrectly (and edited it accordingly). What I should have said is that when you run a stats test against a dataset, there's a known probability that you'll get a significant correlation simply due to chance. The more variables you examine, the higher that chance becomes.

I just found this on Google but the first page of this paper explains it a little better: http://www.stat.berkeley.edu/~mgoldman/Section0402.pdf


It means that you can't use the same data to confirm a hypothesis as you used to generate the hypothesis. Defensible statistical practice would be to throw anything you like at the original data set, come up with whatever ridiculous idea, and then collect a new data set for the purpose of investigating your ridiculous idea. The original data set provides zero[1] evidence for a hypothesis that it inspired you to think of.

[1] Not really, but this is the cleanest way to sidestep multiple comparisons.


Respectfully, I disagree.

(1) First, you can certainly have confidence in hypotheses based off single data sets. If you have a dataset with 1 million hours of TV watching that show 0 correlation between watching golf and watching Judge Judy, it's fine to suspect there's little correlation. You don't need to run a second study to have an informed opinion.

(2) Second, collecting new data sets (or equivalently blinding yourself to partitions) doesn't 100% fix the problem either. If you test lots of hypotheses against your test set, then the odds that some of them are false rises too. Creating third- and fourth- and fifth-level validation sets just keeps pushing the problem up the ladder. In fact, there's no real difference between the requirement to experimentally validate results and the requirement to have a hypothesis 'work' on both halves of a partitioned dataset. The data doesn't care when you collected it.

Ultimately we just have to admit that tests based on randomness are sometimes randomly wrong. There is no perfect silver bullet solution.


> In fact, there's no real difference between the requirement to experimentally validate results and the requirement to have a hypothesis 'work' on both halves of a partitioned dataset.

This would be correct in the absence of investigator malfeasance. Unfortunately, investigator malfeasance is the problem we're trying to solve, so assuming it away is unwise. The requirement to collect new data imposes pretty strict limits on how many hypotheses you can test. The requirement to find a hypothesis along with a division of your existing data set such that the hypothesis holds in both halves is much more generous; it can be automated just as easily as finding a hypothesis that works in the unified data set can.


Fair, but that's mitigated if you have a rule that requires an ordering of the data points (say, chronologically). Then there should be no difference between two 500-data-point studies and one 1,000-data-point study partitioned in two (uniquely determined) halves.


This is not a solution. It removes one degree of freedom, the ability to draw the "line" dividing one half of the data set from the other. But an evil or naive scientist has limitless other degrees of freedom to choose from, and can make as many comparisons (in the "multiple comparisons" sense) as they like, undetectably to you.

After you, the good guy, have specified which half of the data is the playground and which is the confirmatory test set, Evil Scientist can still run as many hypotheses as he feels like until he finds one that validates in both halves.

Under the rule "you can only validate a hypothesis by collecting a new data set dedicated to that hypothesis", we, the observers, have a way of guaranteeing that multiple comparisons did not occur. We have no such guarantee under the system you describe.

So to sum up: the rule I describe is not necessary in order to practice good statistics for your own benefit. But it is necessary in order to have a good statistical argument for convincing someone who can't directly perceive the contents of your mind. It's an auditing tool.


Obviously the data set doesn't become "weaker" or "lose value" -- it's data, and running stats against it doesn't change it.

However, every test for a correlation against a data set has some chance of yielding a false positive or false negative. This chance is called the p-value, and typically .05, or 5%, is the minimum requirement to be considered "significant". But that means that if you test for 20 or so correlations, you would expect one of them to be wrong. And the only thing that can fix that is reproducing the test with a different data set.

Searching for "science reproduction crisis" will give a lot of good results for further reading.

This topic is also what this XKCD is about -- and it's not a coincidence that there are 20 "test" frames with a .05 p-value:

https://www.xkcd.com/882/


That is not the definition of a p-value. :(

A p-value of 5% means that, IF the null hypothesis is true (IF!), then there's a 5% chance of getting results as extreme as measured.

A p-value of 5% does not mean than you should expect a rate of 5% false positives & negatives.


Isn't your second paragraph just a definition of a false positive?

And, it looks like power is the error rate for false negatives:

https://en.m.wikipedia.org/wiki/Statistical_power

Too late to edit my original to fix this.


>We should be dumping these whole datasets into machine learning and having computers give us potential links to explore.

You're describing P-value hacking, thus named because hack scientists use this technique to publish papers about nonsense.


You basically just need to reduce the P-value you need to claim significance (0.1 to 0.001 or less) to account for the probability of finding those correlations even in noise. This is part of why particle physics has such high standards - you can find a lot of things in the TBs of data CERN generates.

See for example: https://en.wikipedia.org/wiki/Genome-wide_association_study

There's a figure in there depicting associations with P-values of 1e-8: https://en.wikipedia.org/wiki/Genome-wide_association_study#...


Sadly, even countries with universal healthcare systems don't have universal health informatics systems (the NHS is a prime example — they spent £12B trying to build an integrated system [1]). Lots of countries attempt, including the US — HIPAA was actually originally about data portability [2], and we just spent another $40B [3]. Thus far only smaller countries have had success with integrated health IT systems [4].

[1] https://en.wikipedia.org/wiki/NHS_Connecting_for_Health [2] https://en.wikipedia.org/wiki/Health_Insurance_Portability_a... [3] https://en.wikipedia.org/wiki/Health_Information_Technology_... [4] https://en.wikipedia.org/wiki/Healthcare_in_Denmark#eHealth


It's frustrating that this problem is more of a political labyrinth than a technology endeavor.


Definitely true. In the US, because fax and phone were carved out in HIPAA as not being ePHI, they have this special protected status that makes it totally cool for providers to fax records around, but sending emails something that risks jail time :/

I wouldn't underestimate the technological barriers to making interoperable health record systems actually useful. There are a lot of different kinds of medical information (SNOMED CT, the best ontology for healthcare, has >1M concepts!), and the best way to structure that information is an unsolved problem. There are lots of different ways out in the wild (complicated by there being lots of half-assed EHRs that were just made to grab incentive money), and the standards that are out there don't really help things (they are so broad that basically every EHR implements their own "flavor" of the standard).


Well, that's the way it's meant to be. Exploratory analysis doesn't have the same purpose as hypothesis testing.


100% agree I hope we can get there!


Thank you for sharing this. We (I) founded a company to help with several niche aspects of healthcare and the bureaucratic issues faced by administrations. We are finding success with data transactions, and while there are some companies out there who work really hard to make transaction engines, it's not very efficient, very expensive, and doesn't benefit the consumer at all.

My past experience as a software developer was, "Give me all the datum, and tell me what you need, then I'll make it work." I even worked for a very large EMR (probably the biggest on the planet), and getting a patient record out of their system is a nightmare, even though the foundation of their application is the patient record.

I'd love to converse more about what you're building, as we capture many unstructured documents and are now using ML to grab details out of these and match to criteria.


Absolutely please shoot me an email!


We should (as a society) consider open sourcing every medical record.

Medical privacy is ethically tricky. It (1) protects bad doctors, (2) makes it harder to develop treatments, (3) makes it hard for consumers to shop intelligently.

Medical privacy would be useful when negotiating cost of coverage with your insurer, but they have a contractual right to demand your complete medical record.

The best arguments I've heard for medical privacy are (1) you might not get a job if you're sick, (2) shame factor could prevent people from going for treatment and (3) you may not get a date if you have, say, herpes. (#3 is true but not necessarily a strong point from a social standpoint).


Your #1, #2, and #3 are all the same thing, but I think it's hugely important:

Medical records can show all kinds of markers about your past / current behavior that let people paint pretty horrible assumptions about eachother.

Type 2 Diabetes? Man you must eat poorly.

Herpes? You must have gotten from being promiscuous and risky

Depression? Must not be able to deal with the shit that is real life.

Hormone therapy? Dental issues? Pain killers? Allergies? I mean the list is almost as long as the list of all medical issues that people.

Just about every medical condition, people paint with behavioral moral/ethical judgement which is almost entirely unfair. I think medical privacy is hugely important for society as we currently are, and losing it would not change these effects, but instead increase the ease to discriminate against them.


Also possible that we judge each other because medical privacy gives us a false impression of the world. Or because privacy allows people living in glass houses to throw stones.

Either way -- if opening medical records leads to new treatments, it may be worth the shame.


> Either way -- if opening medical records leads to new treatments, it may be worth the shame.

That's incredibly easy to say if you don't have any of the problems listed.


I have seasonal allergies and @codemac for some reason listed allergies (maybe b/c it's a preexisting condition and that's in the news?). But point taken.


There is nothing shameful about seasonal allergies.


I think AIDS is the trickiest one. It's not genetic (except for the very few people who are congenitally immune). It can kill you if you don't live in a rich country. Many people who have it are gay, which is illegal in a lot of places. (Even in the US C&BP keeps messing with that canadian guy for having grindr on his phone). STDs in general can reveal you've been cheating on / lying to a partner.

If you asked me 'should we publicize AIDS status in a lightly anonymized form', I say no, of course not.

But if you ask me 'do we want public records about AIDS treatments', absolutely.

(AIDS may be a moot point because there's recent CRISPR research about 'excising' AIDS infection in live mice).

My point: I'd like to have both things but to solve problems at a continent-scale we need transparency about disease and treatment.


Herpes. Mental illness. A history of suicide attempts. HIV+ status. The fact that you weren't born with your current gender. The fact that you've miscarried three times, and are currently pregnant. The fact that your child has fetal alcohol syndrome.

All super fun facts that people would love for friends, coworkers and strangers to be able to find out.

I understand that there are good arguments for releasing medical data, but this is just the "if you have nothing to hide, what are you worried about?" argument.


> The fact that your child has fetal alcohol syndrome.

Trust me, people will find this out after it's born, and they'll be plenty judgmental then.

And hey, rightly so.


> Trust me, people will find this out after it's born, and they'll be plenty judgmental then. And hey, rightly so.

What you're describing is human scale judgment. Example: a church music director doesn't allow such a person to join the choir.

With medical privacy out of the picture, what you'd be rationalizing here would be internet scale judgment. E.g., a script kiddie trolls the set of all known people who had children with fetal alcohol syndrome in an attempt to trigger them to kill themselves.


This is overly judgmental.

Accidental pregnancies happen (even with contraceptive use), and can often not be detected until several weeks have passed. Even if alcohol consumption is stopped immediately, fetal alcohol syndrome spectrum disorder can still occur in the child, since development during the early stages of pregnancy is particularly sensitive to alcohol.


> Even if alcohol consumption is stopped immediately [after noticing you're pregnant], fetal alcohol syndrome spectrum disorder can still occur in the child

Technically can, but this is unlikely in the extreme. (Source: my mother, a practicing obstetrician.)


Thank you for succinctly demonstrating why open sourcing medical records is a terrible idea.

To expand, here's a scenario: what if I'm a single father of a FAS child? What if the mother hid the pregnancy from me until birth? What if she hid the drinking from me? You're going to judge me and my hypothetical child for actions we didn't take, couldn't have prevented, and are dealing with the fallout of in the face of this kind of garbage?

All the judgement would sure help.


> All the judgement would sure help.

Being judged isn't supposed to help you any more than being thrown in jail is.


If you want to feel better about your life choices, you're welcome to watch reality television, instead of reading my hypothetical medical record.


I drew the analogy I did for a reason. Do you really think we imprison people so that we can feel better about ourselves?


Medical records can also be used as identification since they contain data that is unique only to you.

http://www.reuters.com/article/us-cybersecurity-hospitals-id...

It is very easy for the typical software engineer to come up with the brilliant idea of open sourcing everything without thinking of any of the consequences. But the real world is much more complex than that.


Simple solution. Make medical records privacy opt-in.

Most people won't bother (strong default effect), so lots of data for research, and those who care can still can have their privacy. It won't exactly be a random sample, but it should still be better than what's available currently.


Wont this create a "you have something to hide" effect?


I'm assuming making these record public in an anonymized way. If somebody wants your records directly nothing changes. So he doesn't know whether you're sharing your data or not.

Of course there's good chance he can find you in the anonymized dataset if it's detailed enough. But he can't be 100% sure it's you.


75% sure is bad enough, even if it isn't you.


The current law in the US is that insurers can't consider medical history (they can ask your age and whether you smoke).

Looks like that has a fair chance of changing though.


Sadly it does look like it might change :(


Do you really want to out every trans person who has told a doctor about it?


This is a risky thing to do when the patient's name is attached to it. Insurance companies, salesmen, etc., could do quite a lot with such information.

I whole-heartedly support the general idea, and making a centralised database of things like this would be great. Such a database would probably make it easier to anonymise the data as well.


Even without the patient's name attached it is easy to identify people because of the necessary metadata in the record. If you expect to get useful information about, for instance lung disease the record will have to contain information about exposure to likely causes, age, occupation, region of the country (possibly town), sex. It will also contain marital status, whether one has children, drinking and smoking habits, weight, ethnicity.

This is pretty close to unique, just like a browser finger print.

See for instance: http://randomwalker.info/publications/no-silver-bullet-de-id...


Anonymize then let the flood gates open.


De-anonymizing medical records strikes me as a fairly easy problem to solve... the information is literally one large biometric database.


There are actually government mandated methods for deidentifcation

https://www.hhs.gov/hipaa/for-professionals/privacy/special-...


> government mandated methods

Is that supposed to imply they work?


Actually, if you read that, they aren't government mandated methods (in the technical sense) there's an option of either using a government-specified safe harbor method or getting an "expert determination" that the data is deidentified.


I should have said "there is a government mandated method, but that's not the only way" It's more of a starting point than anything else. Also if you get HIPAA audited you either have to follow the government way (easiest for broke startups) or go the expert way but that is a bit more costly to prove out.


Author here, If anyone has any questions about dealing with the medical system or about clinical trials ask away!


I'm guessing you're the author then?

I'm just making my way out of a course called Health Informatics. Most of what we've done is look at HIPAA, and the standards that make sending patient info from one hospital to another possible. In general the whole situation in a mess. I understand the purpose of not sharing identifiable data with the world, stops people from targeting people because of their conditions. But we have a wealth of information that's been made effectively useless from a research perspective.

this isn't much of a question, just wanted to express my frustration with the whole thing as well. that said I've got a lot of respect for your mission, and the balls required to publish your otherwise HIPAA protected info.


> I'm guessing you're the author then?

The comment literally starts with "Author here"


I added that after he asked that question, we'll give him a pass :)


Thanks a ton for the support!! If you ever wanna connect more on it email me!


Hey Brian, it's really great that you're doing this :)

If there's more to your medical history that you want to track down, or you want to get your data transformed into a structured format, you should reach out to us at PicnicHealth and we'll see what we can do.


I've been in touch with Noga before we def will! :)


What was the actual process of acquiring your entire medical record? My understanding was that this information can be highly fragmented depending on the number of different places one has received medical treatment.


110% true, and I can give multiple examples of this

1) Fortunately I went to just one provider for all my treatment and they make the entire EMR extract available for patient download through their website (Sutter Health in CA) props to them for doing a great job at this

2) When dealing with issues w/ other family members and friends we've often only been able to get very minimal data extracts and had to actually fax in requests to get the full medical record sent to us on a CD weeks later.

3) Services are now popping up to do that for you, picnichealth, patientbank, etc. and they should be able to get your full detailed record to view for a cost instead of doing it yourself


Thanks for this! Have you considered sharing DICOM files, too (i.e. the actual images from your MRI, in addition to the reports)? If so, what went into the decision not to include these?


I'll add those in, just didn't download them yet haha


If you're interested in loading the DICOM files into a real viewer, you can try http://www.pacsbin.com, a project I've been working on to keep a personal store of medical images and easily share them on the web. Everything you upload is be auto-anonymized, and you can view the images the same way a radiologist or other physician might.


Ahh great work! Will def check it


I'm really confused the purpose what this article's purpose is. You first talk about the issues with clinical trials then you throw in a tidbit of you just feeling like putting your medical records on public because you couldn't find many open medical records?


Good feedback, when I initially was diagnosed and wanted to start working on this problem I had no idea what a medical record looked like so I didn't know what the data I'd be working with looked like which can be tricky to do a data project without knowing the data structure :) I just wanted to share mine in case anyone wants to tackle something medical record related in the future they'll be able to see what the data sets they'll be working with may look like!

The clinical trial bit is our specific use of that data


You can view the CDA (xml) documents here: http://intelsoft.com.au/challenge/index.htm

CDA are xml document conforming to a schema specified for medical documents.


Was there a link to the actual MRI file and not just the JPEG slices of the damaged bone?


Unfortunately my provider doesn't have that for full download for me, I need to drive there and pickup a CD and haven't had the time to do that yet. Plan to get the full file soon!


CTO at ambrahealth.com here, once you get the CD you can create a free account at our site and upload the MRI for cloud viewing and sharing. A lot of providers are starting to use our service to electronically share the images with patients instead of CD's.


I bet this at scale, like a github for medical records, would be revolutionary.


picnichealth could easily add a "opt in" option whereby patients can opt their data into to trials. Institutions could pay for access to all this curated data to use for testing or recruitment of patients.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: