An awful lot of medical data is complex. Here is what you really want: Large amo...

An awful lot of medical data is complex.

Here is what you really want: Large amounts of curated/quality controlled data with ground truth that you can aggregate & share. Preferably with multiple studies and time points and/or followup. That is stated in rough order of difficulty to acquire.

Here is what you typically get fed into an learning pipeline: 1-2 orders of magnitude too small, with all kinds of noise, and no truth data(i.e. at best a bad proxy).

Hand-waving about unsupervised learning won't solve many of the really difficult problems (although it has uses, obviously). Neither will hand-waving about transfer learning. In some areas most retrospective data sets will never be really available because of consenting issues. QA is hard - the sheer variability of clinical systems in the field, not to mention protocol and practice differences, is often astonishing.

So where does that leave us? To make a real dent fast I suspect you need to focus on data availability, not problem. Ask the question:

What are the fastest path(s) to collecting large volumes of clinically representative data with some QA in place, consented for the ways we want to use it, and with real clinical truth or a decent proxy we can get at in an automated or semi-automated fashion? 1000 Bonus points if real outcome data will be available in future.