Healthcare Data: Fuel for your Machine Learning ‘Rocket Ship’

5 min readFeb 20, 2017

Machine Learning has been passed down from a not-so new field of pattern recognition. Yet in recent times, it has been gaining fresh momentum. From gigantic corporations to fledgling startups, venture capitalists will not lose a moment to bombard the audience with buzzwords like deep learning, neutral networks, NPL etc.

“…90% of the investors have very little idea what AI is so if you’re a founder raising money, you should sprinkle some AI into your pitch deck… Then sit back and watch the funding roll in.”

This is funny nevertheless true and applies to almost all wannabe entrepreneurs and inexperienced investors.

As a computer and computational scientist, it is upsetting to find entrepreneurs and investors reiterate jargons and employ generic Use cases of how machine learning will benefit the humanity at a large scale. I do not want to be judgmental but it seems like they do not even comprehend the requirement to get started in the first case. During our journey, we have gradually understood and started addressing the 4Ps necessary for any machine learning startup. These 4Ps are essential for a well-informed and high-quality healthcare startup. To be brief, they are

i. Problem (i.e. use cases),

ii. Product (i.e. software, hardware),

iii. Petabytes (i.e. colossal amount of data), and

iv. People (i.e. subject matter experts, computer scientists, programmers).

Contacting the subject matter experts working on the areas concerned is perhaps the best way to hit upon use cases or application areas. Most often tech entrepreneurs come from computer science background and thus at times they forget the interdisciplinary nature of the challenge. For example, being a startup organization in medical imaging or radiology space and attempting to build ML algorithms for Computer Aided Diagnostics, one should not overlook to bring in advisors who are radiologists or physicians. Working with them will help one to figure out the right use cases and application areas.

One of the most difficult challenges for any ML healthcare startup is acquisition of data. The major barriers to subdue in healthcare data are privacy and ownership. It is precisely this area, which is the prime focus of this article. I have tried to compile not only the Use cases but also the data set that is available from the internet for a better understanding of the situation.

Prof. Andrew Yan-Tak Ng rightfully remarked “…when I think about building machine learning products I think of building a rocket ship… A rocket ship is a giant engine together with a ton of fuel. Both need to be really big. If you have a lot of fuel and a tiny engine, you won’t get off the ground. If you have a huge engine and a tiny amount of fuel, you can lift up, but you probably won’t make it to orbit. So you need a big engine and a lot of fuel. …giant computers, that’s our rocket engine. And the fuel is the data.

Presently, Brainpan Innovations is drilling data wells and building data pipelines through various products, initiatives, and collaborations to fill huge reservoirs of the so called fuel ‘data’ for our space journey.

Here is the list of links for respective healthcare data, which would catalyze medical researches and trials facilitating for a better healthcare protection. I would love to share with the community.


Can we improve lung cancer detection?

Predict whether breast cancer is benign or malignant.

Automate tumor segmentation (Soft Tissue Sarcomas).

The Cancer Imaging Archive (TCIA).

Prevent cervical cancer by identifying at-risk populations.

Contagious Diseases

What contaminant has caused most of the hospitalizations and fatalities?

Weekly case reports for polio, smallpox, and other diseases in the United States.

Analyze the ongoing spread of Zika virus.

Predict West Nile virus in mosquitoes across the city of Chicago.

Predict when, where and how strong the flu will be.

Predict the likelihood that an HIV patient’s infection will become less severe, given a small dataset and limited clinical information.

Clinical & Drug Discovery

Why do 30% of patients miss their scheduled appointments?

Predicting doctor attributes from prescription behavior.

Collection of physical spine data.

Predict the onset of diabetes based on diagnostic measures.

Predict seizures in long-term human intracranial EEG recordings.

Identify nerve structures in ultrasound images of the neck.

Transforming How We Diagnose Heart Disease.

Identify signs of diabetic retinopathy in eye images.

Predict seizures in intracranial EEG recordings I.

Detect seizures in intracranial EEG recordings II.

Predict visual stimuli from MEG recordings of human brain activity.

Diagnose schizophrenia using multimodal features from MRI scans.

Identify patients who will be admitted to a hospital within the next year using historical claims data. [Data no longer available]

Can we objectively measure the symptoms of Parkinson’s disease with a smartphone?

Think it’s possible to make hospital visits hassle-free? [Data no longer available]

Predict future prescription volume.

Help develop safe and effective medicines by predicting molecular activity.

Start digging into electronic health records and submit your creative, insightful, and visually striking analyses.

Identify patients diagnosed with Type 2 Diabetes.

Predict a biological response of molecules from their chemical properties.


Genome Phenotype SNPs Raw Data.

SP1 factor binding and non-binding sites on Ch1 for classification tasks.

Genomes Online Database


FTP for GenBank below :D


Government data related to the water quality in India.

Air quality monitoring data from northern Taiwan, 2015.

Pesticide used most frequently in the United States?

Predict ocean health, one plankton at a time.

Predict physical and chemical properties of soil using spectral measurements.


American Time Use Survey (ATUS) Eating & Health Module Files from 2014.

State of human health across the world.