Everything About AI in Life Sciences, We Learned in Kindergarten

News provided:

January 5, 2024, 9:28 AM EST

Everything about AI in Life Sciences, we learned in Kindergarten

By Rohit Nambisan



As I prepare to head to the 42nd JP Morgan Healthcare – one of the most influential industry events  and a great kick-off to the new year – I am reminded of one the most important issues our industry must address in 2024. It’s not artificial intelligence (AI), rather, it’s what is required to leverage this potentially transformative technology for clinical research.

Everything about implementing AI in life sciences, we already learned in kindergarten: how to share. Sharing data without violating privacy or consumer concerns, and not putting revenue or proprietary information at risk (even as this seems counterintuitive) is how AI will make its promised impact.

Despite advances in clinical science, 95% of diseases still have no treatment – and there are two primary culprits that can be addressed through data sharing. 

(1) there is not enough accessible data to properly train AI models to help us uncover new treatments. The data exists but is spread out across different entities such as hospitals, payers, drug developers, regulators, and patients. If these groups can share their data in a compliant manner, we can increase the  volume of data, enabling AI to identify new treatments faster.

(2) The industry struggles with clinical trial operational problems, validated by the number of new treatments that failed due to problems with clinical trial data collection, as opposed to efficacy or safety issues. Consider these examples: “BioVie blames protocol errors at trial sites for phase 3 Alzheimer’s drug fail as stock craters,” “Acelyrin Flags CRO Errors in Izokibep Program After Late-Stage Failure,” and “Pfizer scraps half of participants in Lyme disease drug trial due to quality issues.” Here again, sharing real-time data on study conduct can improve study performance by predicting friction points, identifying operational risks for each trial stakeholder (sites, CROs, sponsors, participants) before they manifest as catastrophic issues. 

Training to Identify Umbrellas & the AI Clinical Research Paradox

AI requires tremendous amounts of data, and the data needs to be representative of the population of instances, sampling from multiple sources to reduce bias, which in turn increases data volume requirements. So how much data are we talking about? As an example, Open AI's ChatGPT four was trained on one petabyte of written text data. That's 1.5 million CD ROMs. If we stack them one on top of each other, we'd get a 1.2 mile-high pile of CDs!

Let’s get more specific. It takes about 100,000 examples of a single object to train a computer how to identify it — the simplest thing, like an umbrella. Now consider much more complex “things” like a circulating tumor cell or Parkinson’s disease. Without a data strategy, we are never going to fully realize the incredible potential of AI applied to life sciences. This is the clinical research AI paradox: AI is fueled by data. No single organization has enough information to drive the identification of anything more complex than an umbrella. It is difficult for data scientists to pull together and harmonize data from various companies, each with different data collection methods, sources, lineages, and formats.


Sharing is Caring

Sharing data, and more specifically, the work required to share the data, often surpasses the open market value and purchase price for such data, given that each single organization may have only a small quantity of data to share. Yet, there are beneficial reasons to share data in drug development.  For example, companies can drive more revenue through the development of more novel therapies with higher probabilities of success and lower risk of failure, which impacts the risk/reward equation and can thus mitigate rising drug prices.

Data sharing and cross-industry collaboration, like we witnessed during the pandemic, will power the AI revolution in clinical research in 2024. The COVID pandemic proved that pharmaceutical companies, clinicians, researchers, technology companies, and regulators can work together to drive remarkable global healthcare outcomes.

A current example of how this could work is the MELLODDY consortium. MELLODDY is using federated learning, a data-sharing model that protects companies’ proprietary information while still sharing important research data, to provide much-needed, high-quality small molecule and protein data to help AL/ML models design new therapies faster. Protein drug development is notoriously long, arduous, and costly. But with MELLODDY, the partner organizations can use a federated data model and AI to drive greater efficiency than any individual organization could alone.

Outside of federated models, we can merge different data sets by tokenizing and anonymizing identifiers to ensure that customer and consumer privacy will be maintained while enabling more safe and efficacious treatments to go through development. Again, through tokenization and anonymization, patients can mitigate fears that their data will be weaponized or used against them while still enabling their data to accelerate the development of better treatments. With novel generative AI approaches, data scientists can also rapidly harmonize data from multiple sources, hastening the preparation of that data for analytical consumption.

Imagine how we could accelerate drug development by pooling different data for all known sources of a specific disease. Such collaboration can generate novel and earlier predictive signals of disease detection and progression, and we know that early detection is the best prevention. It also enables more opportunities for non-invasive data collection developed on algorithms that are correlated and validated against disease detection and progression and deployed on our everyday devices like our smartphones, our watches, our smartware. Ultimately, it means keeping us out of hospitals and allowing us to engage in more full lives with our families and loved ones.

There are so many unmet clinical needs. We need a paradigm shift to generate radical outcomes – and this requires AI because we do not have enough human resource power or brute force to tackle all unmet clinical needs. Yet, to generate a valuable AI strategy, we need a data strategy because data fuels AI models. To create a powerful data strategy, we first need a data sharing strategy.

Before even considering AI projects within your organizations this year, think first of the data that will power these solutions. Then explore how we can work together with different algorithm developers, different data sources, data generators, physicians, regulators, and patients to access the best information and generate the optimal predictions for applied science, benefiting humankind by eradicating disease, through the power of AI.


I look forward to talking more about this with you at JPM Health this week. Let’s meet – email me at

DALL·E 2024-01-05 09.18.34 - A photo-realistic image of a diverse group of kindergarten children, representing different descents such as Caucasian, Hispanic, Black, Middle-Easter