A patient 2 months into a checkpoint inhibitor calls the triage line short of breath. By the time the CT confirms pneumonitis, they are on high-dose steroids and the immunotherapy is on hold, sometimes for good. Every oncologist who prescribes these drugs has watched some version of this. The toxicity is real, it can be severe, and the way we catch it today is almost entirely reactive: we wait for the patient to feel sick, then we work backward.
The hope is that, in some cases, the warning was already sitting in the chart. Not in a biomarker we hadn't ordered, but in the labs, the prior imaging, the medication history, the comorbidities we already had. We just weren't reading those data to answer the question who is about to get sick.
The signal is already collected
Over a multi-year Digital Precision Oncology collaboration with GE HealthCare, my group and I built and externally validated machine-learning models that predict both the effectiveness and the toxicities of immune checkpoint inhibitors from real-world patient data, the kind that accumulates in any EHR running an oncology service. We published the core results in JCO Clinical Cancer Informatics in 2024 (Prediction of Effectiveness and Toxicities of Immune Checkpoint Inhibitors Using Real-World Patient Data), and presented validation and extensions at SITC in 2023 and 2024. One result mattered more than the headline accuracy: the models worked off a short medical history captured before the first dose. You don't need years of pristine longitudinal follow-up to stratify risk. You need what's in front of you the day you write the order.
That reframes the clinical question. The point of a toxicity model is not to predict the future for its own sake. It's to change who you watch, how closely, and starting when. A patient flagged high-risk for pneumonitis is a patient whose new cough gets a same-day call back instead of a routine message, whose baseline pulmonary status you actually documented, whose first hint of hypoxia triggers imaging rather than reassurance. The model doesn't make the diagnosis. It re-allocates clinical attention toward the people most likely to need it.
Three toxicities, three different data shapes
Immune-related toxicities are not one problem, and the data that predicts them isn't one shape either. That distinction is what makes this tractable rather than aspirational.
- Pneumonitis turns out to be partly visible on imaging the patient already had. In JCO CCI in 2025 we showed that a radiomics-and-deep-learning model reading a prior thorax CT (a scan ordered for staging, not for any toxicity question) carries signal about who later develops immunotherapy-induced pneumonitis (Radiomics and Deep Learning Prediction of Immunotherapy-Induced Pneumonitis From Computed Tomography). The pixels were sitting in the PACS archive the whole time.
- Hepatitis shows up in the rhythm of routine bloodwork. We built a rolling-window model that watches the ordinary CBC and chemistry panels every patient on therapy already gets, and flags the trajectory toward checkpoint-inhibitor hepatitis before it crosses a clinical threshold. No new test, just reading the labs as a time series instead of one value at a time.
- Colitis is the hardest, because the ground truth is buried in narrative. You can't train or monitor a colitis model without reliably knowing who actually had colitis, and that fact lives in clinic notes, not a structured field. We tackled that with accelerated curation, pulling checkpoint-inhibitor colitis cases out of the EHR semi-automatically rather than by brute-force chart review (Accelerated curation of checkpoint inhibitor-induced colitis cases from electronic health records, JAMIA Open, 2023).
That last one is the tell. The bottleneck on toxicity prediction is rarely the algorithm. It's whether the outcome you're predicting exists as trustworthy, structured data, which is the same problem I keep writing about elsewhere and the reason I spend so much time on data standards. A model is only as good as the labels you trained it on, and in oncology those labels are expensive precisely because the truth is written in prose.
What it takes to actually act on a prediction
A risk score that nobody acts on is worse than no score, because it spends trust. If we're going to put these models in front of clinicians, the harder work is downstream of the math.
First, the prediction has to land in the workflow, not in a dashboard someone visits on a good day. A pneumonitis risk flag belongs on the treatment plan and the triage protocol; it should change what the nurse does when the patient calls, automatically. If acting on it requires remembering to go look, it won't get acted on.
Second, you have to decide what the action is, per toxicity, in advance. "High risk" is not a plan. Earlier imaging? A lower threshold for holding the next cycle? A scheduled check-in at the window when that toxicity typically emerges? Tighter lab cadence for the hepatitis-flagged patient? The model buys you lead time; the protocol decides what to do with it. Without the protocol, a flag is just anxiety.
Third, and this is the part payers and health-system leaders should care about: you have to measure the model in your population before you trust it, and keep measuring after you deploy. We externally validated these models on pan-cancer cohorts for a reason. A model tuned on one institution's mix of cancers, drugs, and documentation habits can quietly fail somewhere else. Calibration drifts. Practice changes. The governance question isn't "is the model accurate" once; it's "how will we know when it stops being accurate here." That's not a reason to wait. It's the operating discipline that makes deploying responsible.
Where this goes
The near-term win is not autonomous at all. It's a quieter shift in standard of care: risk-stratified monitoring for everyone starting a checkpoint inhibitor, drawn from data we already collect and pay to store. The patient flagged as high-risk for pneumonitis gets close watch. The one flagged as low risk across the board gets a lighter touch and fewer false alarms, which is its own clinical and financial good. We move from catching immune-related toxicity after it announces itself to anticipating it, and we do it without ordering a single new test.
For oncologists, that means the toxicity conversation at the start of therapy gets specific instead of generic. For informaticists, it means the work is mostly plumbing and labels, not model architecture. For payers, it means a concrete place where prediction reduces avoidable admissions and steroid courses rather than just generating alerts. The science to do this is largely done and published. What's left is the unglamorous part: wiring it into the day, agreeing on the action, and watching the model the way we'd watch any new diagnostic. That's the work worth doing, because the signal was never the thing we were missing. The willingness to read it was.