SHAPES, the Smoking History and Pack-Year Extraction System, is a natural-language-processing pipeline Dr. Osterman built during his medical-oncology fellowship and biomedical-informatics MS at Vanderbilt. It reads clinical notes and reconstructs granular tobacco exposure, including pack-years, duration, and quit dates, from free text. The work earned him the Conquer Cancer Foundation Young Investigator Award in 2016.

SmokeBERT is a BERT-based smoking-history extraction model that significantly outperforms the original SHAPES on reconstruction accuracy and generalization. It was developed by Heng Tan, a Hematology-Oncology fellow Dr. Osterman mentored, and published as SmokeBERT and Beyond (Tan and Osterman, JCO Clinical Cancer Informatics, 2025), with Dr. Osterman as senior author.

Why does pack-year extraction matter for lung cancer screening?

Low-dose CT screening reduces lung cancer mortality by roughly 20 percent in published trials, but determining who is eligible depends on accurate pack-year exposure history. That history is almost always buried in unstructured clinical narratives rather than the EHR's structured smoking-status field, which is the gap SHAPES and SmokeBERT address.

Lung Cancer

Active medical-oncology practice in lung cancer paired with informatics research that brought structured smoking history into the EHR - SHAPES and its BERT-era successor SmokeBERT - plus screening implementation, immune checkpoint inhibitor prediction, and bedside research from atypical cases.

Lung cancer is the single largest cause of cancer death in the United States and the cancer for which screening evidence is cleanest: low-dose CT screening of eligible smokers and former smokers reduces mortality by roughly 20% in the published trials. But determining who is eligible depends on accurate pack-year exposure history - and pack-year history is almost always buried in unstructured clinical narratives, not the EHR's structured smoking-status field. That gap between "evidence says screen eligible patients" and "the EHR can't reliably tell you who is eligible" is the problem Dr. Osterman has worked on for more than a decade. It was also the work that got him started.

SHAPES (2015-2017) and the YIA. Dr. Osterman built the Smoking History and Pack-Year Extraction System (SHAPES) during his medical-oncology fellowship and biomedical-informatics MS at Vanderbilt - a natural-language-processing pipeline that reads clinical notes and reconstructs granular tobacco exposure (pack-years, duration, quit dates) from free text. The early work was presented at international meetings starting in 2015 (Quantifying Tobacco Exposure Using Clinical Notes and Natural Language Processing, IASLC Targeted Therapies 2016) and earned him the Conquer Cancer Foundation Young Investigator Award in 2016. The first downstream application followed almost immediately: an EHR-wide gene-environment interaction study using SHAPES-extracted smoking data to evaluate lung-cancer risk (Osterman, Wei, Mize & Denny, ASCO 2016 abstract 1524).

Lung cancer screening implementation. Once an extraction pipeline exists, the question becomes whether it actually moves screening forward. A series of Vanderbilt collaborations with the lung-cancer-screening community examined that translation in practice. Two examples: Women screened for breast cancer are dying from lung cancer: An opportunity to improve lung cancer screening (Sandler, Haddad, Paulson, Osterman & Scott, Journal of Medical Screening, 2021) - mining the existing breast-cancer screening cohort for women who would also qualify for lung-cancer screening but weren't being offered it. And Identification and Characterization of Avoidable Hospital Admissions in Patients With Lung Cancer (Lander, Li, Huang, Cass, Iams et al., JNCCN, 2023) - on the care-delivery side, identifying admissions that could be prevented with better outpatient management.

SmokeBERT (2025) - the language-model successor. SHAPES was a careful hand-engineered NLP pipeline; ten years later the technique of choice is transformer-based language modeling applied to the same problem. Heng Tan, MD - a Hematology-Oncology fellow Dr. Osterman mentored - led the development of SmokeBERT, a BERT-based smoking-history extraction model that significantly outperforms the original SHAPES on reconstruction accuracy and generalization. The work was published as SmokeBERT and Beyond: Bridging Clinical Narratives and Structured Smoking Data To Improve Lung Cancer Screening (Tan & Osterman, JCO Clinical Cancer Informatics, 2025), with Dr. Osterman as senior author. The arc is exactly the one this site keeps coming back to: a fellow inheriting a problem that's been around for a decade, applying the current generation of tools to it, and pushing the field forward. See AI in oncology for the broader thread on LLMs in clinical contexts, and clinical informatics education for the trainee pipeline that produced the work.

Immune checkpoint inhibitor prediction in lung cancer. Lung cancer is the single largest clinical application of immune checkpoint inhibitors, and is therefore the highest-yield context for the GE HealthCare Digital Precision Oncology work on predicting ICI efficacy and toxicity from real-world data. The cohorts that powered the flagship Lippenszky et al. JCO CCI 2024 paper draw heavily on Vanderbilt lung-cancer patients, and the predictive framework is most clinically useful where it is most clinically needed: at the point of choosing whether to start an ICI in a patient with metastatic NSCLC.

Bedside research from active clinical practice. Dr. Osterman's lung-cancer clinical practice continually surfaces problems beyond screening that the same data infrastructure can address. A notable example: in collaboration with thoracic oncologist Dr. Wade Iams and the maternal-fetal-medicine team at Vanderbilt, Dr. Osterman co-authored Two Uncomplicated Pregnancies on Alectinib in a Woman With Metastatic ALK-Rearranged NSCLC (Weidenbaum, Cann, Osmundson, Iams & Osterman, JTO Clinical and Research Reports, 2022) - a clinical case report documenting a young woman with metastatic ALK-rearranged NSCLC who carried two pregnancies to term while continuing targeted therapy with alectinib. The paper is the kind of bedside-driven contribution that only happens when an active clinician is paying attention to atypical cases and willing to publish them. Adjacent NSCLC-care work continues through the Vanderbilt-Ingram Thoracic Oncology research community.

Clinical practice and appointments. Dr. Osterman maintains an active medical-oncology clinical practice at Vanderbilt-Ingram Cancer Center, with a clinical focus on lung cancer. New-patient appointments and second opinions can be requested through the Vanderbilt Health physician directory.

The through-line: lung cancer is where the data-infrastructure thesis got tested first. If clinical informatics can solve the smoking-history extraction problem cleanly enough to move population-level screening, it can solve almost any structured-data extraction problem in oncology. And the active clinical practice is what keeps the rest of the work honest. See also cancer data standards (mCODE) and clinical genomics in the EHR for the standards-and-substrate work that the lung cancer research connects to.

Named programs

SHAPES (Smoking History and Pack-Year Extraction System) - 2016 Conquer Cancer Foundation Young Investigator Award
SmokeBERT - clinical narrative to structured smoking data (JCO CCI, 2025)
Lung cancer screening implementation research at Vanderbilt-Ingram

Peer-reviewed publications (12)

Heng Tan, Travis J. Osterman. SmokeBERT and Beyond: Bridging Clinical Narratives and Structured Smoking Data To Improve Lung Cancer Screening. JCO clinical cancer informatics Dec 22, 2025
Shelby A. Crants et al. Clonal Hematopoiesis of Indeterminate Potential After Radiation Therapy. International Journal of Radiation Oncology*Biology*Physics Oct 24, 2025
Kai Zhang, Tongtong Huang, Bradley A. Malin, Travis Osterman, Qi Long, Xiaoqian Jiang. Introducing mCODEGPT as a zero-shot information extraction from clinical free text data tool for cancer research. Communications Medicine Oct 15, 2025
Irbaz Bin Riaz, Muhammad Ali Khan, Travis J. Osterman. Artificial intelligence across the cancer care continuum. Cancer Aug 15, 2025
Yanwei Li et al. Minimal Common Oncology Data Elements Genomics Pilot Project: Enhancing Oncology Research Through Electronic Health Record Interoperability at Vanderbilt University Medical Center. JCO Clinical Cancer Informatics Jun 28, 2024
Levente Lippenszky et al. Prediction of Effectiveness and Toxicities of Immune Checkpoint Inhibitors Using Real-World Patient Data. JCO Clinical Cancer Informatics Mar 21, 2024
Rachel S. Goodman et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Network Open Oct 2, 2023
Eric M. Lander et al. Identification and Characterization of Avoidable Hospital Admissions in Patients With Lung Cancer. Journal of the National Comprehensive Cancer Network Oct 1, 2023
Protiva Rahman et al. Accelerated curation of checkpoint inhibitor-induced colitis cases from electronic health records. JAMIA Open Apr 1, 2023
Rachel S. Goodman, J. Randall Patrinely, Travis Osterman, Lee Wheless, Douglas B. Johnson. On the cusp: Considering the impact of artificial intelligence language models in healthcare. Med (New York, N.Y.) Mar 10, 2023
Douglas Johnson et al. Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model (under review). Feb 28, 2023
Chloe Weidenbaum, Christopher G. Cann, Sarah Osmundson, Wade T. Iams, Travis Osterman. Two Uncomplicated Pregnancies on Alectinib in a Woman With Metastatic ALK-Rearranged NSCLC: A Case Report. JTO Clinical and Research Reports Jun 18, 2022

Selected talks (7)

Tennessee Oncology Data Analysts Association (Nashville, TN): "Advancing Lung Cancer Treatment in the Era of Precision Oncology". Oct 7, 2022
44. DBMI Research Forum (Nashville, TN): "EHR-Wide GxE Study Using Smoking Information Extracted From Clinical Notes". May 16, 2016
46. University of California San Diego, Division of Biomedical Informatics (San Diego, CA): "Extracting and Studying Granular Smoking History from the Electronic Health Record". Mar 8, 2016
AMIA Joint Summit: "Extracting Tobacco Exposure with the Smoking History and Pack-Year Extraction System (SHAPES)". Mar 13, 2018
Conquer Cancer Foundation Scientific and Career Development Retreat (Washington, DC): "Smoking History and Pack Year Extraction System (SHAPES): Supporting Lung Cancer Screening and Tobacco-related Research". Oct 11, 2017
28. NLM Informatics Training Conference (Columbus, OH): "EHR-Wide GxE Study using Smoking Information Extracted from Clinical Notes,”". Jun 29, 2016
AMIA Annual Symposium (San Francisco, CA): "Quantifying Tobacco Exposure Using Clinical Notes and Natural Language Processing to Enable Lung Cancer Screening". Nov 18, 2015

Abstracts (8)

Joseph Vento, Lisa Bastarache, Qingxia M. Chen, Travis Osterman. Real-world side effects of targeted therapies: High-throughput association studies leveraging the CancerLinq Discovery lung cancer database.. Journal of Clinical Oncology May 28, 2025
David Smith et al. 1246 Prediction of pneumonitis in immunotherapy patients from prior thorax CT. Journal for ImmunoTherapy of Cancer Nov 1, 2024
Zoltan Kiss et al. 1294 External validation of machine learning models to predict efficacy and toxicity of immune checkpoint inhibitors using real-world pan cancer cohorts. Journal for ImmunoTherapy of Cancer Nov 1, 2023
Levente Lippenszky et al. 1300 Prediction of efficacy and toxicities of immune checkpoint inhibitors using real-world patient data. Journal for ImmunoTherapy of Cancer Nov 1, 2023
Eszter Csernai et al. Rolling window-based hepatitis toxicity prediction from routine bloodwork in patients undergoing immune checkpoint inhibitor therapy.. Journal of Clinical Oncology Jun 2022
Gergely Horváth et al. Predicting immune checkpoint inhibitor-related hepatitis using electronic health records of patients.. Journal of Clinical Oncology Jun 2022
Levente Lippenszky et al. Predicting immune checkpoint inhibitor-related pneumonitis using patient medical information.. Journal of Clinical Oncology Jun 2022
Eric Michael Lander et al. Characterization of avoidable hospital admissions in patients with lung cancer in the immunotherapy and targeted therapy era.. Journal of Clinical Oncology Jun 2022

In the news (2)

Targeted cancer drug during pregnancy · Vanderbilt University. Aug 2, 2022
Microsoft Investigator Fellow Dr. Travis Osterman uses Azure to support lung cancer treatment protocols · Microsoft Customers Stories. Jul 23, 2021

Frequently asked questions

What is SHAPES?: SHAPES, the Smoking History and Pack-Year Extraction System, is a natural-language-processing pipeline Dr. Osterman built during his medical-oncology fellowship and biomedical-informatics MS at Vanderbilt. It reads clinical notes and reconstructs granular tobacco exposure, including pack-years, duration, and quit dates, from free text. The work earned him the Conquer Cancer Foundation Young Investigator Award in 2016.
What is SmokeBERT?: SmokeBERT is a BERT-based smoking-history extraction model that significantly outperforms the original SHAPES on reconstruction accuracy and generalization. It was developed by Heng Tan, a Hematology-Oncology fellow Dr. Osterman mentored, and published as SmokeBERT and Beyond (Tan and Osterman, JCO Clinical Cancer Informatics, 2025), with Dr. Osterman as senior author.
Why does pack-year extraction matter for lung cancer screening?: Low-dose CT screening reduces lung cancer mortality by roughly 20 percent in published trials, but determining who is eligible depends on accurate pack-year exposure history. That history is almost always buried in unstructured clinical narratives rather than the EHR's structured smoking-status field, which is the gap SHAPES and SmokeBERT address.
Does Dr. Osterman still see patients?: Yes. Dr. Osterman maintains an active medical-oncology clinical practice at Vanderbilt-Ingram Cancer Center with a clinical focus on lung cancer. New-patient appointments and second opinions can be requested through the Vanderbilt Health physician directory.