Evaluation of Large Language Model (LLM)-Based Clinical Abstraction of Electronic Health Records (EHRs) for Non-Small Cell Lung Cancer (NSCLC) Patients

05/22/2025

Evaluation of Large Language Model (LLM)-Based Clinical Abstraction of Electronic Health Records (EHRs) for Non-Small Cell Lung Cancer (NSCLC) Patients

ASCO 2025

Authors Kabir Manghnani, Katie Mo, Kunal Nagpal, Xifeng Wang, Kaitlynn Cunnea, Bridget Bax, Michael Bodker, Arpita Saha, Chelsea Kendall Osterman, Riccardo Miotto, Chithra Sangli

Background: Abstraction is a critical step for converting clinical data from unstructured EHRs into a structured format suitable for real-world data analyses. Typically this is a manual, labor-intensive activity requiring substantial training. While prior work has shown that abstraction by humans is reliable, advances in LLMs may improve the efficiency of abstraction. We aim to measure the performance of LLMs in abstracting a diverse set of oncology data elements.

Methods: Two clinical abstractors independently abstracted unstructured records of 222advanced or metastatic NSCLC patients (mean: 248 pages per case). A two-stage LLM system balancing cost and comprehensiveness was used to abstract clinical elements for demographics, diagnosis, third-party lab biomarker testing, and first line (1L) treatment. The initial stage extracted 16 documents semantically similar to the abstraction query and input them, along with abstraction rules, into an LLM (GPT-4o). The LLM was instructed to provide both the abstracted field and a completeness assessment of provided context. If the first phase resulted in a low completeness score, the entire patient record was then input into a long-context LLM (Gemini-Pro-1.5) to re-attempt abstraction. Gwet’s agreement coefficient (AC) was the primary measure of agreement between the LLM and each abstractor. Date agreement was calculated within ±30 days.

Results:The LLM system yielded abstracted values for 90% of elements where both abstractors provided non-missing values. In these cases, the LLM also demonstrated high agreement with each abstractor (≥0.81 across all categories). Agreement was highest in demographic and diagnosis domains and lower for 1L treatment domain, which require deeper understanding of a patient’s temporal journey. For elements where neither abstractor provided values, the LLM sometimes provided outputs (frequency: 4.9% for non-biomarker elements; 38.5% for biomarker elements). These discrepancies were primarily driven by nuances in abstraction rules; the LLM often included Tempus-tested biomarkers, while abstractors were more rigorous in abstracting only third-party biomarker results.

Conclusions: LLMs show high completion rates and high agreement with human abstractors across a variety of critical abstraction fields. The use of LLMs may significantly reduce the burden of human abstraction and allow for large-scale curation of oncology records. Challenges in handling nuanced contexts underscore the need for careful refinement and evaluation prior to deployment.

Domain	LLM agreement with abstractors (AC, min-max)
Demographic (birth date, sex, race, smoking status)	0.96-1
Diagnosis (stage, histology, year of diagnosis)	0.92-0.98
Third Party Biomarker (EGFR, ALK, ROS1, PDL1, BRAF, RET,NTRK)	0.87-1
1L Treatment (agents, initiation date)	0.81-0.86

VIEW THE PUBLICATION

VIEW THE POSTER