A Comparison of Data Processing Methods To Support Patient Matching to Oncology Clinical Trials

ASCO 2026

May 20, 2026

Oncology

Abstract

Tunghi May Pini, Kunal Nagpal, Mark Vance, Caity Moran Rose, Michelle Huang, Michael Singer, Gianna Klonk, Ryan Godart, Tian Kang, Dan Sun, Arpita Saha, Chelsea Kendall Osterman

Background To pre-screen and match patients to trials at scale, technology tools are needed to support human workflows. These tools require numerous data inputs, including clinical concepts such as cancer diagnosis, stage, and histology. The more complete these data are, the better the accuracy of matches produced, and the more efficient human reviewers can be. While some clinical data are available from clinical system structured fields, most are within unstructured documentation. Methods including human abstraction, natural language processing (NLP) models, and large language model (LLM) agents can improve data completeness and accuracy beyond structured EHR and laboratory information management system (LIMS) data. We compared these methods to obtain diagnosis and stage data for Tempus Link, a tool that supports patient-trial matching at scale for a national oncology trials network.

Methods We randomly sampled patients with one of 4 previously abstracted cancer diagnoses: lung (LC), prostate (PC), colorectal (CRC), and breast cancer (BC). For LC we also examined histology (NSCLC vs SCLC). For each method (EHR, LIMS, NLP predicted, LLM agent), diagnosis, stage, and histology values were compared to the abstracted value, and accuracy was calculated as the percent of patients with a correct value among patients with an abstracted value available. Since these data are inputs into a tool used by RN screeners wh confirm matches before notifying a site, diagnosis was focused on accuracy in broad cancer diagnostic categories (e.g. neoplasm of lung). In addition, we assessed completeness as presence of a usable value across all patients in a cohort.

Results Completeness for EHR, NLP, and LLM was higher for diagnosis compared to stage, ranging from 91.7 – 100% vs 20 – 95.8%. LIMS completeness was lower for all variables across cohorts, ranging from 0 – 87.8%.

Conclusions Variability exists across data sources and processing methods in generating data inputs for trial matching. All methods have high completeness and accuracy for diagnosis, while there are significant gains when applying NLP and LLMs for stage and histology. To balance accuracy, matching efficiency, and cost, the use of enhanced data processing methods are required for certain variables, but may not be needed for others.

Table 1: Accuracy by cohort, variable, and data processing method

Cohort	Variable	EHR	LIMS	NLP	LLM
LC. n=48	# diagnosis correct	48 (100%)	33 (68.8%)	48 (100%)	45 (93.8%)
	# NSCLC correct	1 (2.1%)	27 (56.3%)	48 (100%)	39 (81.3%)
	# stage correct (n=41)	16 (39%)	0 (0%)	27 (65.9%)	30 (73.2%)
PC. n=48	# diagnosis correct	48 (100%)	43 (89.6%)	48 (100%)	44 (91.7%)
	# stage correct (n=45)	22 (48.9%)	0 (0%)	15 (33.3%)	38 (84.4%)
CRC. n=49	# diagnosis correct	47 (95.9%)	42 (85.7%)	47 (95.9%)	46 (93.9%)
	# stage correct (n=46)	21 (45.7%)	0 (0%)	36 (78.3%)	42 (91.3%)
BC. n=50	# diagnosis correct	50 (100%)	9 (18%)	47 (94%)	50 (100%)
	# stage correct (n=36)	5 (13.9%)	0 (0%)	23 (63.9%)	35 (97.2%)

View the publication ↗︎

Related publications

Concurrent RNA- and DNA-NGS Testing for the Detection of Clinically-Relevant Fusions In Pediatric Solid-Tumor Patients

2026-05-20

Oncology

GEMINI-BREAST: Evaluating Minimal Residual Disease (MRD) through Longitudinal Circulating Tumor DNA (ctDNA) Profiling in Breast Malignancies

2026-05-20

Oncology

Screening for Intrinsic and Acquired Resistance Mutations to Anti-EGFR in Real-World MSS Colorectal Cancer Data

2026-05-20

Oncology