The multimodal imperative: Leveraging integrated data streams to accelerate oncology research and drug development

04/06/2026

The multimodal imperative: Leveraging integrated data streams to accelerate oncology research and drug development

Why integrating EHR, claims, and molecular data is essential for modern oncology research, enabling high-definition, longitudinal patient views.

The paradigm of precision oncology is undergoing a fundamental shift. For decades, the industry relied on a unimodal approach—where therapeutic decisions and research cohorts were primarily defined by single-analyte biomarkers or isolated clinical trial endpoints. This provides only a snapshot in time, failing to capture the necessary continuous, longitudinal narrative of disease progression. However, as our understanding of tumor biology and the patient journey deepens, it has become evident that fragmented data often leads to fragmented insights.

To truly de-risk drug development and optimize patient outcomes, researchers and drug developers must move toward a multimodal data architecture. By integrating structured and unstructured electronic health record (EHR) data with Health Information Exchange (HIE) records, medical claims, molecular PDFs, and raw sequencing files, researchers can construct a high-definition, longitudinal view of the patient. This multidimensional approach is no longer just a technical advantage; it is a scientific necessity.

Why molecular data needs clinical context

Molecular profiling has revolutionized oncology, yet a genomic variant in isolation offers limited predictive power. Its clinical utility is realized only when contextualized by the patient’s phenotype—their treatment history, comorbidities, and evolving response to therapy.

Research has demonstrated that clinico-genomic databases (CGDB), which link longitudinal EHR data with comprehensive genomic profiling, significantly enhance the ability to identify real-world endpoints.¹ For life sciences organizations, this integration allows for the discovery of hidden responders—subpopulations that may not meet traditional trial criteria but show significant benefit in a real-world setting. Without this clinical tether, molecular data remains a snapshot in time rather than a continuous narrative of disease progression. Longitudinality is paramount, enabling researchers to track the patient’s therapeutic trajectory, including sequential treatments, evolving toxicities, and long-term outcomes, which is vital for generating robust real-world evidence.

The value of unstructured EHR and molecular PDFs

Approximately 80% of clinical data resides in unstructured formats—physician notes, pathology reports, and scanned molecular PDFs.² These documents contain the ground truth of oncology: the nuances of toxicities, the rationale for treatment switches, and the subtleties of radiographic progression that structured fields often miss.

A 2018 ASCO study highlighted that natural language processing (NLP) and machine learning (ML) models are essential for extracting critical variables at scale. By transforming unstructured narratives into discrete, analyzable data points, researchers can capture complex endpoints like real-world progression-free survival (rwPFS) with a level of granularity that structured EHR data alone cannot provide. Furthermore, the ingestion of raw molecular files—rather than just summary reports—enables bioinformaticians to re-analyze data as new signatures emerge, ensuring that legacy datasets remain valuable as the field of computational biology advances.

Bridging the longitudinal gap: HIE and claims data

One of the greatest challenges in oncology research is the leaky bucket of patient data. Patients often receive care across multiple health systems, leading to fragmented records. Integration of HIE data and administrative claims is vital for achieving a pan-longitudinal view.

Claims data serves as a critical source for understanding healthcare utilization and mortality outside the primary treating institution. FDA guidance on the use of real-world data (RWD) suggests the combination of EHR and claims data can provide a more robust evidence base for regulatory decision-making.³ In drug development, this comprehensive visibility is instrumental in designing synthetic control arms (SCAs), which can reduce the time and cost of clinical trials by providing a high-fidelity comparator group from real-world populations.

Accelerating drug discovery through multimodal AI

The convergence of these diverse data types is the prerequisite for the next generation of artificial intelligence in oncology. Multimodal AI models—those trained simultaneously on clinical, molecular, and imaging data—are proving superior to single-modality models in predicting therapy response.⁴

Research has shown that integrating digital pathology with genomic data can uncover novel spatial biomarkers and tumor microenvironment (TME) interactions that were previously invisible.^{5^{By leveraging an AI-enabled platform that harmonizes these disparate streams, researchers can transition from descriptive analytics to predictive modeling, identifying the right patient for the right therapy with unprecedented accuracy.}}

A foundation for the future

The complexity of oncology demands a data strategy that is as multifaceted as the disease itself. By investing in multimodal datasets—incorporating everything from the deep biological insights of raw sequencing files to the broad longitudinal context of claims and HIE data—the life sciences industry can move beyond traditional data silos.

A seamlessly integrated, de-identified multimodal database does more than support existing research; it acts as an engine for discovery. It allows for the identification of rare mutations, the validation of novel biomarkers, and the streamlining of clinical trial enrollment. As we look toward the future of precision medicine, the ability to structure the unstructured and unify the patient journey will be the hallmark of the most successful therapeutic developments.

To learn more about how intelligent diagnostics and high-fidelity multimodal data are transforming the oncology landscape, explore the Lens Library.

Yunyongying P, Rich M, Jokela J. Patient-Centered Performance Metrics. JAMA. 2019;321(18):1829. doi:10.1001/jama.2019.2056
Kong HJ. Managing Unstructured Big Data in Healthcare System. Healthc Inform Res. 2019;25(1):1-2. doi:10.4258/hir.2019.25.1.1
Center for Drug Evaluation and Research. “Real-World Data: Assessing Electronic Health Records and Medical Claim.” U.S. Food and Drug Administration, FDA, www.fda.gov/regulatory-information/search-fda-guidance-documents/real-world-data-assessing-electronic-health-records-and-medical-claims-data-support-regulatory. Accessed 26 Mar. 2026.
Zhang R, Chen Y, Yue W, et al. Multimodal artificial intelligence in medicine: a task-oriented framework for clinical translation. Front Med (Lausanne). 2026;12:1736272. Published 2026 Jan 14. doi:10.3389/fmed.2025.1736272
Guedes J, Woldmar N, Szasz M, et al. A perspective on integrating digital pathology, proteomics, clinical data and AI analytics in cancer research, Journal of Proteomics. 2025,320:105493. Published 2025 Jul 18. https://doi.org/10.1016/j.jprot.2025.105493.