In the world of AI, foundation models are transforming industries. Most of today’s foundation models are trained on the entire internet, learning from a vast but often unstructured and noisy sea of information. But when it comes to healthcare, the stakes are higher, the data is often siloed and not easily accessible, and the need for accuracy is paramount.
Doing the hard work: Building data pipelines for healthcare
Tempus has spent years laying the groundwork for this moment. We’ve built robust, secure data pipelines with over 4,500 hospitals in the U.S., enabling us to ingest, structure, and harmonize multimodal data at an unprecedented scale. These connections serve as the foundation to ensuring our intelligent diagnostics and AI applications are grounded in all data modalities for a harmonized and full picture of patient care.
We were only able to do this because Tempus offered a unique value to providers. With our product suite and AI-enabled platform, we are able to marry molecular data with clinical data pulled directly from a provider’s EMR system, empowering physicians with a clinical test report that is contextualized to each patient. We have replicated this model beyond oncology, moving into disease areas and specialty areas such as cardiology, neurology, radiology, and pathology.
We can’t do this work alone. Patients and health systems we work with have recognized the value of using health data to build new technologies that can help the entire healthcare ecosystem. Where we have rights to use this data to build foundation models, we feel a responsibility to do so.
A real-world dataset like no other
Our efforts have resulted in one of the largest and most comprehensive, real-world, multimodal datasets. Today, Tempus has approximately 38 million research records, including longitudinal follow-up results, and over 7 billion clinical notes. That dataset includes more than a million cancer patients with rich molecular profiling, around 3 million genomic sequences from patients undergoing hereditary cancer testing, and over 7 million digitized pathology slides.
This work becomes critical in areas like neuropsychiatry, in which treatment decisions are still mostly based on the trial and error method. Tempus’ pharmacogenomic testing and Tempus Pro platform has allowed us to create a rich dataset that has the potential to truly advance therapeutic research and development in this area. That dataset includes:
- 100K patients with depression, anxiety, ADHD, bipolar disorder
- ~75K with genotyping data
- ~25K with whole exome data
- ~25K with clinical progress notes
- ~10k records with data accumulated by our patient reported outcome app (Tempus Pro)
Real-time, multimodal, and always growing
What sets Tempus apart is not just the scale, but the real-time nature of our data integrations with hospital systems. This ongoing stream of contextual, up-to-date healthcare data, once de-identified, is the foundation for building the most robust and relevant models in the industry.
Since starting its cardiology work in 2019, Tempus has accumulated:
- ~11 million patients
- ~1 billion clinical notes that include all relevant clinical data
- ~5.9 million echocardiography reports (30,000 with full imaging data)
- ~5.5 million ECGs
Similarly, in radiology, we have collected and curated approximately 1.9 billion radiology images across cardiology, pulmonology, oncology, in approximately 1.6 million unique cases.
And through Ambry Genetics, and subsequent acquisition of the company earlier this year, we have expanded our oncology dataset to include approximately 3 million genomic sequences from patients undergoing hereditary cancer testing.
Early returns: Healthcare-specific models outperforming general AI
The promise of healthcare-specific foundation models is already being realized. Recent research from NYU shows that models trained exclusively on healthcare data are outperforming general-purpose models like ChatGPT on clinical tasks.
Paige, another Tempus subsidiary, has open-sourced early versions of its Virchow and PRISM to demonstrate how the right datasets can be a huge differentiator in building world class foundation models [Nature].
Our hope is that we can build models that can perform tasks across specialties and disease areas, like cardiology and immunology, while also uncovering novel diagnostic insights only possible from deep machine learning.
With Tempus data, we are beginning this journey by training novel foundation models on top of all of our data. Ithe near future, our model and others like it will be able to serve as expert medical co-pilots and surface novel diagnostic insights.
A proprietary advantage—and just the beginning
We understand that our proprietary dataset is a big responsibility. We are committed to building better models over time, leveraging our data to improve patient outcomes, accelerate research, and empower physicians. As we continue to expand our data assets and refine our models, we believe Tempus will set the standard for what’s possible in healthcare AI, including:
- Predicting outcomes
- Designing better clinical trials
- Pivoting patient population for investigational drugs
- Assessing risk to aid in closer patient monitoring
- Integrating patients’ co-morbidities associated with their specific disease so that they can be treated holistically (i.e. depression and cardiology issues linked to cancer)
- Deploying an AI-driven co-pilot across providers’ EHR systems to query patient data, streamline and maximize workflows, and create efficiencies for administrative tasks.
Conclusion
Building a foundation model demands not only significant capital and talent but, more critically, access to high-quality datasets. At Tempus, we’re embedding AI into all aspects of our platform, allowing us to create high-quality, multimodal datasets for future AI development. There are no shortcuts to solving this data challenge. It requires vision, persistence, and the hard work of building trust and infrastructure with stakeholders across our industry. Tempus has met that challenge head on, and we are proud to lead the way in transforming healthcare through real-time, data-driven AI.