Background: Cytogenetics drives precision oncology by uncovering genetic abnormalities that contribute to cancers. Karyotypes, essential for cytogenetic analysis, are documented in the InternationalSystem for Human Cytogenetic Nomenclature (ISCN) and prevalent in various free-text clinical documents, posing challenges for manual abstraction and computational processing.Furthermore, OCR technology used to digitize these documents often introduces errors and compromises the secondary use of health data, especially problematic for ISCN notation where a single character change can alter meaning. In response, we present a novel NLP approach to extract and structure karyotype data from clinical notes using automated OCR error correction.
Methods: We developed a cancer-type-agnostic NLP pipeline by training two semi-supervised models on randomly sampled clinical notes ( > 85% from oncology patients, including breast, lung and hematopoietic cancers) in the Tempus Database (Tempus AI, Inc., Chicago, IL): a named entity recognition (NER) model to identify karyotype strings in ISCN notation and T5, a transformer–based model for OCR error correction in identified karyotypes. We employed a two-tiered fine-tuning on T5 for training OCR error correction to reduce the need for manual curation: first on a public karyotype database with synthetic OCR errors, then on real OCR errors from clinical notes. The pipeline then standardized and structured the karyotype string into an in-house common data model. NLP model performances were evaluated against Gemma-2-27b, a state-of-the-art large language model, on curated labels from a mixture of clinical records.
Results: The karyotype extraction model, trained as an NER task, achieved precision, recall, and F1scores of 0.86, 0.92, and 0.90, respectively, on a test set of 2800 curated ISCN karyotype strings. Our fine-tuned T5 model significantly outperformed Gemma2, correcting 95% of synthetic OCR errors in a test set of 44,756 karyotype strings and 84% of real OCR errors from clinical notes in a test set of 2,790 karyotype strings, compared to Gemma-2’s 41% and 43%,respectively. Error analysis showed Gemma-2’s tendency to edit uncommon but correct karyotypes to common ones and to inaccurately extend short karyotypes.
Conclusions: To our knowledge, this is the first NLP-driven method for extracting and structuring karyotypes in clinical notes using fully automated OCR error correction irrespective of cancer type or document type. The model outperformed a state-of-the-art LLM in OCR error correction, accelerating the abstractions of cytogenetic information from clinical notes at scale. This advancement provides actionable cytogenetic information to oncology healthcare teams, enhancing the delivery of patient care.
VIEW THE PUBLICATION