NLP & Clinical Data

NLP for Oncology Trial Eligibility: Extracting Evidence from Clinical Notes

NLP for Oncology Trial Eligibility: Extracting Evidence from Clinical Notes

In oncology and rare disease clinical trials, a meaningful portion of eligibility evidence doesn’t exist in structured EHR fields. Disease severity classifications, pathologist assessments of tumor characteristics, prior treatment response narratives, and physician judgment about symptom severity are documented in clinical notes—discharge summaries, pathology reports, progress notes, consultation records. A patient who appears eligible based on ICD-10 codes and lab values may have a disqualifying finding buried in a 14-page oncology progress note.

This is the problem that NLP-based eligibility screening addresses: extracting clinical evidence from unstructured text and applying it to trial eligibility decisions at scale, with enough reliability that coordinators can trust the output.

What Structured Data Misses

The gap between structured EHR data and clinical reality is widest in oncology. Consider a Phase II trial with an inclusion criterion requiring “measurable disease per RECIST 1.1 criteria.” Whether a patient meets this criterion depends on the radiologist’s interpretation of imaging studies—documented in radiology reports, not in a discrete data field. An exclusion criterion like “prior allogeneic stem cell transplant” may be coded as an ICD-10 procedure in some patients and documented only in a transplant summary note for others.

Rare disease trials compound this further. Diagnostic criteria for conditions like amyloidosis, hereditary transthyretin amyloidosis, or rare hematologic malignancies often involve clinical judgment documented in specialist consultation notes, not standard structured coding. A screening tool that only queries structured data will systematically fail to identify eligible patients whose diagnoses are documented at the note level.

How the Extraction Works

NLP eligibility extraction uses transformer-based models fine-tuned on clinical text. The models are trained to recognize clinical concepts, their context, negation, and temporal relationships in the prose patterns characteristic of clinical documentation. “Patient denies prior treatment with checkpoint inhibitors” and “patient received nivolumab for 6 cycles ending in November 2023” require different inference paths despite both mentioning checkpoint inhibitors.

For each patient who clears the structured-data first pass, the NLP model processes the relevant note types (pathology reports, progress notes, discharge summaries, consultation notes) and scores the likelihood of meeting each note-dependent inclusion criterion. The output includes the specific passage that drove the score—the evidence trail that coordinators can verify in under five minutes rather than reading through the full note set.

What Coordinators Get

The goal of NLP-assisted screening is not to replace coordinator review of clinical notes. It’s to direct that review to the right candidates and the right passages. A coordinator reviewing a candidate flagged by the NLP model sees the overall eligibility score, which specific criteria are note-dependent, and the extracted passages that support or flag each criterion—with the source document and page reference.

This changes the coordinator’s task from “read through 30 pages of notes to determine if this patient qualifies” to “verify these three flagged passages and confirm my agreement with the AI’s interpretation.” In practice, that review takes 4–7 minutes per candidate compared to 20–40 minutes for unassisted note review.

Model Limitations Worth Acknowledging

NLP models for clinical notes are accurate, but not infallible. They perform best on the note types they were trained on and at the documentation quality level of the sites in their training data. Sites with inconsistent note templates, heavy use of copy-forward documentation, or non-standard abbreviations will produce more edge cases. Coordinator override capability—the ability to flag a model determination as incorrect—is essential for catching systematic errors specific to a site’s documentation practices, and those override signals should feed back into model improvement for subsequent screening cycles.