Patient Recruitment

Finding 1-in-10,000: EHR Matching for Ultra-Rare Disease Trials

Cohortbridge Editorial · April 15, 2025 · 8 min read

Abstract population dot grid with rare highlighted patients for rare disease trial recruitment

Rare disease enrollment is a denominator problem. When a condition affects 1 in 10,000 people in the US population, a single academic medical center with 400,000 active patients in its EHR system has approximately 40 patients with that condition — before applying any trial-specific eligibility criteria. A phase II program targeting adults aged 18–65, with no prior gene therapy exposure, current specific biomarker expression, and no significant comorbidities, may reduce that 40 to 4 or 5 trial-eligible patients per site. Across a network of 15 sites, that's 60–75 potentially eligible patients in total. If the enrollment target is 40, and the screen-to-randomize conversion rate is 50–60%, that math is extremely tight — and it doesn't account for patients who are identifiable in the EHR but haven't been seen recently enough to have current biomarker data.

The challenge isn't site execution. The challenge is that the denominator is genuinely small, and conventional site selection processes — based on investigator reputation and historical trial experience in adjacent indications — don't have the resolution to find those 4–5 eligible patients per site before the trial activates.

Why Standard Feasibility Methods Fail Rare Disease Programs

The feasibility questionnaire model was designed for trials where eligible patients are numerous enough that estimation works. If a site has 500 T2DM patients in their general medicine practice, a coordinator's estimate of "80–100 likely eligible for a cardiometabolic Phase II" is crude but workable — the site has more than enough eligible patients to absorb estimation error. When the same methodology is applied to a rare neurological condition with 40 patients in the EHR, estimation error of 30–40% means the difference between 3 eligible patients and zero.

High-volume rare disease centers — major academic referral centers, disease-specific specialty clinics — are the standard answer to the denominator problem. These sites aggregate the patient population that is too sparse for community or regional centers to find. But this approach has its own limitations: the same handful of major academic centers appear on every rare disease trial, their research coordinators are stretched across multiple concurrent protocols, and their patient populations increasingly overlap across trials in the same indication. For ultra-rare conditions where even the major centers have 8–12 eligible patients each, the standard rare disease site network approach produces enrollment timelines measured in years rather than months.

How EHR Matching Changes the Denominator Search

EHR matching for rare disease programs works on a different scale assumption than it does for common indication trials. The goal isn't to find 500 eligible patients across 10 sites — it's to find every patient in a network of 2–3 million records who meets a narrow set of criteria. At that population scale, even a condition affecting 1 in 10,000 people produces 200–300 potentially eligible individuals in the raw data, before exclusion criteria are applied.

The matching approach for rare diseases requires finding patients coded under the correct diagnostic terminology across multiple coding systems. ICD-10-CM codes for rare conditions are often non-specific or assigned inconsistently — clinicians may use a broader category code when a specific orphan condition code exists, or may document the diagnosis primarily in clinical notes rather than as a coded problem entry. A matching query that relies solely on the primary ICD-10-CM diagnosis code will miss a meaningful fraction of the actual patient population.

More effective rare disease identification typically combines: primary and secondary ICD-10-CM codes associated with the condition and its common differential diagnoses; SNOMED CT concepts that map to the condition and related clinical findings; LOINC-coded lab values or biomarker results that are characteristic of the condition even when the diagnosis code is absent or imprecise; and medication records where specific orphan drug treatments (typically with identifiable RxNorm codes) serve as a proxy indicator for the condition.

A query combining these signals — even without exact diagnostic code precision — substantially improves sensitivity compared to a single-code approach. The cost is reduced precision: the candidate list will include patients who have a characteristic lab abnormality but not the condition, or who are on a medication that's used for multiple conditions. Manual chart review on this candidate list is still required — but the candidate list is drawn from a full EHR network query, not from the subset of patients whose condition was coded precisely enough to surface through a single-code search.

A Phase II Scenario: Identifying Patients for a Lysosomal Storage Disorder Trial

Consider a Phase II program for a lysosomal storage disorder with an estimated US prevalence of approximately 1 in 40,000. The sponsor's target enrollment is 28 patients across a 12-site global network. A regional CRO running feasibility for the North American sites — 8 sites, targeting 18 patients — began with a conventional site selection approach based on investigator network and prior lysosomal storage disease trial experience.

Initial feasibility questionnaires returned estimates suggesting the 8 sites collectively had access to 35–45 potentially eligible patients. Six weeks after site activation, three sites had zero eligible patients in their active caseload — the estimated patients had either been enrolled in a competing study, were no longer in the site's active care, or on closer chart review did not meet the inclusion criteria for prior enzyme replacement therapy duration.

A structured EHR query approach, applied retrospectively to the same 8 sites, identified 62 candidate patients across the network using a multi-code approach combining ICD-10-CM E75 family codes, LOINC enzyme activity assay results, and RxNorm codes for enzyme replacement agents. After chart review of the highest-confidence candidates, 21 patients were identified as potentially eligible — distributed across 7 of the 8 sites, but with 3 sites accounting for 14 of the 21. The query took approximately 4 days from EHR access to preliminary candidate list. The conventional feasibility process had taken 6 weeks and produced less actionable information.

This kind of scenario is not universal — it requires EHR access agreements at the sites, structured documentation of the relevant criteria, and sufficient EHR data quality for the multi-code approach to work. But it illustrates what changes when the denominator search is conducted systematically rather than by estimation.

The Data Quality Challenge Specific to Rare Diseases

Rare disease EHR matching faces a data quality problem that common indication matching doesn't encounter at the same scale: diagnostic coding inconsistency. For a condition like Type 2 diabetes, ICD-10-CM code E11 is used with high consistency across health systems, specialties, and documentation styles. For an ultra-rare condition, the appropriate code may be unused at some sites (where the condition is documented only in free-text), used for a broader category at others (where a more general code is applied as a proxy), or absent from the problem list entirely because the definitive diagnosis was made at a different institution and never formally entered at the current site.

This means rare disease EHR matching strategies need to incorporate NLP on clinical notes as a supplementary layer for patients whose condition is documented in free text but not in structured fields. NLP introduces its own trade-offs — precision depends on note quality, documentation patterns, and the NLP model's training data — but for conditions where structured coding is unreliable, it is often the difference between finding 40% of the eligible population and finding 75%.

We're not saying that NLP-supplemented rare disease matching is a solved problem. Precision and recall vary by condition, by health system documentation culture, and by note type. A well-constructed NLP pipeline for a specific condition can substantially improve candidate identification; a generic NLP approach applied without condition-specific tuning will produce more noise. The point is that CROs running rare disease feasibility programs need to understand which identification approach their matching platform uses for a given condition, and what its expected sensitivity and specificity characteristics are for that specific use case.

Multi-Site Network Strategy for Ultra-Rare Programs

For conditions where even large single-site EHR networks have fewer than 10 eligible patients, the effective strategy is to query across a federated network of health systems simultaneously — treating the cumulative population as the denominator rather than any individual site's panel. This requires EHR access agreements at multiple sites with compatible API architectures, and a matching platform capable of aggregating cohort output across sites while maintaining protocol-level data isolation (results from Site A are visible to Site A's feasibility team but not to Site B, unless the CRO is operating as the data integrator under the terms of their agreements).

For sponsors pursuing orphan drug designations under the Orphan Drug Act, the enrollment timeline pressure is acute: rare disease trials frequently run long beyond planned timelines, and the ability to identify the full eligible population across a broad network before committing to a site selection and enrollment timeline gives sponsors more accurate data for IND planning and site budget negotiations.

The how it works overview covers the technical architecture for multi-site federated queries. For CROs early in the site selection process for a rare disease program, the relevant question to answer at feasibility is not "do these sites have experience with this indication" — it is "how many patients meeting these criteria exist in these sites' EHR populations." EHR matching answers the second question. The first question matters, but it shouldn't substitute for it. See also the site selection methodology article for how cohort-level patient data informs the broader site selection decision.

Want to see how Cohortbridge works with your protocol?

Schedule a de-identified match run — no commitment, just a live look at structured eligibility matching.

See a Match Run