Site selection for Phase II and III trials in most CROs relies heavily on a combination of three inputs: historical enrollment performance data, the principal investigator's indication-specific publication record, and the site coordinator's questionnaire responses. Each of these inputs is measuring something real. None of them is measuring the thing that matters most for the current protocol: how many patients meeting these specific eligibility criteria exist in this site's patient population right now.
Historical performance tells you how well a site executed previous trials. It doesn't tell you whether those trials had similar enough eligibility criteria to make the comparison valid. An oncology site that enrolled 40 patients in a metastatic NSCLC trial three years ago may have an EHR population with very few patients matching the current program's EGFR exon 19 deletion requirement, HbA1c exclusion, and prior anti-VEGF exposure washout period. The site's historical performance is real; its applicability to the current protocol is uncertain.
The Historical Data Problem in Site Selection
Historical enrollment performance as a site selection signal has a well-known confounding issue: it reflects the trials the site ran, not the patients it has. A high-volume oncology site with strong historical enrollment may have achieved those numbers primarily through high patient volume in a specific therapeutic area, or through an especially engaged investigator team that isn't present anymore. A newer site with a shorter enrollment history may have a patient population that is better matched to the current protocol's specific eligibility criteria.
The standard CRO response to this limitation is to supplement historical data with site questionnaire responses. Sites report their current patient volume in the relevant indication, estimate how many patients might meet the general profile, and indicate their resource availability. This is useful information. But questionnaire-based estimation has a systematic upward bias: sites want to be selected, investigators tend toward optimism about their patient population, and the estimate is based on general indication impression rather than criteria-specific analysis.
A global rare neurological disease program run by a mid-size CRO in mid-2024 selected 14 sites based primarily on investigator network relationships and historical rare disease experience. At the 8-week enrollment review, five of the 14 sites had not randomized a single patient. Post-hoc review found that three of the five had EHR populations with appropriate diagnosis codes but with exclusion criteria disqualification rates above 80% — specifically, prior immunotherapy exposure and recent CSF sampling requirements that the site's general neurology population frequently failed. The information to identify this before site activation existed in the EHR. It wasn't accessed prospectively.
What Prospective Cohort Analysis Produces Instead
Prospective cohort analysis, for the purposes of site selection, means querying a site's EHR population against the current protocol's specific inclusion and exclusion criteria before site activation, rather than estimating eligible patient volume from historical patterns or questionnaire responses.
The output is a site-level eligible patient count based on structured EHR data — coded diagnoses, lab values, medication history, age ranges — matched against the structured representation of the current protocol's criteria. This count is not a guarantee of enrollment; it is a denominator. A site with 45 potentially eligible patients in their structured EHR data has a meaningfully different enrollment potential than a site with 12, regardless of both sites' historical enrollment performance.
This denominator informs site selection in several ways that historical data cannot. It identifies sites that are strong on historical performance but thin on current protocol-eligible patients — sites that may have excellent investigator teams but genuinely can't enroll this trial. It surfaces sites with strong current patient populations that might be underweighted in a historical-performance-only selection model. And it provides a rational basis for enrollment timeline projections: expected randomization rates applied to an actual eligible cohort produce better-calibrated timelines than expected rates applied to estimated populations.
The Enrollment Velocity Relationship
Enrollment velocity — the rate at which sites randomize patients after activation — is affected by site execution factors (coordinator staffing, CRA support, protocol complexity). But it is also affected by the size of the eligible population relative to the site's capacity to screen and enroll. A site with 8 eligible patients in its EHR has a ceiling on its enrollment velocity that no amount of additional CRA support can overcome. A site with 35 eligible patients has headroom.
For trial timelines, the distribution of eligible patients across the site network is at least as important as the total. A 10-site network where 2 sites have 70% of the eligible patients will enroll very differently — and more predictably — than a 10-site network where eligible patients are thinly distributed across all sites. Identifying that distribution before site activation allows a CRO to concentrate resources where eligible patients exist and to set site-level enrollment targets that are grounded in actual patient availability rather than evenly-distributed optimism.
We're not saying historical performance data should be abandoned in site selection decisions. Investigator commitment, site operational quality, and protocol experience all matter for trial execution. The point is that protocol-specific eligible patient volume is a distinct input from historical performance, and it should be measured directly rather than estimated from a proxy.
Where Prospective Analysis Has Limits
Prospective cohort analysis from EHR data captures the patients whose relevant data is documented in structured fields — coded diagnoses, lab results, medication records. It does not capture patients who have the relevant condition but haven't been seen recently enough to have current lab values, or patients who are managed at a different health system within the same geographic catchment area, or patients who are eligible but identified only through a specialist's clinical note that isn't in structured form.
This means the EHR-based eligible patient count is a floor, not a ceiling. The actual eligible population at a site may be larger than the structured data query returns, because some eligible patients have documentation gaps or have been seen elsewhere. A CRO using prospective cohort analysis for site selection should treat the analysis as a relative ranking tool — identifying which sites have more versus fewer eligible patients in their structured data — rather than as a precise absolute count of available patients.
For indications with high rates of unstructured documentation (psychiatric assessments, functional status determinations, some rare disease diagnoses established by specialist report), the gap between the structured data count and the true eligible population may be significant, and supplemental approaches — investigator interviews, chart sampling, registry linkages — may add meaningful precision. For indications where the core eligibility criteria map cleanly to standard EHR codes (T2DM staging, NYHA classification, cancer staging codes, standard lab ranges), the structured data count is more reliable as a site selection input.
Integrating Cohort Analysis into the Feasibility Process
A practical workflow for incorporating prospective cohort analysis into site selection doesn't require replacing the existing feasibility process — it adds a structured data layer before the site questionnaire phase. A preliminary EHR query across candidate sites, producing relative eligible patient volume estimates, informs which sites receive full feasibility questionnaires and CRA engagement. Sites that clearly lack the current-protocol patient population can be deprioritized early, before the more expensive site qualification activities begin.
For sites that don't yet have FHIR R4 API access configured for CRO queries, a structured export of de-identified EHR data in HL7 v2 or CCD format, combined with a local query agent, can produce similar cohort estimates without requiring full FHIR infrastructure. The integration timelines vary by EHR system; see the EHR integrations overview for specifics by platform.
The site selection decision remains a judgment call that weighs patient availability alongside operational factors. Prospective cohort analysis makes the patient availability component of that judgment more precise. For CROs running multiple concurrent programs, the ability to consistently apply this approach across site networks — rather than relying on questionnaire estimation — produces enrollment projections that hold up better through the trial lifecycle. For a detailed look at how the matching engine processes eligibility criteria against structured EHR data, the protocol matching overview covers the methodology.