Module 1 — Site & Investigator Intelligence
This document describes how each URS requirement is implemented in ClinicalOS. Each specification maps 1:1 to a URS requirement, detailing data flows, algorithms, API contracts, and source file locations.
REST API v2 client sends search queries filtered by condition, country, and date range. Response parsed into Site model fields (name, institution, city, country, specialties, trial counts). Each site matched by external_id (NCT number) for deduplication. Provenance record created per run.
Source: services/pipeline/ingestion.py, services/daily_ingestion.py
NCBI E-utilities API queried by investigator name + affiliation. Retrieves publication count and h-index. Results stored in investigators table. Enrichment is idempotent (re-running updates existing records).
Source: scripts/enrich_specialties.py, services/pipeline/entity_resolver.py
Daily mode: queries trials updated in last 24-168 hours, max 10,000 studies. Full mode: broad search by configurable conditions and countries list, includes optional PubMed enrichment (max 50 investigators). Mode selected via POST /api/v1/ingestion/trigger body parameter.
Source: api/v1/ingestion.py, services/daily_ingestion.py
Each ingestion run inserts a record into data_provenance table: source URL, ingestion timestamp, API version, record count, SHA-256 hash of raw response, transformation log (raw > clean > enriched > scored), quality score. Complies with ALCOA+ requirements.
Source: models/data_provenance.py, services/pipeline/ingestion.py
Score formula: total = (recruitment * 0.30) + (experience * 0.25) + (publications * 0.15) + (infrastructure * 0.15) + (regulatory * 0.15) * 100. Recruitment: log(1+total_trials)/log(501) + active bonus. Experience: breadth(30%) + active(40%) + total(30%). Publications: h-index(60%) + pub_count(40%). Infrastructure: capacity tier + institution type bonus. Regulatory: country tier + experience modifier.
Source: services/pipeline/scoring.py
API endpoint /api/v1/sites/{id}/score/explain returns: each dimension value with contributing factors (e.g., trial_count: +0.3), data source attribution (ClinicalTrials.gov, PubMed), confidence level, and human-readable explanation text. Frontend renders radar chart + factor table.
Source: api/v1/sites.py (explain endpoint), components/scoring/score-explainability.tsx
POST /api/v1/sites/{id}/score/customize accepts custom weights object (5 floats summing to 1.0). Scoring service recalculates with provided weights. Returns new composite score. Does not persist custom weights (stateless recalculation).
Source: api/v1/sites.py (customize endpoint), services/pipeline/scoring.py
All random operations use seeded Random instances: random.Random(hash(site_id) + offset). No global random state mutation. AEGIS ML scoring uses temperature=0. Rule-based fallback is purely mathematical (no stochastic components).
Source: services/pipeline/scoring.py
POST /api/v1/sites/search accepts: therapeutic_area, phase, countries[], enrollment_status, min_trials, min_capacity, keyword. SQL query built dynamically with AND logic. Results paginated (page/size or cursor-based). All queries filtered by org_id from JWT.
Source: api/v1/sites.py, services/site_service.py
User query sent to AEGIS AI Agent API for intent extraction. Agent returns structured filters (name_contains, specialty_contains, countries, institution_contains, min_trials). Synonym expansion maps medical terms (e.g., HCC to hepatocellular, liver_cancer). 18 canonical therapeutic areas with synonym dictionaries. Results cached 5 minutes by query hash.
Source: services/smart_search.py
POST /api/v1/site-agent accepts message, optional session_id, context_sites, and conversation history. Agent maintains session state for multi-turn queries. Returns response text, matching sites list, follow-up suggestions, and insights. Session persists for duration of user interaction.
Source: api/v1/site_agent.py, services/site_agent_service.py
Investigator model stores: name, affiliation, country, city, specialty, h_index, publication_count, trial_count, active_trials, email, bio_summary. Data sourced from ClinicalTrials.gov (trial participation) and PubMed (publications). GET /api/v1/investigators with filters. GET /api/v1/investigators/{id} for detail.
Source: api/v1/investigators.py, models/investigator.py
POST /api/v1/investigators/compare accepts list of 2-5 investigator IDs. Returns full profiles in parallel for side-by-side display. Frontend renders comparison table with all metrics.
Source: api/v1/investigators.py
POST /api/v1/exports/sites/pdf generates formatted PDF with scores, dimensions, and details using ReportLab. POST /api/v1/exports/sites/excel generates XLSX with raw data using openpyxl. Both accept search filters to export matching results. StreamingResponse for large exports.
Source: api/v1/exports.py
CRUD operations on projects table (name, indication, phase, countries, target_patients). POST /api/v1/projects/{id}/shortlist adds/removes sites via project_sites junction table. Each shortlist entry records added_by user, notes, and status (shortlisted/selected/rejected).
Source: api/v1/projects.py, models/project.py, models/project_site.py
POST /api/v1/predictions/recruitment accepts site IDs and study parameters. Returns per-site predictions with optimistic/realistic/pessimistic patient counts, confidence interval (0-1), and explanatory factors array. Model uses historical trial completion rates and site capacity.
Source: api/v1/predictions.py, models/prediction.py
TenantMiddleware extracts org_id from JWT payload on every request. OrgId dependency injects org_id into all endpoint functions. All SQLAlchemy queries filter by org_id via BaseModel.TenantMixin. Cross-tenant access returns empty results (never 403 to avoid information disclosure).
Source: core/middleware.py, api/deps.py, models/base.py
POST /api/v1/auth/login validates email+password (bcrypt), returns access_token (15min, HS256) and refresh_token (7 days). Tokens stored as httpOnly cookies. POST /api/v1/auth/refresh generates new access token from valid refresh token. Token payload: sub (user_id), org_id, role, type, exp.
Source: api/v1/auth.py, core/security.py, core/cookies.py
4 roles: super_admin (all operations), org_admin (org management + all features), user (standard features), read_only (GET only). Role stored in JWT. CurrentUser dependency validates role on protected endpoints. Admin endpoints check role == 'super_admin' or 'org_admin'.
Source: api/deps.py, core/security.py, models/organization.py
POST /api/v1/auth/2fa/setup generates TOTP secret and QR code URI. POST /api/v1/auth/2fa/verify validates 6-digit TOTP code. Once enabled, login requires code via POST /api/v1/auth/2fa/login. Backup codes generated at setup (10 single-use codes).
Source: api/v1/auth.py, services/totp_service.py
AuditMiddleware intercepts POST/PUT/PATCH/DELETE requests. On success (status < 400), inserts record into audit_log: id, org_id, user_id, action (HTTP method), entity_type (URL path), details (JSON), ip_address (SHA-256 hashed), created_at. PostgreSQL trigger prevents UPDATE/DELETE on audit_log table.
Source: core/middleware.py (AuditMiddleware), models/audit.py, migration 011
Each audit record includes record_hash = SHA-256(id + org_id + action + entity_type + ip + timestamp) and prev_hash = record_hash of previous record for same org. Chain integrity verifiable by replaying hashes. Index on record_hash for fast verification.
Source: core/middleware.py (_write_audit_log), migration 011
Transit: TLS 1.3 enforced by Cloud Run + Cloudflare. At rest: PostgreSQL pgcrypto extension enabled (migration 011). Sensitive columns (bio_summary_encrypted, specialties_encrypted, EDC credentials) use pgp_sym_encrypt with key from GCP Secret Manager. SSL connection verified via SHOW ssl.
Source: migration 011, core/config.py (SECRET_MANAGER)
Site search uses indexed queries (ix_sites_org_id, ix_sites_country). Cursor-based pagination (O(1) for any page depth). Smart search caches results 5 minutes by query hash. EXPLAIN ANALYZE confirms index usage. Target: p95 < 5 seconds.
Source: api/v1/sites.py, services/site_service.py
Cloud Run auto-scales (0 to N instances). Cloud SQL HA configuration (regional, automatic failover). Health check endpoint /health monitored. Datadog alerts on error rate > 1%. GCP SLA: 99.95% for Cloud Run, 99.95% for Cloud SQL HA.
Source: main.py (/health), GCP infrastructure
Table ai_models stores: model_id, provider, model_name, version, deployment_date, benchmark_score, status (active/deprecated/retired), change_reason, description. API /api/v1/ai-modules/registry lists all registered models. No 'latest' references: all model calls use explicit versioned IDs.
Source: models/ai_model_registry.py, api/v1/ai_modules.py
Table ai_audit_log stores: id, org_id, user_id, model_id, input_hash (SHA-256 of request), output_hash (SHA-256 of response), latency_ms, token_count, timestamp. Every AEGIS API call logged. API /api/v1/ai-explain/audit-log lists entries with filters.
Source: models/ai_model_registry.py, integrations/aegis/mcp_client.py
Footer component shows 'Decision Support Tool' badge. /legal/disclaimer page with full product classification. Sidebar badges: 'Validated' (green) for Module 1, 'Preview' (amber) for modules 2-7, 'Beta' (red) for modules 8-10. Dashboard layout includes disclaimer banner.
Source: app/legal/disclaimer/page.tsx, components/layout/sidebar.tsx, app/dashboard/layout.tsx
Translation system uses React context + dictionary lookup. lib/i18n/translations.ts contains key-value pairs for EN, FR, DE, ES, IT, PT, JA, ZH, KO. LanguageSwitcher component in header. Fallback: missing key returns English value. All UI labels are translation keys.
Source: lib/i18n/context.tsx, lib/i18n/translations.ts, components/layout/language-switcher.tsx
Sites page offers three view modes: (1) Card grid: responsive grid of SiteCard components with score gauge, key metrics, and action buttons. (2) Table: sortable DataTable with columns for all metrics, click-to-sort. (3) Map: interactive map with markers colored by score (green > 70, amber 40-70, red < 40). View mode persisted in component state.
Source: app/dashboard/sites/page.tsx, components/sites/
Total specifications: 30 (1:1 mapping with URS)
All source paths relative to: src/backend/app/