Skip to main content
Status — April 2026: DR procedures are documented (backups, PITR, restore runbook) and our underlying infrastructure has automated daily backups + continuous WAL. Formal DR test exercises have not yet been executed. First drill targeted for Q2 2026. See SOC 2 readiness.

Recovery targets

MetricTarget
RPO (recovery point objective)≤ 5 minutes — worst case data loss
RTO (recovery time objective)≤ 4 hours — worst case downtime
Database PITR retention7 days
Backup retention30 days

Backup strategy

Cloud SQL automated backups

Daily automated snapshots. Point-in-time recovery (PITR) with continuous WAL archiving provides ≤ 5-minute RPO for the last 7 days.

Cloud Storage versioning

Document buckets have object versioning enabled with 30-day retention. Deleted objects are recoverable within the retention window.

Terraform state

State bucket has object versioning + a separate CMEK key. Access is through Workload Identity Federation only.

Cross-region

Backups are stored in a CMEK-encrypted GCS bucket in a different GCP region from the primary database. Automated.

Scenarios and runbooks

Use Cloud SQL point-in-time recovery to a new instance at a timestamp just before the incident. Verify data on the restored instance, then export the affected records and replay them into production. Documented runbook step-by-step.
Provision a new Cloud SQL instance from the latest automated backup. Update the runtime’s DATABASE_URL secret in Secret Manager. Cloud Run re-provisions with the new URL. Target RTO ≤ 2 hours.
Provision new infrastructure in the secondary region using Terraform. Restore from the cross-region backup bucket. Target RTO ≤ 4 hours. This scenario has not yet been drilled.
Isolate affected infrastructure, rotate all secrets, provision clean infrastructure via Terraform, restore from backups taken before the breach window. Full incident response runbook applies — see Incident response.
Restore from GCS object versioning (if within retention window) or cross-region backup. Re-index affected documents in the database.

Testing cadence

TestCadenceLast executed
Backup verification (read-only restore to sandbox)MonthlyNot yet executed — first run targeted Q2 2026
PITR test (restore to specific timestamp)QuarterlyNot yet executed — first run targeted Q2 2026
Full DR drill (region failover)AnnuallyNot yet executed — first run targeted Q2 2026
Tabletop exercise (response walk-through)Semi-annuallyNot yet executed — first run targeted Q2 2026
When each test is completed, this page is updated with the execution date, observed RPO/RTO, and any remediation items.

Data integrity verification

On every restore drill we verify:
  • Row counts match the pre-restore baseline (±allowed delta for the PITR window).
  • Schema matches the migration head via rails db:schema:dump comparison.
  • Sample record checksums verify PHI encryption integrity.
  • Audit logs confirm no tampering in the restore window.

Customer impact

For a DR event that affects customer service:
  • Status page (status.denialbase.com) updated within 15 minutes.
  • Affected customers notified by email.
  • Post-incident summary within 5 business days.

Dependencies outside our control

DependencyWhat a DR event looks like
Google Cloud PlatformGoogle’s multi-region architecture means full-region loss is rare. Region-level failovers documented above.
AnthropicLLM outage → denial detection and appeal drafting queue, then process when API returns. Non-blocking for existing records.
SentryError monitoring degrades but does not affect customer service.
Amazon SESEmail delivery degrades. Users can still sign in and use the app.
Kaiser / payer integrationsOutages queued for retry with exponential backoff.