Disaster recovery - Denialbase

Status — April 2026: DR procedures are documented (backups, PITR, restore runbook) and our underlying infrastructure has automated daily backups + continuous WAL. Formal DR test exercises have not yet been executed. First drill targeted for Q2 2026. See SOC 2 readiness.

Recovery targets

Metric	Target
RPO (recovery point objective)	≤ 5 minutes — worst case data loss
RTO (recovery time objective)	≤ 4 hours — worst case downtime
Database PITR retention	7 days
Backup retention	30 days

Backup strategy

Cloud SQL automated backups

Daily automated snapshots. Point-in-time recovery (PITR) with continuous WAL archiving provides ≤ 5-minute RPO for the last 7 days.

Cloud Storage versioning

Document buckets have object versioning enabled with 30-day retention. Deleted objects are recoverable within the retention window.

Terraform state

State bucket has object versioning + a separate CMEK key. Access is through Workload Identity Federation only.

Cross-region

Backups are stored in a CMEK-encrypted GCS bucket in a different GCP region from the primary database. Automated.

Scenarios and runbooks

Accidental record deletion / corruption

Use Cloud SQL point-in-time recovery to a new instance at a timestamp just before the incident. Verify data on the restored instance, then export the affected records and replay them into production. Documented runbook step-by-step.

Full database loss (primary instance failure)

Provision a new Cloud SQL instance from the latest automated backup. Update the runtime’s DATABASE_URL secret in Secret Manager. Cloud Run re-provisions with the new URL. Target RTO ≤ 2 hours.

Region outage

Provision new infrastructure in the secondary region using Terraform. Restore from the cross-region backup bucket. Target RTO ≤ 4 hours. This scenario has not yet been drilled.

Security breach / ransomware

Isolate affected infrastructure, rotate all secrets, provision clean infrastructure via Terraform, restore from backups taken before the breach window. Full incident response runbook applies — see Incident response.

Document storage loss

Restore from GCS object versioning (if within retention window) or cross-region backup. Re-index affected documents in the database.

Testing cadence

Test	Cadence	Last executed
Backup verification (read-only restore to sandbox)	Monthly	Not yet executed — first run targeted Q2 2026
PITR test (restore to specific timestamp)	Quarterly	Not yet executed — first run targeted Q2 2026
Full DR drill (region failover)	Annually	Not yet executed — first run targeted Q2 2026
Tabletop exercise (response walk-through)	Semi-annually	Not yet executed — first run targeted Q2 2026

When each test is completed, this page is updated with the execution date, observed RPO/RTO, and any remediation items.

Data integrity verification

On every restore drill we verify:

Row counts match the pre-restore baseline (±allowed delta for the PITR window).
Schema matches the migration head via rails db:schema:dump comparison.
Sample record checksums verify PHI encryption integrity.
Audit logs confirm no tampering in the restore window.

Customer impact

For a DR event that affects customer service:

Status page (status.denialbase.com) updated within 15 minutes.
Affected customers notified by email.
Post-incident summary within 5 business days.

Dependencies outside our control

Dependency	What a DR event looks like
Google Cloud Platform	Google’s multi-region architecture means full-region loss is rare. Region-level failovers documented above.
Anthropic	LLM outage → denial detection and appeal drafting queue, then process when API returns. Non-blocking for existing records.
Sentry	Error monitoring degrades but does not affect customer service.
Amazon SES	Email delivery degrades. Users can still sign in and use the app.
Kaiser / payer integrations	Outages queued for retry with exponential backoff.

​Recovery targets

​Backup strategy

Cloud SQL automated backups

Cloud Storage versioning

Terraform state

Cross-region

​Scenarios and runbooks

​Testing cadence

​Data integrity verification

​Customer impact

​Dependencies outside our control

Recovery targets

Backup strategy

Scenarios and runbooks

Testing cadence

Data integrity verification

Customer impact

Dependencies outside our control