Recovery targets
| Metric | Target |
|---|---|
| RPO (recovery point objective) | ≤ 5 minutes — worst case data loss |
| RTO (recovery time objective) | ≤ 4 hours — worst case downtime |
| Database PITR retention | 7 days |
| Backup retention | 30 days |
Backup strategy
Cloud SQL automated backups
Daily automated snapshots. Point-in-time recovery (PITR) with continuous WAL archiving provides ≤ 5-minute RPO for the last 7 days.
Cloud Storage versioning
Document buckets have object versioning enabled with 30-day retention. Deleted objects are recoverable within the retention window.
Terraform state
State bucket has object versioning + a separate CMEK key. Access is through Workload Identity Federation only.
Cross-region
Backups are stored in a CMEK-encrypted GCS bucket in a different GCP region from the primary database. Automated.
Scenarios and runbooks
Accidental record deletion / corruption
Accidental record deletion / corruption
Use Cloud SQL point-in-time recovery to a new instance at a timestamp just before the incident. Verify data on the restored instance, then export the affected records and replay them into production. Documented runbook step-by-step.
Full database loss (primary instance failure)
Full database loss (primary instance failure)
Provision a new Cloud SQL instance from the latest automated backup. Update the runtime’s
DATABASE_URL secret in Secret Manager. Cloud Run re-provisions with the new URL. Target RTO ≤ 2 hours.Region outage
Region outage
Provision new infrastructure in the secondary region using Terraform. Restore from the cross-region backup bucket. Target RTO ≤ 4 hours. This scenario has not yet been drilled.
Security breach / ransomware
Security breach / ransomware
Isolate affected infrastructure, rotate all secrets, provision clean infrastructure via Terraform, restore from backups taken before the breach window. Full incident response runbook applies — see Incident response.
Document storage loss
Document storage loss
Restore from GCS object versioning (if within retention window) or cross-region backup. Re-index affected documents in the database.
Testing cadence
| Test | Cadence | Last executed |
|---|---|---|
| Backup verification (read-only restore to sandbox) | Monthly | Not yet executed — first run targeted Q2 2026 |
| PITR test (restore to specific timestamp) | Quarterly | Not yet executed — first run targeted Q2 2026 |
| Full DR drill (region failover) | Annually | Not yet executed — first run targeted Q2 2026 |
| Tabletop exercise (response walk-through) | Semi-annually | Not yet executed — first run targeted Q2 2026 |
When each test is completed, this page is updated with the execution date, observed RPO/RTO, and any remediation items.
Data integrity verification
On every restore drill we verify:- Row counts match the pre-restore baseline (±allowed delta for the PITR window).
- Schema matches the migration head via
rails db:schema:dumpcomparison. - Sample record checksums verify PHI encryption integrity.
- Audit logs confirm no tampering in the restore window.
Customer impact
For a DR event that affects customer service:- Status page (
status.denialbase.com) updated within 15 minutes. - Affected customers notified by email.
- Post-incident summary within 5 business days.
Dependencies outside our control
| Dependency | What a DR event looks like |
|---|---|
| Google Cloud Platform | Google’s multi-region architecture means full-region loss is rare. Region-level failovers documented above. |
| Anthropic | LLM outage → denial detection and appeal drafting queue, then process when API returns. Non-blocking for existing records. |
| Sentry | Error monitoring degrades but does not affect customer service. |
| Amazon SES | Email delivery degrades. Users can still sign in and use the app. |
| Kaiser / payer integrations | Outages queued for retry with exponential backoff. |