Files
Krow-workspace/docs/MILESTONES/M4/planning/m4-backend-foundation-implementation-plan.md

270 lines
10 KiB
Markdown

# M4 Backend Foundation Implementation Plan (Dev First)
Date: 2026-02-24
Owner: Wilfred (Technical Lead)
Primary environment: `krow-workforce-dev`
## 1) Objective
Build a secure, modular, and scalable backend foundation in `dev` without breaking the current frontend while we migrate high-risk writes from direct Data Connect mutations to backend command endpoints.
## 2) First-principles architecture rules
1. Client apps are untrusted for business-critical writes.
2. Backend is the enforcement layer for validation, permissions, and write orchestration.
3. Multi-entity writes must be atomic, idempotent, and observable.
4. Configuration and deployment must be reproducible by automation.
5. Migration must be backward-compatible until each frontend flow is cut over.
## 3) Pre-coding gates (must be true before implementation starts)
## Gate A: Security boundary
1. Frontend sends Firebase token only. No database credentials in client code.
2. Every new backend endpoint validates Firebase token.
3. Data Connect write access strategy is defined:
- keep simple reads available to client
- route high-risk writes through backend command endpoints
4. Upload and signed URL paths are server-controlled.
## Gate B: Contract standards
1. Standard error envelope is frozen:
```json
{
"code": "STRING_CODE",
"message": "Human readable message",
"details": {},
"requestId": "optional-request-id"
}
```
2. Request validation layer is chosen and centralized.
3. Route naming strategy is frozen:
- canonical routes under `/core` and `/commands`
- compatibility aliases preserved during migration (`/uploadFile`, `/createSignedUrl`, `/invokeLLM`)
4. Validation standard is locked:
- library: `zod`
- schema location: `backend/<service>/src/contracts/` with `core/` and `commands/` subfolders
## Gate C: Atomicity and reliability
1. Command endpoints support idempotency keys for retry-safe writes.
2. Multi-step write flows are wrapped in single backend transaction boundaries.
3. Domain conflict codes are defined for expected business failures.
4. Idempotency storage is locked:
- store in Cloud SQL table
- key scope: `userId + route + idempotencyKey`
- retain records for 24 hours
- repeated key returns original response
## Gate D: Automation and operability
1. Makefile is source of truth for backend setup and deploy in dev.
2. Core deploy and smoke test commands exist before feature migration.
3. Logging format and request tracing fields are standardized.
## 4) Security baseline for foundation phase
## 4.1 Authentication and authorization
1. Foundation phase is authentication-first.
2. Role-based access control is intentionally deferred.
3. All handlers include a policy hook for future role checks (`can(action, resource, actor)`).
## 4.2 Data access control model
1. Client retains Data Connect reads required for existing screens.
2. High-risk writes move behind `/commands/*` endpoints.
3. Backend mediates write interactions with Data Connect and Cloud SQL.
## 4.3 File and URL security
1. Validate file type and size server-side.
2. Separate public and private storage behavior.
3. Signed URL creation checks ownership/prefix scope and expiry limits.
4. Bucket policy split is locked:
- `krow-workforce-dev-public`
- `krow-workforce-dev-private`
- private bucket access only through signed URL
## 4.4 Model invocation safety
1. Enforce schema-constrained output.
2. Apply per-user rate limits and request timeout.
3. Log model failures with safe redaction (no sensitive prompt leakage in logs).
4. Model provider and timeout defaults are locked:
- provider: Vertex AI Gemini
- max route timeout: 20 seconds
- timeout error code: `MODEL_TIMEOUT`
## 4.5 Secrets and credentials
1. Runtime secrets come from Secret Manager only.
2. Service accounts use least-privilege roles.
3. No secrets committed in repository files.
## 5) Modularity baseline
## 5.1 Backend module boundaries
1. `core` module: upload, signed URL, model invocation, health.
2. `commands` module: business writes and state transitions.
3. `policy` module: validation and future role checks.
4. `data` module: Data Connect adapters and transaction wrappers.
5. `infra` module: logging, tracing, auth middleware, error mapping.
## 5.2 Contract separation
1. Keep API request/response schemas in one location.
2. Keep domain errors in one registry file.
3. Keep route declarations thin; business logic in services.
## 5.3 Cloud runtime roles
1. Cloud Run is the primary command and core API execution layer.
2. Cloud Functions v2 is worker-only in this phase:
- upload-related async handlers
- notification jobs
- model-related async helpers when needed
## 6) Automation baseline
## 6.1 Makefile requirements
Add `makefiles/backend.mk` and wire it into root `Makefile` with at least:
1. `make backend-enable-apis`
2. `make backend-bootstrap-dev`
3. `make backend-deploy-core`
4. `make backend-deploy-commands`
5. `make backend-deploy-workers`
6. `make backend-smoke-core`
7. `make backend-smoke-commands`
8. `make backend-logs-core`
## 6.2 CI requirements
1. Backend lint
2. Backend tests
3. Build/package
4. Smoke test against deployed dev route(s)
5. Block merge on failed checks
## 6.3 Session hygiene
1. Update `TASKS.md` and `CHANGELOG.md` each working session.
2. If a new service/API is added, Makefile target must be added in same change.
## 7) Migration safety contract (no frontend breakage)
1. Backend routes ship first.
2. Frontend migration is per-feature wave, not big bang.
3. Keep compatibility aliases until clients migrate.
4. Keep existing Data Connect reads during foundation.
5. For each migrated write flow:
- before/after behavior checklist
- rollback path
- smoke verification
## 8) Scope for foundation build
1. Backend runtime/deploy foundation in dev.
2. Core endpoints:
- `POST /core/upload-file`
- `POST /core/create-signed-url`
- `POST /core/invoke-llm`
- `GET /healthz`
3. Compatibility aliases:
- `POST /uploadFile`
- `POST /createSignedUrl`
- `POST /invokeLLM`
4. Command layer scaffold for first migration routes.
5. Initial migration of highest-risk write paths.
## 9) Implementation phases
## Phase 0: Baseline and contracts
Deliverables:
1. Freeze endpoint naming and compatibility aliases.
2. Freeze error envelope and error code registry.
3. Freeze auth middleware interface and policy hook interface.
4. Publish route inventory from web/mobile direct writes.
Exit criteria:
1. No unresolved contract ambiguity.
2. Team agrees on auth-first now and role-map-later approach.
## Phase 1: Backend infra and automation
Deliverables:
1. `makefiles/backend.mk` with bootstrap, deploy, smoke, logs targets.
2. Environment templates for backend runtime config.
3. Secret Manager and service account setup automation.
Exit criteria:
1. A fresh machine can deploy core backend to dev via Make commands.
## Phase 2: Core endpoint implementation
Deliverables:
1. `/core/upload-file`
2. `/core/create-signed-url`
3. `/core/invoke-llm`
4. `/healthz`
5. Compatibility aliases (`/uploadFile`, `/createSignedUrl`, `/invokeLLM`)
Exit criteria:
1. API harness passes for core routes.
2. Error, logging, and auth standards are enforced.
## Phase 3: Command layer scaffold
Deliverables:
1. `/commands/orders/create`
2. `/commands/orders/{orderId}/cancel`
3. `/commands/orders/{orderId}/update`
4. `/commands/shifts/{shiftId}/change-status`
5. `/commands/shifts/{shiftId}/assign-staff`
6. `/commands/shifts/{shiftId}/accept`
Exit criteria:
1. High-risk writes have backend command alternatives ready.
## Phase 4: Wave 1 frontend migration
Deliverables:
1. Replace direct writes in selected web/mobile flows.
2. Keep reads stable.
3. Verify no regressions in non-migrated screens.
Exit criteria:
1. Migrated flows run through backend commands only.
2. Rollback instructions validated.
## Phase 5: Hardening and handoff
Deliverables:
1. Runbook for deploy, rollback, and smoke.
2. Backend CI pipeline active.
3. Wave 2 and wave 3 migration task list defined.
Exit criteria:
1. Foundation is reusable for staging/prod with environment changes only.
## 10) Wave 1 migration inventory (real call sites)
Web:
1. `apps/web/src/features/operations/tasks/TaskBoard.tsx:100`
2. `apps/web/src/features/operations/orders/OrderDetail.tsx:145`
3. `apps/web/src/features/operations/orders/EditOrder.tsx:84`
4. `apps/web/src/features/operations/orders/components/CreateOrderDialog.tsx:31`
5. `apps/web/src/features/operations/orders/components/AssignStaffModal.tsx:60`
6. `apps/web/src/features/workforce/documents/DocumentVault.tsx:99`
Mobile:
1. `apps/mobile/packages/features/client/home/lib/src/presentation/widgets/shift_order_form_sheet.dart:232`
2. `apps/mobile/packages/features/client/view_orders/lib/src/presentation/widgets/view_order_card.dart:1195`
3. `apps/mobile/packages/features/client/create_order/lib/src/data/repositories_impl/client_create_order_repository_impl.dart:68`
4. `apps/mobile/packages/features/staff/shifts/lib/src/data/repositories_impl/shifts_repository_impl.dart:446`
5. `apps/mobile/packages/features/client/authentication/lib/src/data/repositories_impl/auth_repository_impl.dart:257`
6. `apps/mobile/packages/features/staff/profile_sections/onboarding/profile_info/lib/src/data/repositories/personal_info_repository_impl.dart:51`
## 11) Definition of done for foundation
1. Core endpoints deployed in dev and validated.
2. Command scaffolding in place for wave 1 writes.
3. Auth-first protection active on all new routes.
4. Idempotency + transaction model defined for command writes.
5. Makefile and CI automation cover bootstrap/deploy/smoke paths.
6. Frontend remains stable during migration.
7. Role-map integration points are documented for next phase.
## 12) Locked defaults (approved)
1. Idempotency key storage strategy:
- Cloud SQL table, 24-hour retention, keyed by `userId + route + idempotencyKey`.
2. Validation library and schema location:
- `zod` in `backend/<service>/src/contracts/` (`core/`, `commands/`).
3. Storage bucket naming and split:
- `krow-workforce-dev-public` and `krow-workforce-dev-private`.
4. Model provider and timeout:
- Vertex AI Gemini, 20-second max timeout.
5. Target response-time objectives (p95):
- `/healthz` under 200ms
- `/core/create-signed-url` under 500ms
- `/commands/*` under 1500ms
- `/core/invoke-llm` under 15000ms