📱 Research: Flutter Integration Testing Evaluation

Issue: #533
Focus: Maestro vs. Marionette MCP (LeanCode)
Status: ✅ Completed
Target Apps: KROW Client App & KROW Staff App

1. Executive Summary & Recommendation

Following a technical spike implementing full authentication flows (Login/Signup) for both KROW platforms, Maestro is the recommended integration testing framework.

While Marionette MCP offers an innovative LLM-driven approach for exploratory debugging, it lacks the determinism required for a production-grade CI/CD pipeline. Maestro provides the stability, speed, and native OS interaction necessary to gate our releases effectively.

Why Maestro Wins for KROW:

Zero-Flake Execution: Built-in wait logic handles Firebase Auth latency without hard-coded sleep() calls.
Platform Parity: Single .yaml definitions drive both iOS and Android build variants.
Non-Invasive: Maestro tests the compiled .apk or .app (Black-box), ensuring we test exactly what the user sees.
System Level Access: Handles native OS permission dialogs (Camera/Location/Notifications) which Marionette cannot "see."

2. Technical Evaluation Matrix

Criteria	Maestro	Marionette MCP	Winner
Test Authoring	High Speed: Declarative YAML; Maestro Studio recorder.	Variable: Requires precise Prompt Engineering.	Maestro
Execution Latency	Low: Instantaneous interaction (~5s flows).	High: LLM API roundtrips (~45s+ flows).	Maestro
Environment	Works on Release/Production builds.	Restricted to Debug/Profile modes.	Maestro
CI/CD Readiness	Native CLI; easy GitHub Actions integration.	High overhead; depends on external AI APIs.	Maestro
Context Awareness	Interacts with Native OS & Bottom Sheets.	Limited to the Flutter Widget Tree.	Maestro

3. Spike Analysis & Findings

Tool A: Maestro (The Standard)

We verified the login.yaml and signup.yaml flows across both apps. Maestro successfully abstracted the asynchronous nature of our Data Connect and Firebase backends.

Pros: * Semantics Driven: By targeting Semantics(identifier: '...') in our /design_system/, tests remain stable even if the UI text changes for localization.
- Automatic Tolerance: It detects spinning loaders and waits for destination widgets automatically.
Cons: * Requires strict adherence to adding Semantics wrappers on all interactive components.

Tool B: Marionette MCP (The Experiment)

We spiked this using the marionette_flutter binding and executing via Cursor/Claude.

Pros: * Phenomenal for visual "smoke testing" and live-debugging UI issues via natural language.
Cons: * Non-Deterministic: Prone to "hallucinations" during heavy network traffic.
- Architecture Blocker: Requires the Dart VM Service to be active, making it impossible to test against hardened production builds.

4. Implementation & Migration Blueprint

Phase 1: Semantics Enforcement

We must enforce a linting rule or PR checklist: All interactive widgets in @krow/design_system must include a unique identifier.

// Standardized Implementation
Semantics(
  identifier: 'login_submit_button',
  child: KrowPrimaryButton(
    onPressed: _handleLogin,
    label: 'Sign In',
  ),
)

Phase 2: Repository Structure (Implemented)

Maestro flows are co-located with each app under auth/:

apps/mobile/apps/client/maestro/auth/sign_in.yaml — Client sign-in
apps/mobile/apps/client/maestro/auth/sign_up.yaml — Client sign-up
apps/mobile/apps/staff/maestro/auth/sign_in.yaml — Staff sign-in (phone + OTP)
apps/mobile/apps/staff/maestro/auth/sign_up.yaml — Staff sign-up (phone + OTP)

Credentials are injected via env variables (never hardcoded). Use make test-e2e to run the suite.

Phase 3: CI/CD Integration

The Maestro CLI will be added to our GitHub Actions workflow to automate quality gates.

Trigger: Every PR targeting main or develop.
Action: Generate a build, execute maestro test, and block merge on failure.

4.1 KiB Raw Blame History