Testing strategy (managers and QA)

This document supports decisions about automated testing, CI, and how QA and engineering share responsibility. It is grounded in the current OpenProject Flutter app repository.

If you are new here: read Vision and planned directions first. It states what we want to achieve and why, in plain language. The rest of the document refines that vision with repository facts, priorities, risks, and effort estimates.

Vision and planned directions (read this first)

This section is for readers who were not part of earlier discussions. It captures our intended direction; later sections explain how that maps to this codebase, what already exists, and what we recommend adjusting.

Why automate tests at all

We ship a Flutter client against a living API and rich UI. Bugs show up as wrong data, broken flows, visual regressions, or surprises after server changes. The goal is not “maximum tests,” but predictable quality and earlier feedback, while keeping cost and flake risk under control.

Direction 1: Feature integration tests (device or emulator)

Vision: Run meaningful end-to-end style checks on a real app binding—on a device or emulator—so we exercise navigation, real layout, and timing closer to production.

Constraints we accept: These tests are expensive (time, machines, stability). We do not assume a device farm in CI today. The plan is not to run them on every pull request, but to use them as scheduled smoke (for example once a week) and/or around release candidates, to catch “glue” regressions and reduce repetitive manual load on QA without replacing exploratory testing.

Tooling stance: Prefer Flutter’s **integration_test** where possible; reserve heavier tools (for example Appium) only when the platform cannot be covered otherwise.

Vision: While building a feature, add widget tests and, where agreed, golden (reference image) tests for the screens or states that matter—so we get fast feedback in GitHub Actions on layout and critical UI paths.

Open question we acknowledge: For those tests, we may mock BLoC/Cubit state (or fakes at the repository boundary) or push further and mock HTTP (Dio) with static JSON. The document later recommends a split: keep most widget/golden tests free of wire-level JSON; put HTTP parsing and contract behavior in dedicated API / contract tests so failures stay interpretable and maintenance stays sane.

Vision: Add tests that behave like integration checks for the client–API contract but run on Linux CI—no emulator. Typical forms: fixture-backed HTTP tests next to the generated or hand-written client, and/or tests that drive use cases with a fake HTTP port, still asserting realistic responses.

Motivation: We sometimes ship a broken client because the API or HAL changed and we were not aware. Running these suites on every PR (and optionally triggering workflows when the API repository changes) is meant to surface incompatibility early. This only works if the suite is deterministic (fixtures or a stable staging contract)—hooks without a real test body create noise, not safety.

Direction 4: Unit tests, BLoC tests, and small-scope tests

Vision: Keep growing unit tests, **bloc_test**-style tests, and use case tests as the default safety net for rules and state machines. They are the cheapest layer and should remain the bulk of automated coverage.

Optional direction: Full-app flow with JSON fixtures and screenshots

Some features may additionally use the flow snapshot pattern (full or near-full app under flutter_test, static JSON, faked platform APIs, golden milestones). That is optional and expensive—use for a few critical journeys; details appear below.

How we want to work as a team

Sprint planning: We intend to allocate visible time for tests alongside feature work (for example a agreed percentage or explicit tasks). Management may initially see this as overhead; the argument is that unplanned quality work still happens—it is simply deferred, more expensive, and often lands on QA or customers.

Bugs and TDD: When a bug is reproducible and worth preventing again, we want to fix it through a test-first or test-accompanying workflow where that is practical—not as a dogma for every one-line change, but as a default for real regressions.

How the rest of this document is organized

After this section	You will find
Executive summary	Short decisions and asks for leadership and QA.
Glossary	What we mean by each test type.
Current baseline	What already exists in this repo (CI, tests, gaps).
Later sections	ROI ordering, bootstrap vs ROI, mock boundaries, effort tables, risks, and QA handoff.

Executive summary

Goal: Improve release quality and reduce surprise breakages—especially when the API or shared contracts change—without drowning the team in flaky or redundant tests.
Current constraint: GitHub Actions runs flutter test on Ubuntu only. Device and emulator integration tests are not in CI today; they are suitable for a weekly smoke lane or pre-release, not for every PR.
Highest ROI next step: Add contract or fixture-backed HTTP tests next to the API client layer (packages/openproject_api_sdk). That directly targets “the server changed and the client broke without us knowing.” Widget tests with mocked Dio are not a substitute.
What already works: The repo has substantial unit, BLoC, and use case coverage under test/. The strategy is to extend proven layers, not restart from zero.
Organizational ask: Treat test work as part of delivery in sprint planning (transparent capacity band and Definition of Done). Use TDD-style regression tests for reproducible bugs; avoid dogmatic TDD for trivial changes.

Decisions for management

Approve a small recurring capacity band (for example 10–20% of engineering time on non-trivial stories) for tests that match the DoD below.
Approve ownership and cadence for weekly on-device smoke (who runs it, which devices, where results are recorded).
Approve one CI strategy for golden tests (pinned OS or runner) if goldens are adopted, to avoid endless screenshot churn.

Decisions for QA

Agree which journeys are in the weekly automated smoke set versus what remains manual exploratory testing.
Agree how API or environment incidents are distinguished from app regressions when smoke fails.

Audience and how to use this doc

Audience	Use
Managers	Trade-offs (cost, schedule, risk), CI expectations, why sprint capacity for tests is rational.
QA	What automation can replace or narrow, weekly smoke scope, where to log gaps when coverage does not exist yet.

Glossary (short)

Term	Meaning here
Unit test	Fast, no I/O; pure functions, parsers, small helpers.
BLoC / Cubit test	State transitions and side-effect ordering with controlled fakes (`bloc_test`).
Use case test	Application service orchestration with fake repositories or ports—not real network.
Widget test	Pumps widgets under `flutter_test`; usually with injected fakes or fixed bloc states.
Golden test	Compares rasterized output to a reference image; great for stable design surfaces, sensitive to fonts and OS.
Integration test (`integration_test`)	Runs a real app binding; often on device or emulator; higher cost and flake risk.
Contract / API test	Asserts HTTP shapes, status codes, and parsing against fixtures or a staging API; catches server drift.
Flow snapshot test (optional pattern)	Full-app widget test: pump near-complete app, Dio (or client) returns static JSON from `test/`, platform dependencies faked, scripted user actions, golden screenshots at key states—see Option: Full-app fixture-driven flow tests with screenshots.

Current baseline in this repository

Facts you can cite when prioritizing work:

Area	Baseline
CI	`.github/workflows/ci-cd.yml` runs `flutter test` on `ubuntu-latest`. There is no `integration_test` job and no golden-specific job. iOS jobs build IPA but do not add extra automated test layers beyond the shared Flutter test job pattern.
Integration tests	`integration_test/sign_in_test.dart` plus mocks under `integration_test/mocks/` show a viable pattern: real widget integration with mocked auth boundaries—good for smoke without Appium.
Unit / BLoC / use case	Many tests under `test/` (blocs, use cases, forms, utils). This is an existing strength.
Golden tests	No `matchesGoldenFile` usage observed in `test/` or `lib/` at the time of writing—goldens would be a new investment with CI policy implications.
API SDK tests	`packages/openproject_api_sdk/test/openproject_api_sdk_test.dart` is effectively a stub—high-leverage gap for API drift detection.

Feature breadth versus “needs device or OS”

Feature areas: There are 21 top-level feature directories under lib/app/features/ (for example auth, work_package, home, time_tracking, shared_files, notifications, onboarding, projects, user).

Platform-sensitive dependencies (from pubspec.yaml) include OAuth (flutter_appauth), deep links (app_links), share-in (receive_sharing_intent), image_picker, file_picker, video_player, home_widget, live_activities, flutter_local_notifications, flutter_background_service, flutter_inappwebview, gal, permission_handler, Drift/SQLite, secure storage, and others.

Code touchpoints: A search for direct use of the heaviest plugins (for example ImagePicker, FilePicker, home_widget, local notifications, background service, sharing, Gal, flutter_appauth, AppLinks) lands on the order of roughly 20–25 Dart files under lib/. Many of these are shared services consumed by several features—not 25 separate products.

Interpretation

Only a minority of files sit on the “hard to fake completely” boundary, but they underpin cross-cutting flows: sign-in and callbacks, attachments, notifications, home screen widgets, deep links, rich editor embeds.
On-device or weekly smoke remains important for those flows; it is not true that most of the app must be covered by expensive E2E.
Rule of thumb for stakeholders: ~21 feature areas; several critical journeys benefit from scheduled smoke; the majority of business rules and UI states are cheaper to cover with use case, BLoC, and widget tests if boundaries stay clean.

flowchart TB
  subgraph fastCI [Fast CI - GitHub Actions]
    unit[Unit utils parsers]
    bloc[BLoC Cubit bloc_test]
    usecase[Use cases fake repos]
    widget[Widget tests]
    golden[Golden tests optional]
    contract[API contract or fixture HTTP tests]
  end
  subgraph slowManual [Weekly or manual]
    integ[integration_test on device or emulator]
    smoke[Smoke subset of critical journeys]
  end
  subgraph hardE2E [Rare deep E2E if needed]
    appium[Only if Flutter integration_test insufficient]
  end
  unit --> bloc
  bloc --> usecase
  usecase --> widget
  widget --> golden
  contract --> usecase
  integ --> smoke

Planned test types and where they fit

1. Integration testing on device or emulator (weekly smoke)

Intent: Catch integration issues that unit tests miss, and reduce repetitive manual checks for QA—not to duplicate every manual case.

Reality: Expensive in time and infrastructure; no device farm in CI today. Recommendation: Run a small fixed set of integration_test journeys once per week (or per release candidate), with a named owner and published pass/fail.

Suggested smoke scope (5–10 journeys, illustrative):

Enter instance URL and complete sign-in path (already aligned with integration_test/sign_in_test.dart).
Open a work package from a list into details (navigation + data binding).
Attachment flow: pick or attach where the test uses fakes at the file/camera boundary where possible.
Deep link or cold start into a known route (if stable test URLs exist).
Optional: one path that touches WebView or rich editor only if flakiness stays acceptable.

QA handoff: Weekly smoke narrows regression risk on “whole app glue”; it does not remove need for exploratory testing on new features or visual polish.

Widget tests: Prefer fake repositories or fixed Cubit/BLoC states so failures localize to layout and interaction, not wire format.

Golden tests: Best for design-system widgets and stable screens. Plan for maintenance cost when typography, themes, or locales change. CI must use a single pinned environment for goldens (see Appendix).

BLoC mocked versus Dio mocked in widget tests

Approach	When to use	Trade-off
Mock / fake BLoC or fixed state	Most widget and golden tests	Fast, stable; does not prove HTTP parsing.
Fake repository at domain boundary	When the widget depends on orchestration outcomes	More logic than a dumb bloc mock, still avoids JSON in every test.
Mock Dio deep in the tree	Rarely	Couples UI tests to serialization; noisy failures; prefer dedicated API tests instead.

More “logic covered” without device tests: Put that in use case tests with a fake port (same idea as integration_test/mocks/mock_auth_client.dart at unit speed).

3. API tests without device (GitHub Actions)

Intent: Detect client breakage when the API or HAL changes.

Recommendation: Implement fixture-backed or staging-backed tests in or beside packages/openproject_api_sdk: status codes, error bodies, parsing, and critical query shapes. Optionally trigger workflows when the API repository changes—only after a minimal runnable suite exists; otherwise hooks create noise without signal.

Do not rely on “run every BLoC” as a proxy for API correctness; BLoCs with mocks prove app logic, not server truth.

4. Unit tests and bloc tests

Continue as the default for new rules and regressions. This layer already exists; extend it for every non-trivial state machine and use case branch.

Option: Full-app fixture-driven flow tests with screenshots

This is an optional pattern—not the default for every feature—when the team wants a single automated “snapshot” of a flow (navigation + parsing + UI) under CI without a device.

Shape of the approach

Pump most or all of the app (real MaterialApp / router / GetIt where practical).
Replace HTTP with Dio (or HTTP client) adapters that return bodies from static JSON files under test/ (or a shared fixture package).
Fake or mock platform plugins (auth browser, file picker, notifications, etc.) so the test stays on flutter test.
Drive the same user actions as in a manual flow (tap, enterText, scroll) with stable finders or Keys.
Capture golden images (or equivalent screenshots) at milestone states to form a visual snapshot of the feature.

When it helps

Regressions in wired flows you scripted (navigation, bloc orchestration, mapping JSON → UI).
Unintended UI changes at captured steps (strong signal if goldens are reviewed).
Parsing and field usage for the exact JSON you checked in—only as accurate as fixtures are kept.

Costs and risks

High authoring cost (DI, routing, splash/onboarding, timing). High maintenance when routes, l10n, theme, or fixtures change. Risk of duplicate truth if the same JSON is not aligned with SDK contract tests—prefer one fixture source or generate from OpenAPI where possible.
Does not replace contract tests at the API layer for unknown server changes; fixtures that nobody updates create false confidence.
Does not catch real OS or plugin behavior (everything is mocked).

Use this pattern for few high-value journeys (for example one happy path per critical feature), not as the only test type for every screen.

Protection against breaking changes (by test category)

“Breaking change” here means anything that could ship a bad build: wrong data, crash, wrong UI, broken navigation, incompatible API, or environment-specific failure.

Category	API / JSON contract drift	App logic & state	UI layout & visuals	Navigation & deep links	Real device / OS / plugins
Unit	Low	Low (local only)	None	None	None
BLoC / Cubit	Low (unless bloc parses raw JSON)	High for covered transitions	Low	Low	None
Use case + fake repo	Low at HTTP edge	High for orchestration	None	Low	None
Widget (scoped, fake state)	Low	Medium (binding only)	High for pumped screens	Medium	None
Golden (selective)	Low	Low	High at snapshot points	Low	None
Contract / API tests (fixtures or staging)	High	Low	None	Low	Low (staging deps)
Flow snapshot (full app + JSON + goldens)	Medium (only if fixtures match reality)	High for scripted paths	High at milestones	High for scripted paths	Low
`integration_test` on device	Medium–high (if real backend)	High	Medium	High	High for covered plugins

Reading the table: No single layer scores high everywhere. Combine contract tests (server truth), BLoC/use case (rules), scoped widget or goldens (UI), and a small set of device smokes or flow snapshots for glue.

What gives the most benefit without “testing for testing’s sake”

Ordered by defect prevention per minute of maintenance for this codebase:

Contract or fixture-backed API tests — addresses unaware API-side breaks; lives next to HTTP/DTO code.
BLoC and use case tests — fast feedback on business rules and orchestration.
Widget tests — non-trivial UI, errors, empty states, accessibility fixes where applicable.
Golden tests — selective; design system and stable screens; needs CI discipline.
Weekly **integration_test** smoke — small journey set; human-process ownership.
Flow snapshot tests (optional) — use sparingly for critical end-to-end UI stories under flutter test; pair with API contract tests so JSON fixtures do not drift silently.

Suggestions: ROI order, bootstrap order, limited time

Why flow snapshot is last in the numbered list — That list ranks return per minute of maintenance (and CI fit), not “business unimportance.” Flow snapshots combine DI, routing, fixtures, and goldens and overlap API and visual checks unless limited to very few journeys. Prefer them after cheaper layers, or only where one scripted story is worth that cost.

Two orderings (do not confuse them)

ROI / next-hour investment: Favor contract/API, BLoC/use case, scoped widget, then smoke, then optional flow snapshot—sensible when choosing one extra layer.
Bootstrap / delivery sequence: Starting with wider tests (for example weekly smoke, a thin integration_test slice) can help momentum, learning, and visible wins; it is not the long-term inverted pyramid. Tighten with smaller tests around code that actually breaks.

If there is no time for “everything” (digest)

Fixture or contract tests at the HTTP client — server drift.
BLoC and use case tests — rules and branches.
A small device smoke set — glue.
Scoped widget tests — risky UI states.
Goldens — only with pinned CI and agreed scope.
Flow snapshots — few critical stories, not a duplicate of the whole pyramid.

Bug-driven deepening (works with “big tests first”)

You may prioritize wide tests early; still add unit / BLoC / use case (or a small widget test) when fixing a bug or touching non-trivial logic—cheap regression nets grow on real pain. Use bugs and meaningful edits as triggers, not vague “when we have time.” For new API or domain behavior, add at least a contract or use case happy path—not only tests born from defects—so happy paths do not stay untested.

Effort and maintenance (qualitative scale)

Layer	Effort to write	Effort to maintain	Flake risk	CI fit today
Unit	Low	Low	Very low	Excellent
BLoC	Low–medium	Low	Low	Excellent
Use case + fake repo	Medium	Low–medium	Low	Excellent
Widget	Medium	Medium	Low	Good
Golden	Medium–high	High (design/locale)	Medium	Good if OS pinned
`integration_test` on device	High	Medium	Medium–high	Poor without device CI
Cross-repo API hook	Medium setup	Low per change if automated from OpenAPI/fixtures	Low	Excellent once suite exists
Flow snapshot (full app + fixtures + goldens)	Very high	High	Medium–high	Good if OS pinned for goldens

Authoring time: manual versus AI-assisted (Cursor, Claude)

The ranges below are indicative engineer-time for one medium-complexity feature in this codebase (significant UI + state + a few API calls). They are not estimates for legal or procurement use—teams vary with familiarity, DI shape, and how stable finders are.

Assumptions: AI = strong prompting + human review and correction; restricted environments may add time for redaction and offline iteration.

Initial coverage (first time you add tests for that feature)

Test category	Manual (typical range)	AI-assisted (typical range)	Notes
Unit (helpers, parsers)	0.25–1.5 h	0.15–0.75 h	AI excels at table-driven cases.
BLoC / Cubit	1–4 h	0.5–2 h	Depends on event/side-effect complexity.
Use case + fake repo	1.5–5 h	1–3 h	Fakes must match real ports; AI speeds scaffolding.
Widget (scoped subtree, fake bloc/repo)	2–8 h	1–4 h	Finder stability drives variance.
Golden only (a few stable widgets or screens)	2–6 h	1–3 h	First-time CI + font pinning not included.
Contract / API (first endpoints + fixtures)	3–12 h	2–8 h	Faster if OpenAPI or samples exist.
`integration_test` journey (device/emulator)	4–16 h	2.5–10 h	Flakiness tuning often dominates.
Flow snapshot (full app + Dio JSON + platform fakes + flow + goldens)	1–3 days	0.5–2 days	DI + routing + async + baseline images; AI helps but does not remove integration pain.

Ongoing cost (one sprint later: small product or API tweak)

Test category	Manual (typical range)	AI-assisted (typical range)
Unit / BLoC / use case	0.25–2 h	0.15–1.25 h
Widget (scoped)	0.5–3 h	0.25–2 h
Golden	0.5–4 h	0.25–3 h (often re-baselining images)
Contract / API	0.5–3 h	0.25–2 h
`integration_test`	1–6 h	0.5–4 h
Flow snapshot	0.5–2 days	0.25–1.5 days

What AI usually does not compress much

Deciding what to assert, reviewing goldens, diagnosing flakes, aligning fixtures with production or staging truth, and CI parity (fonts, goldens on Linux vs macOS).

Classic authoring versus AI-assisted (Claude, Cursor)

Topic	Classic	With AI assistance
First draft of tests and mocks	Slower	Faster boilerplate and case enumeration
Flaky test diagnosis	Engineer-led	Still engineer-led; AI may suggest hypotheses
Golden review	Human judgment	Human judgment; AI should not “approve” pixels
Secrets and restricted env	Controlled manually	Still require policy: no credentials in prompts, offline constraints respected
Wrong green tests	Rare if review is strict	Higher risk if assertions are weak—review stays in Definition of Done

Bottom line for management: AI reduces typing and scaffolding time; it does not remove review, CI stability, or product choices about what must be covered. Use the authoring time tables above when negotiating sprint capacity.

Organizational recommendations

Sprint planning

Reserve a transparent band of capacity for tests on non-trivial work (for example 10–20%, tuned by team).
Definition of Done (example bullets):
- New domain rule or branchy logic → use case or BLoC test.
- New or risky UI states → widget test (and golden only if agreed).
- New or changed API usage → contract or fixture test at the SDK/client layer.
- Bug fix with reproduction → regression test where feasible.

TDD for bugs

Encourage: regression test first for reproducible bugs in logic or state machines (documents intent, prevents return of the defect).
Avoid mandating TDD for trivial copy, one-line layout, or pure asset changes—credibility with engineers matters.

Appendix: risks, open decisions, and environment limits

This appendix is not a list of blockers; it is a list of places where strategy meets reality. Each item mixes risk (what goes wrong if ignored), open decision (what leadership or the team must choose), and environment limit (what GitHub Actions, emulators, or OS vendors do not guarantee). Use it when estimating cost, assigning owners, or explaining to stakeholders why a test type “works in principle” but needs guardrails.

Weekly integration without rigor

Context: A policy of “run integration tests weekly” sounds simple. In practice it competes with releases, support, and PTO. If nobody is accountable, the habit dies and confidence drops back to manual-only without anyone updating the strategy doc.

Risk: Silent gaps—you believe smoke runs, but it has not run for weeks; regressions ship; QA is asked to compensate with broader manual passes under time pressure.

What to decide explicitly

Owner: One named person or rotating role responsible for kick-off, triage of failures, and escalation—not “the team.”
Device matrix: At minimum, over time, both Android and iOS (or two representative OS versions). Different OEMs still surface different WebView, keyboard, and permission quirks.
Artifacts: Store logs and failure screenshots (or video) in CI artifacts, a shared drive, or a ticket template so failures are debuggable offline and comparable week to week.

Environment limit: Hosted device farms cost money; local devices depend on who is in the office. Budget and access are management inputs, not engineering-only details.

Golden tests on GitHub Actions

Context: Golden tests rasterize widgets to pixels. Pixel output depends on font metrics, subpixel rendering, theme, device pixel ratio, and Skia/Impeller behavior. GitHub’s ubuntu-latest runners are not pixel-identical to macOS, iOS, or Android.

Risk: PRs fail only because the runner image changed, or because a designer updated a token that was never meant to block merge. Teams then disable goldens or stop updating baselines—losing the benefit entirely.

What to decide explicitly

Single source of truth: For example run goldens only on macos-latest, or use a pinned Docker image with bundled fonts and a fixed Flutter version—documented in the repo.
Update policy: Who may approve baseline image updates, and whether large visual diffs require design or QA sign-off.
Scope: Goldens for design tokens and stable shells first; avoid goldening every screen until the process is trusted.
Localization: Large translation drops (for example Crowdin OTA or mass l10n updates) can invalidate many baselines at once—decide whether goldens run under a fixed test locale and which strings are allowed to appear in golden-covered widgets.

Environment limit: Linux vs macOS font shaping differs; emulator vs physical device differs. “Looks the same to a human” is not the same as “byte-identical PNG.”

API repository hooks

Context: The desirable story is: “API repo changed → client tests run → we know before merge.” That only works if the client repo contains a test suite that speaks the same contract as production (or staging), and if credentials and rate limits are handled.

Risk: A hook that runs a smoke URL against production creates flaky jobs, rate limits, and compliance issues. A hook that runs nothing meaningful trains everyone to ignore red builds.

What to decide explicitly

Input artifact: Prefer recorded fixtures, OpenAPI snapshots, or a dedicated staging with stable seed data over calling arbitrary production URLs from CI.
Failure semantics: Red means “client must adapt or API must roll back”—agree with backend owners what breaking means (response shape, status codes, deprecations).
Secrets: Where instance URLs and tokens live (GitHub secrets, vault); restricted environments may forbid outbound calls entirely—fixtures-only mode then becomes mandatory.

Environment limit: CI runners may not reach internal APIs without VPN or self-hosted runners; that is an infrastructure decision, not a test style decision.

Notifications, background work, Firebase

Context: The app uses platform channels, local notifications, background services, and Firebase-related tooling for stability and diagnostics. CI is a headless, often non-Google Play Services Linux environment. It does not replicate APNs delivery, exact alarm policies on newer Android, or user permission flows the way a phone does.

Risk: Assuming “green CI means push works” creates false confidence. Real failures appear only in dogfood, staging, or store review.

What to decide explicitly

Split responsibilities: Automate pure logic (payload JSON parsing, mapping to domain models, idempotency) with unit or contract tests; keep OS integration on a short manual or staging checklist per release (or per notification feature).
Version matrix: Android notification permission and background execution rules change by OS version—QA and product should know which versions are in support.

Environment limit: Simulating “user dismissed notification” or “app killed then opened from tap” in CI is brittle; budget for targeted manual or device-farm runs for those stories when they change.

Appium (or other external UI drivers)

Context: Flutter’s **integration_test** package drives the app from Dart, shares the same isolate model, and is the default in the ecosystem. Appium (or similar) drives the UI from outside, often across a bridge, and shines when non-Flutter surfaces, system dialogs, or multi-app scenarios dominate.

Risk: Two stacks mean double maintenance, slower feedback, and harder debugging for the majority of flows that Flutter already covers.

What to decide explicitly

Use **integration_test** until a concrete gap is written down (for example “must validate OS share sheet with a specific third-party app”).
If Appium is adopted, cap scope to those gaps and fund training and stable device inventory.

Environment limit: For this codebase, the share of UI that requires Appium-style control is small relative to total features; default to **integration_test** first.

Flow snapshot tests (full app + JSON fixtures + goldens)

Context: This pattern gives a cinematic record of a feature under **flutter_test**, but it stacks DI setup, routing, fixture curation, and golden churn in one place.

Risk: The team maintains two sources of truth for JSON—fixtures in widget tests and reality on the server—unless contract tests own the canonical fixtures.

What to decide explicitly

Cap the number of flow-snapshot suites (for example “at most N active flows”).
Fixture ownership: Same owners as API client changes, or enforce generation from OpenAPI where possible.

Environment limit: Same as goldens: OS and fonts must be pinned for comparable screenshots.

Local persistence, time, and locale (Drift, preferences, clocks)

Context: The app uses local storage (for example Drift/SQLite, preferences). Tests must control schema migrations, seed data, and clock (DateTime.now(), time zones) or become order-dependent and flaky.

Risk: Tests pass locally and fail in CI at midnight UTC, or fail only when run in parallel.

What to decide explicitly

Prefer in-memory databases or isolated temp directories per test where supported.
Inject clock and locale in tests that assert formatting or deadlines.

Environment limit: CI default locale may differ from a developer’s laptop—document or fix locale in tests that format dates and numbers.

Secrets, compliance, and “restricted environment” work

Context: Some organizations forbid pasting production URLs or tokens into AI tools, block arbitrary outbound network from CI, or require audit trails for test data.

Risk: Engineers skip writing API tests because “CI cannot reach the server,” or leak secrets into logs and golden names.

What to decide explicitly

A fixture-only path for CI and a staging path for scheduled jobs, both documented.
Redaction rules for logs and for AI-assisted authoring (no credentials in prompts).

Environment limit: If outbound network is disallowed, contract tests must use checked-in HTTP recordings or static OpenAPI examples—there is no alternative in-process.

Document maintenance

Revisit this strategy after major changes: introduction of golden CI, addition of API contract suite, or onboarding of a device farm. Revisit the appendix decisions (owners, OS pins, fixtures, secrets) when those change. Update the smoke journey list with QA when major flows ship or retire.

Automated test

Testing strategy (managers and QA)

Vision and planned directions (read this first)

Why automate tests at all

Direction 1: Feature integration tests (device or emulator)

Direction 2: Widget tests and golden (screenshot) tests

Direction 4: Unit tests, BLoC tests, and small-scope tests

Optional direction: Full-app flow with JSON fixtures and screenshots

How we want to work as a team

How the rest of this document is organized

Executive summary

Audience and how to use this doc

Glossary (short)

Current baseline in this repository

Feature breadth versus “needs device or OS”

Planned test types and where they fit

1. Integration testing on device or emulator (weekly smoke)

2. Widget tests and golden tests (CI-friendly)

3. API tests without device (GitHub Actions)

4. Unit tests and bloc tests

Option: Full-app fixture-driven flow tests with screenshots

Protection against breaking changes (by test category)

What gives the most benefit without “testing for testing’s sake”

Suggestions: ROI order, bootstrap order, limited time

Effort and maintenance (qualitative scale)

Authoring time: manual versus AI-assisted (Cursor, Claude)

Initial coverage (first time you add tests for that feature)

Ongoing cost (one sprint later: small product or API tweak)

Classic authoring versus AI-assisted (Claude, Cursor)

Organizational recommendations

Appendix: risks, open decisions, and environment limits

Weekly integration without rigor

Golden tests on GitHub Actions

API repository hooks

Notifications, background work, Firebase

Appium (or other external UI drivers)

Flow snapshot tests (full app + JSON fixtures + goldens)

Local persistence, time, and locale (Drift, preferences, clocks)

Secrets, compliance, and “restricted environment” work

Document maintenance

Content

Automated test

Testing strategy (managers and QA)

Vision and planned directions (read this first)

Why automate tests at all

Direction 1: Feature integration tests (device or emulator)

Direction 2: Widget tests and golden (screenshot) tests

Direction 3: API-related tests without a device

Direction 4: Unit tests, BLoC tests, and small-scope tests

Optional direction: Full-app flow with JSON fixtures and screenshots

How we want to work as a team

How the rest of this document is organized

Executive summary

Audience and how to use this doc

Glossary (short)

Current baseline in this repository

Feature breadth versus “needs device or OS”

Planned test types and where they fit

1. Integration testing on device or emulator (weekly smoke)

2. Widget tests and golden tests (CI-friendly)

3. API tests without device (GitHub Actions)

4. Unit tests and bloc tests

Option: Full-app fixture-driven flow tests with screenshots

Protection against breaking changes (by test category)

What gives the most benefit without “testing for testing’s sake”

Suggestions: ROI order, bootstrap order, limited time

Effort and maintenance (qualitative scale)

Authoring time: manual versus AI-assisted (Cursor, Claude)

Initial coverage (first time you add tests for that feature)

Ongoing cost (one sprint later: small product or API tweak)

Classic authoring versus AI-assisted (Claude, Cursor)

Organizational recommendations

Appendix: risks, open decisions, and environment limits

Weekly integration without rigor

Golden tests on GitHub Actions

API repository hooks

Notifications, background work, Firebase

Appium (or other external UI drivers)

Flow snapshot tests (full app + JSON fixtures + goldens)

Local persistence, time, and locale (Drift, preferences, clocks)

Secrets, compliance, and “restricted environment” work

Document maintenance