Content
Automated test
Testing strategy (managers and QA)
This document supports decisions about automated testing, CI, and how QA and engineering share responsibility. It is grounded in the current OpenProject Flutter app repository.
If you are new here: read Vision and planned directions first. It states what we want to achieve and why, in plain language. The rest of the document refines that vision with repository facts, priorities, risks, and effort estimates.
Vision and planned directions (read this first)
This section is for readers who were not part of earlier discussions. It captures our intended direction; later sections explain how that maps to this codebase, what already exists, and what we recommend adjusting.
Why automate tests at all
We ship a Flutter client against a living API and rich UI. Bugs show up as wrong data, broken flows, visual regressions, or surprises after server changes. The goal is not “maximum tests,” but predictable quality and earlier feedback, while keeping cost and flake risk under control.
Direction 1: Feature integration tests (device or emulator)
Vision: Run meaningful end-to-end style checks on a real app binding—on a device or emulator—so we exercise navigation, real layout, and timing closer to production.
Constraints we accept: These tests are expensive (time, machines, stability). We do not assume a device farm in CI today. The plan is not to run them on every pull request, but to use them as scheduled smoke (for example once a week) and/or around release candidates, to catch “glue” regressions and reduce repetitive manual load on QA without replacing exploratory testing.
Tooling stance: Prefer Flutter’s **integration_test** where possible; reserve heavier tools (for example Appium) only when the platform cannot be covered otherwise.
Direction 2: Widget tests and golden (screenshot) tests
Vision: While building a feature, add widget tests and, where agreed, golden (reference image) tests for the screens or states that matter—so we get fast feedback in GitHub Actions on layout and critical UI paths.
Open question we acknowledge: For those tests, we may mock BLoC/Cubit state (or fakes at the repository boundary) or push further and mock HTTP (Dio) with static JSON. The document later recommends a split: keep most widget/golden tests free of wire-level JSON; put HTTP parsing and contract behavior in dedicated API / contract tests so failures stay interpretable and maintenance stays sane.
Direction 3: API-related tests without a device
Vision: Add tests that behave like integration checks for the client–API contract but run on Linux CI—no emulator. Typical forms: fixture-backed HTTP tests next to the generated or hand-written client, and/or tests that drive use cases with a fake HTTP port, still asserting realistic responses.
Motivation: We sometimes ship a broken client because the API or HAL changed and we were not aware. Running these suites on every PR (and optionally triggering workflows when the API repository changes) is meant to surface incompatibility early. This only works if the suite is deterministic (fixtures or a stable staging contract)—hooks without a real test body create noise, not safety.
Direction 4: Unit tests, BLoC tests, and small-scope tests
Vision: Keep growing unit tests, **bloc_test**-style tests, and use case tests as the default safety net for rules and state machines. They are the cheapest layer and should remain the bulk of automated coverage.
Optional direction: Full-app flow with JSON fixtures and screenshots
Some features may additionally use the flow snapshot pattern (full or near-full app under flutter_test, static JSON, faked platform APIs, golden milestones). That is optional and expensive—use for a few critical journeys; details appear below.
How we want to work as a team
Sprint planning: We intend to allocate visible time for tests alongside feature work (for example a agreed percentage or explicit tasks). Management may initially see this as overhead; the argument is that unplanned quality work still happens—it is simply deferred, more expensive, and often lands on QA or customers.
Bugs and TDD: When a bug is reproducible and worth preventing again, we want to fix it through a test-first or test-accompanying workflow where that is practical—not as a dogma for every one-line change, but as a default for real regressions.
How the rest of this document is organized
After this section |
You will find |
|---|---|
Short decisions and asks for leadership and QA. |
|
What we mean by each test type. |
|
What already exists in this repo (CI, tests, gaps). |
|
Later sections |
ROI ordering, bootstrap vs ROI, mock boundaries, effort tables, risks, and QA handoff. |
Executive summary
-
Goal: Improve release quality and reduce surprise breakages—especially when the API or shared contracts change—without drowning the team in flaky or redundant tests.
-
Current constraint: GitHub Actions runs
flutter teston Ubuntu only. Device and emulator integration tests are not in CI today; they are suitable for a weekly smoke lane or pre-release, not for every PR. -
Highest ROI next step: Add contract or fixture-backed HTTP tests next to the API client layer (
packages/openproject_api_sdk). That directly targets “the server changed and the client broke without us knowing.” Widget tests with mocked Dio are not a substitute. -
What already works: The repo has substantial unit, BLoC, and use case coverage under
test/. The strategy is to extend proven layers, not restart from zero. -
Organizational ask: Treat test work as part of delivery in sprint planning (transparent capacity band and Definition of Done). Use TDD-style regression tests for reproducible bugs; avoid dogmatic TDD for trivial changes.
Decisions for management
-
Approve a small recurring capacity band (for example 10–20% of engineering time on non-trivial stories) for tests that match the DoD below.
-
Approve ownership and cadence for weekly on-device smoke (who runs it, which devices, where results are recorded).
-
Approve one CI strategy for golden tests (pinned OS or runner) if goldens are adopted, to avoid endless screenshot churn.
Decisions for QA
-
Agree which journeys are in the weekly automated smoke set versus what remains manual exploratory testing.
-
Agree how API or environment incidents are distinguished from app regressions when smoke fails.
Audience and how to use this doc
Audience |
Use |
|---|---|
Managers |
Trade-offs (cost, schedule, risk), CI expectations, why sprint capacity for tests is rational. |
QA |
What automation can replace or narrow, weekly smoke scope, where to log gaps when coverage does not exist yet. |
Glossary (short)
Term |
Meaning here |
|---|---|
Unit test |
Fast, no I/O; pure functions, parsers, small helpers. |
BLoC / Cubit test |
State transitions and side-effect ordering with controlled fakes ( |
Use case test |
Application service orchestration with fake repositories or ports—not real network. |
Widget test |
Pumps widgets under |
Golden test |
Compares rasterized output to a reference image; great for stable design surfaces, sensitive to fonts and OS. |
Integration test ( |
Runs a real app binding; often on device or emulator; higher cost and flake risk. |
Contract / API test |
Asserts HTTP shapes, status codes, and parsing against fixtures or a staging API; catches server drift. |
Flow snapshot test (optional pattern) |
Full-app widget test: pump near-complete app, Dio (or client) returns static JSON from |
Current baseline in this repository
Facts you can cite when prioritizing work:
Area |
Baseline |
|---|---|
CI |
|
Integration tests |
|
Unit / BLoC / use case |
Many tests under |
Golden tests |
No |
API SDK tests |
|
Feature breadth versus “needs device or OS”
Feature areas: There are 21 top-level feature directories under lib/app/features/ (for example auth, work_package, home, time_tracking, shared_files, notifications, onboarding, projects, user).
Platform-sensitive dependencies (from pubspec.yaml) include OAuth (flutter_appauth), deep links (app_links), share-in (receive_sharing_intent), image_picker, file_picker, video_player, home_widget, live_activities, flutter_local_notifications, flutter_background_service, flutter_inappwebview, gal, permission_handler, Drift/SQLite, secure storage, and others.
Code touchpoints: A search for direct use of the heaviest plugins (for example ImagePicker, FilePicker, home_widget, local notifications, background service, sharing, Gal, flutter_appauth, AppLinks) lands on the order of roughly 20–25 Dart files under lib/. Many of these are shared services consumed by several features—not 25 separate products.
Interpretation
-
Only a minority of files sit on the “hard to fake completely” boundary, but they underpin cross-cutting flows: sign-in and callbacks, attachments, notifications, home screen widgets, deep links, rich editor embeds.
-
On-device or weekly smoke remains important for those flows; it is not true that most of the app must be covered by expensive E2E.
-
Rule of thumb for stakeholders: ~21 feature areas; several critical journeys benefit from scheduled smoke; the majority of business rules and UI states are cheaper to cover with use case, BLoC, and widget tests if boundaries stay clean.
flowchart TB
subgraph fastCI [Fast CI - GitHub Actions]
unit[Unit utils parsers]
bloc[BLoC Cubit bloc_test]
usecase[Use cases fake repos]
widget[Widget tests]
golden[Golden tests optional]
contract[API contract or fixture HTTP tests]
end
subgraph slowManual [Weekly or manual]
integ[integration_test on device or emulator]
smoke[Smoke subset of critical journeys]
end
subgraph hardE2E [Rare deep E2E if needed]
appium[Only if Flutter integration_test insufficient]
end
unit --> bloc
bloc --> usecase
usecase --> widget
widget --> golden
contract --> usecase
integ --> smoke
Planned test types and where they fit
1. Integration testing on device or emulator (weekly smoke)
Intent: Catch integration issues that unit tests miss, and reduce repetitive manual checks for QA—not to duplicate every manual case.
Reality: Expensive in time and infrastructure; no device farm in CI today. Recommendation: Run a small fixed set of integration_test journeys once per week (or per release candidate), with a named owner and published pass/fail.
Suggested smoke scope (5–10 journeys, illustrative):
-
Enter instance URL and complete sign-in path (already aligned with
integration_test/sign_in_test.dart). -
Open a work package from a list into details (navigation + data binding).
-
Attachment flow: pick or attach where the test uses fakes at the file/camera boundary where possible.
-
Deep link or cold start into a known route (if stable test URLs exist).
-
Optional: one path that touches WebView or rich editor only if flakiness stays acceptable.
QA handoff: Weekly smoke narrows regression risk on “whole app glue”; it does not remove need for exploratory testing on new features or visual polish.
2. Widget tests and golden tests (CI-friendly)
Widget tests: Prefer fake repositories or fixed Cubit/BLoC states so failures localize to layout and interaction, not wire format.
Golden tests: Best for design-system widgets and stable screens. Plan for maintenance cost when typography, themes, or locales change. CI must use a single pinned environment for goldens (see Appendix).
BLoC mocked versus Dio mocked in widget tests
Approach |
When to use |
Trade-off |
|---|---|---|
Mock / fake BLoC or fixed state |
Most widget and golden tests |
Fast, stable; does not prove HTTP parsing. |
Fake repository at domain boundary |
When the widget depends on orchestration outcomes |
More logic than a dumb bloc mock, still avoids JSON in every test. |
Mock Dio deep in the tree |
Rarely |
Couples UI tests to serialization; noisy failures; prefer dedicated API tests instead. |
More “logic covered” without device tests: Put that in use case tests with a fake port (same idea as integration_test/mocks/mock_auth_client.dart at unit speed).
3. API tests without device (GitHub Actions)
Intent: Detect client breakage when the API or HAL changes.
Recommendation: Implement fixture-backed or staging-backed tests in or beside packages/openproject_api_sdk: status codes, error bodies, parsing, and critical query shapes. Optionally trigger workflows when the API repository changes—only after a minimal runnable suite exists; otherwise hooks create noise without signal.
Do not rely on “run every BLoC” as a proxy for API correctness; BLoCs with mocks prove app logic, not server truth.
4. Unit tests and bloc tests
Continue as the default for new rules and regressions. This layer already exists; extend it for every non-trivial state machine and use case branch.
Option: Full-app fixture-driven flow tests with screenshots
This is an optional pattern—not the default for every feature—when the team wants a single automated “snapshot” of a flow (navigation + parsing + UI) under CI without a device.
Shape of the approach
-
Pump most or all of the app (real
MaterialApp/ router /GetItwhere practical). -
Replace HTTP with Dio (or HTTP client) adapters that return bodies from static JSON files under
test/(or a shared fixture package). -
Fake or mock platform plugins (auth browser, file picker, notifications, etc.) so the test stays on
flutter test. -
Drive the same user actions as in a manual flow (
tap,enterText,scroll) with stable finders orKeys. -
Capture golden images (or equivalent screenshots) at milestone states to form a visual snapshot of the feature.
When it helps
-
Regressions in wired flows you scripted (navigation, bloc orchestration, mapping JSON → UI).
-
Unintended UI changes at captured steps (strong signal if goldens are reviewed).
-
Parsing and field usage for the exact JSON you checked in—only as accurate as fixtures are kept.
Costs and risks
-
High authoring cost (DI, routing, splash/onboarding, timing). High maintenance when routes, l10n, theme, or fixtures change. Risk of duplicate truth if the same JSON is not aligned with SDK contract tests—prefer one fixture source or generate from OpenAPI where possible.
-
Does not replace contract tests at the API layer for unknown server changes; fixtures that nobody updates create false confidence.
-
Does not catch real OS or plugin behavior (everything is mocked).
Use this pattern for few high-value journeys (for example one happy path per critical feature), not as the only test type for every screen.
Protection against breaking changes (by test category)
“Breaking change” here means anything that could ship a bad build: wrong data, crash, wrong UI, broken navigation, incompatible API, or environment-specific failure.
Category |
API / JSON contract drift |
App logic & state |
UI layout & visuals |
Navigation & deep links |
Real device / OS / plugins |
|---|---|---|---|---|---|
Unit |
Low |
Low (local only) |
None |
None |
None |
BLoC / Cubit |
Low (unless bloc parses raw JSON) |
High for covered transitions |
Low |
Low |
None |
Use case + fake repo |
Low at HTTP edge |
High for orchestration |
None |
Low |
None |
Widget (scoped, fake state) |
Low |
Medium (binding only) |
High for pumped screens |
Medium |
None |
Golden (selective) |
Low |
Low |
High at snapshot points |
Low |
None |
Contract / API tests (fixtures or staging) |
High |
Low |
None |
Low |
Low (staging deps) |
Flow snapshot (full app + JSON + goldens) |
Medium (only if fixtures match reality) |
High for scripted paths |
High at milestones |
High for scripted paths |
Low |
|
Medium–high (if real backend) |
High |
Medium |
High |
High for covered plugins |
Reading the table: No single layer scores high everywhere. Combine contract tests (server truth), BLoC/use case (rules), scoped widget or goldens (UI), and a small set of device smokes or flow snapshots for glue.
What gives the most benefit without “testing for testing’s sake”
Ordered by defect prevention per minute of maintenance for this codebase:
-
Contract or fixture-backed API tests — addresses unaware API-side breaks; lives next to HTTP/DTO code.
-
BLoC and use case tests — fast feedback on business rules and orchestration.
-
Widget tests — non-trivial UI, errors, empty states, accessibility fixes where applicable.
-
Golden tests — selective; design system and stable screens; needs CI discipline.
-
Weekly
**integration_test**smoke — small journey set; human-process ownership. -
Flow snapshot tests (optional) — use sparingly for critical end-to-end UI stories under
flutter test; pair with API contract tests so JSON fixtures do not drift silently.
Suggestions: ROI order, bootstrap order, limited time
Why flow snapshot is last in the numbered list — That list ranks return per minute of maintenance (and CI fit), not “business unimportance.” Flow snapshots combine DI, routing, fixtures, and goldens and overlap API and visual checks unless limited to very few journeys. Prefer them after cheaper layers, or only where one scripted story is worth that cost.
Two orderings (do not confuse them)
-
ROI / next-hour investment: Favor contract/API, BLoC/use case, scoped widget, then smoke, then optional flow snapshot—sensible when choosing one extra layer.
-
Bootstrap / delivery sequence: Starting with wider tests (for example weekly smoke, a thin
integration_testslice) can help momentum, learning, and visible wins; it is not the long-term inverted pyramid. Tighten with smaller tests around code that actually breaks.
If there is no time for “everything” (digest)
-
Fixture or contract tests at the HTTP client — server drift.
-
BLoC and use case tests — rules and branches.
-
A small device smoke set — glue.
-
Scoped widget tests — risky UI states.
-
Goldens — only with pinned CI and agreed scope.
-
Flow snapshots — few critical stories, not a duplicate of the whole pyramid.
Bug-driven deepening (works with “big tests first”)
You may prioritize wide tests early; still add unit / BLoC / use case (or a small widget test) when fixing a bug or touching non-trivial logic—cheap regression nets grow on real pain. Use bugs and meaningful edits as triggers, not vague “when we have time.” For new API or domain behavior, add at least a contract or use case happy path—not only tests born from defects—so happy paths do not stay untested.
Effort and maintenance (qualitative scale)
Layer |
Effort to write |
Effort to maintain |
Flake risk |
CI fit today |
|---|---|---|---|---|
Unit |
Low |
Low |
Very low |
Excellent |
BLoC |
Low–medium |
Low |
Low |
Excellent |
Use case + fake repo |
Medium |
Low–medium |
Low |
Excellent |
Widget |
Medium |
Medium |
Low |
Good |
Golden |
Medium–high |
High (design/locale) |
Medium |
Good if OS pinned |
|
High |
Medium |
Medium–high |
Poor without device CI |
Cross-repo API hook |
Medium setup |
Low per change if automated from OpenAPI/fixtures |
Low |
Excellent once suite exists |
Flow snapshot (full app + fixtures + goldens) |
Very high |
High |
Medium–high |
Good if OS pinned for goldens |
Authoring time: manual versus AI-assisted (Cursor, Claude)
The ranges below are indicative engineer-time for one medium-complexity feature in this codebase (significant UI + state + a few API calls). They are not estimates for legal or procurement use—teams vary with familiarity, DI shape, and how stable finders are.
Assumptions: AI = strong prompting + human review and correction; restricted environments may add time for redaction and offline iteration.
Initial coverage (first time you add tests for that feature)
Test category |
Manual (typical range) |
AI-assisted (typical range) |
Notes |
|---|---|---|---|
Unit (helpers, parsers) |
0.25–1.5 h |
0.15–0.75 h |
AI excels at table-driven cases. |
BLoC / Cubit |
1–4 h |
0.5–2 h |
Depends on event/side-effect complexity. |
Use case + fake repo |
1.5–5 h |
1–3 h |
Fakes must match real ports; AI speeds scaffolding. |
Widget (scoped subtree, fake bloc/repo) |
2–8 h |
1–4 h |
Finder stability drives variance. |
Golden only (a few stable widgets or screens) |
2–6 h |
1–3 h |
First-time CI + font pinning not included. |
Contract / API (first endpoints + fixtures) |
3–12 h |
2–8 h |
Faster if OpenAPI or samples exist. |
|
4–16 h |
2.5–10 h |
Flakiness tuning often dominates. |
Flow snapshot (full app + Dio JSON + platform fakes + flow + goldens) |
1–3 days |
0.5–2 days |
DI + routing + async + baseline images; AI helps but does not remove integration pain. |
Ongoing cost (one sprint later: small product or API tweak)
Test category |
Manual (typical range) |
AI-assisted (typical range) |
|---|---|---|
Unit / BLoC / use case |
0.25–2 h |
0.15–1.25 h |
Widget (scoped) |
0.5–3 h |
0.25–2 h |
Golden |
0.5–4 h |
0.25–3 h (often re-baselining images) |
Contract / API |
0.5–3 h |
0.25–2 h |
|
1–6 h |
0.5–4 h |
Flow snapshot |
0.5–2 days |
0.25–1.5 days |
What AI usually does not compress much
- Deciding what to assert, reviewing goldens, diagnosing flakes, aligning fixtures with production or staging truth, and CI parity (fonts, goldens on Linux vs macOS).
Classic authoring versus AI-assisted (Claude, Cursor)
Topic |
Classic |
With AI assistance |
|---|---|---|
First draft of tests and mocks |
Slower |
Faster boilerplate and case enumeration |
Flaky test diagnosis |
Engineer-led |
Still engineer-led; AI may suggest hypotheses |
Golden review |
Human judgment |
Human judgment; AI should not “approve” pixels |
Secrets and restricted env |
Controlled manually |
Still require policy: no credentials in prompts, offline constraints respected |
Wrong green tests |
Rare if review is strict |
Higher risk if assertions are weak—review stays in Definition of Done |
Bottom line for management: AI reduces typing and scaffolding time; it does not remove review, CI stability, or product choices about what must be covered. Use the authoring time tables above when negotiating sprint capacity.
Organizational recommendations
Sprint planning
-
Reserve a transparent band of capacity for tests on non-trivial work (for example 10–20%, tuned by team).
-
Definition of Done (example bullets):
-
New domain rule or branchy logic → use case or BLoC test.
-
New or risky UI states → widget test (and golden only if agreed).
-
New or changed API usage → contract or fixture test at the SDK/client layer.
-
Bug fix with reproduction → regression test where feasible.
-
TDD for bugs
-
Encourage: regression test first for reproducible bugs in logic or state machines (documents intent, prevents return of the defect).
-
Avoid mandating TDD for trivial copy, one-line layout, or pure asset changes—credibility with engineers matters.
Appendix: risks, open decisions, and environment limits
This appendix is not a list of blockers; it is a list of places where strategy meets reality. Each item mixes risk (what goes wrong if ignored), open decision (what leadership or the team must choose), and environment limit (what GitHub Actions, emulators, or OS vendors do not guarantee). Use it when estimating cost, assigning owners, or explaining to stakeholders why a test type “works in principle” but needs guardrails.
Weekly integration without rigor
Context: A policy of “run integration tests weekly” sounds simple. In practice it competes with releases, support, and PTO. If nobody is accountable, the habit dies and confidence drops back to manual-only without anyone updating the strategy doc.
Risk: Silent gaps—you believe smoke runs, but it has not run for weeks; regressions ship; QA is asked to compensate with broader manual passes under time pressure.
What to decide explicitly
-
Owner: One named person or rotating role responsible for kick-off, triage of failures, and escalation—not “the team.”
-
Device matrix: At minimum, over time, both Android and iOS (or two representative OS versions). Different OEMs still surface different WebView, keyboard, and permission quirks.
-
Artifacts: Store logs and failure screenshots (or video) in CI artifacts, a shared drive, or a ticket template so failures are debuggable offline and comparable week to week.
Environment limit: Hosted device farms cost money; local devices depend on who is in the office. Budget and access are management inputs, not engineering-only details.
Golden tests on GitHub Actions
Context: Golden tests rasterize widgets to pixels. Pixel output depends on font metrics, subpixel rendering, theme, device pixel ratio, and Skia/Impeller behavior. GitHub’s ubuntu-latest runners are not pixel-identical to macOS, iOS, or Android.
Risk: PRs fail only because the runner image changed, or because a designer updated a token that was never meant to block merge. Teams then disable goldens or stop updating baselines—losing the benefit entirely.
What to decide explicitly
-
Single source of truth: For example run goldens only on
macos-latest, or use a pinned Docker image with bundled fonts and a fixed Flutter version—documented in the repo. -
Update policy: Who may approve baseline image updates, and whether large visual diffs require design or QA sign-off.
-
Scope: Goldens for design tokens and stable shells first; avoid goldening every screen until the process is trusted.
-
Localization: Large translation drops (for example Crowdin OTA or mass
l10nupdates) can invalidate many baselines at once—decide whether goldens run under a fixed test locale and which strings are allowed to appear in golden-covered widgets.
Environment limit: Linux vs macOS font shaping differs; emulator vs physical device differs. “Looks the same to a human” is not the same as “byte-identical PNG.”
API repository hooks
Context: The desirable story is: “API repo changed → client tests run → we know before merge.” That only works if the client repo contains a test suite that speaks the same contract as production (or staging), and if credentials and rate limits are handled.
Risk: A hook that runs a smoke URL against production creates flaky jobs, rate limits, and compliance issues. A hook that runs nothing meaningful trains everyone to ignore red builds.
What to decide explicitly
-
Input artifact: Prefer recorded fixtures, OpenAPI snapshots, or a dedicated staging with stable seed data over calling arbitrary production URLs from CI.
-
Failure semantics: Red means “client must adapt or API must roll back”—agree with backend owners what breaking means (response shape, status codes, deprecations).
-
Secrets: Where instance URLs and tokens live (GitHub secrets, vault); restricted environments may forbid outbound calls entirely—fixtures-only mode then becomes mandatory.
Environment limit: CI runners may not reach internal APIs without VPN or self-hosted runners; that is an infrastructure decision, not a test style decision.
Notifications, background work, Firebase
Context: The app uses platform channels, local notifications, background services, and Firebase-related tooling for stability and diagnostics. CI is a headless, often non-Google Play Services Linux environment. It does not replicate APNs delivery, exact alarm policies on newer Android, or user permission flows the way a phone does.
Risk: Assuming “green CI means push works” creates false confidence. Real failures appear only in dogfood, staging, or store review.
What to decide explicitly
-
Split responsibilities: Automate pure logic (payload JSON parsing, mapping to domain models, idempotency) with unit or contract tests; keep OS integration on a short manual or staging checklist per release (or per notification feature).
-
Version matrix: Android notification permission and background execution rules change by OS version—QA and product should know which versions are in support.
Environment limit: Simulating “user dismissed notification” or “app killed then opened from tap” in CI is brittle; budget for targeted manual or device-farm runs for those stories when they change.
Appium (or other external UI drivers)
Context: Flutter’s **integration_test** package drives the app from Dart, shares the same isolate model, and is the default in the ecosystem. Appium (or similar) drives the UI from outside, often across a bridge, and shines when non-Flutter surfaces, system dialogs, or multi-app scenarios dominate.
Risk: Two stacks mean double maintenance, slower feedback, and harder debugging for the majority of flows that Flutter already covers.
What to decide explicitly
-
Use
**integration_test**until a concrete gap is written down (for example “must validate OS share sheet with a specific third-party app”). -
If Appium is adopted, cap scope to those gaps and fund training and stable device inventory.
Environment limit: For this codebase, the share of UI that requires Appium-style control is small relative to total features; default to **integration_test** first.
Flow snapshot tests (full app + JSON fixtures + goldens)
Context: This pattern gives a cinematic record of a feature under **flutter_test**, but it stacks DI setup, routing, fixture curation, and golden churn in one place.
Risk: The team maintains two sources of truth for JSON—fixtures in widget tests and reality on the server—unless contract tests own the canonical fixtures.
What to decide explicitly
-
Cap the number of flow-snapshot suites (for example “at most N active flows”).
-
Fixture ownership: Same owners as API client changes, or enforce generation from OpenAPI where possible.
Environment limit: Same as goldens: OS and fonts must be pinned for comparable screenshots.
Local persistence, time, and locale (Drift, preferences, clocks)
Context: The app uses local storage (for example Drift/SQLite, preferences). Tests must control schema migrations, seed data, and clock (DateTime.now(), time zones) or become order-dependent and flaky.
Risk: Tests pass locally and fail in CI at midnight UTC, or fail only when run in parallel.
What to decide explicitly
-
Prefer in-memory databases or isolated temp directories per test where supported.
-
Inject clock and locale in tests that assert formatting or deadlines.
Environment limit: CI default locale may differ from a developer’s laptop—document or fix locale in tests that format dates and numbers.
Secrets, compliance, and “restricted environment” work
Context: Some organizations forbid pasting production URLs or tokens into AI tools, block arbitrary outbound network from CI, or require audit trails for test data.
Risk: Engineers skip writing API tests because “CI cannot reach the server,” or leak secrets into logs and golden names.
What to decide explicitly
-
A fixture-only path for CI and a staging path for scheduled jobs, both documented.
-
Redaction rules for logs and for AI-assisted authoring (no credentials in prompts).
Environment limit: If outbound network is disallowed, contract tests must use checked-in HTTP recordings or static OpenAPI examples—there is no alternative in-process.
Document maintenance
Revisit this strategy after major changes: introduction of golden CI, addition of API contract suite, or onboarding of a device farm. Revisit the appendix decisions (owners, OS pins, fixtures, secrets) when those change. Update the smoke journey list with QA when major flows ship or retire.