# Testing strategy (managers and QA) This document supports decisions about automated testing, CI, and how QA and engineering share responsibility. It is grounded in the current OpenProject Flutter app repository. **If you are new here:** read Vision and planned directions first. It states what we want to achieve and why, in plain language. The rest of the document refines that vision with repository facts, priorities, risks, and effort estimates. ## Vision and planned directions (read this first) This section is for readers who were not part of earlier discussions. It captures **our intended direction**; later sections explain **how that maps to this codebase**, what already exists, and what we recommend adjusting. ### Why automate tests at all We ship a Flutter client against a **living API** and rich UI. Bugs show up as wrong data, broken flows, visual regressions, or surprises after server changes. The goal is not “maximum tests,” but **predictable quality** and **earlier feedback**, while keeping cost and flake risk under control. ### Direction 1: Feature integration tests (device or emulator) **Vision:** Run meaningful **end-to-end style** checks on a real app binding—on a **device or emulator**—so we exercise navigation, real layout, and timing closer to production. **Constraints we accept:** These tests are **expensive** (time, machines, stability). We **do not** assume a device farm in CI today. The plan is **not** to run them on every pull request, but to use them as **scheduled smoke** (for example **once a week**) and/or around release candidates, to catch “glue” regressions and **reduce repetitive manual load on QA** without replacing exploratory testing. **Tooling stance:** Prefer Flutter’s `**integration_test**` where possible; reserve heavier tools (for example Appium) only when the platform cannot be covered otherwise. ### Direction 2: Widget tests and golden (screenshot) tests **Vision:** While building a feature, add **widget tests** and, where agreed, **golden (reference image) tests** for the screens or states that matter—so we get **fast** feedback in **GitHub Actions** on layout and critical UI paths. **Open question we acknowledge:** For those tests, we may **mock BLoC/Cubit state** (or fakes at the repository boundary) **or** push further and **mock HTTP (Dio) with static JSON**. The document later recommends a **split**: keep most widget/golden tests **free of wire-level JSON**; put **HTTP parsing and contract behavior** in dedicated **API / contract tests** so failures stay interpretable and maintenance stays sane. ### Direction 3: API-related tests without a device **Vision:** Add tests that behave like **integration checks for the client–API contract** but run on **Linux CI**—no emulator. Typical forms: **fixture-backed HTTP** tests next to the generated or hand-written client, and/or tests that drive **use cases** with a fake HTTP port, still asserting realistic responses. **Motivation:** We sometimes ship a broken client because the **API or HAL changed** and we were not aware. Running these suites on every PR (and optionally **triggering workflows when the API repository changes**) is meant to **surface incompatibility early**. This only works if the suite is **deterministic** (fixtures or a stable staging contract)—hooks without a real test body create noise, not safety. ### Direction 4: Unit tests, BLoC tests, and small-scope tests **Vision:** Keep growing **unit tests**, `**bloc_test**`**\-style tests**, and **use case tests** as the default safety net for rules and state machines. They are the cheapest layer and should remain the **bulk** of automated coverage. ### Optional direction: Full-app flow with JSON fixtures and screenshots Some features may additionally use the **flow snapshot** pattern (full or near-full app under `flutter_test`, static JSON, faked platform APIs, golden milestones). That is **optional and expensive**—use for a **few** critical journeys; details appear [below](https://community.openproject.org/#option-full-app-fixture-driven-flow-tests-with-screenshots). ### How we want to work as a team **Sprint planning:** We intend to **allocate visible time for tests** alongside feature work (for example a agreed percentage or explicit tasks). Management may initially see this as overhead; the argument is that **unplanned quality work** still happens—it is simply deferred, more expensive, and often lands on QA or customers. **Bugs and TDD:** When a bug is **reproducible** and worth preventing again, we want to **fix it through a test-first or test-accompanying workflow** where that is practical—not as a dogma for every one-line change, but as a **default for real regressions**. ### How the rest of this document is organized

After this section

You will find

Executive summary

Short decisions and asks for leadership and QA.

Glossary

What we mean by each test type.

Current baseline

What already exists in this repo (CI, tests, gaps).

Later sections

ROI ordering, bootstrap vs ROI, mock boundaries, effort tables, risks, and QA handoff.

## Executive summary 1. **Goal:** Improve release quality and reduce surprise breakages—especially when the API or shared contracts change—without drowning the team in flaky or redundant tests. 2. **Current constraint:** GitHub Actions runs `flutter test` on Ubuntu only. Device and emulator integration tests are **not** in CI today; they are suitable for a **weekly smoke** lane or pre-release, not for every PR. 3. **Highest ROI next step:** Add **contract or fixture-backed HTTP tests** next to the API client layer (`packages/openproject_api_sdk`). That directly targets “the server changed and the client broke without us knowing.” Widget tests with mocked Dio are **not** a substitute. 4. **What already works:** The repo has substantial **unit, BLoC, and use case** coverage under `test/`. The strategy is to **extend** proven layers, not restart from zero. 5. **Organizational ask:** Treat test work as **part of delivery** in sprint planning (transparent capacity band and Definition of Done). Use **TDD-style regression tests** for reproducible bugs; avoid dogmatic TDD for trivial changes. **Decisions for management** * Approve a **small recurring capacity band** (for example 10–20% of engineering time on non-trivial stories) for tests that match the DoD below. * Approve **ownership and cadence** for weekly on-device smoke (who runs it, which devices, where results are recorded). * Approve **one CI strategy for golden tests** (pinned OS or runner) if goldens are adopted, to avoid endless screenshot churn. **Decisions for QA** * Agree which **journeys** are in the weekly automated smoke set versus what remains **manual exploratory** testing. * Agree how **API or environment incidents** are distinguished from **app regressions** when smoke fails. ## Audience and how to use this doc

Audience

Use

Managers

Trade-offs (cost, schedule, risk), CI expectations, why sprint capacity for tests is rational.

QA

What automation can replace or narrow, weekly smoke scope, where to log gaps when coverage does not exist yet.

## Glossary (short)

Term

Meaning here

Unit test

Fast, no I/O; pure functions, parsers, small helpers.

BLoC / Cubit test

State transitions and side-effect ordering with controlled fakes (bloc_test).

Use case test

Application service orchestration with fake repositories or ports—not real network.

Widget test

Pumps widgets under flutter_test; usually with injected fakes or fixed bloc states.

Golden test

Compares rasterized output to a reference image; great for stable design surfaces, sensitive to fonts and OS.

Integration test (integration_test)

Runs a real app binding; often on device or emulator; higher cost and flake risk.

Contract / API test

Asserts HTTP shapes, status codes, and parsing against fixtures or a staging API; catches server drift.

Flow snapshot test (optional pattern)

Full-app widget test: pump near-complete app, Dio (or client) returns static JSON from test/, platform dependencies faked, scripted user actions, golden screenshots at key states—see Option: Full-app fixture-driven flow tests with screenshots.

## Current baseline in this repository Facts you can cite when prioritizing work:

Area

Baseline

CI

.github/workflows/ci-cd.yml runs flutter test on ubuntu-latest. There is no integration_test job and no golden-specific job. iOS jobs build IPA but do not add extra automated test layers beyond the shared Flutter test job pattern.

Integration tests

integration_test/sign_in_test.dart plus mocks under integration_test/mocks/ show a viable pattern: real widget integration with mocked auth boundaries—good for smoke without Appium.

Unit / BLoC / use case

Many tests under test/ (blocs, use cases, forms, utils). This is an existing strength.

Golden tests

No matchesGoldenFile usage observed in test/ or lib/ at the time of writing—goldens would be a new investment with CI policy implications.

API SDK tests

packages/openproject_api_sdk/test/openproject_api_sdk_test.dart is effectively a stub—high-leverage gap for API drift detection.

## Feature breadth versus “needs device or OS” **Feature areas:** There are **21** top-level feature directories under `lib/app/features/` (for example auth, work\_package, home, time\_tracking, shared\_files, notifications, onboarding, projects, user). **Platform-sensitive dependencies** (from `pubspec.yaml`) include OAuth (`flutter_appauth`), deep links (`app_links`), share-in (`receive_sharing_intent`), `image_picker`, `file_picker`, `video_player`, `home_widget`, `live_activities`, `flutter_local_notifications`, `flutter_background_service`, `flutter_inappwebview`, `gal`, `permission_handler`, Drift/SQLite, secure storage, and others. **Code touchpoints:** A search for direct use of the heaviest plugins (for example `ImagePicker`, `FilePicker`, `home_widget`, local notifications, background service, sharing, `Gal`, `flutter_appauth`, `AppLinks`) lands on the order of **roughly 20–25 Dart files** under `lib/`. Many of these are **shared services** consumed by several features—not 25 separate products. **Interpretation** * Only a **minority of files** sit on the “hard to fake completely” boundary, but they underpin **cross-cutting flows**: sign-in and callbacks, attachments, notifications, home screen widgets, deep links, rich editor embeds. * **On-device or weekly smoke** remains important for those flows; it is **not** true that most of the app must be covered by expensive E2E. * **Rule of thumb for stakeholders:** ~21 feature **areas**; **several** critical journeys benefit from scheduled smoke; the **majority** of business rules and UI states are cheaper to cover with use case, BLoC, and widget tests **if** boundaries stay clean. ```mermaid flowchart TB subgraph fastCI [Fast CI - GitHub Actions] unit[Unit utils parsers] bloc[BLoC Cubit bloc_test] usecase[Use cases fake repos] widget[Widget tests] golden[Golden tests optional] contract[API contract or fixture HTTP tests] end subgraph slowManual [Weekly or manual] integ[integration_test on device or emulator] smoke[Smoke subset of critical journeys] end subgraph hardE2E [Rare deep E2E if needed] appium[Only if Flutter integration_test insufficient] end unit --> bloc bloc --> usecase usecase --> widget widget --> golden contract --> usecase integ --> smoke ``` ## Planned test types and where they fit ### 1\. Integration testing on device or emulator (weekly smoke) **Intent:** Catch integration issues that unit tests miss, and reduce repetitive manual checks for QA—not to duplicate every manual case. **Reality:** Expensive in time and infrastructure; **no device farm in CI** today. **Recommendation:** Run a **small fixed set** of `integration_test` journeys **once per week** (or per release candidate), with a named owner and published pass/fail. **Suggested smoke scope (5–10 journeys, illustrative):** * Enter instance URL and complete sign-in path (already aligned with `integration_test/sign_in_test.dart`). * Open a work package from a list into details (navigation + data binding). * Attachment flow: pick or attach where the test uses **fakes** at the file/camera boundary where possible. * Deep link or cold start into a known route (if stable test URLs exist). * Optional: one path that touches **WebView** or rich editor only if flakiness stays acceptable. **QA handoff:** Weekly smoke **narrows** regression risk on “whole app glue”; it does **not** remove need for exploratory testing on new features or visual polish. ### 2\. Widget tests and golden tests (CI-friendly) **Widget tests:** Prefer **fake repositories** or **fixed Cubit/BLoC states** so failures localize to layout and interaction, not wire format. **Golden tests:** Best for **design-system** widgets and **stable** screens. Plan for **maintenance cost** when typography, themes, or locales change. **CI must use a single pinned environment** for goldens (see Appendix). **BLoC mocked versus Dio mocked in widget tests**

Approach

When to use

Trade-off

Mock / fake BLoC or fixed state

Most widget and golden tests

Fast, stable; does not prove HTTP parsing.

Fake repository at domain boundary

When the widget depends on orchestration outcomes

More logic than a dumb bloc mock, still avoids JSON in every test.

Mock Dio deep in the tree

Rarely

Couples UI tests to serialization; noisy failures; prefer dedicated API tests instead.

**More “logic covered” without device tests:** Put that in **use case tests** with a **fake port** (same idea as `integration_test/mocks/mock_auth_client.dart` at unit speed). ### 3\. API tests without device (GitHub Actions) **Intent:** Detect client breakage when the **API or HAL** changes. **Recommendation:** Implement **fixture-backed** or **staging-backed** tests in or beside `packages/openproject_api_sdk`: status codes, error bodies, parsing, and critical query shapes. Optionally trigger workflows when the API repository changes—**only after** a minimal runnable suite exists; otherwise hooks create noise without signal. **Do not** rely on “run every BLoC” as a proxy for API correctness; BLoCs with mocks prove **app logic**, not **server truth**. ### 4\. Unit tests and bloc tests Continue as the **default** for new rules and regressions. This layer already exists; extend it for every **non-trivial** state machine and use case branch. ### Option: Full-app fixture-driven flow tests with screenshots This is an **optional** pattern—not the default for every feature—when the team wants a **single automated “snapshot”** of a flow (navigation + parsing + UI) under CI without a device. **Shape of the approach** * Pump **most or all of the app** (real `MaterialApp` / router / `GetIt` where practical). * Replace HTTP with **Dio (or HTTP client) adapters** that return bodies from **static JSON files** under `test/` (or a shared fixture package). * **Fake or mock platform plugins** (auth browser, file picker, notifications, etc.) so the test stays on `flutter test`. * Drive the **same user actions** as in a manual flow (`tap`, `enterText`, `scroll`) with stable finders or `Key`s. * Capture **golden images** (or equivalent screenshots) at **milestone states** to form a visual snapshot of the feature. **When it helps** * Regressions in **wired flows** you scripted (navigation, bloc orchestration, mapping JSON → UI). * **Unintended UI changes** at captured steps (strong signal if goldens are reviewed). * **Parsing and field usage** for the **exact JSON** you checked in—only as accurate as fixtures are kept. **Costs and risks** * **High** authoring cost (DI, routing, splash/onboarding, timing). **High** maintenance when routes, l10n, theme, or **fixtures** change. Risk of **duplicate truth** if the same JSON is not aligned with **SDK contract tests**—prefer **one fixture source** or generate from OpenAPI where possible. * Does **not** replace **contract tests at the API layer** for unknown server changes; fixtures that nobody updates create **false confidence**. * Does **not** catch **real OS or plugin** behavior (everything is mocked). Use this pattern for **few high-value journeys** (for example one happy path per critical feature), not as the only test type for every screen. ## Protection against breaking changes (by test category) “Breaking change” here means anything that could ship a bad build: wrong data, crash, wrong UI, broken navigation, incompatible API, or environment-specific failure.

Category

API / JSON contract drift

App logic & state

UI layout & visuals

Navigation & deep links

Real device / OS / plugins

Unit

Low

Low (local only)

None

None

None

BLoC / Cubit

Low (unless bloc parses raw JSON)

High for covered transitions

Low

Low

None

Use case + fake repo

Low at HTTP edge

High for orchestration

None

Low

None

Widget (scoped, fake state)

Low

Medium (binding only)

High for pumped screens

Medium

None

Golden (selective)

Low

Low

High at snapshot points

Low

None

Contract / API tests (fixtures or staging)

High

Low

None

Low

Low (staging deps)

Flow snapshot (full app + JSON + goldens)

Medium (only if fixtures match reality)

High for scripted paths

High at milestones

High for scripted paths

Low

integration_test on device

Medium–high (if real backend)

High

Medium

High

High for covered plugins

**Reading the table:** No single layer scores high everywhere. **Combine** contract tests (server truth), BLoC/use case (rules), scoped widget or goldens (UI), and a **small** set of device smokes or flow snapshots for glue. ## What gives the most benefit without “testing for testing’s sake” Ordered by **defect prevention per minute of maintenance** for this codebase: 1. **Contract or fixture-backed API tests** — addresses unaware API-side breaks; lives next to HTTP/DTO code. 2. **BLoC and use case tests** — fast feedback on business rules and orchestration. 3. **Widget tests** — non-trivial UI, errors, empty states, accessibility fixes where applicable. 4. **Golden tests** — selective; design system and stable screens; needs CI discipline. 5. **Weekly** `**integration_test**` **smoke** — small journey set; human-process ownership. 6. **Flow snapshot tests (optional)** — use sparingly for **critical** end-to-end UI stories under `flutter test`; pair with **API contract tests** so JSON fixtures do not drift silently. ## Suggestions: ROI order, bootstrap order, limited time **Why flow snapshot is last in the numbered list** — That list ranks **return per minute of maintenance** (and CI fit), not “business unimportance.” Flow snapshots combine **DI, routing, fixtures, and goldens** and overlap **API** and **visual** checks unless limited to very few journeys. Prefer them **after** cheaper layers, or only where one scripted story is worth that cost. **Two orderings (do not confuse them)** * **ROI / next-hour investment:** Favor contract/API, BLoC/use case, scoped widget, then smoke, then optional flow snapshot—sensible when choosing **one** extra layer. * **Bootstrap / delivery sequence:** Starting with **wider** tests (for example weekly smoke, a thin `integration_test` slice) can help **momentum, learning, and visible wins**; it is **not** the long-term inverted pyramid. **Tighten** with smaller tests around code that actually breaks. **If there is no time for “everything” (digest)** 1. Fixture or contract tests at the HTTP client — **server drift**. 2. BLoC and use case tests — **rules and branches**. 3. A **small** device smoke set — **glue**. 4. Scoped widget tests — **risky UI states**. 5. Goldens — **only** with pinned CI and agreed scope. 6. Flow snapshots — **few** critical stories, not a duplicate of the whole pyramid. **Bug-driven deepening (works with “big tests first”)** You may prioritize **wide** tests early; still add **unit / BLoC / use case** (or a small widget test) **when fixing a bug or touching non-trivial logic**—cheap regression nets grow on real pain. Use **bugs and meaningful edits** as triggers, not vague “when we have time.” For **new** API or domain behavior, add at least a **contract or use case** happy path—not only tests born from defects—so happy paths do not stay untested. ## Effort and maintenance (qualitative scale)

Layer

Effort to write

Effort to maintain

Flake risk

CI fit today

Unit

Low

Low

Very low

Excellent

BLoC

Low–medium

Low

Low

Excellent

Use case + fake repo

Medium

Low–medium

Low

Excellent

Widget

Medium

Medium

Low

Good

Golden

Medium–high

High (design/locale)

Medium

Good if OS pinned

integration_test on device

High

Medium

Medium–high

Poor without device CI

Cross-repo API hook

Medium setup

Low per change if automated from OpenAPI/fixtures

Low

Excellent once suite exists

Flow snapshot (full app + fixtures + goldens)

Very high

High

Medium–high

Good if OS pinned for goldens

## Authoring time: manual versus AI-assisted (Cursor, Claude) The ranges below are **indicative engineer-time** for **one medium-complexity feature** in this codebase (significant UI + state + a few API calls). They are not estimates for legal or procurement use—teams vary with familiarity, DI shape, and how stable finders are. **Assumptions:** AI = strong prompting + human review and correction; restricted environments may **add** time for redaction and offline iteration. ### Initial coverage (first time you add tests for that feature)

Test category

Manual (typical range)

AI-assisted (typical range)

Notes

Unit (helpers, parsers)

0.25–1.5 h

0.15–0.75 h

AI excels at table-driven cases.

BLoC / Cubit

1–4 h

0.5–2 h

Depends on event/side-effect complexity.

Use case + fake repo

1.5–5 h

1–3 h

Fakes must match real ports; AI speeds scaffolding.

Widget (scoped subtree, fake bloc/repo)

2–8 h

1–4 h

Finder stability drives variance.

Golden only (a few stable widgets or screens)

2–6 h

1–3 h

First-time CI + font pinning not included.

Contract / API (first endpoints + fixtures)

3–12 h

2–8 h

Faster if OpenAPI or samples exist.

integration_test journey (device/emulator)

4–16 h

2.5–10 h

Flakiness tuning often dominates.

Flow snapshot (full app + Dio JSON + platform fakes + flow + goldens)

1–3 days

0.5–2 days

DI + routing + async + baseline images; AI helps but does not remove integration pain.

### Ongoing cost (one sprint later: small product or API tweak)

Test category

Manual (typical range)

AI-assisted (typical range)

Unit / BLoC / use case

0.25–2 h

0.15–1.25 h

Widget (scoped)

0.5–3 h

0.25–2 h

Golden

0.5–4 h

0.25–3 h (often re-baselining images)

Contract / API

0.5–3 h

0.25–2 h

integration_test

1–6 h

0.5–4 h

Flow snapshot

0.5–2 days

0.25–1.5 days

**What AI usually does not compress much** * Deciding **what** to assert, **reviewing** goldens, **diagnosing** flakes, aligning **fixtures** with production or staging truth, and **CI** parity (fonts, goldens on Linux vs macOS). ## Classic authoring versus AI-assisted (Claude, Cursor)

Topic

Classic

With AI assistance

First draft of tests and mocks

Slower

Faster boilerplate and case enumeration

Flaky test diagnosis

Engineer-led

Still engineer-led; AI may suggest hypotheses

Golden review

Human judgment

Human judgment; AI should not “approve” pixels

Secrets and restricted env

Controlled manually

Still require policy: no credentials in prompts, offline constraints respected

Wrong green tests

Rare if review is strict

Higher risk if assertions are weak—review stays in Definition of Done

**Bottom line for management:** AI reduces **typing and scaffolding** time; it does **not** remove **review**, **CI stability**, or **product choices** about what must be covered. Use the [authoring time tables](https://community.openproject.org/#authoring-time-manual-versus-ai-assisted-cursor-claude) above when negotiating sprint capacity. ## Organizational recommendations **Sprint planning** * Reserve a **transparent band** of capacity for tests on non-trivial work (for example **10–20%**, tuned by team). * **Definition of Done (example bullets):** * New domain rule or branchy logic → use case or BLoC test. * New or risky UI states → widget test (and golden only if agreed). * New or changed API usage → contract or fixture test at the SDK/client layer. * Bug fix with reproduction → regression test where feasible. **TDD for bugs** * **Encourage:** regression test **first** for reproducible bugs in logic or state machines (documents intent, prevents return of the defect). * **Avoid mandating** TDD for trivial copy, one-line layout, or pure asset changes—credibility with engineers matters. ## Appendix: risks, open decisions, and environment limits This appendix is not a list of blockers; it is a list of **places where strategy meets reality**. Each item mixes **risk** (what goes wrong if ignored), **open decision** (what leadership or the team must choose), and **environment limit** (what GitHub Actions, emulators, or OS vendors do not guarantee). Use it when estimating cost, assigning owners, or explaining to stakeholders why a test type “works in principle” but needs guardrails. ### Weekly integration without rigor **Context:** A policy of “run integration tests weekly” sounds simple. In practice it competes with releases, support, and PTO. If nobody is accountable, the habit dies and **confidence drops back to manual-only** without anyone updating the strategy doc. **Risk:** Silent gaps—you believe smoke runs, but it has not run for weeks; regressions ship; QA is asked to compensate with broader manual passes under time pressure. **What to decide explicitly** * **Owner:** One named person or rotating role responsible for kick-off, triage of failures, and escalation—not “the team.” * **Device matrix:** At minimum, over time, both **Android** and **iOS** (or two representative OS versions). Different OEMs still surface different WebView, keyboard, and permission quirks. * **Artifacts:** Store logs and failure screenshots (or video) in CI artifacts, a shared drive, or a ticket template so failures are **debuggable offline** and comparable week to week. **Environment limit:** Hosted device farms cost money; local devices depend on who is in the office. Budget and access are management inputs, not engineering-only details. ### Golden tests on GitHub Actions **Context:** Golden tests rasterize widgets to pixels. Pixel output depends on **font metrics**, **subpixel rendering**, **theme**, **device pixel ratio**, and **Skia/Impeller** behavior. GitHub’s `ubuntu-latest` runners are not pixel-identical to macOS, iOS, or Android. **Risk:** PRs fail only because the runner image changed, or because a designer updated a token that was never meant to block merge. Teams then **disable goldens** or stop updating baselines—losing the benefit entirely. **What to decide explicitly** * **Single source of truth:** For example run goldens only on `macos-latest`, or use a **pinned** Docker image with **bundled fonts** and a fixed Flutter version—documented in the repo. * **Update policy:** Who may approve baseline image updates, and whether large visual diffs require **design or QA** sign-off. * **Scope:** Goldens for **design tokens and stable shells** first; avoid goldening every screen until the process is trusted. * **Localization:** Large translation drops (for example Crowdin OTA or mass `l10n` updates) can invalidate **many** baselines at once—decide whether goldens run under a **fixed test locale** and which strings are allowed to appear in golden-covered widgets. **Environment limit:** Linux vs macOS font shaping differs; emulator vs physical device differs. “Looks the same to a human” is not the same as “byte-identical PNG.” ### API repository hooks **Context:** The desirable story is: “API repo changed → client tests run → we know before merge.” That only works if the client repo contains a **test suite that speaks the same contract** as production (or staging), and if credentials and rate limits are handled. **Risk:** A hook that runs a **smoke URL against production** creates flaky jobs, rate limits, and compliance issues. A hook that runs **nothing meaningful** trains everyone to ignore red builds. **What to decide explicitly** * **Input artifact:** Prefer **recorded fixtures**, **OpenAPI snapshots**, or a **dedicated staging** with stable seed data over calling arbitrary production URLs from CI. * **Failure semantics:** Red means “client must adapt or API must roll back”—agree with backend owners what **breaking** means (response shape, status codes, deprecations). * **Secrets:** Where instance URLs and tokens live (GitHub secrets, vault); **restricted environments** may forbid outbound calls entirely—fixtures-only mode then becomes mandatory. **Environment limit:** CI runners may not reach internal APIs without VPN or self-hosted runners; that is an infrastructure decision, not a test style decision. ### Notifications, background work, Firebase **Context:** The app uses platform channels, local notifications, background services, and Firebase-related tooling for stability and diagnostics. CI is a **headless, often non-Google Play Services** Linux environment. It does not replicate **APNs** delivery, **exact alarm** policies on newer Android, or user permission flows the way a phone does. **Risk:** Assuming “green CI means push works” creates false confidence. Real failures appear only in **dogfood**, **staging**, or **store review**. **What to decide explicitly** * **Split responsibilities:** Automate **pure logic** (payload JSON parsing, mapping to domain models, idempotency) with unit or contract tests; keep **OS integration** on a **short manual or staging checklist** per release (or per notification feature). * **Version matrix:** Android notification permission and background execution rules change by OS version—QA and product should know which versions are **in support**. **Environment limit:** Simulating “user dismissed notification” or “app killed then opened from tap” in CI is brittle; budget for targeted manual or device-farm runs for those stories when they change. ### Appium (or other external UI drivers) **Context:** Flutter’s `**integration_test**` package drives the app from Dart, shares the same isolate model, and is the default in the ecosystem. **Appium** (or similar) drives the UI from outside, often across a bridge, and shines when **non-Flutter surfaces**, **system dialogs**, or **multi-app** scenarios dominate. **Risk:** Two stacks mean **double maintenance**, slower feedback, and harder debugging for the majority of flows that Flutter already covers. **What to decide explicitly** * Use `**integration_test**` until a concrete gap is written down (for example “must validate OS share sheet with a specific third-party app”). * If Appium is adopted, **cap scope** to those gaps and fund training and stable device inventory. **Environment limit:** For this codebase, the share of UI that **requires** Appium-style control is **small** relative to total features; default to `**integration_test**` first. ### Flow snapshot tests (full app + JSON fixtures + goldens) **Context:** This pattern gives a cinematic record of a feature under `**flutter_test**`, but it stacks **DI setup**, **routing**, **fixture curation**, and **golden churn** in one place. **Risk:** The team maintains **two sources of truth** for JSON—fixtures in widget tests and reality on the server—unless contract tests own the canonical fixtures. **What to decide explicitly** * **Cap** the number of flow-snapshot suites (for example “at most N active flows”). * **Fixture ownership:** Same owners as API client changes, or enforce generation from OpenAPI where possible. **Environment limit:** Same as goldens: OS and fonts must be pinned for comparable screenshots. ### Local persistence, time, and locale (Drift, preferences, clocks) **Context:** The app uses **local storage** (for example Drift/SQLite, preferences). Tests must control **schema migrations**, **seed data**, and **clock** (`DateTime.now()`, time zones) or become order-dependent and flaky. **Risk:** Tests pass locally and fail in CI at midnight UTC, or fail only when run in parallel. **What to decide explicitly** * Prefer **in-memory** databases or isolated temp directories per test where supported. * Inject **clock** and **locale** in tests that assert formatting or deadlines. **Environment limit:** CI default locale may differ from a developer’s laptop—document or fix locale in tests that format dates and numbers. ### Secrets, compliance, and “restricted environment” work **Context:** Some organizations forbid pasting production URLs or tokens into AI tools, block arbitrary outbound network from CI, or require audit trails for test data. **Risk:** Engineers skip writing API tests because “CI cannot reach the server,” or leak secrets into logs and golden names. **What to decide explicitly** * A **fixture-only** path for CI and a **staging** path for scheduled jobs, both documented. * Redaction rules for logs and for AI-assisted authoring (no credentials in prompts). **Environment limit:** If outbound network is disallowed, **contract tests must use checked-in HTTP recordings** or static OpenAPI examples—there is no alternative in-process. ## Document maintenance Revisit this strategy after major changes: introduction of golden CI, addition of API contract suite, or onboarding of a device farm. Revisit the [appendix decisions](https://community.openproject.org/#appendix-risks-open-decisions-and-environment-limits) (owners, OS pins, fixtures, secrets) when those change. Update the **smoke journey list** with QA when major flows ship or retire.