Testing and CI for Android Engineers (2026)
In short
Android testing at modern teams in 2026 is a layered pyramid: a wide JVM-based base of JUnit5 unit tests (Mockk for mocking, Turbine for Flow assertions, Truth for fluent matchers), a thick middle of Compose UI tests using createComposeRule with semantics matchers and Robolectric for shadowed Android dependencies, a screenshot-test layer using Paparazzi (pure JVM, no emulator) or Roborazzi (Robolectric-driven Compose snapshots) gated in CI, and a thin top of instrumented end-to-end tests on real devices via Firebase Test Lab. CI runs on GitHub Actions with a matrix of API levels, ABIs, and Compose previews, caching Gradle and the AVD snapshot for fast feedback. The senior bar: every PR runs the unit + screenshot suite in under 10 minutes on emulator-free runners, and a nightly Test Lab run validates a curated device matrix before release.
Key takeaways
- JUnit5 (Jupiter) is the modern unit-test runtime on Android in 2026. It supports parameterized tests, dynamic tests, nested classes, and lifecycle extensions cleanly. The official Android testing docs (developer.android.com/training/testing) cover the Gradle plugin setup; AGP 8.x supports JUnit5 on the JVM source set without instrumentation.
- Mockk (github.com/mockk/mockk) is the canonical Kotlin mocking library: idiomatic DSL, coroutine-aware (coEvery, coVerify), supports object/static/final mocking out of the box, and integrates with relaxed mocks so you only stub what matters. Mockito-Kotlin is acceptable but Mockk is the senior default.
- Turbine (github.com/cashapp/turbine) is the canonical way to assert on Kotlin Flows in tests. test { awaitItem() } gives you a deterministic, sequential assertion API for cold and hot flows, replacing flaky collectAsState-with-delay patterns. Pair with kotlinx-coroutines-test runTest for virtual time control.
- Compose UI testing uses createComposeRule (or createAndroidComposeRule when you need an Activity). Assert with semantics: onNodeWithText, onNodeWithTag, onNodeWithContentDescription. Use semantics modifiers (Modifier.semantics, Modifier.testTag) to expose state without leaking implementation. Full reference: developer.android.com/jetpack/compose/testing.
- Robolectric runs Android instrumentation-style tests on the JVM by shadowing framework classes. In 2026, Robolectric 4.13+ supports Compose, AndroidX Test, and SDK 35. It's the bridge between pure JUnit and emulator-based tests: fast like JVM, real-enough like instrumentation.
- Screenshot testing has two production-ready options: Paparazzi (cashapp/paparazzi) renders Compose to PNG on the JVM with no emulator using LayoutLib, and Roborazzi (takahirom/roborazzi) leverages Robolectric to snapshot Compose with full semantics support. Paparazzi is faster; Roborazzi is more accurate for animation/state-dependent UI.
- Firebase Test Lab (firebase.google.com/docs/test-lab) runs instrumented tests on a curated matrix of physical and virtual devices in the cloud. The senior pattern: PR-tier runs Paparazzi + Robolectric on GitHub Actions (no emulator needed, <10 min); nightly/release tier runs Espresso + UI Automator on Test Lab against a 6-device matrix.
- GitHub Actions for Android: cache ~/.gradle/caches and ~/.gradle/wrapper aggressively, use reactivecircus/android-emulator-runner only when you actually need an emulator, run unit + screenshot tests on ubuntu-latest in parallel matrix jobs, and gate merges on the screenshot diff artifact. Bitrise is the alternative for teams that want managed Mac/Android infrastructure with deeper iOS overlap.
Unit testing modern Android: JUnit5 + Mockk + Turbine
The unit-test layer is where 70-80% of an Android codebase's tests should live: pure JVM, no Android framework dependencies, runs in milliseconds. The canonical 2026 stack is JUnit5 (Jupiter) for the runtime, Mockk for mocking, Turbine for Flow assertions, Truth or AssertK for fluent matchers, and kotlinx-coroutines-test for deterministic virtual time. The official Android testing fundamentals page (developer.android.com/training/testing) covers the Gradle wiring; everything below assumes AGP 8.x with the JUnit5 plugin applied.
JUnit5 over JUnit4 buys you parameterized tests without external libraries, lifecycle extensions (instead of fragile @Rule chains), nested test classes for grouping setup, and dynamic tests for table-driven cases. The migration cost is small — most JUnit4 tests work with @Rule converted to @ExtendWith — and the readability win is large.
Mockk (github.com/mockk/mockk) is the canonical Kotlin mocking library. It mocks final classes (which Mockito requires opt-in for), supports object/static mocking, and has first-class coroutine support via coEvery/coVerify. Relaxed mocks (mockk(relaxed = true)) return sensible defaults for unstubbed calls so tests stay focused on the behavior under test. The verbose Mockito-Kotlin alternative is acceptable but Mockk is the senior default.
Turbine (github.com/cashapp/turbine) is the canonical way to test Kotlin Flows. The pattern flow.test { awaitItem(); awaitComplete() } gives you a sequential, deterministic API that replaces flaky collectAsState-with-delay patterns. Pair Turbine with runTest from kotlinx-coroutines-test to get virtual time control: advanceTimeBy and advanceUntilIdle let you fast-forward without real-clock waits.
A canonical ViewModel test exercising all four libraries — JUnit5, Mockk, Turbine, runTest — looks like this:
@ExtendWith(MainDispatcherExtension::class)
class JobListViewModelTest {
private val repo: JobRepository = mockk()
private lateinit var vm: JobListViewModel
@BeforeEach fun setUp() {
vm = JobListViewModel(repo)
}
@Test fun `loads jobs and emits Success state`() = runTest {
coEvery { repo.fetchJobs("android") } returns listOf(Job("1", "Sr Android"))
vm.uiState.test {
assertThat(awaitItem()).isEqualTo(JobUiState.Idle)
vm.search("android")
assertThat(awaitItem()).isEqualTo(JobUiState.Loading)
val success = awaitItem() as JobUiState.Success
assertThat(success.jobs).hasSize(1)
cancelAndIgnoreRemainingEvents()
}
coVerify(exactly = 1) { repo.fetchJobs("android") }
}
}
What this test gets right: MainDispatcherExtension swaps Dispatchers.Main for a TestDispatcher so viewModelScope runs deterministically; coEvery stubs the suspend function without ceremony; Turbine's test { } block makes Flow emissions sequential and assertable; coVerify confirms the repository was called exactly once with the expected argument; cancelAndIgnoreRemainingEvents prevents the test from hanging on an open Flow. The whole suite runs in milliseconds on the JVM with zero Android framework dependencies — the cheapest, fastest feedback you can give an Android engineer.
Compose UI testing: createComposeRule + semantics
Compose UI testing replaces the Espresso ViewMatchers world with a semantics-tree query API. Instead of finding views by ID and asserting on widget state, you query the merged semantics tree (the same one TalkBack uses) and assert on what the user can perceive. The canonical reference is developer.android.com/jetpack/compose/testing, which documents the full createComposeRule API, semantics matchers, and synchronization model.
There are two test-rule factories. createComposeRule() is for tests of pure Composables with no Activity dependency — the fastest path, runs under Robolectric or on-device. createAndroidComposeRule<ActivityT>() is for tests that need a real Activity (system intents, Activity-scoped DI, Activity result contracts). Default to createComposeRule; reach for the Activity variant only when the test genuinely depends on the Activity lifecycle.
The query API has three primary entry points: onNodeWithText (matches displayed text), onNodeWithTag (matches Modifier.testTag), and onNodeWithContentDescription (matches accessibility labels). Test tags are the senior pattern for testing — they're explicit, stable, and don't break when copy changes. The rule against testing internal implementation: never assert on Composable internals like raw padding values; always assert on what a user (or accessibility service) would perceive.
Synchronization is automatic. The compose rule waits for the recomposition queue and animation clock to be idle before each assertion. When you need to control time explicitly — testing a debounced search field, a fade-in animation — use composeTestRule.mainClock.autoAdvance = false plus mainClock.advanceTimeBy(millis). This gives you the same virtual-time pattern as kotlinx-coroutines-test, but for the Compose frame clock.
A canonical Compose UI test exercising createComposeRule and the semantics API:
class LoginScreenTest {
@get:Rule val rule = createComposeRule()
@Test fun `submit button enables when email and password are valid`() {
rule.setContent { LoginScreen(onSubmit = {}) }
rule.onNodeWithTag("submit").assertIsNotEnabled()
rule.onNodeWithTag("email").performTextInput("[email protected]")
rule.onNodeWithTag("password").performTextInput("hunter2!")
rule.onNodeWithTag("submit")
.assertIsEnabled()
.assertHasClickAction()
.performClick()
}
}
What this test gets right: tags on the testable affordances (Modifier.testTag("submit")) instead of brittle text matching; assertions chained on the same node for readability; explicit pre/post state assertions that document the contract; performClick exercises the full Modifier.clickable path including ripple and accessibility announcements.
For the testTag pattern to work, the production Composable needs Modifier.semantics { testTagsAsResourceId = true } at a parent boundary if you want Espresso/UIAutomator interop, or simply Modifier.testTag("submit") on the leaf for Compose-only tests. The Android docs cover both patterns in the testing guide linked above.
Screenshot testing: Paparazzi + Roborazzi
Screenshot testing catches visual regressions that no semantic test can: a misplaced padding, a wrong dark-theme color, a text-overflow ellipsis. In 2026 the Android ecosystem has two production-ready options, and serious teams use both for different layers of the pyramid.
Paparazzi (github.com/cashapp/paparazzi) is Cash App's pure-JVM Compose/View renderer built on LayoutLib (the same renderer Android Studio's preview pane uses). It runs as a JUnit rule on the JVM with no emulator, no Robolectric, no instrumentation. A single Paparazzi test renders in 50-200ms; a 500-test suite runs in under a minute. The trade-off: LayoutLib doesn't run animations or process input; Paparazzi snapshots a single frame.
The Paparazzi pattern: declare @get:Rule val paparazzi = Paparazzi(...), then in each @Test call paparazzi.snapshot { MyComposable() }. The first run writes baseline PNGs to src/test/snapshots/; subsequent runs compare and fail on diff above a configurable threshold. CI uploads the diff PNGs as artifacts so reviewers can see exactly what changed.
Roborazzi (github.com/takahirom/roborazzi) takes a different approach: it hooks into Robolectric's rendering pipeline and lets you snapshot a Compose tree at any point during a Robolectric test. This means Roborazzi can capture animations mid-frame, post-input states, gesture-driven UI, and anything else that requires the full Android runtime. The cost is speed (Robolectric setup is heavier than Paparazzi) and occasional shadow-class quirks.
The senior pattern in 2026: Paparazzi for the static design-system layer (every reusable component, every theme, every locale variant); Roborazzi for the interactive layer (post-tap states, scroll positions, animation midpoints). Both render to PNG, both diff against baselines, both gate the merge on visual approval.
Baseline management is the operational reality of screenshot tests. The canonical workflow: developer runs ./gradlew recordPaparazziDebug locally to update baselines, commits them in the same PR as the code change, reviewer compares the baseline diff in GitHub. Storing PNGs in git is fine for typical Compose components (4-40 KB each); repos with thousands of snapshots use Git LFS.
Flakiness is the failure mode to design against. Common sources: system fonts (pin via Paparazzi.deviceConfig with a known font), date/time-dependent UI (inject a Clock and pass a fixed instant), random IDs (seed any RNG in your design system), and dark-mode animations (snapshot the steady state, not the transition). The teams that succeed with screenshot testing treat determinism as a first-class API contract, not an afterthought.
CI on GitHub Actions: matrix builds + Firebase Test Lab
The 2026 senior CI pipeline for Android has two tiers. Tier 1 (every PR) runs on GitHub Actions ubuntu-latest with no emulator: unit tests (JUnit5), Robolectric tests, Paparazzi/Roborazzi screenshots, ktlint/detekt, and a release build. The whole tier 1 should finish in under 10 minutes on a properly cached pipeline. Tier 2 (nightly + release branches) runs instrumented Espresso/UI Automator tests on Firebase Test Lab against a curated device matrix: pixel 7/8/9 on API 30/33/35, plus one low-end and one tablet. The split keeps PR feedback fast while still catching device-specific regressions before release.
The Gradle cache is the single biggest CI win. actions/cache for ~/.gradle/caches and ~/.gradle/wrapper, keyed on the hash of *.gradle*, gradle-wrapper.properties, and gradle/libs.versions.toml, takes a cold build from 8 minutes to under 2. The Gradle build-cache (--build-cache plus a remote node like develocity) further cuts incremental work.
For PRs that genuinely need an emulator (deep-link intent tests, system-UI permission flows), reactivecircus/android-emulator-runner is the canonical action. It boots an AVD with hardware acceleration on Linux runners (KVM available on ubuntu-latest since late 2023) and caches the AVD snapshot so subsequent runs skip the cold boot. Without snapshot caching, AVD boot is 2-3 minutes; with it, 20-30 seconds.
Firebase Test Lab (firebase.google.com/docs/test-lab) is the senior choice for nightly/release-tier coverage. It runs your instrumented APK + test APK on real and virtual devices in Google's data center, captures logcat, video, and screenshots, and surfaces the results via gcloud or the Firebase console. The pricing model (per-device-minute) makes it impractical to run on every PR for a large suite, but it's the right tool for release gating against the device matrix you actually ship to.
A canonical GitHub Actions workflow for tier 1 covers checkout, JDK setup, Gradle cache, and the test matrix:
name: Android CI
on: [pull_request, push]
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
api-level: [30, 33, 35]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-java@v4
with: { distribution: temurin, java-version: 21 }
- uses: gradle/actions/setup-gradle@v3
with: { cache-read-only: ${{ github.ref != 'refs/heads/main' }} }
- run: ./gradlew testDebugUnitTest --stacktrace
- run: ./gradlew verifyPaparazziDebug
- run: ./gradlew :app:roborazziDebug
- uses: actions/upload-artifact@v4
if: failure()
with:
name: screenshot-diffs-${{ matrix.api-level }}
path: '**/build/paparazzi/failures/**'
What this workflow gets right: JDK 21 (the AGP 8.x baseline), gradle/actions/setup-gradle (handles caches and dependency-graph submission), conditional cache-read-only so only main branch writes the cache (avoids cache poisoning from PRs), explicit screenshot-diff artifact upload on failure, and a matrix on api-level so test code that branches on Build.VERSION.SDK_INT is exercised against each.
What this workflow does NOT include: the Test Lab tier 2 (a separate workflow on a schedule trigger using google-github-actions/auth and gcloud firebase test android run), retry-on-flake (use --rerun-tasks sparingly; flaky tests should be quarantined and fixed, not papered over), and SonarQube/Codecov upload (additive, not load-bearing). The senior+ implementation composes all of these but keeps the PR-tier surface minimal so feedback stays fast.
Bitrise is the alternative platform for Android CI in 2026. Its strengths are managed macOS for iOS+Android orgs, a step library with first-party Android steps, and Apple-Google chain-of-trust signing. The trade-off versus GitHub Actions: vendor lock-in on the workflow YAML format and per-build-minute pricing that gets expensive at scale. For Android-only orgs already on GitHub, GHA is the default; for orgs with deep iOS investment, Bitrise pays back the lock-in cost.
Frequently asked questions
- Should I migrate from JUnit4 to JUnit5 for Android unit tests?
- Yes for new modules; gradually for existing ones. JUnit5 in 2026 has first-class Gradle support via the de.mannodermaus.android-junit5 plugin, and the productivity wins (parameterized tests, lifecycle extensions, nested classes) compound across hundreds of tests. Instrumented tests on the Android runtime are still on JUnit4 because AndroidX Test ships with that runner; the migration is JVM-side only.
- Mockk or Mockito-Kotlin?
- Mockk is the senior default in 2026. Kotlin-idiomatic DSL, native coroutine support (coEvery/coVerify), final-class mocking out of the box, and object/static mocking without bytecode-rewriting opt-ins. Mockito-Kotlin is acceptable on legacy codebases or teams that share mocking infrastructure with Java services. New Android modules should reach for Mockk first.
- When should I use Turbine vs collecting a Flow manually?
- Use Turbine for any Flow assertion in a unit test. Manual collection — launch { flow.collect { ... } } with delays or shared state — is the most common source of flaky tests in modern Android codebases. Turbine's awaitItem/awaitComplete API is sequential and deterministic, integrates with runTest virtual time, and fails fast on unexpected emissions. There is no scenario in 2026 where a manual collector is preferable in a test.
- createComposeRule or createAndroidComposeRule?
- createComposeRule for tests of Composables with no Activity dependency — the fast default, works under Robolectric or on-device. createAndroidComposeRule<Activity>() when the test needs a real Activity for system intents, Activity-scoped DI, or Activity result contracts. Most Composable tests should use createComposeRule; reach for the Android variant only when the test genuinely needs the Activity lifecycle.
- Paparazzi or Roborazzi for screenshot tests?
- Both. Paparazzi for the static design-system layer (every reusable component, every theme, every locale) — it's pure JVM, renders in 50-200ms per test, no emulator. Roborazzi for the interactive layer (post-tap states, animation midpoints, scroll positions) — it leverages Robolectric so it can capture state-dependent UI. Senior teams run Paparazzi on every PR and Roborazzi on the integration suite.
- How do I keep screenshot tests from flaking?
- Treat determinism as a first-class API. Pin the device config (Paparazzi.deviceConfig) so font and density don't drift. Inject a Clock and pass a fixed Instant for any date/time UI. Seed any RNG in the design system. Snapshot steady states, not animation transitions. When a test does flake, fix the source of nondeterminism — never increase the diff threshold to mask it. Threshold inflation is how screenshot suites become useless.
- Robolectric or instrumented tests for UI logic?
- Robolectric for the inner loop (PR tier) — runs on the JVM in seconds, no emulator, supports Compose and AndroidX Test. Instrumented tests on a real device or emulator for the outer loop (release tier) — catches device-specific bugs Robolectric's shadows miss (graphics drivers, vendor framework forks, hardware sensors). The pyramid is wide on Robolectric, narrow on instrumented.
- When does Firebase Test Lab pay off?
- When you ship to a meaningfully diverse device matrix and need real-device coverage before release. Test Lab runs on a curated set of physical and virtual devices in Google's data center, captures logcat/video/screenshots, and integrates with gcloud and CI. The pricing model (per-device-minute) makes it impractical for every PR but exactly right for nightly + release-tier matrix runs against pixel 7/8/9 on API 30/33/35 plus one low-end and one tablet.
- What's the right Gradle cache strategy on GitHub Actions?
- Use gradle/actions/setup-gradle, which handles ~/.gradle/caches and ~/.gradle/wrapper plus dependency-graph submission. Set cache-read-only to true for non-main branches so PRs read but don't write the cache (avoids cache poisoning from forks). For larger orgs, layer a remote Gradle build-cache node (Develocity or a self-hosted instance) on top — that gets you cross-PR task reuse, not just dependency reuse. The cold-to-warm delta is typically 8 minutes to under 2.
- GitHub Actions or Bitrise for Android CI?
- GitHub Actions is the default in 2026 for Android-only orgs already on GitHub: KVM-accelerated emulator runners on ubuntu-latest, mature Gradle caching, and free minutes for OSS. Bitrise is the right choice for orgs with deep iOS+Android overlap who want managed macOS, Apple-Google signing chain of trust, and a step library purpose-built for mobile. Most Android-only teams will be happier on GHA; mixed-platform mobile shops often prefer Bitrise.
Sources
- Android Developers — Testing fundamentals. Canonical reference for the Android testing pyramid, AndroidX Test, and Gradle test wiring.
- Android Developers — Testing your Compose layout. Canonical reference for createComposeRule, the semantics tree, and Compose synchronization.
- Cash App Paparazzi — Pure-JVM Compose/View screenshot rendering via LayoutLib. Canonical fast-tier screenshot testing library.
- Roborazzi — Robolectric-driven Compose screenshot testing with full state and animation support. Canonical interactive-tier screenshot library.
- Mockk — Kotlin-idiomatic mocking library with native coroutine, final-class, and object/static support. The senior default for Android unit tests in 2026.
- Firebase Test Lab — Cloud-hosted device matrix for instrumented Android tests. Canonical platform for nightly/release-tier real-device coverage.
About the author. Blake Crosley founded ResumeGeni and writes about Android engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.