EngineeringApril 7, 2026

How we ship without a QA team: the internal stack behind Rolo

Three engineers. No dedicated QA. No release manager. And yet every PR gets a static analysis pass, a size audit, a Linear issue check, and a structured reviewer scorecard before any human reads a line of diff. Here is how we built it.

Jack Luo

Founder, Agent School

We are not doing this manually. We built an internal stack that enforces quality automatically, keeps everyone aligned without stand-ups, and lets each engineer work like they have leverage beyond their headcount. This is the first post describing how that stack actually works.

The problem with small teams moving fast

The common failure mode for small engineering teams is the same everywhere. You move fast for a few months, ship a lot, and then the codebase becomes a place where every change feels risky. You find a `window.confirm()` in production. A `process.env` read outside the validated environment layer. A hardcoded z-index that breaks a modal on mobile. Nothing catastrophic, just the slow accumulation of shortcuts that compounds into real friction.

The usual answer is to hire a QA engineer or slow down the release cadence. We did neither. Instead we automated the things a good senior engineer would catch on the first read of a diff.

The PR audit: a robot reviewer that never misses

Every pull request we open gets a CI job that posts a structured comment before any human reviews it. Not a linter output. A document.

The audit starts with diff stats and a size classification. PRs smaller than 250 lines are labeled `S`. Over 3000 lines, the job hard-fails and blocks the merge entirely. That upper limit is not arbitrary: an XXL PR is almost always a scoping failure, not a legitimate feature boundary.

After sizing, the job detects the PR type from the branch prefix or title. A `feat:` branch has a 1000-line limit. A `bug:` fix has 500 lines. A `story:` branch implementing a full planned feature has 3000 lines. These limits enforce discipline before the reviewer ever opens the diff.

The most useful section is the change tour. The audit parses the diff and groups every changed file by its architecture layer: API routes, UI components, DB schema, DB migrations, services, repositories, hooks, lib/infra, and pages. Each entry shows the file, the delta, and a description extracted from the hunk headers and the first meaningful added lines. A reviewer can scan the tour in thirty seconds and understand whether the PR is doing what its title says it is doing.

Then come the static pattern checks. Eight of them, running on the raw diff:

- Banned browser dialogs. Any use of `window.confirm`, `window.alert`, or `window.prompt` is flagged immediately. We use custom modal components instead, which have proper keyboard navigation, ARIA support, and mobile behavior. - Direct environment access. Reading `process.env.SOME_VAR` outside the validated env layer gets caught. Everything goes through `@/lib/env`, which is Zod-validated at startup. - Z-index violations. Our mobile layout uses a fixed z-index scale with seven specific values. Any arbitrary value outside the approved set is flagged. This stopped three separate mobile layering bugs before they reached review. - Pixel breakpoints. We use rem-based media queries throughout the app. A `768px` anywhere in the diff is a signal that someone pulled a breakpoint from memory instead of following the convention. rem and px desync at non-default font sizes - we learned this the hard way. - Overly permissive Zod schemas. `z.any()` gets flagged. It is almost always a shortcut taken under pressure that turns into a validation gap in production. - Console calls in server code. We use a structured logger in server-side code. `console.log` in a route handler or service is a signal that the error handling is not following the observability patterns. - Hardcoded hex colors. Everything should go through Tailwind tokens or CSS variables. A raw hex in a JSX attribute is a design system leak. - Missing `use client` guard. When a file imports `useState`, `useEffect`, or similar hooks, the audit checks whether `"use client"` is present at the top of the file. React Server Components will throw at runtime without it.

Finally, the comment includes a reviewer scorecard template. Seven dimensions: scope adherence, code correctness, security, test coverage, architecture alignment, observability, and UI/theme consistency. Each scored 1-5. Total 35 points. The approval threshold is 28. Reviewers copy the table into their review comment and fill in the scores. This sounds bureaucratic until you realize it produces a consistent record of why PRs were approved and what tradeoffs were accepted.

Linear everywhere

Every PR is expected to reference a Linear issue. The audit checks for an `AGE-NNN` identifier in the PR title, body, or branch name, and warns visibly if it is missing.

For story branches, the audit also detects the epic number from the branch slug and automatically assigns the PR to the corresponding GitHub milestone. A PR on `story/3-AGE-92-profile-redesign` gets assigned to the Epic 3 milestone without anyone touching it manually. Milestone progress in GitHub tracks alongside our planning state in Linear.

Our branch naming is strict by convention: `story/<epic>.<story>-<linear-id>-<short-title>` for full story implementations, `feat/<linear-id>-<short-title>` for features, `fix/<linear-id>-<short-title>` for bugs. The audit validates the prefix and infers context from the shape. This means the branch name is not just a label - it is a routing key that drives automation downstream.

Autonomous agents with a notification layer

We run autonomous coding agents via OpenCode. An agent can take a story specification, read the codebase, implement the change, run the linter, and commit the work in modular groups organized by concern. One engineer can have multiple agents working in parallel on independent stories.

The coordination problem with autonomous agents is knowing when they need you. An agent that is blocked on an ambiguous requirement or needs permission to proceed with a potentially destructive change should not require you to be watching a terminal.

We solve this with brrr, a minimal HTTP notification API. Every agent response fires a webhook to the relevant engineer's phone. Finished a task: `Done:` ping. Asking a clarifying question: `Question:` ping. Needs approval before a schema change or auth flow modification: `Permission:` ping. The engineer is not watching the terminal. They are doing something else. The phone tells them when their attention is needed.

This pattern - autonomous work with targeted interrupts - is what makes agent-assisted development actually workable on a small team. Without the notification layer, you are either babysitting the terminal or discovering blocked work hours later.

Remote builds that know what changed

Our CI pipeline runs on self-hosted runners. Every job that runs is a job we are responsible for, which means we care about which jobs run unnecessarily.

The first job in every CI run is a change detection step. It uses path filters to classify whether the changes are marketing-only. A change touching only `src/app/(marketing)/`, `src/components/marketing/`, or `public/marketing/**` has no risk of breaking the application layer. When that condition is true, the build job is skipped entirely.

The pipeline is otherwise sequential in a deliberate order: lint first, then type check, then build. The type check does not start until lint passes. The build does not start until types are clean. This catches the cheap failures early and avoids wasting build time on code that would fail for a simpler reason.

For mobile and PWA coverage, a parallel job matrix runs Playwright tests across three configurations: mobile Chrome, mobile Safari, and installed PWA mode. These catch the class of bugs that only appear in the mobile layout - touch event handling, iOS scroll behavior, safe area insets, bottom navigation overlap.

How the pieces connect

None of these systems are particularly novel on their own. Automated PR analysis, linear issue tracking, self-hosted CI, phone notifications - all solved problems individually. The value is in how they compose.

A story gets created in Linear. A BMAD story file captures the full implementation context. An agent picks it up, implements it, and commits in logical groups. A PR opens, automatically tagged with the story's AGE identifier and epic milestone. The PR audit fires and posts the change tour, static analysis findings, and reviewer scorecard. A human reviewer fills in the scorecard and approves or requests changes. The engineer who owns the story gets a brrr ping when the review is posted.

The feedback loop from "story in planning" to "reviewed and mergeable" runs in a few hours. No status meetings. No Slack thread asking if the build passed. No manual label application. No reviewer wondering whether the z-index is right.

For a team of three, that compression is the difference between shipping and not shipping.

This is a post about our internal engineering stack. If you want to talk about any of these systems, find us on the site or on the waitlist.

Want to try Rolo?

Join the waitlist and be among the first to use Rolo.