Skip to main content

VertaaUX Articles

What Automated UX Scanning Can Spot Before Humans Get There

Set a realistic boundary between detectable UX smells, WCAG violations, and the issues that still need human judgment.

Petri Lahdelma5 min read5 min remaining

Last updated March 9, 2026

UXAccessibilityTestingWCAG
Run This On Your SiteListen: Unavailable

Automated testing is useful, but it is not a verdict. It can catch a meaningful share of accessibility and UX risk early, yet it still leaves important gaps that only context, task analysis, and human review can close.

That distinction matters because the wrong mental model creates false confidence. A clean scan can still hide a confusing flow, weak copy, inaccessible edge states, or interaction debt that only shows up when someone actually tries to use the product.

The right question is not whether automation works. It is where it works, where it fails, and how teams should use it responsibly before release.

Why this matters now

The current numbers still make the case for automation and against overclaiming at the same time.

  • WebAIM's 2025 Million found 50,960,288 distinct accessibility errors across the top million home pages, averaging 51 errors per page.
  • Deque's coverage research says automated testing found 57.38% of total issues on average in its sample.

So the practical conclusion is not "automation is weak." It is "automation is valuable but incomplete."

If you treat it as a magic verdict, you will mislead the team.

If you treat it as a fast evidence layer, you will catch a large amount of preventable debt before it reaches customers.

The three buckets teams should use

Most of the confusion disappears if teams sort findings into three buckets instead of one generic issue list.

BucketWhat it containsGood examplesWhat to do next
Machine-detectableExplicit, rule-backed failuresMissing labels, contrast failures, broken structureFix quickly and prevent regressions
Machine-augmentedRisk signals that need interpretationDense forms, weak CTA clarity, likely cognitive loadReview with design and product context
Human-onlyQuestions automation cannot answer wellTrust, comprehension, emotional safety, assistive-tech experienceTest directly with humans

That taxonomy is more useful than a single score because it tells the team what kind of decision is required.

What automation can catch with confidence

The highest-value early checks are still structural:

  • missing labels and names
  • contrast failures
  • broken heading hierarchy
  • weak landmarks
  • obvious target-size issues
  • duplicate IDs
  • malformed control relationships

These checks matter because they are high-confidence, fast to route, and often easy to fix before the release train moves on.

For many teams, this category alone pays for the workflow. It removes preventable issues from the product before a designer, researcher, or accessibility specialist has to spend time rediscovering them manually.

What automation can only estimate

The second bucket is where modern tooling gets more interesting.

Automated systems can increasingly estimate risks such as:

  • overly dense screens
  • repetitive error messaging
  • weak information scent
  • confusing CTA wording
  • poor hierarchy in long settings or dashboard views

These are useful signals, but they are not proof.

A system can suggest that a screen is likely cognitively heavy. It cannot prove whether the audience for that screen will actually understand it under realistic conditions. That still depends on context, task, and sometimes real user validation.

What still needs human judgment

Some of the most important product questions remain hard to automate well:

  • Does this billing or consent flow feel trustworthy?
  • Is the language understandable for the real audience?
  • Can someone using assistive technology complete the task confidently, not just technically?
  • Does the sequence still make sense when the user is interrupted, zoomed in, or recovering from an error?

Automation can point. Humans still have to decide.

A four-step pre-release triage flow

This is where richer content is more useful than one paragraph of advice, so here is a concrete triage flow using the same evidence model described above.

Audit the exact page, form, or journey that is about to ship. Avoid relying on an old report from a different build or from a design review environment.

Split findings by certainty

Create three buckets immediately: deterministic failures, machine-augmented risk signals, and manual-review items. That alone prevents most false-confidence reporting.

Inspect one evidence-rich example

Open one representative issue and make sure the report includes the page, the screenshot, the selector or component reference, and a short explanation of why the finding matters.

Convert the output into a release decision

Block on the explicit failures that affect the journey, review the top risk signals with a human, and document what still needs manual verification after ship.

What a useful audit excerpt looks like

A good report is not just a score. It is an evidence bundle.

JSON
{
  "page": "/checkout",
  "finding_type": "machine_detectable",
  "severity": "critical",
  "summary": "Form field has no accessible name",
  "selector": "input[name='email']",
  "wcag": ["4.1.2", "3.3.2"],
  "evidence": {
    "screenshot": "checkout-email-field.png",
    "notes": "Visible placeholder exists, but no programmatic label is exposed."
  },
  "manual_follow_up": "Verify full checkout flow with screen reader after remediation."
}

The useful shape is always the same:

  • what was found
  • where it was found
  • why it matters
  • what type of evidence supports it
  • what still needs manual review

If a report cannot answer those questions, it is too thin to support release decisions.

A simple PR gate example

One practical way to use this in an engineering workflow is to keep the automated part strict and the interpretive part explicit.

YAML
audit_pr_preview:
  script:
    - vertaa audit -u $PREVIEW_URL --format json > audit.json
    - vertaa diff --baseline main --input audit.json --fail-on critical
  rules:
    - if new_critical_machine_detectable_issues > 0
      then: block_merge
    - if new_machine_augmented_risk_signals > 0
      then: require_human_review

That is a much better workflow than treating all findings as blockers or, worse, treating all findings as optional suggestions.

Checklist before you trust a clean scan

  • The report scope matches the actual release candidate, not an old environment.
  • Deterministic findings are separated from lower-confidence risk signals.
  • The highest-risk journey was manually reviewed at least once.
  • The report includes at least one representative example with screenshot and selector-level evidence.
  • The team can explain what the scan did not cover.
  • Any customer-facing or compliance-facing summary avoids language like "fully accessible" unless stronger evidence exists.

Where VertaaUX fits

VertaaUX is most credible when it helps teams separate rule-backed findings, AI-assisted risk signals, and explicit "needs review" areas instead of pretending one score can explain everything.

That is also what makes the output more usable. Teams do not need another vague quality score. They need a clearer decision surface:

  • what machines can know with confidence
  • what machines can only suggest
  • what humans still need to judge

For a concrete example of the report shape this article is describing, see the sample audit report.

References

Automation can point. Models can estimate. Only humans can finally decide whether an experience is genuinely usable for the people it is meant to serve.

Audit your page now

Apply this article on a live URL and get an actionable report in minutes.

Improve this article

Found an error, outdated section, or gap? Send feedback and we will update the changelog.

Was this useful?

Quick signal helps us prioritize article updates.