Evidence-firstVersioned rulesConfidence-scoredDeterministic core

Accuracy & Credibility

VertaaUX audits are designed to be verifiable. We don't just produce findings—we publish what makes them trustworthy: which parts are deterministic vs. LLM-assisted, how confidence is assigned, and how each finding links back to evidence you can review.

Deterministic Where Possible

Core checks are computed from the page snapshot (DOM/CSS/layout) and produce consistent results for the same input and ruleset.

Evidence Attached

Findings include selectors and other traceability fields where available. This enables fast reproduction and review by humans.

Measured & Reported

We track accuracy signals (precision/recall, false positives, stability) and publish methodology so numbers are interpretable.

What Accuracy Means Here

“Accuracy” is not a single score. We publish multiple signals so teams can judge credibility from different angles: correctness, repeatability, and traceability.

Correctness (Precision / Recall)

We aim to quantify how often findings are true (precision) and how often real issues are caught (recall). Where a labeled dataset is not yet available, we label metrics explicitly as “target” vs. “measured”.

Stability (Rerun Variance)

We measure whether repeated audits of the same snapshot produce the same outputs. Deterministic checks should be stable; any probabilistic outputs are confidence-scored.

Validation & Methodology

This is how a VertaaUX audit is produced and how to interpret the results.

Deterministic vs. LLM-assisted

Most detection is deterministic (DOM/CSS/interaction simulation). Optional enhancement layers may use ML/LLMs for summarization, prioritization, or remediation suggestions.

Deterministic: snapshot parsing, semantic structure, keyboard operability, scoring.
LLM-assisted (when enabled): remediation suggestions and other higher-level recommendations.

Evidence Fields per Finding

Findings are designed to be reviewable. When available, each finding includes:

rule_id and ruleset_version for traceability and repeatability.
confidence for probabilistic outputs (LLM/ML-assisted).
selector and/or element reference for reproduction.
Evidence such as DOM snippet, WCAG references, and links to the audited page context.

Expected false positives / false negatives

Common false positives

Intentional design choices (e.g., visually subtle focus indicators) that still satisfy internal standards.
Highly customized components where semantics are correct but hard to infer from DOM alone.
Dynamic UI states that differ between initial load and real-user interaction flows.

Common false negatives

Content behind authentication, feature flags, or geo blocks.
Issues that only appear after long sessions, complex multi-step flows, or user-generated data.
Late-loading elements that appear after network-idle snapshots.

Example: evidence-backed finding schema

This example shows the shape we use for traceability. Not every finding includes every field yet; missing fields are treated as “needs human review” rather than forced certainty.

Finding JSON Schema Example · JSON

{
  "category": "semantic",
  "severity": "warning",
  "rule_id": "semantic.heading_hierarchy",
  "ruleset_version": "1.0.0",
  "confidence": 0.92,
  "selector": "main h3:nth-of-type(2)",
  "dom_snippet": "<h3>Pricing</h3>",
  "wcag_reference": "WCAG 2.2 — 1.3.1",
  "evidence": {
    "why": "Heading level skips from H1 to H3, which can confuse assistive tech.",
    "how_to_verify": "Inspect the DOM and confirm heading order matches visual hierarchy."
  }
}

You can export raw JSON from the audit results view (Export JSON) and validate selectors and rules against the page DOM.

Trust Markers

Credibility is also operational. Beyond methodology, we provide signals that the system is run and maintained responsibly.

Versioned Rules

Changes are tracked and shipped with a ruleset version so teams can compare results over time.

Signed Webhooks

Webhook payloads are signed (HMAC) and deliveries are logged with retries for auditability.

Security Posture

Security headers, rate limiting, and SSRF protections are applied to reduce misuse and protect infrastructure.

Changelog Developer Docs Security Sample Report