Guides

Outset UX Evals: A How To Guide

Jan 22, 2026

Christopher Monnier

Outset UX Evals: A Practical How‑To Guide

For years, teams have relied on benchmarks, win rates, and tightly‑scoped usability tests to decide whether an AI product is “good enough.” And yet, many of those same products still leave users confused, mistrustful, or quietly dissatisfied.

UX Evals exist to close that gap.

This guide is a practical companion to Introducing UX Evals that provides a step-by-step approach on how to actually run UX Evals in the real world.

What you’re actually evaluating

UX Evals don’t ask whether a model passes a rubric. They ask something simpler—and harder: Did this interaction deliver value to the person using it?

Instead of synthetic prompts and third‑party judges, UX Evals center first‑person, multi‑turn experiences with real people. Participants bring their own goals, imperfect prompts, shifting intent, and lived context, and then evaluate the experience themselves.

When UX Evals are the right tool

UX Evals are most useful when:

  • The experience unfolds over multiple turns or interactions

  • User intent is ambiguous or evolves over time

  • The system plays a role in decisions, judgment, or sensemaking

  • Trust, tone, and clarity matter as much as correctness

If the question you’re trying to answer is “Which experience actually helps people move forward?”, you’re in the right place.

How to run Outset UX Evals

We’ve worked with customers like Microsoft Copilot to analyze UX Evals they’ve run on our platform to templatize the set-up making it easier for researchers to get started. 

Here’s a step-by-step guide of the process if you’re running one manually.

Step 1: Human-Moderated Qual to Define the Experience Dimensions

Before running a large-scale AI-moderated eval, spend time in a small number of 1:1, human-moderated interviews related to the specific type of experience you’re evaluating. Pay close attention to how users describe good and bad interactions in their own words.

From this work, define 8–12 experience dimensions that matter to users in this particular context. These are the criteria users naturally use to judge whether an interaction worked for them.

These should be:

  • Grounded in real language users use

  • Stable enough to compare across versions

  • Experiential (not technical or model-internal)

Common examples include:

  • Usefulness

  • Trust

  • Clarity

  • Helpfulness

  • Credibility

These dimensions become the backbone of the eval.

Step 2: Select the Right Participants

UX Evals work best when participants reflect real use, not hypothetical use.

Use demographic criteria and behavioral screeners to recruit people who:

  • Actually rely on AI tools

  • Are in relevant contexts or situations

  • Would plausibly use the product you’re evaluating

A well-chosen audience produces cleaner signal than a broad but generic one.

Step 3: Set Up the Comparison

Most UX Evals compare two experiences:

  • Two versions of your product, or

  • Your product versus a competitor

Participants will try both experiences in a counterbalanced order to reduce ordering effects.

This structure allows you to measure not just absolute ratings, but relative differences—which are often more meaningful.

Step 4: Define the Interaction Space

Rather than scripted prompts, define a broad interaction category that mirrors real use.

For example:

  • Planning something

  • Getting advice

  • Working through a real problem

Participants are instructed to:

  • Choose their own goal

  • Use their own wording

  • Interact for as long as they normally would

This keeps the experience realistic while still bounding the eval to a comparable space.

Step 5: Collect Ratings and Rationale

After each interaction, collect three things:

  1. Quantitative ratings
    Participants rate the experience across your predefined dimensions using Likert scales. Present dimensions in randomized order.

  2. Qualitative rationale
    AI-moderated follow-up questions capture why participants gave each rating, while the interaction is still fresh.

  3. Preference
    After completing both experiences, participants choose which one they preferred overall.

This combination ensures you get both directional signal and interpretability.

Step 6: Analyze for Action

With sufficient responses, analyze:

  • Mean scores by dimension

  • Relative differences between experiences

  • Which dimensions most strongly predict preference

Then layer in qualitative artifacts — screen recordings, video clips,, and follow-ups — to understand what’s driving the numbers.

The output should be a short list of:

  • Clear strengths to preserve

  • Specific breakdowns to address

  • Dimensions where improvements will matter most to users

What you should expect to learn

A well-run UX Eval produces:

  • Statistically meaningful comparisons of subjective experience

  • Clear evidence of which aspects of the experience matter most to users

  • Concrete guidance for product and model iteration

UX Evals don’t necessarily replace benchmarks or traditional UX research. They complement them by capturing something other methods miss: how it actually feels to use the system.

Final note

As AI products move further away from deterministic flows and toward open-ended interaction, evaluation has to evolve alongside them.

UX Evals provide structure without flattening real use, and signal without losing human context.

About the author
Christopher Monnier

Principal UX Researcher, Microsoft Copilot

Chris Monnier is a Principal UX Researcher on the Microsoft Copilot team where him and his team are redefining how we understand users in the new AI era. With an esteemed background at companies like Airbnb and Opendoor prior to Microsoft, Chris brings a true curiosity to research and loves helping builders figure out the world.

Interested in learning more? Book a personalized demo today!

Book Demo

Subscribe to our newsletter

Enter your contact details to get the latest tips and stories to help boost your business. 

Subscribe to our newsletter

Enter your contact details to get the latest tips and stories to help boost your business. 

Subscribe to our newsletter

Enter your contact details to get the latest tips and stories to help boost your business.