Guides

Outset UX Evals: A How To Guide

Jan 22, 2026

—

Christopher Monnier

Outset UX Evals: A Practical How‑To Guide

For years, teams have relied on benchmarks, win rates, and tightly‑scoped usability tests to decide whether an AI product is “good enough.” And yet, many of those same products still leave users confused, mistrustful, or quietly dissatisfied.

UX Evals exist to close that gap.

This guide is a practical companion to Introducing UX Evals that provides a step-by-step approach on how to actually run UX Evals in the real world.

What you’re actually evaluating

UX Evals don’t ask whether a model passes a rubric. They ask something simpler—and harder: Did this interaction deliver value to the person using it?

Instead of synthetic prompts and third‑party judges, UX Evals center first‑person, multi‑turn experiences with real people. Participants bring their own goals, imperfect prompts, shifting intent, and lived context, and then evaluate the experience themselves.

When UX Evals are the right tool

UX Evals are most useful when:

The experience unfolds over multiple turns or interactions
User intent is ambiguous or evolves over time
The system plays a role in decisions, judgment, or sensemaking
Trust, tone, and clarity matter as much as correctness

If the question you’re trying to answer is “Which experience actually helps people move forward?”, you’re in the right place.

How to run Outset UX Evals

We’ve worked with customers like Microsoft Copilot to analyze UX Evals they’ve run on our platform to templatize the set-up making it easier for researchers to get started.

Here’s a step-by-step guide of the process if you’re running one manually.

Step 1: Human-Moderated Qual to Define the Experience Dimensions

Before running a large-scale AI-moderated eval, spend time in a small number of 1:1, human-moderated interviews related to the specific type of experience you’re evaluating. Pay close attention to how users describe good and bad interactions in their own words.

From this work, define 8–12 experience dimensions that matter to users in this particular context. These are the criteria users naturally use to judge whether an interaction worked for them.

These should be:

Grounded in real language users use
Stable enough to compare across versions
Experiential (not technical or model-internal)

Common examples include:

Usefulness
Trust
Clarity
Helpfulness
Credibility

These dimensions become the backbone of the eval.

Step 2: Select the Right Participants

UX Evals work best when participants reflect real use, not hypothetical use.

Use demographic criteria and behavioral screeners to recruit people who:

Actually rely on AI tools
Are in relevant contexts or situations
Would plausibly use the product you’re evaluating

A well-chosen audience produces cleaner signal than a broad but generic one.

Step 3: Set Up the Comparison

Most UX Evals compare two experiences:

Two versions of your product, or
Your product versus a competitor

Participants will try both experiences in a counterbalanced order to reduce ordering effects.

This structure allows you to measure not just absolute ratings, but relative differences—which are often more meaningful.

Step 4: Define the Interaction Space

Rather than scripted prompts, define a broad interaction category that mirrors real use.

For example:

Planning something
Getting advice
Working through a real problem

Participants are instructed to:

Choose their own goal
Use their own wording
Interact for as long as they normally would

This keeps the experience realistic while still bounding the eval to a comparable space.

Step 5: Collect Ratings and Rationale

After each interaction, collect three things:

Quantitative ratings
Participants rate the experience across your predefined dimensions using Likert scales. Present dimensions in randomized order.
Qualitative rationale
AI-moderated follow-up questions capture why participants gave each rating, while the interaction is still fresh.
Preference
After completing both experiences, participants choose which one they preferred overall.

This combination ensures you get both directional signal and interpretability.

Step 6: Analyze for Action

With sufficient responses, analyze:

Mean scores by dimension
Relative differences between experiences
Which dimensions most strongly predict preference

Then layer in qualitative artifacts — screen recordings, video clips,, and follow-ups — to understand what’s driving the numbers.

The output should be a short list of:

Clear strengths to preserve
Specific breakdowns to address
Dimensions where improvements will matter most to users

What you should expect to learn

A well-run UX Eval produces:

Statistically meaningful comparisons of subjective experience
Clear evidence of which aspects of the experience matter most to users
Concrete guidance for product and model iteration

UX Evals don’t necessarily replace benchmarks or traditional UX research. They complement them by capturing something other methods miss: how it actually feels to use the system.

Final note

As AI products move further away from deterministic flows and toward open-ended interaction, evaluation has to evolve alongside them.

UX Evals provide structure without flattening real use, and signal without losing human context.

About the author

Christopher Monnier

Principal UX Researcher, Microsoft Copilot

Chris Monnier is a Principal UX Researcher on the Microsoft Copilot team where him and his team are redefining how we understand users in the new AI era. With an esteemed background at companies like Airbnb and Opendoor prior to Microsoft, Chris brings a true curiosity to research and loves helping builders figure out the world.

Interested in learning more? Book a personalized demo today!

Book Demo

Subscribe to our newsletter

Enter your contact details to get the latest tips and stories to help boost your business.

Subscribe to our newsletter

Enter your contact details to get the latest tips and stories to help boost your business.

Subscribe to our newsletter

Enter your contact details to get the latest tips and stories to help boost your business.

The most advanced AI-moderated research platform

Secure by design

The most advanced AI-moderated research platform

Secure by design

The most advanced AI-moderated research platform

Secure by design

Outset UX Evals: A How To Guide

Outset UX Evals: A Practical How‑To Guide

What you’re actually evaluating

When UX Evals are the right tool

How to run Outset UX Evals

Step 1: Human-Moderated Qual to Define the Experience Dimensions

Step 2: Select the Right Participants

Step 3: Set Up the Comparison

Step 4: Define the Interaction Space

Step 5: Collect Ratings and Rationale

Step 6: Analyze for Action

What you should expect to learn

Final note

About the author

Christopher Monnier

Recommended articles

How AI-Moderated Research Changes Product Teams

The Researcher of the Future

Introducing Enterprise-Wide Structure for Distributed Research Teams

How AI-Moderated Research Changes Product Teams

The Researcher of the Future

Introducing Enterprise-Wide Structure for Distributed Research Teams

Beyond Speed: The Hidden Benefits of AI-Moderated Research

Subscribe to our newsletter

Subscribe to our newsletter

Subscribe to our newsletter

Subscribe to our newsletter

Subscribe to our newsletter

Subscribe to our newsletter