What is the right scale for a usability survey — Likert, SUS, or NPS?

SUS for benchmarked, comparable usability scores (10 items, 0-100 output). Likert for specific attitudes about features ('the dashboard is easy to scan'). NPS for stakeholder reporting and trend tracking; do not use it as your only experience measure because it conflates loyalty with usability. The senior pattern: SUS plus 2-3 Likert items targeting the specific design questions, plus one open-ended 'why' item.

How big does my sample need to be for a UX survey?

It depends on the decision. Describing one design with reasonable precision: roughly n=30-100 depending on the metric. Comparing two designs at a 5-point SUS difference with 80% power: roughly 30-50 per cell. Detecting rare events (2-5% error rates): hundreds to thousands. The wrong question is 'is n=X enough'; the right question is 'what decision am I making and what precision do I need to make it confidently?' NN/g and Jeff Sauro's measuringu.com have practical sample-size guides.

Should I use NPS as my main UX metric?

No. NPS is fine as one input in a stakeholder dashboard, but as a sole experience measure it is over-collapsed (the -100/+100 score throws away most of the data) and conflates loyalty with usability. The senior UXR treats NPS as a tracking metric for trend and uses SUS, CSAT, task-success rate, and qualitative feedback for actual experience signal.

Are click and scroll data 'real' research data?

Yes — and often more honest than self-report. Behavioral metrics escape recall bias, social-desirability bias, and aspirational answers. They cannot tell you why someone did what they did, which is where qualitative research comes in. The senior UXR pairs behavioral signal (what happened) with self-report and interview signal (why it happened) and calls out contradictions between them as findings.

How do I avoid leading questions when I have a strong hypothesis?

Write three versions of every question — one that leads toward your hypothesis, one that leads away, and one that is neutral — then field the neutral version. Have a colleague read the survey blind without knowing your hypothesis and predict the modal answer; if they can, the question is leading. Pre-test surveys with 5-10 people in a think-aloud before fielding to catch wording issues. Caroline Jarrett's NN/g survey-design articles are the canonical reference.

When should I hand off analysis to a data scientist?

When the question is causal ('does this change cause the lift?'), when the analysis requires multivariate modeling (driver analysis, segmentation at scale, longitudinal cohorts), when the data lives in the warehouse rather than a survey tool, or when an A/B test needs designing rather than just reading. UXR keeps instrument design, construct validity, descriptive analysis, and interpretation. The partnership produces stronger claims than either role alone.

What is sample-ratio mismatch and why does it matter for UXRs reading A/B tests?

Sample-ratio mismatch is when the actual split between A and B differs significantly from the planned split (50/50 should be 50/50 within sampling noise). When it is off, something is wrong with assignment or logging — the test is broken and the result is unreliable. UXRs reading test results should always check the sample sizes per arm before reading the metric; a 'significant' result on a broken test is misleading. DS partners typically flag SRM, but UXRs should know to ask.

How do I report quantitative results to non-statistical stakeholders?

Lead with the decision, not the statistic. 'We should keep version B; users completed the task 18% more often (52% vs 44%, n=200 per arm).' Use top-2-box percentages or medians, not means with standard deviations, when the audience is non-technical. Always include a confidence interval or range on key numbers. Pair every chart with one qualitative quote that illustrates the finding so the number has texture, not just precision.

UX Researcher Hub

Survey Design and Quantitative UX Research Methods (2026)

By Blake Crosley · Last verified 2026-04-30

In short

Survey-and-quantitative methods is the UXR craft of designing instruments and reading numeric signal without lying with statistics. The senior bar in 2026: you write non-leading questions, you pick scales (Likert, semantic differential, SUS, NPS, CSAT) deliberately, you reason about sample size and power before fielding, you triangulate quant with qual, you know when behavioral metrics replace self-report, and you partner with data science on causal claims rather than smuggling them in. Strong UXRs treat numbers as evidence, not decoration.

Key takeaways

Question wording beats sample size for survey validity. Leading questions, double-barreled items, and unbalanced scales contaminate data no matter how many people respond. Erika Hall's 'Just Enough Research' (muleshq.com) and Caroline Jarrett's survey-design work at NN/g cover the canonical traps.
Pick a scale on purpose. Likert (agreement), semantic differential (paired adjectives), SUS (10-item usability standard), CSAT (single-item satisfaction), NPS (likelihood-to-recommend) each measure different constructs. Jeff Sauro's measuringu.com is the canonical reference for choosing among them.
Sample size depends on the decision, not a folk rule. Detecting a 5-point SUS difference between two designs needs roughly 30+ users per cell with reasonable power; ranking five tasks by completion needs different math. NN/g's quantitative-methods article (nngroup.com/articles/quantitative-research-methods) is a practical entry point.
Behavioral metrics (clicks, scrolls, time-on-task, completion rate, error rate) are research data, not just analytics. They escape self-report bias and answer questions surveys can't (what did people actually do, how long did it take, where did they fail).
Mixed methods is the senior default. Surveys tell you what and how many; interviews tell you why and how. Tomer Sharon's work (tomersharon.com) frames this as 'quantitative for breadth, qualitative for depth' — neither alone is enough for a confident claim.
A/B test interpretation is a UXR skill, not just a data-science one. UXRs read effect sizes and confidence intervals, ask whether the metric is the right metric, and challenge novelty effects, sample-ratio mismatch, and peeking. The UXR contribution: framing what the experiment is actually measuring.
Know when to hand off to data science. Causal inference, multivariate models, segmentation at scale, and longitudinal analysis are DS work. UXR owns instrument design, construct validity, and interpretation; DS owns the model. The partnership produces better claims than either does alone.

Survey design that produces actionable signal

Most UX surveys fail before a single response arrives because the questions are wrong. The fix is not more questions or more respondents — it is rigorous question design. The standard traps:

Leading questions tell respondents what answer to give. "How helpful did you find our improved checkout?" presupposes the checkout is improved and that it is helpful. The neutral form: "How would you describe your checkout experience?" with a balanced scale.
Double-barreled questions ask two things at once. "How easy and fast was the signup?" mixes two constructs (ease, speed) and produces uninterpretable data. Split into two items.
Unbalanced scales push responses one direction. A 5-point scale running "Excellent / Very Good / Good / Fair / Poor" has four positive options and one negative one, biasing the mean. Balance: "Very Good / Good / Neither / Poor / Very Poor."
Vague quantifiers ("often," "sometimes," "frequently") mean different things to different people. Replace with concrete frequencies ("more than once a week," "once a month or less").
Recall bias happens when questions ask about distant past behavior. Anchor recall ("In the last 7 days...") or measure behavior directly via product analytics.

Pick the scale on purpose. The senior UXR repertoire:

Likert (5- or 7-point agreement scales) — for attitudes and perceptions ("This product is easy to use"). Treat as ordinal; report medians or top-2-box, not means, when the audience is non-statistical.
Semantic differential — paired opposite adjectives ("Confusing 1 — 7 Clear") with respondents marking a point on the continuum. Strong for brand and aesthetic perception.
System Usability Scale (SUS) — 10-item standardized instrument, output 0-100. The most widely benchmarked usability measure; comparable across studies and products. Calculator and benchmarks at measuringu.com/sus.
Customer Satisfaction (CSAT) — usually a single item ("How satisfied were you with X?") on a 1-5 or 1-7 scale. Cheap, fast, weak as a stand-alone signal.
Net Promoter Score (NPS) — a single item ("How likely are you to recommend...?") on 0-10. Industry-standard but heavily critiqued in UXR circles for over-collapsing the data into a -100/+100 score and conflating loyalty with experience. Use it because stakeholders ask for it; do not let it crowd out richer measures.
Open-ended items — one or two short text fields per survey for the "why" behind ratings. Code these into themes in a structured pass; do not skip them.

Distribution and recruitment determine who actually responds, and recruitment bias often dwarfs measurement error. In-product surveys reach engaged users disproportionately; email panels reach people willing to do surveys for incentives; intercept surveys catch people in the moment but at low response rates. A confident claim names the sample and acknowledges what it cannot generalize to.

Sample-size and power thinking for UXRs

The folk rule "n=5 finds 85% of usability issues" (Nielsen, qualitative testing) is correct in its narrow original context and dangerously wrong as a universal heuristic. Quantitative survey work has different math, and applying the n=5 rule to a survey is the most common stats mistake junior UXRs make. The senior UXR thinks about sample size in terms of the decision being made:

Are you describing one population (one design, one user group)? Confidence intervals get tighter as n grows, but with diminishing returns. n=30 gives reasonable precision for proportions in the 20-80% range; n=100+ is comfortable for finer slicing into segments.
Are you comparing two designs (A vs B)? Power depends on the effect size you want to detect. A 5-point SUS difference is roughly a "medium" effect; detecting it with 80% power and α=0.05 needs roughly 30-50 per cell. Smaller effects need larger samples — quickly. A 2-point SUS difference may need n=200+ per cell to detect with confidence.
Are you ranking many things (which task is hardest, which feature is most valued)? You need enough responses per cell that confidence intervals do not overlap. With 5 tasks and n=20 per task, your rank ordering is noisy and a different sample would produce a different ranking.
Are you measuring rare events (catastrophic errors, drop-off at a specific step)? Rare events need very large samples. If 2% of users hit an error, n=100 will catch roughly 0-4 of them — not enough to characterize the failure or trust the rate estimate.
Are you doing segment comparisons (mobile vs desktop, new vs returning)? The relevant n is per-segment, not total. A study with n=300 split 80/20 by platform has n=60 mobile users — adequate for description, marginal for comparison.

Common UXR stats traps:

Treating ordinal scales as interval. Likert means are convenient but assume equal spacing between scale points (between "Strongly Disagree" and "Disagree" vs. "Disagree" and "Neither"). For non-statistical audiences, prefer top-2-box percentages or medians.
Reporting averages for skewed distributions. Time-on-task is almost always right-skewed; the mean is pulled by a few slow users. Report the median and the IQR (interquartile range), not the mean and SD.
p-hacking by re-slicing. If you run 20 segment comparisons, one will be "significant" at p<0.05 by chance. Pre-register the analysis plan or apply Bonferroni / Benjamini-Hochberg correction.
Confusing statistical significance with practical significance. With huge samples, trivial differences become statistically significant. Report effect sizes (Cohen's d, odds ratios, percentage-point differences), not just p-values.
Confidence intervals only for the metric, not the comparison. Two means with overlapping CIs may still differ significantly; the CI of the difference is the right reference for the comparison claim.
Survivorship bias in the sample. If you only survey current users, you cannot make claims about churned users. Name the sample frame in every readout.

NN/g's quantitative-research-methods overview (nngroup.com/articles/quantitative-research-methods) and Jeff Sauro's body of work at measuringu.com are the canonical references for sample-size thinking and stats traps in UX work. The senior bar: when a stakeholder asks "is the difference real?", you answer with a confidence interval, not a p-value.

Mixed methods: when to combine survey + interview

Surveys answer "what" and "how many." Interviews answer "why" and "how." Behavioral analytics answer "what did people actually do." A confident research claim usually needs at least two of the three. The senior UXR routinely combines them:

Sequential explanatory (quant → qual). Run a survey to find the pattern; run interviews with a subsample to explain it. "Mobile users rate the dashboard 1 point lower on SUS — interviews with 8 of them surfaced that the filter sidebar collapses on small screens and disappears."
Sequential exploratory (qual → quant). Interview to discover the dimensions; survey to size them. "Interviews surfaced four reasons people abandon onboarding; survey of 800 churned users showed reason #2 accounts for 47%, reasons #1, #3, #4 each under 20%."
Concurrent triangulation. Run quant and qual in parallel and look for convergence or divergence. Convergent signals are strong; divergent signals are interesting (the qual is often closer to the truth when survey items are weak).
Survey-then-diary (or diary-then-survey). A diary study captures longitudinal qualitative texture; a survey at either end sizes it. Combining the two reveals patterns that one-shot research cannot see.

Behavioral metrics — clicks, scrolls, time-on-task, completion rate, error rate, drop-off, return rate — are quantitative research data and should be treated that way. They escape three big self-report problems: people misremember their behavior, people answer aspirationally, and people cannot describe sub-second interactions. The senior UXR pairs survey constructs with behavioral observations: a high SUS score combined with a 22% task-completion rate is a contradiction worth investigating, and the contradiction is usually the finding.

Practical mixed-methods patterns that hold up in product work:

Survey at scale + 5-8 follow-up interviews. The cheapest mixed-methods design. The survey gives you the distribution; the interviews give you the mechanism. Recruit interview participants from the survey itself by adding a "may we contact you?" item.
Behavioral cohort + intercept survey. Identify a behavioral cohort (e.g., users who abandoned at step 3) from analytics; intercept a sample with a short survey or interview invitation. Aligns the qual to the exact behavior rather than the user's general impression.
Usability test (qual) + SUS at the end (quant). Standard mid-scale practice. The qualitative observations explain the SUS score; the SUS makes the readout comparable across studies and over time.

Tomer Sharon's framing (tomersharon.com) — "quantitative for breadth, qualitative for depth" — is the working principle. Erika Hall's "Just Enough Research" (muleshq.com) makes the case that mixed methods at small scale beat single-method studies at large scale, because the evidence triangulates rather than stacking on a single weak instrument. The strongest research readouts always cite at least two independent data streams pointing at the same finding.

When to partner with data science vs do quant in-house

UXR and data science overlap in the middle and diverge at the edges. The senior UXR knows where the line is and crosses it deliberately, with a partner. The clean way to think about it: UXR owns the question and the instrument; DS owns the model and the math; both own interpretation.

UXR owns:

Instrument design — what to measure, how to ask, which scale, what construct.
Construct validity — does this metric actually capture the thing we care about?
Survey fielding and recruitment — sampling frame, distribution, response-bias awareness.
Descriptive analysis — frequencies, medians, IQRs, top-2-box, simple comparisons.
Interpretation — what the number means in product context, what the qualitative data layered on top of it suggests.

Data science owns:

Causal inference — A/B test design, instrumental variables, regression discontinuity, difference-in-differences. UXR contributes to the question; DS owns the model.
Multivariate analysis — driver analysis, factor analysis, cluster/segmentation at scale, structural equation modeling.
Longitudinal analysis — cohort tracking, churn modeling, time-series, survival analysis.
Production analytics infrastructure — event schema, data warehouse, dashboards, experimentation platform.

The handoff signals: when a question requires controlling for confounders, when the analysis involves modeling rather than describing, when the data lives in the warehouse rather than in a survey tool, or when the claim is causal ("does X cause Y") rather than correlational ("X and Y co-occur"). At that point the UXR brings the question, the construct definition, and the qualitative context; the DS brings the model. The reverse handoff also matters: when DS finds a quantitative pattern they cannot explain, UXR brings interviews to surface the mechanism.

A/B test interpretation sits squarely on the boundary and is the place where UXR-DS partnership produces the most lift. UXR contributions: framing what the experiment actually measures (is the success metric the right metric, or just the easy metric?); challenging novelty effects (was the lift real, or a one-week curiosity bump that decayed?); checking sample-ratio mismatch (did the assignment work as intended?); reading qualitative tickets, support contacts, and session replays from the test period to add texture; questioning whether a "neutral" result is genuinely neutral or hides offsetting subgroup effects. DS contributions: the statistical test, the confidence interval, the corrections for multiple comparisons, the power analysis, the segment cuts. The strongest experiment readouts come from both disciplines reading the same data together and writing the recommendation jointly.

Three patterns to avoid. First, the UXR who runs regression in a survey tool because they took a stats class once — the math may be technically correct but lacks the validation a DS partnership provides, and stakeholders will (rightly) discount it. Second, the DS who reports significance without UXR context — the number is real but the meaning is missing. Third, the silent handoff where one team produces a result and the other team is asked to bless it post-hoc; this almost always produces a worse claim than co-authoring from the start.

The senior UXR who builds a real partnership with DS produces better claims than either does alone. That is the whole job in three words: better claims, together.

Frequently asked questions

What is the right scale for a usability survey — Likert, SUS, or NPS?: SUS for benchmarked, comparable usability scores (10 items, 0-100 output). Likert for specific attitudes about features ('the dashboard is easy to scan'). NPS for stakeholder reporting and trend tracking; do not use it as your only experience measure because it conflates loyalty with usability. The senior pattern: SUS plus 2-3 Likert items targeting the specific design questions, plus one open-ended 'why' item.
How big does my sample need to be for a UX survey?: It depends on the decision. Describing one design with reasonable precision: roughly n=30-100 depending on the metric. Comparing two designs at a 5-point SUS difference with 80% power: roughly 30-50 per cell. Detecting rare events (2-5% error rates): hundreds to thousands. The wrong question is 'is n=X enough'; the right question is 'what decision am I making and what precision do I need to make it confidently?' NN/g and Jeff Sauro's measuringu.com have practical sample-size guides.
Should I use NPS as my main UX metric?: No. NPS is fine as one input in a stakeholder dashboard, but as a sole experience measure it is over-collapsed (the -100/+100 score throws away most of the data) and conflates loyalty with usability. The senior UXR treats NPS as a tracking metric for trend and uses SUS, CSAT, task-success rate, and qualitative feedback for actual experience signal.
Are click and scroll data 'real' research data?: Yes — and often more honest than self-report. Behavioral metrics escape recall bias, social-desirability bias, and aspirational answers. They cannot tell you why someone did what they did, which is where qualitative research comes in. The senior UXR pairs behavioral signal (what happened) with self-report and interview signal (why it happened) and calls out contradictions between them as findings.
How do I avoid leading questions when I have a strong hypothesis?: Write three versions of every question — one that leads toward your hypothesis, one that leads away, and one that is neutral — then field the neutral version. Have a colleague read the survey blind without knowing your hypothesis and predict the modal answer; if they can, the question is leading. Pre-test surveys with 5-10 people in a think-aloud before fielding to catch wording issues. Caroline Jarrett's NN/g survey-design articles are the canonical reference.
When should I hand off analysis to a data scientist?: When the question is causal ('does this change cause the lift?'), when the analysis requires multivariate modeling (driver analysis, segmentation at scale, longitudinal cohorts), when the data lives in the warehouse rather than a survey tool, or when an A/B test needs designing rather than just reading. UXR keeps instrument design, construct validity, descriptive analysis, and interpretation. The partnership produces stronger claims than either role alone.
What is sample-ratio mismatch and why does it matter for UXRs reading A/B tests?: Sample-ratio mismatch is when the actual split between A and B differs significantly from the planned split (50/50 should be 50/50 within sampling noise). When it is off, something is wrong with assignment or logging — the test is broken and the result is unreliable. UXRs reading test results should always check the sample sizes per arm before reading the metric; a 'significant' result on a broken test is misleading. DS partners typically flag SRM, but UXRs should know to ask.
How do I report quantitative results to non-statistical stakeholders?: Lead with the decision, not the statistic. 'We should keep version B; users completed the task 18% more often (52% vs 44%, n=200 per arm).' Use top-2-box percentages or medians, not means with standard deviations, when the audience is non-technical. Always include a confidence interval or range on key numbers. Pair every chart with one qualitative quote that illustrates the finding so the number has texture, not just precision.

Sources

About the author. Blake Crosley founded ResumeGeni and writes about UX research, hiring technology, and ATS optimization. More writing at blakecrosley.com.