Case & Behavioral Hard Asked at AmazonAsked at EtsyAsked at GoogleAsked at Walmart

How would you design a metric to measure the quality of a search feature inside an e-commerce app?

For Data Scientist ML Engineer Data Analyst

The short answer

Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.

How to think about it

Two tracks: offline and online metrics

Offline relevance metrics (model evaluation)

These use a labelled dataset of query-result pairs scored by human raters or implicit signals from historical data.

NDCG@k (Normalised Discounted Cumulative Gain at rank k): measures how well the top-k results are ranked relative to ideal. Use k=5 or k=10 for typical result pages.
MRR (Mean Reciprocal Rank): average of 1/rank of the first relevant result. Good for navigational queries where there is one right answer.
Precision@k: fraction of the top-k results that are relevant.

For an e-commerce search, NDCG@5 is usually preferred because multiple relevant products can appear at different positions.

Online behavioural metrics (production A/B test)

Behavioural signals from real users are noisier but more valid as proxies for user value.

Metric	What it captures	Caution
CTR at rank 1	Did the best result attract clicks?	Can be gamed by thumbnail quality
Zero-result rate	Queries returning no results	Must segment by query type
Reformulation rate	User changed the query within 30 s	Signal of dissatisfaction
Add-to-cart rate post-search	Did the user find something to buy?	True downstream goal
Search abandonment rate	Left after search, no click	Combines relevance and UX

Connecting them

Use add-to-cart rate post-search as the primary north-star. NDCG@5 is the offline proxy. Guardrails: zero-result rate and reformulation rate must not increase.

Worked example. A new retrieval model improves NDCG@5 from 0.68 to 0.74 on the labelled set. In an A/B test on 10 % of traffic for 2 weeks: add-to-cart rate +3.1 % (p less than 0.01), reformulation rate -1.4 %. Zero-result rate unchanged. Ship.

How would you design a metric to measure the quality of a search feature inside an e-commerce app?

Two tracks: offline and online metrics

Offline relevance metrics (model evaluation)

Online behavioural metrics (production A/B test)

Connecting them

Keep practising

Explore further