datarekha
Case & Behavioral Hard Asked at AmazonAsked at EtsyAsked at GoogleAsked at Walmart

How would you design a metric to measure the quality of a search feature inside an e-commerce app?

The short answer

Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.

How to think about it

Two tracks: offline and online metrics

Offline relevance metrics (model evaluation)

These use a labelled dataset of query-result pairs scored by human raters or implicit signals from historical data.

  • NDCG@k (Normalised Discounted Cumulative Gain at rank k): measures how well the top-k results are ranked relative to ideal. Use k=5 or k=10 for typical result pages.
  • MRR (Mean Reciprocal Rank): average of 1/rank of the first relevant result. Good for navigational queries where there is one right answer.
  • Precision@k: fraction of the top-k results that are relevant.

For an e-commerce search, NDCG@5 is usually preferred because multiple relevant products can appear at different positions.

Online behavioural metrics (production A/B test)

Behavioural signals from real users are noisier but more valid as proxies for user value.

MetricWhat it capturesCaution
CTR at rank 1Did the best result attract clicks?Can be gamed by thumbnail quality
Zero-result rateQueries returning no resultsMust segment by query type
Reformulation rateUser changed the query within 30 sSignal of dissatisfaction
Add-to-cart rate post-searchDid the user find something to buy?True downstream goal
Search abandonment rateLeft after search, no clickCombines relevance and UX

Connecting them

Use add-to-cart rate post-search as the primary north-star. NDCG@5 is the offline proxy. Guardrails: zero-result rate and reformulation rate must not increase.

Worked example. A new retrieval model improves NDCG@5 from 0.68 to 0.74 on the labelled set. In an A/B test on 10 % of traffic for 2 weeks: add-to-cart rate +3.1 % (p less than 0.01), reformulation rate -1.4 %. Zero-result rate unchanged. Ship.

Keep practising

All Case & Behavioral questions

Explore further

Skip to content