How would you design a metric to measure the quality of a search feature inside an e-commerce app?
Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.
How to think about it
Two tracks: offline and online metrics
Offline relevance metrics (model evaluation)
These use a labelled dataset of query-result pairs scored by human raters or implicit signals from historical data.
- NDCG@k (Normalised Discounted Cumulative Gain at rank k): measures how well the top-k results are ranked relative to ideal. Use k=5 or k=10 for typical result pages.
- MRR (Mean Reciprocal Rank): average of 1/rank of the first relevant result. Good for navigational queries where there is one right answer.
- Precision@k: fraction of the top-k results that are relevant.
For an e-commerce search, NDCG@5 is usually preferred because multiple relevant products can appear at different positions.
Online behavioural metrics (production A/B test)
Behavioural signals from real users are noisier but more valid as proxies for user value.
| Metric | What it captures | Caution |
|---|---|---|
| CTR at rank 1 | Did the best result attract clicks? | Can be gamed by thumbnail quality |
| Zero-result rate | Queries returning no results | Must segment by query type |
| Reformulation rate | User changed the query within 30 s | Signal of dissatisfaction |
| Add-to-cart rate post-search | Did the user find something to buy? | True downstream goal |
| Search abandonment rate | Left after search, no click | Combines relevance and UX |
Connecting them
Use add-to-cart rate post-search as the primary north-star. NDCG@5 is the offline proxy. Guardrails: zero-result rate and reformulation rate must not increase.
Worked example. A new retrieval model improves NDCG@5 from 0.68 to 0.74 on the labelled set. In an A/B test on 10 % of traffic for 2 weeks: add-to-cart rate +3.1 % (p less than 0.01), reformulation rate -1.4 %. Zero-result rate unchanged. Ship.