Pandas & Data Wrangling Easy Asked at MetaAsked at NetflixAsked at Airbnb

What is the difference between wide and long (tidy) data formats, and why does it matter for analysis?

For Data Analyst Data Scientist ML Engineer

The short answer

Wide format stores multiple measurements as separate columns per subject; long (tidy) format stores one measurement per row with a variable-name column and a value column. Long format is required by most statistical and visualization libraries, makes adding new variables trivial, and is the standard expected by groupby and merge operations.

How to think about it

The trap in this question is that wide format looks like the natural one — it is exactly how a spreadsheet is laid out, one row per subject and a column per measurement. The interviewer wants to hear that you know when that breaks. The moment you want to do anything analytical — group by time period, plot one line per patient, run a regression — most libraries demand long (tidy) format: one measurement per row, a column naming the variable, a column holding the value. Recognizing which shape you have and how to convert is the difference between two clean lines and a script that repeats itself for every new column.

The visual difference

WIDE (one row per patient)
patient  bp_2022  bp_2023  bp_2024
Alice    120      118      122
Bob      135      130      128

LONG / TIDY (one row per measurement)
patient  year   bp
Alice    2022   120
Alice    2023   118
Alice    2024   122
Bob      2022   135

Adding a new year in wide format means adding a column — and every analysis that lists the year columns by hand now has to be edited. In long format it means adding rows, which never touches existing code.

melt and pivot, round-trip

Here is the conversion both ways on a small frame, with Carol deliberately missing her 2024 reading so you can see how each format carries the gap.

import pandas as pd

wide = pd.DataFrame({
    "patient": ["Alice", "Bob", "Carol"],
    "bp_2022": [120, 135, 110],
    "bp_2023": [118, 130, 115],
    "bp_2024": [122, 128, float("nan")],   # Carol has no 2024 reading yet
})

# Wide -> Long with melt
long = wide.melt(id_vars="patient", var_name="year_col", value_name="bp")
long["year"] = long["year_col"].str.extract(r"(\d{4})").astype("int")
long = long.drop(columns="year_col").sort_values(["patient", "year"]).reset_index(drop=True)
print("Long format after melt:")
print(long)
print()

# Analysis is now one groupby, no column-by-column repetition
print("Mean BP per year:")
print(long.groupby("year")["bp"].mean().round(1))
print()

# Long -> Wide with pivot
back = long.pivot(index="patient", columns="year", values="bp")
back.columns.name = None
print("Back to wide via pivot:")
print(back)

Long format after melt:
  patient     bp  year
0   Alice  120.0  2022
1   Alice  118.0  2023
2   Alice  122.0  2024
3     Bob  135.0  2022
4     Bob  130.0  2023
5     Bob  128.0  2024
6   Carol  110.0  2022
7   Carol  115.0  2023
8   Carol    NaN  2024

Mean BP per year:
year
2022    121.7
2023    121.0
2024    125.0
Name: bp, dtype: float64

Back to wide via pivot:
          2022   2023   2024
patient                     
Alice    120.0  118.0  122.0
Bob      135.0  130.0  128.0
Carol    110.0  115.0    NaN

melt turned three year-columns into nine rows with a single bp column, so the per-year mean is one groupby that works no matter how many years exist — and notice the 2024 mean of 125.0 quietly skips Carol’s NaN rather than choking on it. Her missing reading is now one explicit NaN in one row, not an awkward gap in a column. The pivot at the end reverses the move exactly, recovering the original wide layout for presentation.

The general workflow: store and analyze in long format, then pivot to wide only at the last step, when a human reader or a model needs side-by-side columns. Wide is genuinely the right shape for a few things — correlation matrices, scikit-learn feature matrices (one column per feature), and cross-tab reports people read directly — but those are endpoints, not working formats.

Learn it properly pivot, melt, stack

What is the difference between wide and long (tidy) data formats, and why does it matter for analysis?

The visual difference

melt and pivot, round-trip

Keep practising

Explore further