datarekha

File Organization & Indexing

Heap vs sorted files, primary vs secondary indexes, hash vs B+-tree — the four choices behind every fast lookup, and the one GATE tests.

7 min read Intermediate GATE DA Lesson 73 of 122

What you'll learn

  • Heap vs sorted files; why most tables sit in heaps until you index them
  • Primary/clustering vs secondary indexes — which one reorders the file
  • Dense vs sparse: one entry per row vs one per block
  • Hash (O(1) equality only) vs B+-tree (O(log n) equality AND range)

Before you start

Imagine a phonebook with a million names in random order. Finding “Sharma” means flipping every page. An index is the alphabetical tab at the edge — a small side structure that says “Sharma starts on page 743.” Disk pages are slow to read, so anything that lets you jump straight to the right page is a massive win.

Databases keep the rows themselves in a file, and then build one or more indexes on top. Picking the right kind for the queries you actually run is what this lesson is about — and it’s the same choice GATE keeps asking.

Files first — heap or sorted

Before any index exists, your rows live in a file. Two basic layouts:

  • Heap file — rows in insertion order. Inserts are cheap (append to the end). Lookups scan the whole thing.
  • Sorted file — rows kept ordered by some column. Lookups use binary search. Inserts are painful: keeping order means shifting rows around.

Most real tables sit in heaps and lean on indexes for speed.

Primary, clustering, secondary — does the file move?

This is the bit that confuses people. Two questions to ask of any index:

  1. Does the file itself reorder around it?
  2. Is the indexed column the primary key, or some other column?
Primary / clusteringfile sorted by indexed column102540index10, 12, 1825, 27, 3340, 44, 51data file (sorted)Secondaryfile order is unrelated (heap)ABCindex (sorted by key)C, A, …B, …, A…, B, Cdata file (heap)Only one primary index per table (it controls the file order). Many secondaries are fine.
Primary/clustering reorders the file; secondary indexes point in from the side.
  • Primary index — built on the primary key, and the file is sorted by it. One per table.
  • Clustering index — same idea (file sorted by it), but the column doesn’t have to be a key. Still at most one per table.
  • Secondary index — built on any other column. The data file is left alone. You can have as many as you like.

And one more axis, mostly mentioned with primary/clustering indexes:

  • Dense index — one entry per row. Bigger, but you can find any row directly.
  • Sparse index — one entry per block (page). Smaller, but you have to scan inside the block. Only works on a sorted file (primary or clustering).

Hash vs B+-tree — the real exam question

Once you’ve chosen what to index, you choose how to store the index. Two big families:

Hash indexequality only · O(1)h₀h₁h₂h₃42, 71936, 88, 411WHERE x = 19 ✓WHERE x BETWEEN 10 AND 30 ✗ORDER BY x ✗B+-tree indexequality and range · O(log n)2510 1830 444, 711, 19222836, 4251, 88WHERE x = 19 ✓WHERE x BETWEEN 10 AND 30 ✓ORDER BY x ✓
Hash buckets give O(1) for equality but cannot scan ranges. B+-tree leaves are sorted and chained — perfect for range scans and ORDER BY.
  • Hash index. Hash the key, jump to a bucket, fetch. Expected O(1) for WHERE x = 19. But buckets are not in any order, so for WHERE x BETWEEN 10 AND 30 you would have to probe every value in that range one by one — that is no better than no index at all. Same for ORDER BY x.
  • B+-tree index. A balanced multi-way search tree whose leaves hold the keys in sorted order and are chained left-to-right. WHERE x = 19 is O(log n). For a range, you descend once to the bottom of the range and then walk the leaf chain. ORDER BY x becomes free — the leaves are already sorted. This is why almost every default database index is a B+-tree.

Worked example — GATE DA 2024 Q45

A query you have seen and will see again:

SELECT * FROM T WHERE x BETWEEN 10 AND 20 ORDER BY x;

You can build one index on column x. Hash or B+-tree?

  • A hash index on x lets you probe equality, but for BETWEEN 10 AND 20 the engine would have to either hash every value from 10 to 20 (if x is integer) or, more generally, skip the index entirely. And ORDER BY x cannot use a hash index — buckets have no order. So hash is useless here.
  • A B+-tree index on x descends to the leaf containing 10, then walks right through 11, 12, …, 20 along the sorted leaf chain. Each value’s row is read in x order, so the ORDER BY adds no work at all.

B+-tree wins. Any query mixing range or ordering with the indexed column gives the same answer. This is the design call GATE DA 2024 Q45 asked about — and it’s the design call every senior engineer makes on the job, too.

How GATE asks this

A short MCQ or MSQ. Format: “given query X, which index is best?” or “which of the following statements about [primary / secondary / dense / sparse] indexes are true?” The decision tree is short — equality only → hash is fine; range or ordering → B+-tree; reordering the file → primary or clustering; column other than the key → secondary. Walk that tree and you’ll be right.

Quick check

Quick check

0/5
Q1You can build ONE index on column `salary` of `Employees`. Queries are `SELECT * FROM Employees WHERE salary BETWEEN 50000 AND 80000 ORDER BY salary`. Which index type wins?
Q2Your only query is `SELECT * FROM Users WHERE user_id = ?` (always exact equality, never range). Which is the best fit?
Q3Which statements about indexes are TRUE? (select all that apply)select all that apply
Q4A `Books(book_id PK, author, title, year)` table holds 1 million rows in a heap. The query `SELECT title FROM Books WHERE author = 'Tagore' AND year = 1913` runs a hundred times per second. Which choice is the best fit?
Q5Which statements about primary vs secondary, and dense vs sparse, are TRUE? (select all that apply)select all that apply

Practice this in an interview

All questions
How does a B-tree index work, and when does the database choose not to use it?

A B-tree index stores key values in a balanced tree of sorted nodes, allowing the engine to reach any value in O(log n) page reads instead of scanning every row. The optimizer skips the index when the estimated cost of random I/O exceeds a full-table scan, when a function wraps the indexed column, or when the query returns such a large fraction of rows that a sequential scan is cheaper.

What is a covering index and how does it eliminate heap fetches?

A covering index includes every column a query needs — both filter and select columns — so the engine can answer the query entirely from the index pages without touching the main table heap. This removes the costliest part of an index scan: the random I/O for each individual row fetch.

How does columnar storage work, and how does partitioning improve query performance in a data warehouse?

Columnar storage colocates values from the same column on disk, so aggregation queries read only the columns they need rather than full rows — dramatically reducing I/O on wide tables. Partitioning physically separates data into subdirectories (e.g., by date), allowing the query engine to skip entire partitions whose predicate cannot match, cutting scan volume from the full table to just the relevant slice.

What makes a predicate sargable, and what are the most common ways to accidentally make a predicate non-sargable?

A sargable predicate (Search ARGument ABLE) is one the engine can evaluate using an index seek — a direct traversal to the matching key range. Predicates become non-sargable when a function or implicit cast is applied to the indexed column, forcing the engine to compute a derived value for every row before comparing.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content