Relational Algebra I
Filter rows, pick columns, mix tables with set ops — the small toolkit that builds every SQL query underneath. The single most-tested DBMS sub-topic in GATE DA.
What you'll learn
- Selection σ keeps rows that match a predicate; projection π keeps columns and drops duplicates
- Cross product × pairs every row with every row; union, intersection, difference need union-compatible relations
- Rename ρ renames a relation or its attributes — the only operator that does no real work on data
- The A − B = ∅ idiom: how relational algebra encodes a 'for-all' subset check
Before you start
You want every student whose marks are above 80, then only their names — no IDs, no roll numbers, no clutter. SQL spells it `SELECT name FROM Student WHERE marks
80`. Underneath, the database engine is doing something simpler: it runs two tiny operators back-to-back. That underneath layer is relational algebra — six little symbols that combine to build every query you write. (It’s not a museum piece either: every query planner, from Postgres to Spark SQL, rewrites your SQL into exactly these operators before deciding how to run it.)
Each operator takes one or two relations (think tables) and returns another relation. So you can chain them freely, like Lego.
The six basic operators
We’ll meet them with one table: Student(id, name, marks) with rows
(1, Asha, 85), (2, Ravi, 70), (3, Maya, 92).
Selection σ picks rows that satisfy a predicate (a true/false test on a row,
like marks ≥ 80). Projection π picks columns. Two operators, two halves of
every basic SQL query.
- σ_predicate(R) — keep the rows of R where the predicate is true. Predicate
can use
=, comparisons, AND, OR, NOT on attributes. - π_attrs(R) — keep only the listed columns. Set semantics: if dropping columns creates duplicate rows, they collapse into one.
- R × S — Cartesian product. Every row of R paired with every row of S.
If R has
mrows and S hasn, the result hasm·nrows. - R ∪ S, R ∩ S, R − S — set operations. R and S must be union-compatible: same number of columns, same domains in the same order.
- ρ_S(R) or ρ_S(a,b,c)(R) — rename the relation (and optionally its attributes). Useful before joining a table to itself.
That’s it — six operators. SQL is mostly sugar over them.
The subset idiom — encoding “for-all”
Plain SQL doesn’t have a tidy way to say “every row of A appears in B.” In relational algebra, set difference makes it tidy:
A − B = ∅ ⇔ A ⊆ B ⇔ every tuple of A is also in B
If A minus B is empty, nothing in A was missing from B — so A is a subset. That tiny trick is how RA encodes “for-all” questions before you reach for division.
How GATE asks this
A favourite MCQ pattern: give you four candidate expressions and ask which
one checks a “for-all” condition. The answer hinges on whether you can read the
A − B = ∅ shape. GATE DA 2024 Q16 did exactly that — see the worked example.
Worked example — GATE DA 2024 Q16
Relations are
Team(name),Defender(name),Forward(name). Which expression evaluates to true if and only if every player in Team is BOTH a defender AND a forward?
The condition is “every name in Team is in π_name(Defender) ∩ π_name(Forward)”
— a subset check. Apply the idiom:
π_name(Team) − ( π_name(Defender) ∩ π_name(Forward) ) = ∅
Reading it left to right: take all the names in Team; subtract those who are in both Defender and Forward; if nothing is left over, every Team member is both — which was the question. That’s the 2024 answer (option C).
Watch the direction. The expression
( π_name(Defender) ∩ π_name(Forward) ) − π_name(Team) = ∅
says the OPPOSITE — every dual-role player is in Team — a different claim.
A second sanity rule: π drops duplicates (set semantics). So
π_name(σ_marks ≥ 80(Student)) on a table where two students happen to share a
name returns one row, not two. SQL is bag-semantics by default; pure RA is
set-semantics. GATE expects set-semantics unless told otherwise.
Quick check
Quick check
Practice this in an interview
All questionsEvery core SQL clause — SELECT, WHERE, GROUP BY, HAVING, JOIN, ORDER BY, LIMIT — has a direct pandas equivalent, but SQL executes inside a database engine with optimized query planning and disk-backed storage, while pandas requires all data to fit in RAM. Use SQL for large persistent datasets and pandas for in-memory transformation, feature engineering, and integration with the Python ML ecosystem.
OLTP systems handle many small, latency-sensitive transactions that read and write a few rows at a time, so they are optimized for fast point lookups and row-level locking. OLAP systems run infrequent but wide analytical queries over millions of rows, so they benefit from columnar storage, bulk scans, and denormalized schemas that minimize joins.
You can always GROUP BY a derived expression written directly. Whether you can reference a SELECT alias in GROUP BY depends on the database: MySQL and BigQuery allow it, while PostgreSQL and SQL Server do not because aliases are not resolved until after GROUP BY in the logical order.
Raw-column filters belong in WHERE so the engine scans fewer rows before grouping. Aggregate filters must go in HAVING. Applying a filter in HAVING that could have been in WHERE forces the engine to aggregate more rows than necessary.