datarekha

Relational Algebra I

Filter rows, pick columns, mix tables with set ops — the small toolkit that builds every SQL query underneath. The single most-tested DBMS sub-topic in GATE DA.

8 min read Intermediate GATE DA Lesson 65 of 122

What you'll learn

  • Selection σ keeps rows that match a predicate; projection π keeps columns and drops duplicates
  • Cross product × pairs every row with every row; union, intersection, difference need union-compatible relations
  • Rename ρ renames a relation or its attributes — the only operator that does no real work on data
  • The A − B = ∅ idiom: how relational algebra encodes a 'for-all' subset check

Before you start

You want every student whose marks are above 80, then only their names — no IDs, no roll numbers, no clutter. SQL spells it `SELECT name FROM Student WHERE marks

80`. Underneath, the database engine is doing something simpler: it runs two tiny operators back-to-back. That underneath layer is relational algebra — six little symbols that combine to build every query you write. (It’s not a museum piece either: every query planner, from Postgres to Spark SQL, rewrites your SQL into exactly these operators before deciding how to run it.)

Each operator takes one or two relations (think tables) and returns another relation. So you can chain them freely, like Lego.

The six basic operators

We’ll meet them with one table: Student(id, name, marks) with rows (1, Asha, 85), (2, Ravi, 70), (3, Maya, 92).

Selection σ picks rows that satisfy a predicate (a true/false test on a row, like marks ≥ 80). Projection π picks columns. Two operators, two halves of every basic SQL query.

σ — selection (rows)σ_marks ≥ 80 (Student)idnamemarks1Asha852Ravi703Maya92Asha and Maya kept; Ravi dropped.π — projection (columns)π_name (Student)nameAshaRaviMayaOther columns gone. Duplicates would be removed.
σ filters along the row axis; π filters along the column axis.
  • σ_predicate(R) — keep the rows of R where the predicate is true. Predicate can use =, comparisons, AND, OR, NOT on attributes.
  • π_attrs(R) — keep only the listed columns. Set semantics: if dropping columns creates duplicate rows, they collapse into one.
  • R × S — Cartesian product. Every row of R paired with every row of S. If R has m rows and S has n, the result has m·n rows.
  • R ∪ S, R ∩ S, R − S — set operations. R and S must be union-compatible: same number of columns, same domains in the same order.
  • ρ_S(R) or ρ_S(a,b,c)(R) — rename the relation (and optionally its attributes). Useful before joining a table to itself.

That’s it — six operators. SQL is mostly sugar over them.

The subset idiom — encoding “for-all”

Plain SQL doesn’t have a tidy way to say “every row of A appears in B.” In relational algebra, set difference makes it tidy:

A − B = ∅   ⇔   A ⊆ B   ⇔   every tuple of A is also in B

If A minus B is empty, nothing in A was missing from B — so A is a subset. That tiny trick is how RA encodes “for-all” questions before you reach for division.

How GATE asks this

A favourite MCQ pattern: give you four candidate expressions and ask which one checks a “for-all” condition. The answer hinges on whether you can read the A − B = ∅ shape. GATE DA 2024 Q16 did exactly that — see the worked example.

Worked example — GATE DA 2024 Q16

Relations are Team(name), Defender(name), Forward(name). Which expression evaluates to true if and only if every player in Team is BOTH a defender AND a forward?

The condition is “every name in Team is in π_name(Defender) ∩ π_name(Forward)” — a subset check. Apply the idiom:

π_name(Team) − ( π_name(Defender) ∩ π_name(Forward) )  =  ∅

Reading it left to right: take all the names in Team; subtract those who are in both Defender and Forward; if nothing is left over, every Team member is both — which was the question. That’s the 2024 answer (option C).

Watch the direction. The expression ( π_name(Defender) ∩ π_name(Forward) ) − π_name(Team) = ∅ says the OPPOSITE — every dual-role player is in Team — a different claim.

A second sanity rule: π drops duplicates (set semantics). So π_name(σ_marks ≥ 80(Student)) on a table where two students happen to share a name returns one row, not two. SQL is bag-semantics by default; pure RA is set-semantics. GATE expects set-semantics unless told otherwise.

Quick check

Quick check

0/6
Q1Relation R has 5 rows, S has 4 rows. How many rows does R × S have?numerical answer — type a number
Q2Student(id, name, marks) has 6 rows. Two students are named 'Asha' and two are named 'Ravi'. How many rows does π_name(Student) return under pure relational algebra?numerical answer — type a number
Q3Relations A(x) and B(x) are union-compatible. Which expression evaluates to ∅ exactly when every tuple of A is also in B?
Q4Which statements about the basic operators are TRUE? (select all that apply)select all that apply
Q5Given Team(name) = {Asha, Ravi, Maya} and Defender(name) = {Asha, Ravi, Tara}, what does π_name(Team) − π_name(Defender) return?
Q6Which expressions return 'names of students with marks ≥ 80' from Student(id, name, marks)? (select all that apply)select all that apply

Practice this in an interview

All questions
How do common SQL operations map to pandas, and when should you use SQL instead of pandas?

Every core SQL clause — SELECT, WHERE, GROUP BY, HAVING, JOIN, ORDER BY, LIMIT — has a direct pandas equivalent, but SQL executes inside a database engine with optimized query planning and disk-backed storage, while pandas requires all data to fit in RAM. Use SQL for large persistent datasets and pandas for in-memory transformation, feature engineering, and integration with the Python ML ecosystem.

What is the difference between OLTP and OLAP workloads, and how does that drive database design choices?

OLTP systems handle many small, latency-sensitive transactions that read and write a few rows at a time, so they are optimized for fast point lookups and row-level locking. OLAP systems run infrequent but wide analytical queries over millions of rows, so they benefit from columnar storage, bulk scans, and denormalized schemas that minimize joins.

Can you GROUP BY a derived expression or a SELECT alias, and how does this differ across databases?

You can always GROUP BY a derived expression written directly. Whether you can reference a SELECT alias in GROUP BY depends on the database: MySQL and BigQuery allow it, while PostgreSQL and SQL Server do not because aliases are not resolved until after GROUP BY in the logical order.

Given a query that filters on both a raw column and an aggregate result, how do you structure it for correctness and performance?

Raw-column filters belong in WHERE so the engine scans fewer rows before grouping. Aggregate filters must go in HAVING. Applying a filter in HAVING that could have been in WHERE forces the engine to aggregate more rows than necessary.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content