datarekha
FAQ 91 questions · 19 subjects

Frequently asked questions

Quick, accurate answers to the things learners ask most — from “do I need math for Python?” to “when does Spark beat Pandas?”. Each subject links through to the full lessons.

Python

Lessons
Do I need to know math to start learning Python?

No. Python's core syntax — variables, loops, functions, lists, and dictionaries — needs nothing beyond basic arithmetic. Math only matters later for specific domains like data science, and even then the Python itself stays simple. Start with the syntax and add math when a project demands it.

What's the difference between a list and a tuple in Python?

A list is mutable (you can add, remove, or change items) and uses square brackets; a tuple is immutable (fixed once created) and uses parentheses. Use a list when the collection will change, and a tuple for fixed records or as dictionary keys, where immutability is required.

Why is my Python code slow — is the GIL to blame?

For CPU-bound work, the Global Interpreter Lock (GIL) stops threads from running Python bytecode in true parallel, so threading won't help — use multiprocessing or vectorised libraries like NumPy. For I/O-bound work the GIL is released during waits, so threads or asyncio do help. Most 'slow Python' is actually unvectorised or algorithmic, not the GIL.

Read the lesson
When should I use a list comprehension instead of a for loop?

Use a comprehension when you're building a new list by transforming or filtering an iterable — it's more concise and usually faster. Stick with a regular for loop when the body has side effects, multiple statements, or complex logic, where a comprehension would hurt readability.

What's the difference between == and is in Python?

`==` checks whether two values are equal; `is` checks whether two names point to the exact same object in memory. Use `==` for value comparison (the common case) and reserve `is` for identity checks like `x is None`.

NumPy

Lessons
Why use NumPy instead of plain Python lists?

NumPy stores data in contiguous, typed arrays and runs operations in optimised C, so element-wise math on large arrays is often 10–100× faster than Python loops and uses far less memory. It's the foundation Pandas, scikit-learn, and PyTorch are built on.

What is broadcasting in NumPy?

Broadcasting is how NumPy applies an operation between arrays of different shapes without copying data — it virtually stretches the smaller array to match the larger one. For example, subtracting a 1D row of column means from a 2D matrix works per-column without a loop, as long as the trailing dimensions are compatible.

What does the axis argument mean in NumPy?

`axis` names the dimension the operation collapses along. For a 2D array, `axis=0` reduces down the rows (one result per column) and `axis=1` reduces across the columns (one result per row). The common mistake is reading 'axis=0' as 'rows' when it actually aggregates over them.

What's the difference between a NumPy view and a copy?

Basic slicing returns a view — a window into the same memory — so changing the slice changes the original array. Fancy indexing (with a list or boolean mask) returns a copy. Call `.copy()` when you need an independent array to avoid surprising in-place mutations.

Should I ever use a Python loop over a NumPy array?

Rarely. Prefer vectorised operations, broadcasting, and built-in functions, which run in C and are far faster. Reach for a loop only when the logic genuinely can't be vectorised, and even then consider tools like Numba.

Pandas

Lessons
What's the difference between loc and iloc in Pandas?

`loc` selects by label (index and column names); `iloc` selects by integer position. Use `loc` when you know the row index or column name, and `iloc` when you want the Nth row or column regardless of its label.

How do I avoid the SettingWithCopyWarning?

That warning means you may be assigning to a copy of a slice, so the change might not stick. Do the selection and assignment in a single `.loc` call — e.g. `df.loc[df.x > 0, 'y'] = 1` — or take an explicit `.copy()` first if you intend to work on a separate frame.

When should I use apply versus a vectorised operation?

Prefer vectorised operations and built-in methods — they run in optimised C and are far faster than `apply`, which loops in Python. Use `apply` only for genuinely custom row/column logic that can't be expressed with vectorised functions.

What's the difference between merge, join, and concat?

`merge` (and the similar `join`) combine frames by matching key columns, like a SQL join. `concat` stacks frames along an axis — rows on top of each other or columns side by side — without matching keys. Use merge for relational joins and concat for appending or aligning by index.

Why did my groupby explode the number of rows?

That usually happens when you merge before aggregating — a one-to-many join multiplies rows, and a later sum double-counts. Aggregate to the right grain first, or confirm your join keys are unique, before combining tables.

Storytelling with Visualisation

Lessons
When should I use Matplotlib versus Seaborn?

Seaborn is built on Matplotlib and is faster for common statistical charts (distributions, categories, correlations) with attractive defaults. Drop to Matplotlib when you need fine-grained control over a custom figure. In practice you plot with Seaborn and tweak with Matplotlib.

What's the difference between a figure and axes in Matplotlib?

The figure is the whole canvas; axes are the individual plot areas inside it, each with its own x/y coordinates. One figure can hold many axes (subplots). Understanding this split is the key to building multi-panel charts cleanly.

Which chart should I use for my data?

Match the chart to the question: line for trends over time, bar for comparing categories, scatter for relationships between two numeric variables, histogram for one variable's distribution, and box or violin for comparing distributions across groups. Avoid pie charts beyond a few categories.

How do I make charts that work in both light and dark mode?

Avoid hard-coded colors; use a palette tied to your theme and keep sufficient contrast against either background. Test the figure on both backgrounds, and choose colors that stay distinguishable for color-blind readers.

Why does my plot look cramped or get cut off when saved?

Labels often overflow the default bounding box. Call `plt.tight_layout()` (or save with `bbox_inches='tight'`) to fit everything, and set an explicit figure size for the medium you're targeting.

Business Analytics

Lessons
What's the difference between revenue, profit, and margin?

Revenue is total money from sales; profit is what's left after costs; margin is profit as a percentage of revenue. A business can have high revenue and still lose money if costs exceed it — which is why margin, not revenue, tells you whether the model actually works.

What are CAC and LTV, and why do they matter together?

CAC (Customer Acquisition Cost) is what you spend to win a customer; LTV (Lifetime Value) is the total profit that customer brings. The business is healthy when LTV comfortably exceeds CAC — a common benchmark is roughly 3:1 — otherwise you lose money on every customer you acquire.

Read the lesson
What is a break-even analysis?

Break-even is the point where total revenue equals total cost, so profit is zero. It tells you how many units you must sell, or what price you must charge, to cover fixed and variable costs before you start making money — essential for pricing and go/no-go decisions.

Read the lesson
What is RFM segmentation?

RFM scores customers on Recency (how recently they bought), Frequency (how often), and Monetary value (how much) to group them into actionable segments like loyal, at-risk, or new. It's a simple, powerful way to target retention and marketing without a complex model.

Why can the average customer be misleading?

Averages hide skew. If a few large customers dominate, the mean misrepresents the typical one, and decisions based on it misfire. Look at the median and the full distribution — segments and percentiles usually tell a truer story than a single average.

What's the difference between WHERE and HAVING?

`WHERE` filters individual rows before grouping; `HAVING` filters groups after `GROUP BY` has aggregated them. Use `WHERE` for row conditions and `HAVING` for conditions on aggregates like `COUNT(*) > 5`.

When should I use a window function instead of GROUP BY?

Use a window function when you need an aggregate alongside the original rows — a running total, a rank within each group, or each row's share of its group's total — without collapsing the rows. `GROUP BY` is for when you only want one summary row per group.

Read the lesson
What's the difference between an INNER JOIN and a LEFT JOIN?

An INNER JOIN keeps only rows that match in both tables; a LEFT JOIN keeps every row from the left table and fills NULLs where the right has no match. Use LEFT JOIN when you want to keep all records from your primary table even when related data is missing.

Why is my SQL query slow?

Usually it's missing indexes on join or filter columns, functions wrapped around indexed columns (which disable the index), or returning far more rows than needed. Read the query plan with EXPLAIN, index the columns you filter and join on, and avoid `SELECT *` on wide tables.

What is a CTE and when should I use one?

A CTE (Common Table Expression, the `WITH` clause) names a subquery so you can reference it like a temporary table. Use it to break a complex query into readable steps, or recursively to walk hierarchies like org charts. It mainly improves readability.

PySpark

Lessons
When do I actually need Spark instead of Pandas?

Reach for Spark when your data is too large for one machine's memory or you need a cluster to process it in parallel — typically tens of gigabytes and up. For data that fits comfortably in RAM, Pandas (or Polars) is simpler and faster; Spark's distributed overhead only pays off at scale.

What does lazy evaluation mean in Spark?

Spark doesn't run transformations as you write them — it builds a plan and only executes when an action (like `count`, `collect`, or `write`) is called. This lets its Catalyst optimiser reorder and combine steps, but it also means errors can surface only at the action.

Read the lesson
What's the difference between a transformation and an action in Spark?

Transformations (`select`, `filter`, `join`, `groupBy`) describe what to compute and return a new DataFrame lazily; actions (`count`, `show`, `collect`, `write`) trigger execution and return results. Nothing runs until an action is called.

Why is my Spark job slow or running out of memory?

The usual cause is data skew or wide shuffles — joins and groupBy operations that move data across the cluster. Check for skewed keys, avoid `collect()` on large data, cache reused DataFrames, and let Adaptive Query Execution (AQE) tune partitions.

What's the difference between the Spark driver and executors?

The driver runs your program and builds the execution plan; executors are the worker processes across the cluster that run tasks on partitions of the data. Pulling too much data back to the driver (e.g. `collect()`) is a common source of out-of-memory errors.

Math for ML

Lessons
How much math do I really need for machine learning?

Enough to reason, not to derive everything from scratch: linear algebra (vectors, matrices, dot products), the basics of calculus (gradients and the chain rule), and probability and statistics (distributions, expectation, Bayes). You can start applying models with less and deepen the math as you go.

Why is linear algebra so important for ML?

Data is represented as vectors and matrices, and nearly every model — from linear regression to neural networks — is built on matrix multiplication. Linear algebra is the language that makes these operations fast and lets you reason about transformations, projections, and dimensionality.

What is a gradient, intuitively?

A gradient is the vector of partial derivatives that points in the direction of steepest increase of a function. Training follows the negative gradient downhill to reduce the loss — that's gradient descent — and the gradient's size tells you how steep the slope is at the current point.

What's the difference between probability and statistics?

Probability reasons forward — given a known model, how likely is an outcome. Statistics reasons backward — given observed data, what model or parameters likely produced it. ML uses both: probability to define models, statistics to fit and evaluate them.

What is Bayes' theorem used for?

Bayes' theorem updates a prior belief into a posterior after seeing evidence, by weighing how likely the evidence is under each hypothesis. It underpins spam filters and medical-test interpretation, and it explains why a positive result on a rare-condition test can still mean low actual risk.

Machine Learning

Lessons
Do neural networks always beat traditional ML?

No. On most tabular data, gradient-boosted trees (XGBoost, LightGBM, CatBoost) match or beat neural networks with less tuning and far less data. Deep learning dominates for images, text, and audio, but for structured business data, classical models are usually the stronger and simpler choice.

Read the lesson
What's the difference between training, validation, and test sets?

You fit the model on the training set, tune hyperparameters on the validation set, and report final performance once on the untouched test set. Keeping the test set truly held out is what makes your reported accuracy honest rather than optimistic.

Why is my model 99% accurate but useless?

Accuracy is misleading on imbalanced data — a model that always predicts the majority class can score 99% while catching none of the rare cases that matter. Use metrics suited to the problem, like precision, recall, F1, or AUC, and read the confusion matrix.

Read the lesson
What is data leakage and why is it dangerous?

Data leakage is when information unavailable at prediction time sneaks into training — like scaling using the whole dataset before splitting, or a feature derived from the target. It produces great validation scores that collapse in production, making it one of the most costly silent bugs in ML.

Read the lesson
When should I use XGBoost versus a random forest?

Random forests are robust and need little tuning, so they make a strong baseline. Gradient boosting (XGBoost and friends) usually reaches higher accuracy but is more sensitive to hyperparameters and overfitting. Start with a random forest, then try boosting to squeeze out more performance.

Deep Learning

Lessons
What does PyTorch's autograd actually do?

Autograd records the operations in your forward pass as a graph, then automatically computes gradients of the loss with respect to every parameter by applying the chain rule backward. That's what lets you train with `loss.backward()` without deriving gradients by hand.

Why do neural networks need activation functions?

Without a non-linear activation, stacking layers just composes linear functions, which collapses to a single linear model no matter how deep. Activations like ReLU add non-linearity, letting the network approximate complex, curved decision boundaries.

What's the difference between SGD, Adam, and AdamW?

SGD updates weights using the raw gradient (optionally with momentum). Adam adapts the step size per parameter using running estimates of the gradient and its variance, often training faster. AdamW fixes how Adam handles weight decay for better regularisation, and is the common default for transformers.

What loss function should I use?

Use cross-entropy for classification — it heavily penalises confident wrong predictions — and mean squared error (or a robust variant like Huber) for regression. The loss must match the output layer, for example softmax outputs paired with cross-entropy.

What is a transformer and why did it change everything?

A transformer is an architecture built on self-attention, which lets every token directly weigh every other token regardless of distance, and processes a whole sequence in parallel rather than step by step like an RNN. That parallelism and long-range modeling are what made modern large language models possible.

Read the lesson

MLOps

Lessons
What is MLOps, in one sentence?

MLOps is the practice of reliably deploying, monitoring, and maintaining machine-learning models in production — applying DevOps discipline (version control, CI/CD, automation, monitoring) to the messier reality of data and models that drift over time.

Why do models that work in a notebook fail in production?

Common causes are training/serving skew (production data or preprocessing differs from training), data drift over time, environment and dependency differences, and no monitoring to catch the decline. Reproducible pipelines and Docker containers close most of these gaps.

What is model drift and how do I detect it?

Drift is when the live data (data drift) or the input–target relationship (concept drift) changes from what the model trained on, quietly degrading accuracy. Detect it by monitoring input distributions and prediction quality over time, and retrain or alert when they shift beyond a threshold.

Why use Docker for machine learning?

Docker packages your code, libraries, and system dependencies into one image that runs identically on your laptop, in CI, and in production. That eliminates 'it works on my machine' failures, which are especially common in ML because of heavy, version-sensitive dependencies.

What does MLflow do?

MLflow tracks experiments — logging parameters, metrics, and artifacts so you can compare runs — and provides a model registry to version and stage models for deployment. It's the system of record that keeps ML work reproducible instead of scattered across notebooks.

Generative AI

Lessons
What actually is a large language model?

An LLM is a neural network (a transformer) trained to predict the next token in text. From that single objective, at scale, it learns grammar, facts, reasoning patterns, and style, then generates by predicting one token at a time. It doesn't look things up — it produces statistically likely continuations.

What is RAG and why use it?

RAG (Retrieval-Augmented Generation) retrieves relevant documents and feeds them into the model's context so it answers from your data instead of only its training. It reduces hallucination, lets you use private or up-to-date information, and avoids the cost of retraining the model.

Read the lesson
What do temperature, top-k, and top-p control?

They control randomness in generation. Temperature scales how sharply the model favors high-probability tokens — low is focused and near-deterministic, high is creative and riskier. Top-k and top-p (nucleus) limit sampling to the most likely tokens. Use low temperature for factual tasks, higher for brainstorming.

Read the lesson
Why does an LLM hallucinate, and how do I reduce it?

Because it generates plausible continuations rather than retrieving facts, it can state confident falsehoods, especially outside its training data. Reduce it by grounding answers with RAG, asking for citations, lowering temperature, and constraining the task — but you can't eliminate it, so verify critical outputs.

What's the difference between fine-tuning and RAG?

Fine-tuning adjusts the model's weights to change its style or specialise its behavior; RAG leaves the model unchanged and supplies knowledge at query time. Use RAG for facts that change or are private, and fine-tuning for consistent format, tone, or task behavior — they're often combined.

Agentic AI

Lessons
What makes an AI agent different from a chatbot?

An agent doesn't just answer — it plans and acts in a loop: it calls tools, observes the results, and decides the next step until a goal is met. A chatbot produces one reply; an agent can search, run code, query a database, and chain those steps autonomously.

What is the Model Context Protocol (MCP)?

MCP is an open standard for connecting AI models to tools and data sources through a uniform interface, so any MCP-compatible client can use any MCP server. It acts like a universal adapter that replaces bespoke, per-integration glue code for agent tooling.

What's the difference between LangChain and LangGraph?

LangChain offers building blocks and chains for composing LLM apps; LangGraph models an agent as an explicit graph of nodes and edges with shared state, which makes loops, branching, retries, and human-in-the-loop control far easier to reason about. Use LangGraph when the control flow gets complex.

Why do multi-agent systems often fail?

Common failure modes are agents losing shared context, compounding each other's errors over long chains, looping without progress, and ballooning cost and latency. Reliability usually comes from constraining each agent's scope, adding verification steps, and keeping the control flow explicit rather than fully open-ended.

What are the main agent design patterns?

The core patterns are reflection (the model critiques and revises its own output), tool use (calling external functions), planning (decomposing a goal into steps), and multi-agent collaboration (specialised agents working together). Most production agents combine a few of these.

Data Structures & Algorithms

Lessons
How much DSA do I need for data science and ML?

The practical core: Big-O intuition to reason about cost, hash maps and sets for fast lookups and dedup, sorting and binary search, and a feel for when an O(n²) approach won't scale. You rarely implement exotic algorithms, but understanding complexity is what keeps data pipelines fast.

What is Big-O notation, simply?

Big-O describes how an algorithm's time or memory grows as the input grows, ignoring constants. O(n) doubles when the input doubles; O(n²) quadruples; O(log n) barely grows. It's the tool for predicting whether code will still be fast at 10× or 1000× the data.

When should I use a hash table?

Use a hash table (dict or set in Python) whenever you need fast lookups, membership tests, counting, or deduplication — it offers average O(1) access. It's the workhorse behind grouping, joins, and 'have I seen this before' checks across data work.

What's the difference between O(n log n) and O(n²)?

O(n log n) is the speed of good sorting algorithms and scales to millions of items; O(n²) compares every pair and becomes painfully slow past a few thousand. Turning a nested-loop O(n²) approach into a sort- or hash-based one is one of the highest-leverage optimisations.

What are Bloom filters and HyperLogLog used for?

They're probabilistic structures that trade a little accuracy for huge memory savings at scale. A Bloom filter answers 'have I probably seen this?' without storing every item; HyperLogLog estimates the count of distinct items in a stream using tiny memory. Both power real-world dedup and analytics.

GATE DA

Lessons
What is the GATE DA exam?

GATE DA (Data Science and Artificial Intelligence) is an Indian graduate-entrance exam, introduced in 2024, covering probability and statistics, linear algebra, calculus, programming and data structures, databases, machine learning, and AI. It tests conceptual understanding and problem-solving rather than rote memorisation.

What subjects does GATE DA cover?

The syllabus spans probability and statistics, linear algebra, calculus and optimization, programming and data structures, algorithms, database management, data warehousing, machine learning, and AI (search, logic, reasoning), plus the common General Aptitude section. Probability, linear algebra, and ML carry significant weight.

How should I prepare for GATE DA?

Build concepts first, then drill previous-year-style problems under timed conditions, and review with spaced retrieval so material sticks. Prioritise high-weight topics like probability, linear algebra, and ML, and practice General Aptitude, which is high-return for the time invested.

How is GATE DA different from a typical ML course?

GATE DA is exam-oriented — it rewards precise definitions, derivations, and fast, accurate problem-solving against the official syllabus, where an ML course is project-oriented. The concepts overlap heavily, but the exam demands speed and rigor on paper rather than building systems.

How accurate should my study answers be?

Always verify solutions against official answer keys and primary sources, since small differences in convention or rounding can change a multiple-choice answer. Reputable preparation material checks every previous-year answer against the official key for exactly this reason.

What is the difference between Git and GitHub?

Git is the version-control tool that runs on your computer and tracks changes to your files; GitHub is a website that hosts Git repositories online for sharing and collaboration. You can use Git entirely offline, while GitHub (like GitLab or Bitbucket) adds remote backup, pull requests, and team workflows on top.

Read the lesson
What is the difference between git merge and git rebase?

Merge combines two branches by creating a new merge commit that ties their histories together and preserves exactly what happened; rebase instead replays your commits on top of the target branch, giving a clean linear history but rewriting those commits with new IDs. Use merge to preserve true history on shared branches, and rebase only to tidy a local branch before sharing — never rebase commits others have already pulled.

Read the lesson
How do I undo the last Git commit?

Use git reset to move your branch back: git reset --soft HEAD~1 undoes the commit but keeps your changes staged, while git reset --hard HEAD~1 discards the changes entirely. If the commit was already pushed and shared, use git revert instead, which adds a new commit that reverses it without rewriting history.

Read the lesson
What does the staging area do in Git?

The staging area (also called the index) is where you assemble exactly which changes go into your next commit. You edit files in your working directory, run git add to stage the specific changes you want, then git commit to snapshot them — this two-step flow lets you craft focused, meaningful commits instead of dumping every change at once.

Read the lesson

Command Line

Lessons
What is the difference between the terminal, the shell, and bash?

The terminal is the application window that shows text, the shell is the program running inside it that reads your commands and executes them, and bash and zsh are specific shells (zsh is the macOS default). In short, the terminal is the screen, the shell is the interpreter, and bash or zsh are particular brands of that interpreter.

Read the lesson
What is the difference between grep and find?

grep searches inside files for lines matching a text pattern, while find locates the files themselves by name, size, type, or modification time. Use grep to answer 'which files contain this text?' and find to answer 'where are the files with these properties?' — and they are often combined in one pipeline.

Read the lesson
What is the difference between a single and double redirect in the shell?

A single greater-than sign redirects a command's output to a file and overwrites whatever was there, while a double greater-than sign appends the output to the end of the file instead. Use append when adding to a log you want to keep, and overwrite only when you intend to replace the file's contents.

Read the lesson
How do I make a shell script executable?

Add a shebang line such as #!/usr/bin/env bash at the top, run chmod +x on the file to give it execute permission, then run it with ./script.sh. The shebang tells the system which interpreter to use, and the execute bit is what lets you run the file directly.

Read the lesson

Time Series

Lessons
Why can't I use a normal train/test split for time series?

Because time series observations are ordered and correlated with their own past, so randomly shuffling them leaks future information into training and inflates your scores. Always split by time — train on earlier data and test on later data — and validate with walk-forward backtesting rather than random k-fold cross-validation.

Read the lesson
What is the difference between ARIMA and SARIMA?

ARIMA models trend and short-term structure through autoregression, differencing, and moving-average terms but assumes no repeating seasonal pattern; SARIMA adds a seasonal set of those same terms at the season's period (for example every 12 months) to capture cycles ARIMA misses. Use SARIMA when your data has a clear, fixed-length seasonal pattern.

Read the lesson
What does it mean for a time series to be stationary?

A stationary series has statistical properties — its mean, variance, and autocorrelation — that stay constant over time, which is exactly what models like ARIMA assume. You check it with a plot plus the Augmented Dickey-Fuller test, and usually achieve it by differencing the series to remove trend and seasonality.

Read the lesson
When should I use Prophet instead of ARIMA?

Reach for Prophet when you want strong seasonality, holiday effects, and robustness to missing data and outliers with minimal tuning, since it is designed for business forecasting by analysts. Prefer ARIMA or SARIMA when you want a well-understood statistical model and are willing to identify orders from the ACF and PACF — and check either one against a naive baseline.

Read the lesson

Recommender Systems

Lessons
What is the difference between content-based and collaborative filtering?

Content-based filtering recommends items similar to what you already liked using item features such as genre, tags, or text, while collaborative filtering uses the behavior of many users — 'people like you also liked' — without needing item features. Content-based handles brand-new items well, collaborative filtering captures taste patterns features can't, and hybrid systems combine both.

Read the lesson
What is the cold-start problem in recommender systems?

Cold start is when you can't make good recommendations because there is no interaction history yet — for a new user, a new item, or a brand-new system. The fixes are content-based filtering and metadata for new items, onboarding preferences and popularity fallbacks for new users, and deliberate exploration to gather data.

Read the lesson
What is matrix factorization in recommender systems?

Matrix factorization approximates the sparse user-item rating matrix as the product of two smaller matrices of latent factors, one for users and one for items, so a predicted rating is the dot product of a user vector and an item vector. These learned factors capture hidden taste dimensions and powered the Netflix Prize-winning approaches.

Read the lesson
Why is RMSE a poor metric for recommender systems?

Because recommendation is a ranking problem — users only ever see the top few items — so a rating-prediction error like RMSE doesn't measure whether the right items reached the top. Use top-k ranking metrics such as precision@k, recall@k, and NDCG, and always compare against a simple popularity baseline.

Read the lesson
Skip to content