Data Engineering Interview Questions — pipeline judgment patterns | Learn

StrongYes tip

Most data engineering interview questions are not asking whether you memorized Spark APIs or every warehouse buzzword. They are asking whether you can move data from source to serving without losing correctness, observability, or operational sanity.

Data engineering interviews sit in the gap between SQL, databases, and system design. Across data engineering loops at Databricks, Snowflake, and Scale AI, the research finds the pattern consistent: the people who prepared only with LeetCode get blindsided by the pipeline design round. StrataScratch's data engineer guide confirms it — "SQL is bread and butter for any Data Engineer," but the interview also tests architecture, ETL judgment, and system trade-offs that pure coding prep never touches.

As of April 9, 2026, LeetCode still treats Pandas as a dedicated prep lane with an Introduction to Pandas study plan that explicitly drills data cleaning and column operations before 30 Days of Pandas. Public data-engineering prep guides still cluster around the same core areas: SQL, Python, database design, ETL, and end-to-end data-platform reasoning. The StrongYes company corpus shows the company version of the same pattern: Databricks, Scale AI, and Snowflake all push candidates toward data pipelines, query engines, storage trade-offs, or distributed data systems.

The good news is that these interviews are more repetitive than they look.

What data-engineering interviewers are actually testing

A strong answer usually proves five things:

You can model the data at the right grain.
You can explain how raw data gets extracted, transformed, and loaded reliably.
You understand the difference between batch, streaming, and serving needs.
You know how to protect quality with validation, idempotency, and observability.
You can make performance and cost trade-offs without breaking correctness.

That is the real shape of most data engineering loops. They are less about one tool and more about system judgment across the pipeline.

The five data-engineering question families that keep repeating

1. SQL and data-modeling questions

Most data-engineering interviews still start here.

Expect prompts like:

write the query for daily active users
dedupe events to the latest row per customer
model orders, payments, and refunds
explain how you would index or partition a hot table

The key skill is connecting the data model to the read pattern. If you jump straight into syntax, you usually miss the grain and produce the wrong answer faster.

Use the same sequence every time:

define the entity and grain
name the important query path
write the query or schema
explain correctness risks like duplicates, late data, and nulls

If this layer still feels shaky, start with SQL Interview Questions and Database Interview Questions before trying to brute-force more pipeline questions.

2. ETL and batch-pipeline questions

This is the classic data-engineering family:

design a daily ingestion pipeline
move data from an OLTP source into an analytics warehouse
rebuild a failed backfill safely
explain how you would transform messy source data

The interviewers are usually checking whether you understand the basic flow:

extract from one or more sources
validate and normalize the input
transform to the target grain
load into the destination
verify the output and alert on failures

Good answers name the operational details that make ETL real. Start Data Engineering's pipeline design patterns frames the key principle I repeat in every coaching session: "idempotency in data pipelines means that one can run a data pipeline multiple times with the same input, and the output will not change" — and achieving that requires both replayable sources and overwritable sinks.

The operational details that separate a strong answer from a hand-wave:

scheduling
retries
idempotency
late-arriving data
schema changes
backfills

If you describe ETL as "move data from A to B," you sound underprepared. I have watched candidates lose the round at exactly this point — not because their SQL was wrong, but because they never mentioned what happens when the pipeline fails at 3 AM and needs to backfill. If you say "I would make the load idempotent, track a watermark, and separate incremental runs from backfills," you sound like someone who has actually operated pipelines.

3. Warehouse, storage, and query-performance questions

These questions usually show up as design trade-offs:

row store versus column store
partitioning by date or customer
lake versus warehouse versus lakehouse
why a query got slower as data volume grew

You do not need a warehouse manifesto. You do need a grounded explanation of what the system is optimizing for.

Useful ingredients:

scan volume
filter selectivity
file size and partition shape
update pattern
latency expectation for downstream users

Candidates often get lost by naming tools instead of naming pressures. Striim's warehouse-vs-lake-vs-lakehouse comparison frames the real trade-off: warehouses are expensive but BI-optimized, lakes are cheap but suffer from poor query performance without proper management, and lakehouses bridge the gap with transaction guarantees on lake storage. "Use Snowflake" or "use Delta Lake" is not an answer. "This is append-heavy analytics data with large scans and time-based filters, so columnar storage and date partitioning fit the read pattern better than row-oriented OLTP storage" is the kind of sentence interviewers trust.

4. Streaming and real-time pipeline questions

This is where the interview starts to look like distributed systems.

Common versions:

design a clickstream pipeline
process events from Kafka in near real time
handle out-of-order or duplicate messages
serve metrics with low latency while keeping the warehouse correct

The core ideas that repeat are:

at-least-once versus exactly-once semantics
watermarking and late data
idempotent consumers
dead-letter queues or replay paths
separation between raw events and curated outputs

You do not need to promise impossible guarantees. Confluent's exactly-once semantics deep dive explains how Kafka achieves it — "processing of any input record is considered completed if and only if state is updated accordingly and output records are successfully produced once" — but in an interview, the better answer is usually honest:

"I would assume at-least-once delivery, make the downstream write idempotent, and keep enough raw history to replay if the transform logic changes."

I tell every candidate: that sounds much stronger than hand-waving about real-time magic, because it shows you understand the operational reality.

5. Python and Pandas questions

Not every data-engineering loop asks Pandas, but enough do that you should be ready for light data-cleaning or reshaping questions.

LeetCode's current Pandas study plan still frames this as beginner-level work: creating columns, cleaning data, and preparing for larger practice sets. That is the useful signal. Interview Pandas questions are usually not research-grade analysis. They are practical manipulation questions:

filter rows
fill or drop missing values
normalize timestamps
dedupe records
group and aggregate

PYTHON

import pandas as pd
 
 
def clean_events(df: pd.DataFrame) -> pd.DataFrame:
    cleaned = df.copy()
    cleaned["event_time"] = pd.to_datetime(
        cleaned["event_time"],
        errors="coerce",
        utc=True,
    )
    cleaned = cleaned.dropna(subset=["user_id", "event_time"])
    cleaned = cleaned.sort_values("event_time")
    cleaned = cleaned.drop_duplicates(
        subset=["event_id"],
        keep="last",
    )
    return cleaned

The important talk-track is not "I know Pandas." It is:

I normalize the time field early
I drop invalid critical rows explicitly
I dedupe on the stable key
I preserve deterministic ordering before downstream logic

Real Python's Pandas combining guide is worth reviewing if you are rusty on merge vs join vs concat — understanding that an inner join loses unmatched rows while an outer join preserves everything with NaN is exactly the kind of detail that separates data engineering from notebook theater.

Common mistakes that cost easy points

Codecademy's data engineer interview guide frames the two things interviews assess: technical proficiency AND the ability to communicate through conceptual understanding. Most of the mistakes below fail on the communication side:

Treating the interview like pure SQL when the prompt is really about pipeline design.
Naming tools without naming the pressure that justifies them.
Ignoring idempotency, backfills, and schema evolution.
Optimizing for speed before proving correctness.
Talking about streaming as if duplicates and late events do not exist.
Using Pandas or Spark vocabulary without explaining the transformation logic.

The answer shape that works

For most data-engineering prompts, use this order:

define the source and target
define the grain and contract
define the transform logic
define failure modes and quality checks
define scaling or latency trade-offs

That sequence keeps you from rambling and makes the pipeline legible to the interviewer.

A short prep plan for data-engineering loops

Drill SQL until joins, windows, dedupe, and retention queries feel boring.
Review one clean ETL design and one streaming design end to end.
Practice explaining partitioning, file layout, and performance trade-offs in plain English.
Do a small Pandas cleanup exercise so row filtering, missing-data handling, and grouping do not feel rusty.
Run one mock that mixes SQL, pipeline design, and one failure-analysis or quality question in the same session.