Skip to main content

Command Palette

Search for a command to run...

The Great Escape: Migrating from Databricks to EMR Without Burning Your Pipeline Down

TL;DR: Your CFO saw the Databricks bill and nearly choked on their coffee. Now, you’ve been tasked with migrating your cozy, scheduled PySpark notebooks reading from AWS MSK (Kafka) over to the wild west of AWS EMR. Will you save money? Yes. Will you question your life choices while configuring YARN memory overhead? Also yes. Here is the Staff Engineer’s guide to migrating your ingestion pipeline, handling late CDC arrivals, and surviving the architecture shift.

Updated
4 min read
The Great Escape: Migrating from Databricks to EMR Without Burning Your Pipeline Down

Act 1: The Architecture Shift (What is actually happening?)

Let’s be honest. Databricks is like staying at an all-inclusive resort. You write some code in a notebook, hit "Schedule," and Databricks provisions the cluster, manages the Delta Lake magic, and handles the orchestration.

Moving to EMR is like going camping. You bring your own tent (infrastructure management), start your own fire (Spark tuning), and occasionally fight a bear (Bootstrap Actions).

The Current State Vs. Future State


Act 2: Will it actually benefit you? (Cost vs. Sanity)

The Financial Win: Yes, you will save a boatload of cash. By dropping the Databricks Unit (DBU) licensing fee, you are paying purely for AWS EC2 compute and EMR management fees. If you use EMR Spot Instances, you can slash your bill by up to 70%.

The Hidden Cost: You pay for it in Engineering Hours. Databricks auto-tuning (Photon, optimized Delta reads/writes) is gone. You are now running Open Source Apache Spark. If your jobs suddenly run 30% slower on EMR, you have to dig into the Spark UI yourself and figure out why your shuffle partitions are acting up.


Act 3: Production Edge Cases (Where it all goes wrong)

When you move a streaming/batch CDC pipeline from Databricks to EMR, here are the edge cases that will haunt your on-call shifts:

1. The Spot Instance Massacre

You want to save money, so you run your EMR Task nodes on AWS Spot Instances. Mid-ingestion, AWS reclaims your nodes.

  • The Danger: If your Spark Structured Streaming checkpoints aren't perfectly synced to S3, you will double-process MSK offsets or drop CDC events.

  • The Fix: Ensure your checkpointing interval is tight, use EMR's graceful decommissioning feature, and never put your Spark Driver on a Spot instance.

2. CDC & Late Arrivals without "Magic"

In Databricks, handling late-arriving data with Delta Live Tables or optimized MERGE statements is heavily abstracted. On EMR, you are using OSS Delta Lake or Apache Iceberg.

  • The Danger: A CDC UPDATE record arrives 4 hours late.

  • The Fix: You must manually write rock-solid withWatermark() logic in your PySpark structured streams. If doing batch micro-batches, your MERGE INTO logic needs strict partition pruning, or EMR will full-scan your S3 bucket and cost you more than Databricks ever did.

3. The "Notebook Lift and Shift" Disaster

  • The Danger: Exporting your .ipynb notebooks and trying to run them directly on EMR using Papermill or a hacky script.

  • The Fix: Don't do this. Grow up, refactor your code into modular .py files, package them into .whl (wheels) or zipfiles, and submit them via spark-submit.


Act 4: The Pros and Cons

Feature

Databricks (The Resort)

AWS EMR (The Wilderness)

Cost

💸💸💸 (AWS Compute + DBU Tax)

💸 (Just AWS Compute)

Orchestration

Built-in Databricks Workflows

DIY (Airflow, Step Functions)

Code format

Notebooks (Easy, but messy)

spark-submit Python scripts (Proper SWE)

Performance Tuning

Auto-magical (Photon engine)

Manual (Welcome to spark-defaults.conf)

Spark UI

Click a button, it's there forever

Dig through the Spark History Server / S3 logs

Act 5: The Golden Rules (Do's and Don'ts)

  • DO use EMR Serverless. If you hate tuning EC2 instances and managing YARN queues, EMR Serverless is the perfect middle ground between Databricks' ease of use and standard EMR's low cost.

  • DON'T skip the orchestrator. Databricks Job Scheduler is gone. Invest time in setting up Amazon Managed Workflows for Apache Airflow (MWAA) to orchestrate your MSK ingestion steps.

  • DO switch to Apache Iceberg (Optional). If you are leaving the Databricks ecosystem, OSS Delta Lake is great, but Apache Iceberg has better native AWS integration (Athena, Glue Data Catalog) for CDC tables.

  • DON'T complain about the Spark UI. You are saving the company $200k a year. Put on your big kid pants and read the logs.

💡
Migrating from Databricks to EMR is the ultimate test of a Data Engineer. Anyone can write PySpark in a notebook. It takes a true engineer to package it, orchestrate it, and keep it running on Spot instances while the CFO sleeps soundly on a bed of saved cash.

#DataEngineering #AWSEMR #Databricks #CloudFinOps #ApacheSpark #BigData #PySpark #Kafka #SystemArchitecture

More from this blog

C

Core Data Engineering Platform

9 posts