Course Outline
Introduction, Objectives, and Migration Strategy
- Course goals, participant profile alignment, and success criteria
- High-level migration approaches and risk considerations
- Setting up workspaces, repositories, and lab datasets
Day 1 — Migration Fundamentals and Architecture
- Lakehouse concepts, Delta Lake overview, and Databricks architecture
- SMP vs MPP differences and implications for migration
- Medallion (Bronze→Silver→Gold) design and Unity Catalog overview
Day 1 Lab — Translating a Stored Procedure
- Hands-on migration of a sample stored procedure to a notebook
- Mapping temp tables and cursors to DataFrame transformations
- Validation and comparison with original output
Day 2 — Advanced Delta Lake & Incremental Loading
- ACID transactions, commit logs, versioning, and time travel
- Auto Loader, MERGE INTO patterns, upserts, and schema evolution
- OPTIMIZE, VACUUM, Z-ORDER, partitioning, and storage tuning
Day 2 Lab — Incremental Ingestion & Optimization
- Implementing Auto Loader ingestion and MERGE workflows
- Applying OPTIMIZE, Z-ORDER, and VACUUM; validating results
- Measuring read/write performance improvements
Day 3 — SQL in Databricks, Performance & Debugging
- Analytical SQL features: window functions, higher-order functions, JSON/array handling
- Reading the Spark UI, DAGs, shuffles, stages, tasks, and bottleneck diagnosis
- Query tuning patterns: broadcast joins, hints, caching, and spill reduction
Day 3 Lab — SQL Refactoring & Performance Tuning
- Refactor a heavy SQL process into optimized Spark SQL
- Use Spark UI traces to identify and fix skew and shuffle issues
- Benchmark before/after and document tuning steps
Day 4 — Tactical PySpark: Replacing Procedural Logic
- Spark execution model: driver, executors, lazy evaluation, and partitioning strategies
- Transforming loops and cursors into vectorized DataFrame operations
- Modularization, UDFs/pandas UDFs, widgets, and reusable libraries
Day 4 Lab — Refactoring Procedural Scripts
- Refactor a procedural ETL script into modular PySpark notebooks
- Introduce parametrization, unit-style tests, and reusable functions
- Code review and best-practice checklist application
Day 5 — Orchestration, End-to-End Pipeline & Best Practices
- Databricks Workflows: job design, task dependencies, triggers, and error handling
- Designing incremental Medallion pipelines with quality rules and schema validation
- Integration with Git (GitHub/Azure DevOps), CI, and testing strategies for PySpark logic
Day 5 Lab — Build a Complete End-to-End Pipeline
- Assemble Bronze→Silver→Gold pipeline orchestrated with Workflows
- Implement logging, auditing, retries, and automated validations
- Run full pipeline, validate outputs, and prepare deployment notes
Operationalization, Governance, and Production Readiness
- Unity Catalog governance, lineage, and access controls best practices
- Cost, cluster sizing, autoscaling, and job concurrency patterns
- Deployment checklists, rollback strategies, and runbook creation
Final Review, Knowledge Transfer, and Next Steps
- Participant presentations of migration work and lessons learned
- Gap analysis, recommended follow-up activities, and training materials handoff
- References, further learning paths, and support options
Requirements
- An understanding of data engineering concepts
- Experience with SQL and stored procedures (Synapse / SQL Server)
- Familiarity with ETL orchestration concepts (ADF or similar)
Audience
- Technology managers with a data engineering background
- Data engineers transitioning procedural OLAP logic to Lakehouse patterns
- Platform engineers responsible for Databricks adoption
Testimonials (1)
All the topics it covers, although many were very quick, give us an idea of what we will need to delve into. I also liked that we were able to do internships, although I insist, I think the course deserves more.
Sandra Mariela Lopez Bernal - Kueski
Course - Databricks
Machine Translated