Course Outline
Introduction, Objectives, and Migration Strategy
- Course goals, alignment with participant profiles, and success criteria.
- High-level migration approaches and risk considerations.
- Setup of workspaces, repositories, and lab datasets.
Day 1 — Migration Fundamentals and Architecture
- Lakehouse concepts, Delta Lake overview, and Databricks architecture.
- Differences and implications of SMP versus MPP for migration.
- Medallion (Bronze→Silver→Gold) design and Unity Catalog overview.
Day 1 Lab — Translating a Stored Procedure
- Hands-on migration of a sample stored procedure to a notebook.
- Mapping temporary tables and cursors to DataFrame transformations.
- Validation and comparison with original output.
Day 2 — Advanced Delta Lake & Incremental Loading
- ACID transactions, commit logs, versioning, and time travel capabilities.
- Auto Loader, MERGE INTO patterns, upserts, and schema evolution.
- OPTIMIZE, VACUUM, Z-ORDER, partitioning, and storage tuning.
Day 2 Lab — Incremental Ingestion & Optimization
- Implementation of Auto Loader ingestion and MERGE workflows.
- Application of OPTIMIZE, Z-ORDER, and VACUUM; validation of results.
- Measurement of read/write performance improvements.
Day 3 — SQL in Databricks, Performance & Debugging
- Analytical SQL features: window functions, higher-order functions, and JSON/array handling.
- Reading the Spark UI, DAGs, shuffles, stages, tasks, and diagnosing bottlenecks.
- Query tuning patterns: broadcast joins, hints, caching, and spill reduction.
Day 3 Lab — SQL Refactoring & Performance Tuning
- Refactoring a complex SQL process into optimized Spark SQL.
- Utilizing Spark UI traces to identify and resolve skew and shuffle issues.
- Benchmarking before/after scenarios and documenting tuning steps.
Day 4 — Tactical PySpark: Replacing Procedural Logic
- Spark execution model: driver, executors, lazy evaluation, and partitioning strategies.
- Transforming loops and cursors into vectorized DataFrame operations.
- Modularization, UDFs/pandas UDFs, widgets, and reusable libraries.
Day 4 Lab — Refactoring Procedural Scripts
- Refactoring a procedural ETL script into modular PySpark notebooks.
- Introduction of parametrization, unit-style tests, and reusable functions.
- Code review and application of best-practice checklists.
Day 5 — Orchestration, End-to-End Pipeline & Best Practices
- Databricks Workflows: job design, task dependencies, triggers, and error handling.
- Designing incremental Medallion pipelines with quality rules and schema validation.
- Integration with Git (GitHub/Azure DevOps), CI, and testing strategies for PySpark logic.
Day 5 Lab — Build a Complete End-to-End Pipeline
- Assembly of a Bronze→Silver→Gold pipeline orchestrated with Workflows.
- Implementation of logging, auditing, retries, and automated validations.
- Running the full pipeline, validating outputs, and preparing deployment notes.
Operationalization, Governance, and Production Readiness
- Unity Catalog governance, lineage, and access controls best practices.
- Cost management, cluster sizing, autoscaling, and job concurrency patterns.
- Deployment checklists, rollback strategies, and runbook creation.
Final Review, Knowledge Transfer, and Next Steps
- Participant presentations of migration work and lessons learned.
- Gap analysis, recommended follow-up activities, and handoff of training materials.
- References, further learning paths, and support options.
Requirements
- Foundational understanding of data engineering concepts.
- Practical experience with SQL and stored procedures (Synapse / SQL Server).
- Familiarity with ETL orchestration concepts (such as ADF or similar tools).
Target Audience
- Technology managers possessing a background in data engineering.
- Data engineers transitioning from procedural OLAP logic to Lakehouse patterns.
- Platform engineers responsible for driving Databricks adoption.