Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop Ecosystem
  • Overview of Python and Scala

Foundational Theory:

  • Architecture
  • RDDs
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Hands-on Workshop: Understanding Basics via Databricks Environment:

  • Exercises with the RDD API
  • Core action and transformation functions
  • PairRDDs
  • Joins
  • Caching strategies
  • Exercises with the DataFrame API
  • SparkSQL
  • DataFrame operations: select, filter, group, sort
  • User-Defined Functions (UDFs)
  • Introduction to the DataSet API
  • Streaming

Hands-on Workshop: Understanding Deployment via AWS Environment:

  • Basics of AWS Glue
  • Comparing AWS EMR and AWS Glue
  • Example jobs in both environments
  • Evaluating advantages and disadvantages

Additional Topics:

  • Introduction to Apache Airflow orchestration

Requirements

Programming skills (preferably in Python and Scala)

Basic knowledge of SQL

 21 Hours

Testimonials (3)

Related Categories