Get in Touch

Course Outline

Section 1: Introduction to Hadoop

  • History and core concepts of Hadoop
  • Ecosystem overview
  • Available distributions
  • High-level architecture
  • Common myths about Hadoop
  • Key challenges in Hadoop implementation
  • Hardware and software considerations
  • Lab: Initial exploration of Hadoop

Section 2: HDFS

  • Design principles and architecture
  • Core concepts (horizontal scaling, replication, data locality, rack awareness)
  • Daemons: Namenode, Secondary Namenode, DataNode
  • Communication mechanisms and heartbeats
  • Data integrity management
  • Read and write paths
  • Namenode High Availability (HA) and Federation
  • Labs: Interacting with HDFS

Section 3 : Map Reduce

  • Concepts and architecture
  • Daemons (MRV1): JobTracker and TaskTracker
  • Processing phases: driver, mapper, shuffle/sort, reducer
  • MapReduce Version 1 and Version 2 (YARN)
  • Deep dive into MapReduce internals
  • Introduction to Java-based MapReduce programming
  • Labs: Executing a sample MapReduce program

Section 4 : Pig

  • Pig compared to Java MapReduce
  • Pig job workflow
  • Pig Latin language fundamentals
  • Data processing with Pig
  • Transformations and joins
  • User Defined Functions (UDF)
  • Labs: Writing Pig scripts for data analysis

Section 5: Hive

  • Architecture and design
  • Data types
  • SQL capabilities in Hive
  • Creating Hive tables and executing queries
  • Data partitioning
  • Joins
  • Text processing techniques
  • Labs: Various hands-on exercises on data processing with Hive

Section 6: HBase

  • Concepts and architecture
  • Comparison of HBase, RDBMS, and Cassandra
  • HBase Java API
  • Handling time series data in HBase
  • Schema design strategies
  • Labs: Interacting with HBase via the shell; programming with the HBase Java API; Schema design exercises

Requirements

  • Proficiency in the Java programming language is required, as most practical exercises will be conducted in Java.
  • Familiarity with the Linux environment is essential, including the ability to navigate the command line and edit files using tools like vi or nano.

Lab environment

Zero Install : Participants do not need to install Hadoop software on their own devices. A fully functional Hadoop cluster will be provided for use during the course.

Students will need the following tools

  • An SSH client (Linux and Mac systems come with built-in SSH clients; for Windows, PuTTY is recommended)
  • A web browser to access the cluster, with Firefox being the recommended option.
 28 Hours

Testimonials (1)

Related Categories