Key Highlights

Bigdata Apache Hadoop Spark Scala Course Topics to be Covered:

Related Courses

Master in Data Science

Expert in Artificial Intelligence

Master in Data Science

Job Opportunities

What you’ll learn

Learning Outcomes

Who this course is for:

This Bigdata Apache Hadoop Spark Scala course from SMEClabs will make you ready to switch careers on big data Hadoop and spark. After watching this, you will understand about Hadoop, HDFS, YARN, Map reduce, python, pig, hive, oozie, sqoop, flume, HBase, No SQL, Spark, Spark sql, Spark Streaming.

Why Spark?

Apache Spark is an open-source cluster computing framework for Hadoop community clusters. It qualifies to be one of the best data analytics and processing engines for large-scale data with its unmatchable speed, ease of use, and sophisticated analytics. Following are the advantages and features that make Apache Spark a crossover hit for operational as well as investigative analytics:

Bigdata Apache Hadoop Spark Scala Training Syllabus:
  • Linux (Ubuntu/Centos) – Tips and Tricks
  • Basic(core) Java Programming Concepts – OOPS
  • Learning Objectives: In this module, you will understand what Big Data is, the limitations of the traditional solutions for Big Data problems, how Hadoop solves those Big Data problems, Hadoop Ecosystem, Hadoop Architecture, HDFS, Anatomy of File Read and Write & how MapReduce works.
  • Topics:
    • Introduction to Big Data & Big Data Challenges
    • Limitations & Solutions of Big Data Architecture
    • Hadoop & its Features
    • Hadoop Ecosystem
    • Hadoop 2.x Core Components
    • Hadoop Storage: HDFS (Hadoop Distributed File System)
    • Hadoop Processing: MapReduce Framework
    • Different Hadoop Distributions
  • Hadoop 2.x Architecture
  • Typical workflow
  • HDFS Commands
  • Writing files to HDFS
  • Reading files from HDFS
  • Rack awareness
  • Hadoop daemons
  • Before MapReduce
  • MapReduce overview
  • Word count problem
  • Word count flow and solution
  • MapReduce flow
  • Data Types
  • File Formats
  • Explain the Driver, Mapper and Reducer code
  • Configuring development environment – Eclipse
  • Writing unit test
  • Running locally
  • Running on cluster
  • Hands on exercises
  • Anatomy of MapReduce job run
  • Job submission
  • Job initialization
  • Task assignment
  • Job completion
  • Job scheduling
  • Job failures
  • Shuffle and sort
  • Hands on exercises
  • File Formats – Sequence Files
  • Compression Techniques
  • Input Formats – Input splits & records, text input, binary input
  • Output Formats – text output, binary output, lazy output
  • Hands on exercises
  • Counters
  • Side data distribution
  • MapReduce combiner
  • MapReduce partitioner
  • MapReduce distributed cache
  • Hands exercises
  • Hive Architecture
  • Types of Metastore
  • Hive Data Types
  • HiveQL
  • File Formats – Parquet, ORC, Sequence and Avro Files Comparison
  • Partitioning & Bucketing
  • Hive JDBC Client
  • Hive UDFs
  • Hive Serdes
  • Hive on Tez
  • Hands-on exercises
  • Integration with Tableau
  • Flume Architecture
  • Flume Agent Setup
  • Types of sources, channels, sinks Multi Agent Flow
  • Hands-on exercises
  • Introduction to Apache Pig
  • MapReduce vs Pig
  • Pig Components & Pig Execution
  • Pig Data Types & Data Models in Pig
  • Pig Latin Programs
  • Shell and Utility Commands
  • Pig UDF & Pig Streaming
  • Testing Pig scripts with Punit
  • Aviation use-case in PIG
  • Pig Demo of Healthcare Dataset
  • HBase Data Model
  • HBase Shell
  • HBase Client API
  • Hive Data Loading Techniques
  • Apache Zookeeper Introduction
  • ZooKeeper Data Model
  • Zookeeper Service
  • HBase Bulk Loading
  • Getting and Inserting Data
  • HBase Filters
  • Sqoop Architecture
  • Sqoop Import Command Arguments, Incremental Import
  • Sqoop Export
  • Sqoop Jobs
  • Hands-on exercises
  • Spark Basics
  • What is Apache Spark?
  • Spark Installation
  • Spark Configuration
  • Spark Context
  • Using Spark Shell
  • Resilient Distributed Datasets (RDDs) – Features, Partitions, Tuning Parallelism
  • Functional Programming with Spark
  • ark Basics
  • What is Apache Spark?
  • Spark Installation
  • Spark Configuration
  • Spark Context
  • Using Spark Shell
  • Resilient Distributed Datasets (RDDs) – Features, Partitions, Tuning Parallelism
  • Functional Programming with Spark
  • Oozie
  • Oozie Components
  • Oozie Workflow
  • Scheduling Jobs with Oozie Scheduler
  • Demo of Oozie Workflow
  • Oozie Coordinator
  • Oozie Commands
  • Oozie Web Console
  • Oozie for MapReduce
  • Combining flow of MapReduce Jobs
  • Hive in Oozie
  • Hadoop Project Demo
  • Hadoop Talend Integration
  • Log File Analysis covering Flume, HDFS, MR/Pig, Hive, Tableau
  • Crime Data Analysis Covering Oozie, Sqoop, HDFS, Hive, HBase, RestFul Client.
  • Hadoop Use Cases in Insurance Domain
  • Hadoop Use Cases in Retail Domain

Detailed Syllabus

Best-in-class content by leading faculty and industry leaders in the form of videos, cases and projects


    Enquiry for Batch & Seat Availability

      Attend a 30-minute FREE class with our Top Trainers

      Our Trainers are Industrial Experience super-experts who simplify complex
      concepts visually through real examples


        Enquiry for Batch & Seat Availability

          smeclabs online skill development training

          Everything You Need is Here

          Our Certification & Accreditation
          NSDC IISC SMEC Certifications

          The SMEClabs Advantage

          Strong hand-holding with dedicated support to help you master Bigdata Apache Hadoop Spark Scala.

          ©  2001-2021 SMEClabs.  All Rights Reserved.