Background Image

Home

DataEngineer-Syllabus

Data Engineering Course Image

Data Engineering Course Syllabus

Module 1: Hadoop

What is Big Data? Characteristics (Volume, Variety, Velocity,
Veracity), Data Storage and Processing: Batch vs Real-Time
Processing
Hadoop Distributed File System (HDFS) architecture
Read from & write to hadoop file system, basic hdfs commands
YARN & its components, YARN schedulers

Module 2: HIVE

What is Hive? Use cases and role in the Hadoop ecosystem
Hive Architecture
Introduction to HiveQL (Hive Query Language)
Creating Tables in Hive
Partitioning and Bucketing in Hive
Different file formats such as parquet, orc, avro etc

Module 3: PYTHON

Introduction python programming language and basic coding
Python functions, loops, if-else statements
Data structures such as lists, dictionary, set, tuple
Advanced looping such as comprehensions, lambda functions
Get familiar with python generators & decorators
Basic pandas features & data frame operations

Module 4: PYSPARK

What is Apache Spark and why is it important for Data Engineering?
Understanding Spark Internals
PySpark setup on Databricks
RDDs and Data Frames in PySpark
Data frame vs datasets, DAG & lineage
SparkSQL for querying structured data
PySpark transformations (using spark-sql functions) and actions
Concepts of shuffling, Performance tuning and memory management in pyspark
Repartition vs coalesce
Handling Large-Scale Joins
Spark structured streaming to read data from kafka & real-time aggregations

Module 5: AWS Fundamentals

Introduction to Cloud Computing
Understanding AWS Core Services
Introduction to IAM (Identity and Access Management)

Module 6: AWS S3

S3 Basics: Buckets and Objects
Data versioning
Storage classes & Lifecycle Policies
Cross region replication
AWS S3 commands using CLI

Module 7: AWS EMR

EMR Architecture and Features
Setting up and Managing EMR Clusters
Instance types & their utilities
Running PySpark Jobs on EMR
Adding steps as tasks on EMR
Autoscaling policies for EMR cluster

Module 8: Data Transformation and ETL with AWS Glue

EMR Architecture and Features
Setting up and Managing EMR Clusters
Instance types & their utilities
Running PySpark Jobs on EMR
Adding steps as tasks on EMR
Autoscaling policies for EMR cluster
Introduction to Glue Dynamic Data Frame
Glue Dynamic Frame vs PySpark Data Frame

Module 9: Capstone Project

Extract data from an external source and load it to S3
Transform data using PySpark
Process large-scale data on AWS EMR
Store processed data in S3
Catalog the metadata using Glue Crawler & view the results in Athena

Module 10: Case Studies and Real-world Projects

You will also gain hands-on experience in designing and implementing end-to-end data pipelines
Handling diverse data types
Addressing scalability and performance challenges

Module 11: Best Practices and Future Trends

Version control for data pipelines
Testing and monitoring ETL processes
Emerging technologies (AI/ML in data engineering)
Trends in data privacy and security

Data Engineering Syllabus

1. Module 1: Hadoop
2. Module 2: HIVE
3. Module 3: PYTHON
4. Module 4: PYSPARK
5. Module 5: AWS Fundamentals
6. Module 6: AWS S3
7. Module 7: AWS EMR
8. Module 8: Data Transformation and ETL with AWS Glue
9. Module 9: Capstone Project
10. Module 10: Case Studies and Real-world Projects
11. Module 11: Best Practices and Future Trends

Contact Us

Have questions or need assistance? Reach out to us anytime, and our team will be happy to help.

Contact Us