
Data Engineering Course Syllabus
Module 1: Hadoop
- What is Big Data? Characteristics (Volume, Variety, Velocity,
- Veracity), Data Storage and Processing: Batch vs Real-Time
- Processing
- Hadoop Distributed File System (HDFS) architecture
- Read from & write to hadoop file system, basic hdfs commands
- YARN & its components, YARN schedulers
Module 2: HIVE
- What is Hive? Use cases and role in the Hadoop ecosystem
- Hive Architecture
- Introduction to HiveQL (Hive Query Language)
- Creating Tables in Hive
- Partitioning and Bucketing in Hive
- Different file formats such as parquet, orc, avro etc
Module 3: PYTHON
- Introduction python programming language and basic coding
- Python functions, loops, if-else statements
- Data structures such as lists, dictionary, set, tuple
- Advanced looping such as comprehensions, lambda functions
- Get familiar with python generators & decorators
- Basic pandas features & data frame operations
Module 4: PYSPARK
- What is Apache Spark and why is it important for Data Engineering?
- Understanding Spark Internals
- PySpark setup on Databricks
- RDDs and Data Frames in PySpark
- Data frame vs datasets, DAG & lineage
- SparkSQL for querying structured data
- PySpark transformations (using spark-sql functions) and actions
- Concepts of shuffling, Performance tuning and memory management in pyspark
- Repartition vs coalesce
- Handling Large-Scale Joins
- Spark structured streaming to read data from kafka & real-time aggregations
Module 5: AWS Fundamentals
- Introduction to Cloud Computing
- Understanding AWS Core Services
- Introduction to IAM (Identity and Access Management)
Module 6: AWS S3
- S3 Basics: Buckets and Objects
- Data versioning
- Storage classes & Lifecycle Policies
- Cross region replication
- AWS S3 commands using CLI
Module 7: AWS EMR
- EMR Architecture and Features
- Setting up and Managing EMR Clusters
- Instance types & their utilities
- Running PySpark Jobs on EMR
- Adding steps as tasks on EMR
- Autoscaling policies for EMR cluster
Module 8: Data Transformation and ETL with AWS Glue
- EMR Architecture and Features
- Setting up and Managing EMR Clusters
- Instance types & their utilities
- Running PySpark Jobs on EMR
- Adding steps as tasks on EMR
- Autoscaling policies for EMR cluster
- Introduction to Glue Dynamic Data Frame
- Glue Dynamic Frame vs PySpark Data Frame
Module 9: Capstone Project
- Extract data from an external source and load it to S3
- Transform data using PySpark
- Process large-scale data on AWS EMR
- Store processed data in S3
- Catalog the metadata using Glue Crawler & view the results in Athena
Module 10: Case Studies and Real-world Projects
- You will also gain hands-on experience in designing and implementing end-to-end data pipelines
- Handling diverse data types
- Addressing scalability and performance challenges
Module 11: Best Practices and Future Trends
- Version control for data pipelines
- Testing and monitoring ETL processes
- Emerging technologies (AI/ML in data engineering)
- Trends in data privacy and security
Data Engineering Syllabus
1. Module 1: Hadoop
2. Module 2: HIVE
3. Module 3: PYTHON
4. Module 4: PYSPARK
5. Module 5: AWS Fundamentals
6. Module 6: AWS S3
7. Module 7: AWS EMR
8. Module 8: Data Transformation and ETL with AWS Glue
9. Module 9: Capstone Project
10. Module 10: Case Studies and Real-world Projects
11. Module 11: Best Practices and Future Trends