AWS Glue - AWS Study Guide

What is AWS Glue?

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. It is primarily an ETL (Extract, Transform, Load) service.

Key Features

1. Data Catalog

A centralized repository to store structural and operational metadata for all your data assets.
Compatible with Apache Hive Metastore.

2. Crawlers

Programs that scan your data sources (S3, RDS, DynamoDB) to automatically discover the schema and populate the Data Catalog.

3. ETL Jobs

Runs Spark (or Python shell) scripts to transform data (e.g., convert CSV to Parquet, clean data) and load it into a destination (e.g., Redshift, S3).

Exam Tips

[!IMPORTANT] Crawler & Data Catalog: If a question mentions "Discovering schema automatically" or "Populating a Data Catalog", the answer is AWS Glue Crawler.

[!NOTE] ETL: Glue is the managed ETL service. Use it to prepare data before analytics (like Redshift or Athena).

Common Use Cases

Data Lakes: Building and managing a data lake on S3.
Data Preparation: Cleaning and transforming raw log data before loading it into Amazon Redshift.