Back to Catalog
Data & Analytics

AWS Glue

"Serverless data integration service for ETL (Extract, Transform, Load)."

What is AWS Glue?

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. It is primarily an ETL (Extract, Transform, Load) service.

Key Features

1. Data Catalog

  • A centralized repository to store structural and operational metadata for all your data assets.
  • Compatible with Apache Hive Metastore.

2. Crawlers

  • Programs that scan your data sources (S3, RDS, DynamoDB) to automatically discover the schema and populate the Data Catalog.

3. ETL Jobs

  • Runs Spark (or Python shell) scripts to transform data (e.g., convert CSV to Parquet, clean data) and load it into a destination (e.g., Redshift, S3).

Exam Tips

[!IMPORTANT] Crawler & Data Catalog: If a question mentions "Discovering schema automatically" or "Populating a Data Catalog", the answer is AWS Glue Crawler.

[!NOTE] ETL: Glue is the managed ETL service. Use it to prepare data before analytics (like Redshift or Athena).

Common Use Cases

  • Data Lakes: Building and managing a data lake on S3.
  • Data Preparation: Cleaning and transforming raw log data before loading it into Amazon Redshift.
Kinesis
IoT Core
SWIPE ZONE
< DRAG ME >