AWS Glue

On This Page

1	Overview	2	Architecture
3	When to Use	4	When Not to Use
5	Implementation Guide	6	Decision Criteria
7	Anti-Patterns to Avoid	8	References

Overview

Glue has three components that work together. The Data Catalogue stores metadata — table definitions, schema information, and data source connections — and serves as a central metadata registry for the AWS account. Crawlers discover data sources, infer schemas, and populate the Catalogue automatically. Jobs execute the transformation logic — as Spark scripts, Python Shell scripts, or visual workflows built in Glue Studio.

Architecture

%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'Inter, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%%
flowchart TD
    START([Data in Source Systems])

    START --> CRAWL[Glue Crawler\nDiscovers schema\nPopulates Data Catalogue]
    CRAWL --> CATALOGUE[Glue Data Catalogue\nCentral metadata registry\nShared with Athena and Redshift]

    CATALOGUE --> JOB_TYPE{Job Type}
    JOB_TYPE -->|Visual no-code| STUDIO[Glue Studio\nDrag-and-drop ETL\nAuto-generates PySpark]
    JOB_TYPE -->|Custom code| SPARK[Glue Spark Job\nPySpark or Scala\nDistributed processing]
    JOB_TYPE -->|Lightweight Python| PYTHON[Python Shell Job\nSingle node\nAPI calls and light transforms]

    STUDIO & SPARK & PYTHON --> TRANSFORM_G[Transform\nFilter, join, aggregate\nSchema mapping, deduplication]
    TRANSFORM_G --> BOOKMARK{Job Bookmark\nEnabled?}
    BOOKMARK -->|Yes| INCREMENTAL[Process only new data\nSince last successful run]
    BOOKMARK -->|No| FULL[Full dataset reprocessing\nHigher cost and duration]

    INCREMENTAL & FULL --> TARGET[Write to Target\nS3 Parquet, Redshift\nRDS, DynamoDB, Glue Catalogue]
    TARGET --> DONE_G([Data Available for Analytics])

    style START fill:#4f8ef7,color:#fff
    style DONE_G fill:#10b981,color:#fff
    style CATALOGUE fill:#e0f2fe
    style BOOKMARK fill:#fef3c7

When to Use

Batch ETL pipelines from operational databases to S3 data lake
Schema discovery and catalogue management for a data lake on S3
Transformation of raw data into analytics-ready Parquet format
Incremental data ingestion from RDS, DynamoDB, JDBC sources, or S3

When Not to Use

Real-time or near-real-time streaming — use Kinesis Data Streams or Kafka with Flink
Simple single-table copy with no transformation — use AWS DMS
Lightweight API-to-API integration — use Lambda or Step Functions
Sub-minute latency requirements — Glue Spark jobs have a startup overhead of 1–3 minutes

Implementation Guide

Job Bookmarks

Enable job bookmarks for incremental processing. Bookmarks track the last successfully processed position — a timestamp, a primary key watermark, or an S3 object modification time. On the next run, the job processes only data that arrived since the last run. Without bookmarks, every run reprocesses the full dataset.

Output Format

Write Glue job output as Parquet with Snappy compression. Parquet is a columnar format — Athena and Redshift Spectrum read only the columns referenced in a query, dramatically reducing query time and cost. Snappy balances compression ratio with Spark decompression speed. Avoid CSV output for analytics workloads — column scans read the entire file.

Partition Strategy

Partition S3 output by date — year=2025/month=06/day=10/ — so queries that filter by date only scan the relevant partitions. Athena's partition pruning skips unrelated data entirely. Partition by the most common query filter — date for time-series data, region for geographic data, entity ID for domain-specific lookups.

Decision Criteria

Use Case	Glue	Lambda ETL	Kinesis + Flink
Batch processing — hourly or daily	Preferred	Possible for small data	Overkill
Real-time streaming	Not suitable	Not suitable	Preferred
Large dataset — hundreds of GB	Preferred — Spark scales	Impractical — memory limits	Depends on throughput
Light transformation, small data	Overkill — use Python Shell	Preferred	Not suitable
Schema discovery for S3 lake	Preferred — Crawler	Manual	Not applicable

Anti-Patterns to Avoid

⚠ 1. Full Reprocess on Every Run Without Job Bookmarks

Running a Glue job that reads all data from the source on every execution. At day one, this is fast. At month six, the job reads six months of data to produce one day of new output. Duration and cost grow linearly with data volume.

Hover to see the fix ↻

↺ Correct Approach

Enable job bookmarks. The job processes only the data that arrived since the last successful run. Duration stays constant as the dataset grows.

⚠ 2. Writing CSV Output to S3 for Analytics

Glue jobs that output CSV files to S3 because CSV is familiar. Athena queries over CSV scan the entire file for every query regardless of which columns are needed. A 100GB CSV costs the same to query as a 100GB Parquet file even if only two columns are needed.

Hover to see the fix ↻

↺ Correct Approach

Output Parquet with Snappy compression partitioned by the primary query filter. Athena queries run 10–100× faster and cost 90% less on columnar formats.

Flowchart

%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'Inter, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD START([Raw Data in Source]) START --> CRAWL_G[Glue Crawler\nInfer schema automatically\nScheduled or on-demand] CRAWL_G --> CAT_G[Glue Data Catalogue\nTable definitions and partitions\nShared with Athena and Redshift] CAT_G --> JOB_G[Glue Job\nSpark, Python Shell\nor Glue Studio visual] JOB_G --> BOOK_G{Job Bookmark\nenabled?} BOOK_G -->|No| WARN_G[Full reprocess every run\nCost and duration grow\nover time] BOOK_G -->|Yes| INC_G[Incremental processing\nOnly new data since\nlast successful run] WARN_G & INC_G --> WRITE_G[Write to S3\nParquet plus Snappy compression\nPartitioned by date or entity] WRITE_G --> REGISTER_G[Update Glue Catalogue\nNew partitions registered\nAthena queries see new data] REGISTER_G --> DONE_GL([Data Lake Updated\nQueryable via Athena]) style START fill:#4f8ef7,color:#fff style DONE_GL fill:#10b981,color:#fff style WARN_G fill:#fef3c7 style CAT_G fill:#e0f2fe

References

AWS — AWS Glue Developer Guide. docs.aws.amazon.com/glue
AWS — Glue Data Catalogue. docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler
AWS — Glue Studio User Guide. docs.aws.amazon.com/glue/latest/ug/what-is-glue-studio
Apache Parquet — Columnar storage format. parquet.apache.org

Ascendion Engineering Knowledge Base ← Integration Architecture