On This Page
1Overview2Architecture
3When to Use4When Not to Use
5Implementation Guide6Decision Criteria
7Anti-Patterns to Avoid8References

Overview

Glue has three components that work together. The Data Catalogue stores metadata — table definitions, schema information, and data source connections — and serves as a central metadata registry for the AWS account. Crawlers discover data sources, infer schemas, and populate the Catalogue automatically. Jobs execute the transformation logic — as Spark scripts, Python Shell scripts, or visual workflows built in Glue Studio.

Architecture

%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'IBM Plex Sans, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD START([Data in Source Systems]) START --> CRAWL[Glue Crawler\nDiscovers schema\nPopulates Data Catalogue] CRAWL --> CATALOGUE[Glue Data Catalogue\nCentral metadata registry\nShared with Athena and Redshift] CATALOGUE --> JOB_TYPE{Job Type} JOB_TYPE -->|Visual no-code| STUDIO[Glue Studio\nDrag-and-drop ETL\nAuto-generates PySpark] JOB_TYPE -->|Custom code| SPARK[Glue Spark Job\nPySpark or Scala\nDistributed processing] JOB_TYPE -->|Lightweight Python| PYTHON[Python Shell Job\nSingle node\nAPI calls and light transforms] STUDIO & SPARK & PYTHON --> TRANSFORM_G[Transform\nFilter, join, aggregate\nSchema mapping, deduplication] TRANSFORM_G --> BOOKMARK{Job Bookmark\nEnabled?} BOOKMARK -->|Yes| INCREMENTAL[Process only new data\nSince last successful run] BOOKMARK -->|No| FULL[Full dataset reprocessing\nHigher cost and duration] INCREMENTAL & FULL --> TARGET[Write to Target\nS3 Parquet, Redshift\nRDS, DynamoDB, Glue Catalogue] TARGET --> DONE_G([Data Available for Analytics]) style START fill:#4f8ef7,color:#fff style DONE_G fill:#10b981,color:#fff style CATALOGUE fill:#e0f2fe style BOOKMARK fill:#fef3c7

When to Use

  • Batch ETL pipelines from operational databases to S3 data lake
  • Schema discovery and catalogue management for a data lake on S3
  • Transformation of raw data into analytics-ready Parquet format
  • Incremental data ingestion from RDS, DynamoDB, JDBC sources, or S3

When Not to Use

  • Real-time or near-real-time streaming — use Kinesis Data Streams or Kafka with Flink
  • Simple single-table copy with no transformation — use AWS DMS
  • Lightweight API-to-API integration — use Lambda or Step Functions
  • Sub-minute latency requirements — Glue Spark jobs have a startup overhead of 1–3 minutes

Implementation Guide

Job Bookmarks

Enable job bookmarks for incremental processing. Bookmarks track the last successfully processed position — a timestamp, a primary key watermark, or an S3 object modification time. On the next run, the job processes only data that arrived since the last run. Without bookmarks, every run reprocesses the full dataset.

Output Format

Write Glue job output as Parquet with Snappy compression. Parquet is a columnar format — Athena and Redshift Spectrum read only the columns referenced in a query, dramatically reducing query time and cost. Snappy balances compression ratio with Spark decompression speed. Avoid CSV output for analytics workloads — column scans read the entire file.

Partition Strategy

Partition S3 output by date — year=2025/month=06/day=10/ — so queries that filter by date only scan the relevant partitions. Athena's partition pruning skips unrelated data entirely. Partition by the most common query filter — date for time-series data, region for geographic data, entity ID for domain-specific lookups.

Decision Criteria

Use Case Glue Lambda ETL Kinesis + Flink
Batch processing — hourly or daily Preferred Possible for small data Overkill
Real-time streaming Not suitable Not suitable Preferred
Large dataset — hundreds of GB Preferred — Spark scales Impractical — memory limits Depends on throughput
Light transformation, small data Overkill — use Python Shell Preferred Not suitable
Schema discovery for S3 lake Preferred — Crawler Manual Not applicable

Anti-Patterns to Avoid

⚠ 1. Full Reprocess on Every Run Without Job Bookmarks

Running a Glue job that reads all data from the source on every execution. At day one, this is fast. At month six, the job reads six months of data to produce one day of new output. Duration and cost grow linearly with data volume.

Hover to see the fix ↻
↺ Correct Approach

Enable job bookmarks. The job processes only the data that arrived since the last successful run. Duration stays constant as the dataset grows.

⚠ 2. Writing CSV Output to S3 for Analytics

Glue jobs that output CSV files to S3 because CSV is familiar. Athena queries over CSV scan the entire file for every query regardless of which columns are needed. A 100GB CSV costs the same to query as a 100GB Parquet file even if only two columns are needed.

Hover to see the fix ↻
↺ Correct Approach

Output Parquet with Snappy compression partitioned by the primary query filter. Athena queries run 10–100× faster and cost 90% less on columnar formats.

Flowchart

%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'IBM Plex Sans, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD START([Raw Data in Source]) START --> CRAWL_G[Glue Crawler\nInfer schema automatically\nScheduled or on-demand] CRAWL_G --> CAT_G[Glue Data Catalogue\nTable definitions and partitions\nShared with Athena and Redshift] CAT_G --> JOB_G[Glue Job\nSpark, Python Shell\nor Glue Studio visual] JOB_G --> BOOK_G{Job Bookmark\nenabled?} BOOK_G -->|No| WARN_G[Full reprocess every run\nCost and duration grow\nover time] BOOK_G -->|Yes| INC_G[Incremental processing\nOnly new data since\nlast successful run] WARN_G & INC_G --> WRITE_G[Write to S3\nParquet plus Snappy compression\nPartitioned by date or entity] WRITE_G --> REGISTER_G[Update Glue Catalogue\nNew partitions registered\nAthena queries see new data] REGISTER_G --> DONE_GL([Data Lake Updated\nQueryable via Athena]) style START fill:#4f8ef7,color:#fff style DONE_GL fill:#10b981,color:#fff style WARN_G fill:#fef3c7 style CAT_G fill:#e0f2fe

References

  1. AWS — AWS Glue Developer Guide. docs.aws.amazon.com/glue
  2. AWS — Glue Data Catalogue. docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler
  3. AWS — Glue Studio User Guide. docs.aws.amazon.com/glue/latest/ug/what-is-glue-studio
  4. Apache Parquet — Columnar storage format. parquet.apache.org
Ascendion Engineering Knowledge Base ← Integration Architecture