| 1 | Overview | 2 | Architecture |
| 3 | When to Use | 4 | When Not to Use |
| 5 | Implementation Guide | 6 | Decision Criteria |
| 7 | Anti-Patterns to Avoid | 8 | References |
Overview
Glue has three components that work together. The Data Catalogue stores metadata — table definitions, schema information, and data source connections — and serves as a central metadata registry for the AWS account. Crawlers discover data sources, infer schemas, and populate the Catalogue automatically. Jobs execute the transformation logic — as Spark scripts, Python Shell scripts, or visual workflows built in Glue Studio.
Architecture
%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'IBM Plex Sans, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD START([Data in Source Systems]) START --> CRAWL[Glue Crawler\nDiscovers schema\nPopulates Data Catalogue] CRAWL --> CATALOGUE[Glue Data Catalogue\nCentral metadata registry\nShared with Athena and Redshift] CATALOGUE --> JOB_TYPE{Job Type} JOB_TYPE -->|Visual no-code| STUDIO[Glue Studio\nDrag-and-drop ETL\nAuto-generates PySpark] JOB_TYPE -->|Custom code| SPARK[Glue Spark Job\nPySpark or Scala\nDistributed processing] JOB_TYPE -->|Lightweight Python| PYTHON[Python Shell Job\nSingle node\nAPI calls and light transforms] STUDIO & SPARK & PYTHON --> TRANSFORM_G[Transform\nFilter, join, aggregate\nSchema mapping, deduplication] TRANSFORM_G --> BOOKMARK{Job Bookmark\nEnabled?} BOOKMARK -->|Yes| INCREMENTAL[Process only new data\nSince last successful run] BOOKMARK -->|No| FULL[Full dataset reprocessing\nHigher cost and duration] INCREMENTAL & FULL --> TARGET[Write to Target\nS3 Parquet, Redshift\nRDS, DynamoDB, Glue Catalogue] TARGET --> DONE_G([Data Available for Analytics]) style START fill:#4f8ef7,color:#fff style DONE_G fill:#10b981,color:#fff style CATALOGUE fill:#e0f2fe style BOOKMARK fill:#fef3c7
When to Use
- Batch ETL pipelines from operational databases to S3 data lake
- Schema discovery and catalogue management for a data lake on S3
- Transformation of raw data into analytics-ready Parquet format
- Incremental data ingestion from RDS, DynamoDB, JDBC sources, or S3
When Not to Use
- Real-time or near-real-time streaming — use Kinesis Data Streams or Kafka with Flink
- Simple single-table copy with no transformation — use AWS DMS
- Lightweight API-to-API integration — use Lambda or Step Functions
- Sub-minute latency requirements — Glue Spark jobs have a startup overhead of 1–3 minutes
Implementation Guide
Job Bookmarks
Enable job bookmarks for incremental processing. Bookmarks track the last successfully processed position — a timestamp, a primary key watermark, or an S3 object modification time. On the next run, the job processes only data that arrived since the last run. Without bookmarks, every run reprocesses the full dataset.
Output Format
Write Glue job output as Parquet with Snappy compression. Parquet is a columnar format — Athena and Redshift Spectrum read only the columns referenced in a query, dramatically reducing query time and cost. Snappy balances compression ratio with Spark decompression speed. Avoid CSV output for analytics workloads — column scans read the entire file.
Partition Strategy
Partition S3 output by date — year=2025/month=06/day=10/ — so queries that filter by date only scan the relevant partitions. Athena's partition pruning skips unrelated data entirely. Partition by the most common query filter — date for time-series data, region for geographic data, entity ID for domain-specific lookups.
Decision Criteria
| Use Case | Glue | Lambda ETL | Kinesis + Flink |
|---|---|---|---|
| Batch processing — hourly or daily | Preferred | Possible for small data | Overkill |
| Real-time streaming | Not suitable | Not suitable | Preferred |
| Large dataset — hundreds of GB | Preferred — Spark scales | Impractical — memory limits | Depends on throughput |
| Light transformation, small data | Overkill — use Python Shell | Preferred | Not suitable |
| Schema discovery for S3 lake | Preferred — Crawler | Manual | Not applicable |
Anti-Patterns to Avoid
Running a Glue job that reads all data from the source on every execution. At day one, this is fast. At month six, the job reads six months of data to produce one day of new output. Duration and cost grow linearly with data volume.
Enable job bookmarks. The job processes only the data that arrived since the last successful run. Duration stays constant as the dataset grows.
Glue jobs that output CSV files to S3 because CSV is familiar. Athena queries over CSV scan the entire file for every query regardless of which columns are needed. A 100GB CSV costs the same to query as a 100GB Parquet file even if only two columns are needed.
Output Parquet with Snappy compression partitioned by the primary query filter. Athena queries run 10–100× faster and cost 90% less on columnar formats.
Flowchart
References
- AWS — AWS Glue Developer Guide. docs.aws.amazon.com/glue
- AWS — Glue Data Catalogue. docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler
- AWS — Glue Studio User Guide. docs.aws.amazon.com/glue/latest/ug/what-is-glue-studio
- Apache Parquet — Columnar storage format. parquet.apache.org