| 1 | Overview | 2 | Architecture Overview |
| 3 | AWS Service Topology | 4 | Implementation Guide |
| 5 | Decision Criteria | 6 | Cost Model |
| 7 | Anti-Patterns to Avoid | 8 | References |
Overview
When batch processing is too slow — financial transactions that must be detected as fraudulent within milliseconds, telemetry that must trigger auto-scaling within seconds, clickstreams that must personalise a page before the next request — streaming is the architecture. AWS provides two parallel tracks: the Kinesis family for AWS-native streaming and MSK for Kafka-protocol compatibility with existing tooling.
Architecture Overview
%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'IBM Plex Sans, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD START([High-velocity Data Source]) START --> INGEST{Ingestion Layer} INGEST -->|AWS-native streaming| KDS[Kinesis Data Streams\nReal-time, millisecond latency\nShard-based throughput] INGEST -->|Kafka protocol needed| MSK[Amazon MSK\nManaged Apache Kafka\nExisting Kafka tooling] INGEST -->|Delivery to storage only| KDF[Kinesis Data Firehose\nZero administration\nDirect to S3 or Redshift] KDS --> PROCESS{Processing Layer} MSK --> PROCESS PROCESS -->|Real-time analytics| KDA[Kinesis Data Analytics\nApache Flink\nSQL on streaming data] PROCESS -->|Event processing| LAMBDA_S[Lambda\nTrigger per shard\nBatch processing] KDA & LAMBDA_S --> STORE{Storage Layer} KDF --> STORE STORE -->|Analytics and ML| S3_S[S3 Data Lake\nParquet format\nGlue catalogue] STORE -->|Search and dashboards| OS[OpenSearch Service\nKibana dashboards\nAnomaly detection] STORE -->|Low-latency queries| DDB_S[DynamoDB\nPre-aggregated state\nReal-time lookups] S3_S & OS & DDB_S --> DONE([Insights Available — Real Time and Historical]) style START fill:#4f8ef7,color:#fff style DONE fill:#10b981,color:#fff style KDA fill:#e0f2fe style S3_S fill:#e0f2fe
AWS Service Topology
%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'IBM Plex Sans, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD SOURCES[Data Sources\nApplications, IoT, Clickstreams\nDatabase CDC via DMS] subgraph KINESIS["Kinesis Family"] KDS_T[Kinesis Data Streams\n1 shard = 1 MB/s in, 2 MB/s out\nRetention: 24h to 365 days] KDF_T[Kinesis Data Firehose\nBuffer: 1-15 minutes or 1-128 MB\nAutomatic format conversion] KDA_T[Kinesis Data Analytics\nApache Flink application\nWindowed aggregations] end subgraph MSK_BLOCK["Amazon MSK"] BROKER[Kafka Brokers\nManaged upgrades\nAutoscale storage] CONNECT[MSK Connect\nKafka Connect managed\nSource and sink connectors] SCHEMA_S[Glue Schema Registry\nAvro, JSON, Protobuf\nSchema evolution governance] end subgraph DESTINATIONS["Data Destinations"] S3_T[S3 Data Lake\nGlue Crawler then Data Catalogue\nAthena SQL queries] RS[Amazon Redshift\nFirehose direct load\nBI and reporting] OS_T[OpenSearch\nReal-time indexing\nKibana visualisation] end SOURCES --> KDS_T & KDF_T & BROKER KDS_T --> KDA_T KDS_T --> KDF_T KDA_T --> OS_T KDF_T --> S3_T & RS BROKER --> CONNECT CONNECT --> S3_T SCHEMA_S -.->|Schema enforcement| BROKER & KDS_T style SOURCES fill:#4f8ef7,color:#fff style SCHEMA_S fill:#e0f2fe
Implementation Guide
CDK Stack — Kinesis Data Stream with Lambda Consumer
import * as cdk from 'aws-cdk-lib';
import * as kinesis from 'aws-cdk-lib/aws-kinesis';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as lambdaEventSources from 'aws-cdk-lib/aws-lambda-event-sources';
import * as firehose from '@aws-cdk/aws-kinesisfirehose-alpha';
import * as destinations from '@aws-cdk/aws-kinesisfirehose-destinations-alpha';
import * as s3 from 'aws-cdk-lib/aws-s3';
import { Duration, RemovalPolicy } from 'aws-cdk-lib';
export class DataStreamingStack extends cdk.Stack {
constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
super(scope, id, props);
// Kinesis stream — shard count formula:
// shards = max(ingest MB/s, consume MB/s / 2)
const stream = new kinesis.Stream(this, 'AppStream', {
shardCount: 2, // 2 MB/s ingest, 4 MB/s consume
retentionPeriod: Duration.hours(24),
encryption: kinesis.StreamEncryption.KMS,
streamMode: kinesis.StreamMode.PROVISIONED,
});
// Lambda consumer with enhanced fan-out
const processor = new lambda.Function(this, 'StreamProcessor', {
runtime: lambda.Runtime.NODEJS_20_X,
architecture: lambda.Architecture.ARM_64,
handler: 'index.handler',
code: lambda.Code.fromAsset('lambda/stream-processor'),
timeout: Duration.minutes(5),
});
processor.addEventSource(
new lambdaEventSources.KinesisEventSource(stream, {
startingPosition: lambda.StartingPosition.TRIM_HORIZON,
batchSize: 100,
bisectBatchOnError: true,
maxBatchingWindow: Duration.seconds(5),
reportBatchItemFailures: true,
}));
// Firehose — archive all events to S3
const archiveBucket = new s3.Bucket(this, 'ArchiveBucket', {
versioned: true,
removalPolicy: RemovalPolicy.RETAIN,
});
new firehose.DeliveryStream(this, 'ArchiveStream', {
sourceStream: stream,
destinations: [
new destinations.S3Bucket(archiveBucket, {
dataOutputPrefix: 'events/year=!{timestamp:yyyy}/'
+ 'month=!{timestamp:MM}/day=!{timestamp:dd}/',
bufferingInterval: Duration.minutes(5),
bufferingSize: cdk.Size.mebibytes(64),
}),
],
});
}
}
Decision Criteria
| Factor | Kinesis Data Streams | Amazon MSK | Kinesis Firehose |
|---|---|---|---|
| Protocol | AWS proprietary | Apache Kafka | AWS proprietary |
| Latency | Sub-second | Sub-second | 60–900 seconds |
| Administration | Minimal | Medium | None |
| Consumer groups | Shard iterators | Kafka consumer groups | Not applicable |
| Replay | 24h–365 days | Configurable retention | Not supported |
| Use when | New AWS-native streaming | Kafka migration or ecosystem | Delivery to S3/Redshift only |
Cost Model
Kinesis Data Streams costs $0.015 per shard hour plus $0.014 per million PUT payload units. Two shards running continuously costs approximately $22/month plus data charges. MSK costs are EC2-based — a three-broker cluster with kafka.m5.large instances costs approximately $200/month plus storage.
Anti-Patterns to Avoid
Adding a Kinesis stream for simple task queue use cases — sending emails, resizing images, processing orders one at a time. Kinesis is designed for high-throughput ordered streams, not task queues. The shard management, enhanced fan-out, and iterator tracking adds complexity for no benefit.
Use SQS for task queue patterns. Use Kinesis when you need ordered, high-throughput streaming with multiple independent consumers reading the same stream at different positions.
Distributing records across shards using a low-cardinality partition key such as event type or region. All records of type OrderCreated go to shard 1, which hits its 1 MB/s write limit and drops records under load while other shards sit idle.
Use a high-cardinality partition key — customer ID, order ID, or device ID. Distribute evenly across all shards. Monitor the MaxPutRecordsThrottled metric per shard.
Flowchart
References
- AWS — Amazon Kinesis Data Streams Developer Guide. docs.aws.amazon.com/streams
- AWS — Amazon MSK Developer Guide. docs.aws.amazon.com/msk
- AWS — Kinesis Data Firehose Developer Guide. docs.aws.amazon.com/firehose
- Apache Kafka — Documentation. kafka.apache.org/documentation