Data Streaming on AWS

On This Page

1	Overview	2	Architecture Overview
3	AWS Service Topology	4	Implementation Guide
5	Decision Criteria	6	Cost Model
7	Anti-Patterns to Avoid	8	References

Overview

When batch processing is too slow — financial transactions that must be detected as fraudulent within milliseconds, telemetry that must trigger auto-scaling within seconds, clickstreams that must personalise a page before the next request — streaming is the architecture. AWS provides two parallel tracks: the Kinesis family for AWS-native streaming and MSK for Kafka-protocol compatibility with existing tooling.

Architecture Overview

%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'Inter, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%%
flowchart TD
    START([High-velocity Data Source])

    START --> INGEST{Ingestion Layer}
    INGEST -->|AWS-native streaming| KDS[Kinesis Data Streams\nReal-time, millisecond latency\nShard-based throughput]
    INGEST -->|Kafka protocol needed| MSK[Amazon MSK\nManaged Apache Kafka\nExisting Kafka tooling]
    INGEST -->|Delivery to storage only| KDF[Kinesis Data Firehose\nZero administration\nDirect to S3 or Redshift]

    KDS --> PROCESS{Processing Layer}
    MSK --> PROCESS
    PROCESS -->|Real-time analytics| KDA[Kinesis Data Analytics\nApache Flink\nSQL on streaming data]
    PROCESS -->|Event processing| LAMBDA_S[Lambda\nTrigger per shard\nBatch processing]

    KDA & LAMBDA_S --> STORE{Storage Layer}
    KDF --> STORE
    STORE -->|Analytics and ML| S3_S[S3 Data Lake\nParquet format\nGlue catalogue]
    STORE -->|Search and dashboards| OS[OpenSearch Service\nKibana dashboards\nAnomaly detection]
    STORE -->|Low-latency queries| DDB_S[DynamoDB\nPre-aggregated state\nReal-time lookups]

    S3_S & OS & DDB_S --> DONE([Insights Available — Real Time and Historical])

    style START fill:#4f8ef7,color:#fff
    style DONE fill:#10b981,color:#fff
    style KDA fill:#e0f2fe
    style S3_S fill:#e0f2fe

AWS Service Topology

%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'Inter, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%%
flowchart TD
    SOURCES[Data Sources\nApplications, IoT, Clickstreams\nDatabase CDC via DMS]

    subgraph KINESIS["Kinesis Family"]
        KDS_T[Kinesis Data Streams\n1 shard = 1 MB/s in, 2 MB/s out\nRetention: 24h to 365 days]
        KDF_T[Kinesis Data Firehose\nBuffer: 1-15 minutes or 1-128 MB\nAutomatic format conversion]
        KDA_T[Kinesis Data Analytics\nApache Flink application\nWindowed aggregations]
    end

    subgraph MSK_BLOCK["Amazon MSK"]
        BROKER[Kafka Brokers\nManaged upgrades\nAutoscale storage]
        CONNECT[MSK Connect\nKafka Connect managed\nSource and sink connectors]
        SCHEMA_S[Glue Schema Registry\nAvro, JSON, Protobuf\nSchema evolution governance]
    end

    subgraph DESTINATIONS["Data Destinations"]
        S3_T[S3 Data Lake\nGlue Crawler then Data Catalogue\nAthena SQL queries]
        RS[Amazon Redshift\nFirehose direct load\nBI and reporting]
        OS_T[OpenSearch\nReal-time indexing\nKibana visualisation]
    end

    SOURCES --> KDS_T & KDF_T & BROKER
    KDS_T --> KDA_T
    KDS_T --> KDF_T
    KDA_T --> OS_T
    KDF_T --> S3_T & RS
    BROKER --> CONNECT
    CONNECT --> S3_T
    SCHEMA_S -.->|Schema enforcement| BROKER & KDS_T

    style SOURCES fill:#4f8ef7,color:#fff
    style SCHEMA_S fill:#e0f2fe

Implementation Guide

CDK Stack — Kinesis Data Stream with Lambda Consumer

import * as cdk from 'aws-cdk-lib';
import * as kinesis from 'aws-cdk-lib/aws-kinesis';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as lambdaEventSources from 'aws-cdk-lib/aws-lambda-event-sources';
import * as firehose from '@aws-cdk/aws-kinesisfirehose-alpha';
import * as destinations from '@aws-cdk/aws-kinesisfirehose-destinations-alpha';
import * as s3 from 'aws-cdk-lib/aws-s3';
import { Duration, RemovalPolicy } from 'aws-cdk-lib';

export class DataStreamingStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Kinesis stream — shard count formula:
    //   shards = max(ingest MB/s, consume MB/s / 2)
    const stream = new kinesis.Stream(this, 'AppStream', {
      shardCount: 2,         // 2 MB/s ingest, 4 MB/s consume
      retentionPeriod: Duration.hours(24),
      encryption: kinesis.StreamEncryption.KMS,
      streamMode: kinesis.StreamMode.PROVISIONED,
    });

    // Lambda consumer with enhanced fan-out
    const processor = new lambda.Function(this, 'StreamProcessor', {
      runtime: lambda.Runtime.NODEJS_20_X,
      architecture: lambda.Architecture.ARM_64,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda/stream-processor'),
      timeout: Duration.minutes(5),
    });

    processor.addEventSource(
      new lambdaEventSources.KinesisEventSource(stream, {
        startingPosition: lambda.StartingPosition.TRIM_HORIZON,
        batchSize: 100,
        bisectBatchOnError: true,
        maxBatchingWindow: Duration.seconds(5),
        reportBatchItemFailures: true,
      }));

    // Firehose — archive all events to S3
    const archiveBucket = new s3.Bucket(this, 'ArchiveBucket', {
      versioned: true,
      removalPolicy: RemovalPolicy.RETAIN,
    });

    new firehose.DeliveryStream(this, 'ArchiveStream', {
      sourceStream: stream,
      destinations: [
        new destinations.S3Bucket(archiveBucket, {
          dataOutputPrefix: 'events/year=!{timestamp:yyyy}/'
            + 'month=!{timestamp:MM}/day=!{timestamp:dd}/',
          bufferingInterval: Duration.minutes(5),
          bufferingSize: cdk.Size.mebibytes(64),
        }),
      ],
    });
  }
}

Decision Criteria

Factor	Kinesis Data Streams	Amazon MSK	Kinesis Firehose
Protocol	AWS proprietary	Apache Kafka	AWS proprietary
Latency	Sub-second	Sub-second	60–900 seconds
Administration	Minimal	Medium	None
Consumer groups	Shard iterators	Kafka consumer groups	Not applicable
Replay	24h–365 days	Configurable retention	Not supported
Use when	New AWS-native streaming	Kafka migration or ecosystem	Delivery to S3/Redshift only

Cost Model

Kinesis Data Streams costs $0.015 per shard hour plus $0.014 per million PUT payload units. Two shards running continuously costs approximately $22/month plus data charges. MSK costs are EC2-based — a three-broker cluster with kafka.m5.large instances costs approximately $200/month plus storage.

Anti-Patterns to Avoid

⚠ 1. Using Kinesis When SQS Is Sufficient

Adding a Kinesis stream for simple task queue use cases — sending emails, resizing images, processing orders one at a time. Kinesis is designed for high-throughput ordered streams, not task queues. The shard management, enhanced fan-out, and iterator tracking adds complexity for no benefit.

Hover to see the fix ↻

↺ Correct Approach

Use SQS for task queue patterns. Use Kinesis when you need ordered, high-throughput streaming with multiple independent consumers reading the same stream at different positions.

⚠ 2. Ignoring Shard Hot Spots

Distributing records across shards using a low-cardinality partition key such as event type or region. All records of type OrderCreated go to shard 1, which hits its 1 MB/s write limit and drops records under load while other shards sit idle.

Hover to see the fix ↻

↺ Correct Approach

Use a high-cardinality partition key — customer ID, order ID, or device ID. Distribute evenly across all shards. Monitor the MaxPutRecordsThrottled metric per shard.

Flowchart

%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'Inter, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD START([High-Velocity Data Event]) START --> CHOOSE{Select Streaming Service} CHOOSE -->|New AWS streaming, low ops| KDS[Kinesis Data Streams\nProvisioned or on-demand shards\n1 MB per second per shard] CHOOSE -->|Kafka ecosystem required| MSK[Amazon MSK\nManaged Kafka brokers\nExisting consumer groups] CHOOSE -->|Delivery to S3 or Redshift only| KDF[Kinesis Firehose\nZero administration\nBuffered delivery] KDS --> CONSUME{Consumer Type} CONSUME -->|Real-time analytics| KDA[Kinesis Analytics\nApache Flink\nWindowed SQL] CONSUME -->|Event processing| FN_S[Lambda Function\nBatch size 100\nBisect on error] MSK --> CONNECT_S[MSK Connect\nKafka Connect\nSink to S3 or OpenSearch] KDA --> STORE_S[OpenSearch\nDashboards\nAnomaly detection] FN_S --> DDB_S2[DynamoDB\nAggregated state\nReal-time queries] KDF --> S3_D[S3 Data Lake\nParquet via Glue\nAthena queryable] CONNECT_S --> S3_D S3_D & STORE_S & DDB_S2 --> DONE([Streaming Data Available for Insights]) style START fill:#4f8ef7,color:#fff style DONE fill:#10b981,color:#fff style KDA fill:#e0f2fe style CONNECT_S fill:#e0f2fe

References

AWS — Amazon Kinesis Data Streams Developer Guide. docs.aws.amazon.com/streams
AWS — Amazon MSK Developer Guide. docs.aws.amazon.com/msk
AWS — Kinesis Data Firehose Developer Guide. docs.aws.amazon.com/firehose
Apache Kafka — Documentation. kafka.apache.org/documentation

Ascendion Engineering Knowledge Base ← Cloud