Real-time Data Processing with Apache Kafka and Spark Streaming

Introduction

Real-time data processing has become essential for modern applications. Whether it's fraud detection, user behavior analytics, or IoT monitoring, the ability to process data as it arrives is critical. In this post, I'll share our experience building a real-time analytics platform using Apache Kafka and Spark Streaming.

Architecture Overview

Our architecture consists of several key components:

Data Sources: Multiple applications publishing events to Kafka
Kafka Cluster: Message broker handling event streaming
Spark Streaming: Real-time processing engine
Data Sinks: Databases, data lakes, and dashboards

Why Kafka?

Apache Kafka excels at high-throughput message streaming:

Scalability: Handles millions of messages per second
Durability: Persistent message storage with replication
Fault Tolerance: Built-in replication and failover
Decoupling: Producers and consumers are independent

Kafka Best Practices

Topic Design

Proper topic organization is crucial:

Use clear naming conventions (e.g., domain.entity.action)
Configure appropriate partition counts based on throughput needs
Set retention policies based on use case
Use compacted topics for stateful data

Producer Configuration

acks=all: Wait for all replicas to acknowledge
compression.type=snappy: Reduce network bandwidth
enable.idempotence=true: Prevent duplicates
Batch messages for better throughput

Consumer Groups

Leverage consumer groups for parallel processing:

Partitions are distributed among group members
Scale by adding more consumers (up to partition count)
Handle rebalancing gracefully
Monitor consumer lag closely

Spark Streaming Deep Dive

Structured Streaming

We use Spark's Structured Streaming API for its advantages:

Unified batch and streaming code
Built-in support for event time processing
Exactly-once semantics
Integration with Spark SQL and DataFrames

Processing Models

Micro-Batch Processing

Default mode in Structured Streaming:

Processes data in small batches (seconds)
Lower resource overhead
Good for most use cases
Latency in the order of seconds

Continuous Processing

For ultra-low latency requirements:

Processes events continuously
Millisecond latencies
Higher resource usage
Limited operation support

Exactly-Once Semantics

Achieving exactly-once processing is critical but challenging. Our approach:

Idempotent Writes

Enable idempotent producer in Kafka
Use transactional writes for multi-sink scenarios
Implement idempotent operations in processing logic

Checkpointing

Enable checkpointing to HDFS/S3
Spark tracks offset commits automatically
Ensures recovery to exact position after failures
Monitor checkpoint size and clean old checkpoints

Handling Backpressure

When processing can't keep up with incoming data:

Spark-Side Solutions

maxOffsetsPerTrigger: Limit records per batch
minOffsetPerTrigger: Ensure minimum batch size
Scale cluster resources (add executors)
Optimize processing logic

Kafka-Side Solutions

Increase partition count for better parallelism
Add more consumer instances
Implement rate limiting at producers
Use separate topics for different priorities

State Management

Stateful Operations

Many use cases require maintaining state:

Aggregations: Running counts, sums, averages
Windows: Tumbling, sliding, or session windows
Joins: Stream-to-stream or stream-to-static joins

State Store Optimization

Use RocksDB for large state (instead of in-memory)
Configure state TTL to prevent unbounded growth
Use watermarks for event-time processing
Monitor state store size and performance

Monitoring and Observability

Key Metrics to Monitor

Kafka Metrics

Consumer lag per partition
Message throughput (messages/sec)
Broker CPU and disk usage
Under-replicated partitions

Spark Streaming Metrics

Processing time per batch
Scheduling delay
Number of active queries
Executor memory usage

Alerting Strategy

Alert on consumer lag exceeding thresholds
Monitor processing delays
Track failure rates
Set up dead letter queues for failed messages

Performance Optimization

Spark Tuning

Parallelism: Match or exceed Kafka partition count
Memory: Allocate sufficient executor memory
Caching: Cache frequently accessed reference data
Shuffle: Minimize shuffles in processing logic

Kafka Tuning

Increase num.network.threads for high throughput
Tune replica.fetch.max.bytes for replication
Use appropriate compression codec
Configure log.segment.bytes based on retention

Testing Strategy

Unit Testing

Test processing logic with static data
Use Spark's built-in testing utilities
Mock external dependencies

Integration Testing

Use embedded Kafka for tests
Test end-to-end with representative data
Verify exactly-once semantics
Test failure recovery scenarios

Production Lessons

What Worked Well

Structured Streaming's exactly-once guarantees
Kafka's reliability and scalability
Watermarks for handling late data
Comprehensive monitoring setup

Challenges Faced

State store size management with long windows
Handling schema evolution in streaming data
Balancing latency vs. throughput
Coordinating upgrades without downtime

Results and Impact

Processing 10M+ events per minute
Average end-to-end latency under 2 seconds
99.99% exactly-once delivery
Zero data loss during failures
Enabled real-time business decisions

Conclusion

Building a real-time streaming platform with Kafka and Spark requires careful consideration of architecture, performance, and reliability. By following best practices for both technologies, implementing proper monitoring, and learning from production experience, you can build a robust system that processes data at scale with low latency and high reliability.

Kafka Spark Streaming Real-time Scala