Introduction
Real-time data processing has become essential for modern applications. Whether it's fraud detection, user behavior analytics, or IoT monitoring, the ability to process data as it arrives is critical. In this post, I'll share our experience building a real-time analytics platform using Apache Kafka and Spark Streaming.
Architecture Overview
Our architecture consists of several key components:
- Data Sources: Multiple applications publishing events to Kafka
- Kafka Cluster: Message broker handling event streaming
- Spark Streaming: Real-time processing engine
- Data Sinks: Databases, data lakes, and dashboards
Why Kafka?
Apache Kafka excels at high-throughput message streaming:
- Scalability: Handles millions of messages per second
- Durability: Persistent message storage with replication
- Fault Tolerance: Built-in replication and failover
- Decoupling: Producers and consumers are independent
Kafka Best Practices
Topic Design
Proper topic organization is crucial:
- Use clear naming conventions (e.g.,
domain.entity.action) - Configure appropriate partition counts based on throughput needs
- Set retention policies based on use case
- Use compacted topics for stateful data
Producer Configuration
acks=all: Wait for all replicas to acknowledgecompression.type=snappy: Reduce network bandwidthenable.idempotence=true: Prevent duplicates- Batch messages for better throughput
Consumer Groups
Leverage consumer groups for parallel processing:
- Partitions are distributed among group members
- Scale by adding more consumers (up to partition count)
- Handle rebalancing gracefully
- Monitor consumer lag closely
Spark Streaming Deep Dive
Structured Streaming
We use Spark's Structured Streaming API for its advantages:
- Unified batch and streaming code
- Built-in support for event time processing
- Exactly-once semantics
- Integration with Spark SQL and DataFrames
Processing Models
Micro-Batch Processing
Default mode in Structured Streaming:
- Processes data in small batches (seconds)
- Lower resource overhead
- Good for most use cases
- Latency in the order of seconds
Continuous Processing
For ultra-low latency requirements:
- Processes events continuously
- Millisecond latencies
- Higher resource usage
- Limited operation support
Exactly-Once Semantics
Achieving exactly-once processing is critical but challenging. Our approach:
Idempotent Writes
- Enable idempotent producer in Kafka
- Use transactional writes for multi-sink scenarios
- Implement idempotent operations in processing logic
Checkpointing
- Enable checkpointing to HDFS/S3
- Spark tracks offset commits automatically
- Ensures recovery to exact position after failures
- Monitor checkpoint size and clean old checkpoints
Handling Backpressure
When processing can't keep up with incoming data:
Spark-Side Solutions
maxOffsetsPerTrigger: Limit records per batchminOffsetPerTrigger: Ensure minimum batch size- Scale cluster resources (add executors)
- Optimize processing logic
Kafka-Side Solutions
- Increase partition count for better parallelism
- Add more consumer instances
- Implement rate limiting at producers
- Use separate topics for different priorities
State Management
Stateful Operations
Many use cases require maintaining state:
- Aggregations: Running counts, sums, averages
- Windows: Tumbling, sliding, or session windows
- Joins: Stream-to-stream or stream-to-static joins
State Store Optimization
- Use RocksDB for large state (instead of in-memory)
- Configure state TTL to prevent unbounded growth
- Use watermarks for event-time processing
- Monitor state store size and performance
Monitoring and Observability
Key Metrics to Monitor
Kafka Metrics
- Consumer lag per partition
- Message throughput (messages/sec)
- Broker CPU and disk usage
- Under-replicated partitions
Spark Streaming Metrics
- Processing time per batch
- Scheduling delay
- Number of active queries
- Executor memory usage
Alerting Strategy
- Alert on consumer lag exceeding thresholds
- Monitor processing delays
- Track failure rates
- Set up dead letter queues for failed messages
Performance Optimization
Spark Tuning
- Parallelism: Match or exceed Kafka partition count
- Memory: Allocate sufficient executor memory
- Caching: Cache frequently accessed reference data
- Shuffle: Minimize shuffles in processing logic
Kafka Tuning
- Increase
num.network.threadsfor high throughput - Tune
replica.fetch.max.bytesfor replication - Use appropriate compression codec
- Configure
log.segment.bytesbased on retention
Testing Strategy
Unit Testing
- Test processing logic with static data
- Use Spark's built-in testing utilities
- Mock external dependencies
Integration Testing
- Use embedded Kafka for tests
- Test end-to-end with representative data
- Verify exactly-once semantics
- Test failure recovery scenarios
Production Lessons
What Worked Well
- Structured Streaming's exactly-once guarantees
- Kafka's reliability and scalability
- Watermarks for handling late data
- Comprehensive monitoring setup
Challenges Faced
- State store size management with long windows
- Handling schema evolution in streaming data
- Balancing latency vs. throughput
- Coordinating upgrades without downtime
Results and Impact
- Processing 10M+ events per minute
- Average end-to-end latency under 2 seconds
- 99.99% exactly-once delivery
- Zero data loss during failures
- Enabled real-time business decisions
Conclusion
Building a real-time streaming platform with Kafka and Spark requires careful consideration of architecture, performance, and reliability. By following best practices for both technologies, implementing proper monitoring, and learning from production experience, you can build a robust system that processes data at scale with low latency and high reliability.