3.4 Big Data Processing
Many modern systems generate or need to analyze massive datasets—"Big Data"—that are too large to be handled by a single machine or a traditional database. This requires specialized distributed processing frameworks. The two primary models for processing Big Data are Batch Processing and Stream Processing.
3.4.a Batch Processing
Batch processing is a method where a large volume of data is collected over time, grouped into a "batch," and then processed all at once in a single, large job.
- Analogy: Doing all of your laundry for the week on Sunday. You collect a large batch of clothes and run a big job to wash them all.
- Key Characteristics:
- High Throughput: It's optimized to process massive amounts of data efficiently.
- High Latency: There is a significant delay between when data is generated and when the results are available because the job runs on a schedule (e.g., hourly, daily). It is not real-time.
- Bounded Data: It operates on a finite, static dataset (e.g., "all the sales from yesterday").
- Use Cases:
- Generating end-of-day financial reports.
- Large-scale ETL (Extract, Transform, Load) jobs to populate a data warehouse.
- Calculating complex analytics over a large historical dataset.
- Training machine learning models.
- Common Technologies:
- Hadoop MapReduce: The original, foundational framework for distributed batch processing.
- Apache Spark: A more modern, faster, and more general-purpose framework that has largely superseded MapReduce for most batch tasks due to its in-memory processing capabilities.
3.4.b Stream Processing
Stream processing is a method where data is processed continuously, event-by-event, as it arrives. It operates on data that is in motion.
- Analogy: Washing each dish immediately after you finish eating. You process each item (data event) as it comes, in near real-time.
- Key Characteristics:
- Low Latency: It's designed to provide results within milliseconds or seconds of the data arriving.
- Real-Time: It enables immediate reaction to events.
- Unbounded Data: It operates on a potentially infinite stream of data with no defined end (e.g., a continuous stream of sensor data or website clicks).
- Use Cases:
- Real-time fraud detection in credit card transactions.
- Live monitoring of user activity on a website (clickstream analysis).
- Real-time analytics dashboards.
- Alerting on anomalies from system logs or IoT sensor data.
- Common Technologies:
- Apache Flink: A powerful stream processing framework known for its performance and robust state management.
- Apache Spark Streaming / Structured Streaming: Spark's modules for a processing live data streams (often using a "micro-batch" approach).
- Kafka Streams: A client library for building real-time applications and microservices that process data directly from Kafka topics.
Comparison: Batch vs. Stream
Feature | Batch Processing | Stream Processing |
---|---|---|
Data Scope | Bounded, large, static datasets | Unbounded, continuous data streams |
Latency | High (minutes to hours) | Low (milliseconds to seconds) |
Throughput | Optimized for high bulk throughput | Optimized for low-latency event processing |
Analysis | Complex, deep analysis | Simple analysis, aggregations, filtering |
Summary for an Interview
- Understand the fundamental difference: Batch processes large, finite datasets with high latency, while Stream processes unbounded data continuously with low latency.
- Be able to choose the right model for a given task. For example, "To generate our weekly user engagement report, we'll run a nightly batch job using Apache Spark. To detect fraudulent transactions as they happen, we need a stream processing solution using Apache Flink."
- Acknowledge that many complex systems use both models. For example, a streaming pipeline for real-time dashboards and a separate batch pipeline for more accurate, in-depth historical analysis (this is often called a "Lambda Architecture").
Advertisement