3.4.1 | ugeco

3.4 Big Data Processing

Many modern systems generate or need to analyze massive datasets—"Big Data"—that are too large to be handled by a single machine or a traditional database. This requires specialized distributed processing frameworks. The two primary models for processing Big Data are Batch Processing and Stream Processing.

3.4.a Batch Processing

Batch processing is a method where a large volume of data is collected over time, grouped into a "batch," and then processed all at once in a single, large job.

Analogy: Doing all of your laundry for the week on Sunday. You collect a large batch of clothes and run a big job to wash them all.
Key Characteristics:
- High Throughput: It's optimized to process massive amounts of data efficiently.
- High Latency: There is a significant delay between when data is generated and when the results are available because the job runs on a schedule (e.g., hourly, daily). It is not real-time.
- Bounded Data: It operates on a finite, static dataset (e.g., "all the sales from yesterday").
Use Cases:
- Generating end-of-day financial reports.
- Large-scale ETL (Extract, Transform, Load) jobs to populate a data warehouse.
- Calculating complex analytics over a large historical dataset.
- Training machine learning models.
Common Technologies:
- Hadoop MapReduce: The original, foundational framework for distributed batch processing.
- Apache Spark: A more modern, faster, and more general-purpose framework that has largely superseded MapReduce for most batch tasks due to its in-memory processing capabilities.

3.4.b Stream Processing

Stream processing is a method where data is processed continuously, event-by-event, as it arrives. It operates on data that is in motion.

Analogy: Washing each dish immediately after you finish eating. You process each item (data event) as it comes, in near real-time.
Key Characteristics:
- Low Latency: It's designed to provide results within milliseconds or seconds of the data arriving.
- Real-Time: It enables immediate reaction to events.
- Unbounded Data: It operates on a potentially infinite stream of data with no defined end (e.g., a continuous stream of sensor data or website clicks).
Use Cases:
- Real-time fraud detection in credit card transactions.
- Live monitoring of user activity on a website (clickstream analysis).
- Real-time analytics dashboards.
- Alerting on anomalies from system logs or IoT sensor data.
Common Technologies:
- Apache Flink: A powerful stream processing framework known for its performance and robust state management.
- Apache Spark Streaming / Structured Streaming: Spark's modules for a processing live data streams (often using a "micro-batch" approach).
- Kafka Streams: A client library for building real-time applications and microservices that process data directly from Kafka topics.

Comparison: Batch vs. Stream

Feature	Batch Processing	Stream Processing
Data Scope	Bounded, large, static datasets	Unbounded, continuous data streams
Latency	High (minutes to hours)	Low (milliseconds to seconds)
Throughput	Optimized for high bulk throughput	Optimized for low-latency event processing
Analysis	Complex, deep analysis	Simple analysis, aggregations, filtering

Summary for an Interview

Understand the fundamental difference: Batch processes large, finite datasets with high latency, while Stream processes unbounded data continuously with low latency.
Be able to choose the right model for a given task. For example, "To generate our weekly user engagement report, we'll run a nightly batch job using Apache Spark. To detect fraudulent transactions as they happen, we need a stream processing solution using Apache Flink."
Acknowledge that many complex systems use both models. For example, a streaming pipeline for real-time dashboards and a separate batch pipeline for more accurate, in-depth historical analysis (this is often called a "Lambda Architecture").