3.4 Big Data Processing

Many modern systems generate or need to analyze massive datasets—"Big Data"—that are too large to be handled by a single machine or a traditional database. This requires specialized distributed processing frameworks. The two primary models for processing Big Data are Batch Processing and Stream Processing.


3.4.a Batch Processing

Batch processing is a method where a large volume of data is collected over time, grouped into a "batch," and then processed all at once in a single, large job.

  • Analogy: Doing all of your laundry for the week on Sunday. You collect a large batch of clothes and run a big job to wash them all.
  • Key Characteristics:
    • High Throughput: It's optimized to process massive amounts of data efficiently.
    • High Latency: There is a significant delay between when data is generated and when the results are available because the job runs on a schedule (e.g., hourly, daily). It is not real-time.
    • Bounded Data: It operates on a finite, static dataset (e.g., "all the sales from yesterday").
  • Use Cases:
    • Generating end-of-day financial reports.
    • Large-scale ETL (Extract, Transform, Load) jobs to populate a data warehouse.
    • Calculating complex analytics over a large historical dataset.
    • Training machine learning models.
  • Common Technologies:
    • Hadoop MapReduce: The original, foundational framework for distributed batch processing.
    • Apache Spark: A more modern, faster, and more general-purpose framework that has largely superseded MapReduce for most batch tasks due to its in-memory processing capabilities.

3.4.b Stream Processing

Stream processing is a method where data is processed continuously, event-by-event, as it arrives. It operates on data that is in motion.

  • Analogy: Washing each dish immediately after you finish eating. You process each item (data event) as it comes, in near real-time.
  • Key Characteristics:
    • Low Latency: It's designed to provide results within milliseconds or seconds of the data arriving.
    • Real-Time: It enables immediate reaction to events.
    • Unbounded Data: It operates on a potentially infinite stream of data with no defined end (e.g., a continuous stream of sensor data or website clicks).
  • Use Cases:
    • Real-time fraud detection in credit card transactions.
    • Live monitoring of user activity on a website (clickstream analysis).
    • Real-time analytics dashboards.
    • Alerting on anomalies from system logs or IoT sensor data.
  • Common Technologies:
    • Apache Flink: A powerful stream processing framework known for its performance and robust state management.
    • Apache Spark Streaming / Structured Streaming: Spark's modules for a processing live data streams (often using a "micro-batch" approach).
    • Kafka Streams: A client library for building real-time applications and microservices that process data directly from Kafka topics.

Comparison: Batch vs. Stream

FeatureBatch ProcessingStream Processing
Data ScopeBounded, large, static datasetsUnbounded, continuous data streams
LatencyHigh (minutes to hours)Low (milliseconds to seconds)
ThroughputOptimized for high bulk throughputOptimized for low-latency event processing
AnalysisComplex, deep analysisSimple analysis, aggregations, filtering

Summary for an Interview

  • Understand the fundamental difference: Batch processes large, finite datasets with high latency, while Stream processes unbounded data continuously with low latency.
  • Be able to choose the right model for a given task. For example, "To generate our weekly user engagement report, we'll run a nightly batch job using Apache Spark. To detect fraudulent transactions as they happen, we need a stream processing solution using Apache Flink."
  • Acknowledge that many complex systems use both models. For example, a streaming pipeline for real-time dashboards and a separate batch pipeline for more accurate, in-depth historical analysis (this is often called a "Lambda Architecture").
Advertisement