Detail of gates open on hydroelectric dam

When you think of Big Data, you probably think of massive databases of millions or even billions of files. This certainly is one side of Big Data, but all that data is just collecting cyber dust if you don’t have the resources and expertise to put it to work for you.

Batch processing is an efficient way to generate insights when working with a high volume of data. Hadoop‘s MapReduce paradigm, which divides massive datasets across multiple clusters of commodity hardware, is the classic example of batch processing. Depending on the size of the job, processing time can take between minutes and hours. But what if you need insights at or near real-time? When your most important consideration is extracting near real-time insights from massive amounts of data, you need streaming data.

In this article, we’ll look at what streaming data is, how it compares with traditional batch processing, what a streaming data pipeline looks like, and some common applications of streaming data.

Understanding the Difference Between Batches and Streams

How you turn raw data into valuable insights comes down to the kind of question you’re asking. Do you want data to inform your long-term growth strategy? Or are you trying to make sure your existing operations run as efficiently as possible?

Let’s say, for example, that you’re an online video service that wants to get into producing original programming. You could use Big Data to look at the viewing data of your millions of users to help identify niches that aren’t being fulfilled and use those insights to inform what new projects you decide to greenlight. Batch processing is ideal for generating this kind of strategic insight when there’s no particular advantage or need to process this data in real time.

But what if you’re trying to answer a more prosaic question—one that’s not as complicated but requires a response time in seconds to milliseconds? For example, let’s say you run a factory that builds widgets and you want to keep all of its widget-making robots running smoothly. To do that, it would be helpful to know in advance when one of the robots requires maintenance. In this case, looking at real-time production line sensor data could identify potential problems before they result in a shut down. But there might be thousands of different sensors, each one generating hundreds of kilobytes of data every minute. In order to get the insights you need from that data and avoid production slowdowns, you’ll need a system that can process it a lot of data very, very quickly. Streaming data is ideal for this kind of situation.

How Streaming Works

As the name implies, streaming data is meant to process and analyze large and constantly moving volumes of data. There are a few different ways to accomplish this, each with its own advantages and disadvantages. The first approach is called native stream processing (or tuple-at-a-time processing). In this approach, every event is processed as it comes in, one after the other, resulting in the lowest-possible latency. Unfortunately, processing every incoming event is also computationally expensive. Micro-batch processing makes the opposite tradeoff, dividing incoming events up into batches either by arrival time or until a batch has reached a certain size. This reduces the computational cost of processing but can also introduce more latency.

As with batch processing, both these streaming approaches divide jobs up across multiple machines, but on a much smaller scale. The other major difference is that in streaming data the data is processed and analyzed before it ever ends up in a data warehouse.

Ingestion

Before you can start processing streaming data, you have to get it into the system. If you’re already using the Hadoop ecosystem for your storage and analytics, one option is Kafka—a highly scalable, fault-tolerant data ingestion service that integrates pretty naturally with some of the most popular open-source streaming data processing platforms. Another popular tool for data ingestion, Flume, can be configured to work with streaming data, but may not scale as easily.

If your team is already using Amazon Web Services, then you might want to consider using Kinesis, which is very similar to Kafka but has the advantage of integrating nicely with other tools from AWS, including ElasticSearch, EC2, and Lambda. That said, being a cloud service, Kinesis (like other AWS services) runs the risk of increased latency relative to Kafka or Flume.

Processing

Once you have the data, you need a tool to process it. There are a number of open-source and proprietary streaming platforms, and which you go with will depend on your specific needs. Two of the most popular open-source options are Storm and Spark Streaming. Both are top-level Apache projects that can work with Hadoop and Amazon Web Services.

Of the two, Storm is generally believed to be better equipped for pure streaming data. Storm is task parallel, meaning it can spread its tasks across many machines in order to efficiently execute multiple operations at once. One disadvantage, though, is that Storm doesn’t run natively on Hadoop clusters. Instead, it relies on ZooKeeper for coordination.

If you’re already operating with a Hadoop cluster and need the ability to perform batch processing in addition to streaming, then Spark Streaming might be a more appealing option. While Storm is a true streaming operation, Spark Streaming relies on micro-batches. Spark Streaming is data parallel, meaning it can use multiple machines to simultaneously perform multiple tasks on a distributed dataset. In effect, it’s a supercharged version of Hadoop’s normal MapReduce algorithm.

Applications

If you’re still trying to sort out whether your project is a good fit for streaming data, ask yourself if the value of the data decreases over time. If the answer is yes, then you might be looking at a case where real-time processing and analysis is beneficial. Otherwise, you’re probably better off sticking with traditional batch processing.

Another exciting application of streaming data is in the field of machine learning. While most organizations that make use of machine learning periodically retrain their algorithms with updated datasets, streaming data enables the creation of incremental algorithms that automatically retrain themselves as new data becomes available.

Here are a few use cases to illustrate the value of streaming data processing.

  1. An e-commerce platform uses customer browsing behavior to suggest additional products or services the user might be interested in during the same session.
  2. A power company analyzes usage data from millions of customers to proactively reroute power from different sources to prevent outages.
  3. A hedge fund analyzes the movements of thousands of stocks to identify arbitrage opportunities.
  4. A web security firm looks at millions of events to proactively identify attempts at hacking or DDOS attacks.
  5. A music streaming service looks at user-listening data to automatically improve its user recommendations.

Notice that in some of these cases, the same data can have both a real-time use case and then a different use case in batch processing. That’s why many organizations have adopted a hybrid approach, wherein data initially goes through a streaming platform before being collected and analyzed in a large batch at a later time.

Are You Ready to Stream?

Taking full advantage of streaming data is a significant technical undertaking, requiring expertise across multiple areas. You’ll need database pros with experience building complex systems that can interact with Hadoop or AWS, as well as data scientists who know how to build a pipeline that turns a torrent of data into valuable insights. Explore freelancers on Upwork today.

Need a Data Scientist For Your Project?

Find a Pro