In an earlier blog I discussed how more and more digital marketing organizations are depending on Big Data analyses of billions (trillions etc) of pieces of data to generate business intelligence and drive research-based decision making (see Big Data and the New Marketing Paradigm). Computing on the scale required to handle Big Data comes with a number of challenges. To handle all the challenges Big Data systems usually perform a variety of actions which are individually scalable:

  1. Collect
  2. Store
  3. Analyze
  4. Share

Big Data and Cloud ServicesAmazon recently conducted a Big Data event in Boston where its web services division (AWS) shared with the community how customers are leveraging its cloud platform to process large data and calculation sets.

Traditionally, scientific or marketing organizations with complex computational data-sets incurred large infrastructure costs maintaining the number of computers required to handle them. Now, with third-party cloud computing services, a customer can spin-up hundreds of computing or storage instances on-demand, use them in parallel to process its large problems in a short time, and then release the resources paying only for the hours of computing power that were consumed. In short, with major 3rd party players like Amazon’s AWS or Microsoft Azure, scientific and industrial firms now have access to massive computational power, available in minutes, at a fraction of the cost of hosting their own machines.

Barriers to Entry

According to the experts, distributed computing is hardest to achieve or scale from one machine to two. Once you surpass the two-machine threshold, adding additional machines is not as complex. The main reason for this is that your big data analysis application architecture needs to be (re)designed to take advantage of the scaling capabilities that cloud computing offers. Though there may be costs associated with re-architecting your software, one immediate benefit will be the reduced need for hiring. Once done, scaling your computing capability becomes a simple matter of calling on more remote boxes (as opposed to buying more and staffing their maintenance) – a tangible resource savings.

What is a cloud computing framework?

Cloud computing is not just about being able to create and use large numbers of machines but about architecting systems that can scale easily to they can leverage the ability to add more machines to achieve big data computing more efficiently. To facilitate the ability to scale processing power, software frameworks like Hadoop have emerged. Hadoop utilizes the Amazon Elastic Map Reduce (EMR) cloud processing infrastructure to provide rich functionality that allows engineers and architects to dynamically manage their infrastructure, servers, databases, load balancers, switches and firewalls. These frameworks allow for complex programs that can automatically create additional server images or start other resources to utilize in a distributed processing application. Many large enterprises, researchers and analysts have used these frameworks to conduct big data projects like analyzing the human genome, collecting data about the cosmos or financial modeling and analysis, among many others.

The Changing Economics of Big Data Computing

Distributed computing is changing the economics of processing and large data sets. Because cloud services like AWS charge on a per hour basis, you can enlist tens or hundreds of processing cores or machines to process your complex data and then shut them down when you’re done. This means you only incur cost during the period in which the machines are being utilized. Before the cloud phenomenon, companies requiring big data capabilities needed to purchase the machines that comprised the processing power they required even if it was only fully utilized all the time. The cost proposition is now very different.

I Hope this overview helps folks better understand the challenges of big data and how emerging technologies and architectures are emerging to address these challenges.