Lots of machines, bulk of data and thousands of files to be operated for a specific task. Work is all mishmash. Then, there arises issues of inter-process communication, network failure and scalability. Distributed world seems a mess. A big thanks to the MapReduce paradigm for preventing this situation to occur anymore. It has simplified the processes to a great extent through its distinct features. Basically, it is a programming model for computing larger dataset on a cluster and transforming into aggregated results. Its key features include:
- Redundancy and fault tolerance,
- Automatic parallelization and distribution,
- Language independence, Management of inter-process communication,
- Failure detection, and so on.
Now there is only business logic which should be considered, the complexities of business environment will be handled by MapReduce itself. MapReduce comprises of two functions, which are:
MAP FUNCTION: Map() requires a “key-value” pair as an input and it produces an intermediate key-value pair that will be used in next step. It is applied in parallel to every pair of data given in the input.
Related Resources from B2C
» Free Webcast: Blogging in the Age of Modern Marketers
REDUCE FUNCTION: Reduce() takes list of values for each key and then applies the aggregation logic on that list. The output is further sorted by key and there appears a desired result.
Suppose we have to find out the best scores made in the ODIs that were played from 1971 to 2000.
Let the following be the data of one file:
Assuming that we have around 3000 such files, each of which contains the runs that were scored in the ODIs by different players. Now when we use this model, the Map function will go through the data of these files and return the maximum score given by every player. The remaining set of data will be:
In the same way, data of 10 other files has gone through the Map function. So, following is the data which we are having from different files:
Now, this set will work as an input for Reduce function. Reduce() will generate an output for the maximum score of each player, thus giving us the highest scores of the given duration as:
It would have been a very hectic task to perform this operation in parallel over such a huge data by using normal programming techniques and nearly impossible to get the desired result without a pile of issues. MapReduce framework is a blessing in this regard, as it has made the computation far easier and simpler. Being a powerful paradigm, it has catered many such issues which were headache for the developers otherwise. If you want a detailed Hadoop MapReduce tutorial, you can watch this MapReduce video for a detailed technical explanation about Hadoop MapReduce. Even you are looking forward to be an expert in Hadoop Technology, you can join a Hadoop training program within your budget. Below are some options which you can explore:
- Horton Works: http://hortonworks.com/
- Intellipaat Software Solutions: http://intellipaat.com/
- Intell Hadoop:http://hadoop.intel.com/training/