Mastering Big Data with Distributed Processing

Andreas Kretz Blog

Ever wish you could get an easy answer to what is Big Data? Get the Big Data Checklist, totally free, along with weekly blog tips delivered directly to your inbox.

Distributed processing is the backbone of Big Data. How to efficiently store and process data with a scalable solution is the focus in this part of my series “Learning Big Data the right way”.

We will talk about data locality with HDFS. And to top it off, I will show you on an real world example how distributed processing with MapReduce works.

In part one of the series, I ranted about the extract transform, load (ETL) problem of SQL databases and how hard it is to scale such systems. If you missed it you can find it here: Link

Finally, distributed processing explained

What distributed processing means, is that you are not doing the analytics on a single server. You do it parallel on multiple machines.

A distributed setup usually consists of a master, who is doing the management of the whole process and slaves who are doing the actual processing.

To process data in parallel, the master needs to load the data, chop it up into pieces and distribute them to the the slaves for computation. Each slave is then doing the analytics (calculations) you programmed.

Another ETL nightmare

Distributing data from the master to the slaves has one major flaw: Because the needed data is stored locally on some system, the master would have to first load it from the source and temporarily store it.

This approach basically creates the same problems as the ETL process of SQL databases. For very large datasets, loading data to a single master and distributing it to the slaves is very inefficient.

Another extract-bottleneck. 🙁

Waking up from the nightmare with data locality through HDFS

The ETL problem can be evaded by not using a central data store. Storing pieces of data locally on every processing slave is the key.

The master only manages what piece of data is stored on which slave. For processing, the master then only needs to tell the slaves which block of data, stored on that node has to be processed.

This perfectly brings us to the Hadoop Distributed File System, HDFS.

The above described data locality is exactly what HDFS is doing. It is the foundation of Hadoop and the first thing you need to learn and understand.

You can think of HDFS as an actual file system above the file system on the server. The difference between a normal file system like NTFS or ext* is that HFDS is a distributed file system.

This means, it spans over all connected slaves who are called data nodes. Stored files get automatically chopped up into 128MB blocks.

These blocks are then automatically distributed over the network to the data nodes. Every node only gets some pieces of the file, depending on the size of the cluster.

To not lose some data in case of a server error, blocks are automatically replicated twice to some other nodes.

Boom done, that’s it, right? All problems solved? Well, not quite.

Very often you need to further process the distributed results to create a desired output. One solution would be to send all the results from the nodes back to the master for further processing.

But, sending results of the nodes back to a single master is a bad idea. It creates another bottleneck.

While the results of the nodes should be a lot smaller than the raw input they may also be quite large (gigabytes).

This problem can be solved by using MapReduce.

Distributed processing with MapReduce

Since the early days of the Hadoop eco system, the MapReduce framework is one of the main components of Hadoop alongside the Hadoop file system HDFS.

Google implemented MapReduce to analyse stored html content of websites through counting all the html tags and all the words and combinations of them (for instance headlines). The output was then used to create the page ranking for Google Search.

That was when everybody started to optimise his website for the google search. Serious search engine optimisation was borne. That was the year 2004.

Back to our problem. What would be the solution for the bottleneck created by sending data back to the master? Well, what MapReduce is doing is that it is processing the data in two stages.

The map phase and the reduce phase.

In the map phase, the framework is reading data from HDFS. Each dataset is called an input record.

Then there is the reduce phase. In the reduce phase, the actual computation is done and the results are stored. The storage target can either be a database or back HDFS or something else.

After all it’s Java – so you can implement what you like.

The magic of MapReduce is how the map and reduce phase are implemented and how both phases are working together.

The map and reduce phases are parallelised. What that means is, that you have multiple map phases (mappers) and reduce phases (reducers) that can run in parallel on your cluster machines.

How MapReduce actually works

First of all, the whole map and reduce process relies heavily on using key/value pairs. That’s what the mappers are for.

In the map phase input data, for instance a file, gets loaded and transformed into key/value pairs.

When each map phase is done it sends the created key/value pairs to the reducers where they are getting sorted by key. This means, that an input record for the reduce phase is a list of values from the mappers that all have the same key.

Then the reduce phase is doing the computation of that key and its values and outputting the results.

How many mappers and reducers can you use in parallel? The number of parallel map and reduce processes depends on how many CPU cores you have in your cluster. Every mapper and every reducer is using one core.

This means that the more CPU cores you actually have, the more mappers you can use, the faster the extraction process can be done. The more reducers you are using the faster the actual computation is being done.

To make this more clear, I have prepared an example.

A real world internet of things example

As I said before, MapReduce works in two stages, map and reduce. Often these stages are explained with an word count task.

Personally, I hate this example because counting stuff is to trivial and does not really show you what you can do with MapReduce. Therefore, we are going to use a more real world use-case from the world of the internet of things (IoT).

IoT applications create an enormous amount of data that has to be processed. This data is generated by physical sensors who take measurements, like room temperature at 8.00 o’Clock.

Every measurement consists of a key (the timestamp when the measurement has been taken) and a value (the actual value measured by the sensor).

Because you usually have more than one sensor on your machine, or connected to your system, the key has to be a compound key. Compound keys contain additionally to the measurement time information about the source of the signal.

But, let’s forget about compound keys for now. Today we have only one sensor. Each measurement outputs key/value pairs like: Timestamp-Value.

The goal of this exercise is to create average daily values of that sensor’s data.

How MapReduce deals with IoT data

The image below shows how the map and reduce process works.

First, the map stage loads unsorted data (input records) from the source (e.g. HDFS) by key and value (key:2016-05-01 01:02:03, value:1).

Then, because the goal is to get daily averages, the hour:minute:second information is cut from the timestamp.

That is all that happens in the map phase, nothing more.

After all parallel map phases are done, each key/value pair gets sent to the one reducer who is handling all the values for this particular key.

Every reducer input record then has a list of values and you can calculate (1+5+9)/3, (2+6+7)/3 and (3+4+8)/3. That’s all.

What do you think you need to do to generate minute averages?

Yes, you need to cut the key differently. You then would need to cut it like this: “2016-05-01 01:02”. Keeping the Hour and minute information in the key.

What you can also see is, why map reduce is so great for doing parallel work. In this case, the map stage could be done by nine mappers in parallel because each map is independent from all the others.

The reduce stage could still be done by three tasks in parallel. One each for orange, blue and one for green.

That means, if your dataset would be 10 times as big and you’d have 10 times the machines, the time to do the calculation would be the same.

How to scale big data systems

Scaling such a Big Data System is quite easy. Because storage and processing is distributed, all you have to do is add more servers to the cluster.

By adding servers, you increase the number of disks available hence the storage capacity. This also results in more available CPUs for increased processing capability of the cluster.

Is MapReduce the right thing for you? There must be a catch, right?

I chose to explain you distributed processing through MapReduce because it is easy to understand. But MapReduce does not stand for distributed processing. It is just one implementation.

Should you use it or look for something else?

MapReduce has been invented for tasks like:

  • Analysing server logs to find out what has been going on
  • Analysing network traffic logs to identify network problems
  • Basically any problem that involves counting stuff in very large files

Here’s the problem with MapReduce:

MapReduce has one big restriction: It is a two step process, map and reduce.

Once data is processed through the map and reduce phase, it has to be stored again.

If your computation cannot be done in those two steps, you need to do the whole process again. Load the previous results from disk and do another map and reduce run because results cannot be stored in memory and later be used again.

This is where it does not make sense to use MapReduce in my opinion.

Distributed in memory processing with Apache Spark

If you want to do complex stuff like machine learning MapReduce is not the weapon of choice.

Thats why I’ll show you how distributed in memory processing with Apache Spark works. We will go over a Spark example where we analyse Twitch.tv chat and talk about stream and batch processing.

I will also show you a neat trick how you can prototype spark jobs and directly visualise the results.

Follow this link to the view the post: https://andreaskretz.com/2016/08/08/how-everybody-can-harvest-the-power-of-data-mining/

To make sure you don’t miss my next posts is to subscribe to my newsletter. Just put in your E-Mail address right here and hit subscribe.

This way I will be able to send you an E-Mail when I have uploaded the next article in this series.

You can also follow me on Twitter where I share stuff I find throughout the day:

Wanna do more? Please follow this Link to LinkedIn and hit the like button to share this article with your professional network. That would be super awesome!

See you next time!

Andreas