Some time ago I have created a simple and modular big data platform blueprint for myself. It is based on what I have seen in the field and read in tech blogs all over the internet.
Today I am going to share it with you.
Why do I believe it will be super useful to you?
Because, unlike other blueprints it is not focused on technology. It is based on four common big data platform design patterns.
Following my blueprint will allow you to create the big data platform that fits exactly your needs. Building the perfect platform will allow data scientists to discover new insights.
It will enable you to perfectly handle big data and allow you to make data driven decisions.
The blueprint is focused on the four key areas: Ingest, store, analyse and display.
Having the platform split like this turns it it a modular platform with loosely coupled interfaces.
Why is it so important to have a modular platform?
If you have a platform that is not modular you end up with something that is fixed or hard to modify. This means you can not adjust the platform to changing requirements of the company.
Because of modularity it is possible to switch out every component, if you need it.
Now, lets talk more about each key area.
Ingestion is all about getting the data in from the source and making it available to later stages. Sources can be everything form tweets, server logs to IoT sensor data like from cars.
Sources send data to your API Services. The API is going to push the data into a temporary storage.
The temporary storage allows other stages simple and fast access to incoming data.
A great solution is to use messaging queue systems like Apache Kafka, RabbitMQ or AWS Kinesis. Sometimes people also use caches for specialised applications like Redis.
A good practice is that the temporary storage follows the publish, subscribe pattern. This way APIs can publish messages and Analytics can quickly consume them.
This is the typical big data storage where you just store everything. It enables you to analyse the big picture.
Most of the data might seem useless for now, but it is of upmost importance to keep it. Throwing data away is a big no no.
Why not throw something away when it is useless?
Although it seems useless for now, data scientists can work with the data. They might find new ways to analyse the data and generate valuable insight from it.
What kind of systems can be used to store big data?
Systems like Hadoop HDFS, Hbase, Amazon S3 or DynamoDB are a perfect fit to store big data.
The analyse stage is where the actual analytics is done. Analytics, in the form of stream and batch processing.
Streaming data is taken from ingest and fed into analytics. Streaming analyses the “live” data thus, so generates fast results.
As the central and most important stage, analytics also has access to the big data storage. Because of that connection, analytics can take a big chunk of data and analyse it.
This type of analysis is called batch processing. It will deliver you answers for the big questions.
To learn more about stream and batch processing read my blog post: How to Create New and Exciting Big Data Aided Products
The analytics process, batch or streaming, is not a one way process. Analytics also can write data back to the big data storage.
Often times writing data back to the storage makes sense. It allows you to combine previous analytics outputs with the raw data.
Analytics insight can give meaning to the raw data when you combine them. This combination will often times allow you to create even more useful insight.
A wide variety of analytics tools are available. Ranging from MapReduce or AWS Elastic MapReduce to Apache Spark and AWS lambda.
Displaying data is as important as ingesting, storing and analysing it. People need to be able to make data driven decisions.
This is why it is important to have a good visual presentation of the data. Sometimes you have a lot of different use cases or projects using the platform.
It might not be possible for you to build the perfect UI that fits everyone. What you should do in this case is enable others to build the perfect UI themselves.
How to do that? By creating APIs to access the data and making them available to developers.
Either way, UI or API the trick is to give the display stage direct access to the data in the big data cluster. This kind of access will allow the developers to use analytics results as well as raw data to build the the perfect application.
What kind of systems can you use and where?
Here is how the blueprint looks, if you replace the symbols with software:
Unfortunatly it is too much to explain all the single software platforms in this post. That is why in the following weeks I will go through every one of them with you.
Starting next week with showing you what Hadoop is and what is making it so popular.
Jump directly to the post: What Is Hadoop And Why Is It So Freakishly Popular?
Make sure to not to miss it by subscribing to the newsletter. This way I will be able to send you an e-mail when I published the post 🙂
In the meantime, I recommend you to read how stream and batch processing works.
Do you have some comments or questions about this article? Please drop a line in the comment section to get in contact with me and the community.