Data driven companies like Netflix, Facebook and Twitter dominate their market. They are processing unbelievable amounts of user data, every minute of the day.
They are keeping it quite a secret, how this processing actually works. Today, I am going to show you two simple strategies every data driven company uses:
Batch processing, and stream processing
Batching and streaming will help you to predict what users want. As a result, you can create new products and services based on data and not on your or your gut feeling.
It will almost feel like cheating. Because you already know what products and services will work, even before you create them.
Using insight from big data to create new products almost feels like cheating. Click to Tweet
Netflix: The prime example for a data driven company
Netflix revolutionised how we watch movies and tv. Currently over 75 million users watch 125 million hours of Netflix content every day!
Netflix’s revenue comes from a monthly subscription service. So, the goal for Netflix is to keep you subscribed and to get new subscribers.
To achieve this, Netflix is licensing movies from studios as well as creating its own original movies and tv series.
But offering new content is not everything. What is also very important is, to keep you watching content that already exists.
To be able to recommend you content, Netflix is collecting data from users. And it is collecting a lot.
Currently, Netflix analyses about 500 billion user events per day. That results in a stunning 1.3 petabytes every day.
All this data allows Netflix to build recommender systems for you. The recommenders are showing you content that you might like, based on your viewing habits, or what is currently trending.
Ask the big questions – batch processing
Remember your last yearly tax statement?
You break out the folders. You run around the house searching for the receipts.
All that fun stuff.
When you finally found everything you fill out the form and send it on its way.
Doing the tax statement is a prime example of a batch process.
Data comes in and gets stored, analytics loads the data from storage and creates an output (insight):
Batch processing is something you do either without a schedule or on a schedule (tax statement). It is used to ask the big questions and gain the insights by looking at the big picture.
To do so, batch processing jobs use large amounts of data. This data is provided by storage systems like Hadoop HDFS.
They can store lots of data (petabytes) without a problem.
Results from batch jobs are very useful, but the execution time is high. Because the amount of used data is high.
It can take minutes or sometimes hours until you get your results.
The Netflix batch processing pipeline
When Netflix started out, they had a very simple batch processing system architecture.
The key components were Chuckwa, a scalable data collection system, Amazon S3 and Elastic MapReduce.
Chuckwa wrote incoming messages into Hadoop sequence files, stored in Amazon S3. These files then could be analysed by Elastic MapReduce jobs.
Jobs were executed regularly on a daily and hourly basis. As a result, Netflix could learn how people used the services every hour or once a day.
Know what customers want and if they like what you give them
Because you are looking at the big picture you can create new products. Netflix uses insight from big data to create new tv shows and movies.
They created House of Cards based on data. There is a very interesting Ted talk about this you should watch:
How to use data to make a hit TV show | Sebastian Wernicke:
Batch processing also helps Netflix to know the exact episode of a TV show that gets you hooked. Not only globally but for every country where Netflix is available.
Check out the article from TheVerge
They know exactly what show works in what country and what show does not.
It helps them create shows that work in everywhere or select the shows to license in different countries. Germany for instance does not have the full library that Americans have 🙁
We have to put up with only a small portion of tv shows and movies. If you have to select, why not select those that work best.
Batch processing is like looking into the past. Streaming gives you live data insight.Click to Tweet
Batch processing is not enough
As a data platform for generating insight the Cuckwa pipeline was a good start. It is very important to be able to create hourly and daily aggregated views for user behaviour.
To this day Netflix is still doing a lot of batch processing jobs.
The only problem is: With batch processing you are basically looking into the past.
For Netflix, and data driven companies in general, looking into the past is not enough. They want a live view of what is happening.
Gain instant insight into your data – stream processing
Streaming allows users to make quick decisions and take actions based on “real-time” insight. Contrary to batch processing, streaming processes data on the fly, as it comes in.
With streaming you don’t have to wait minutes or hours to get results. You gain instant insight into your data.
In the batch processing pipeline, the analytics was after the data storage. It had access to all the available data.
Stream processing creates insight before the data storage. It has only access to fragments of data as it comes in.
As a result the scope of the produced insight is also limited. Because the big picture is missing.
Only with streaming analytics you are able to create advanced services for the customer. That is why Netflix incorporated stream processing into Chuckwa V2.0 and the new Keystone pipeline.
One example of advanced services through stream processing is the Netflix “Trending Now” feature.
Netflix Trending Now feature
One of the newer Netflix features is “Trending now”. To the average user it looks like that “Trending Now” means currently most watched.
This is what I get displayed as trending while I am writing this on a Saturday morning at 8:00 in Germany. But it is so much more.
What is currently being watched is only a part of the data that is used to generate “Trending Now”.
“Trending now” is created based on two types of data sources: Play events and Impression events.
What messages those two types actually include is not really communicated by Netflix. I did some research on the Netflix Techblog and this is what I found out:
Play events include what title you have watched last, where you did stop watching, where you used the 30s rewind and others.
Impression events are collected as you browse the Netflix Library like scroll up and down, scroll left or right, click on a movie and so on
Basically, play events log what you do while you are watching. Impression events are capturing what you do on Netflix, while you are not watching something.
Real-time streaming architecture
Netflix uses three internet facing services to exchange data with the client’s browser or mobile app. These services are simple Apache Tomcat based web services.
The service for receiving play events is called “Viewing History”. Impression events are collected with the “Beacon” service.
The “Recommender Service” makes recommendations based on trend data available for clients.
Messages from the Beacon and Viewing History services are put into Apache Kafka.
It acts as a buffer between the data services and the analytics.
Beacon and Viewing History publish messages to Kafka topics. The analytics subscribes to the topics and gets the messages automatically delivered in a first in first out fashion.
After the analytics the workflow is straight forward. The trending data is stored in a Cassandra Key-Value store. The recommender service has access to Cassandra and is making the data available to the Netflix client.
The algorithms how the analytics system is processing all this data is not known to the public. It is a trade secret of Netflix.
What is known, is the analytics tool they use. Back in Feb. 2015 they wrote in the tech blog that they use a custom made tool.
They also stated, that Netflix is going to replace the custom made analytics tool with Apache Spark streaming in the future. My guess is, that they did the switch to Spark some time ago, because their post is more than a year old.
Should you do batch or stream processing?
It is a good idea to start with batch processing. Batch processing is the foundation of every good big data platform.
A batch processing architecture is simple, and therefore quick to set up. Platform simplicity means, it will also be relatively cheap to run.
A batch processing platform will enable you to quickly ask the big questions. They will give you invaluable insight into your data and customers.
When the time comes and you also need to do analytics on the fly, then add a streaming pipeline to your batch processing big data platform.
Share your experience with the community
Do you already know some ingenious things that people do with batch or stream processing? If you do, please share it with our community and write a comment.
If you have a question regarding this post, please also write a comment. I am here to help.
You know what would also be super awesome?
Head over to LinkedIn and like this post to get it some traction: Link to Post (Big Data and Analytics group members only).
My big data platform blueprint that does it all
How does the big data platform for batch and stream processing look like?
Next time I will give you my big data platform blueprint. With the blueprint you will be able to instantly set up your own big data platform.
The blueprint will help you and your company to start making data driven decisions.
All you have to do to make sure you don’t miss the blueprint, is to subscribe to my newsletter. Just put in your E-Mail address right here and hit subscribe.
This way I will be able to send you an E-Mail when I have uploaded the blueprint.
Thanks a lot!