Today I am going to tell you how you can explore your data through data mining like a data scientist. The cool thing is: To start, you only need your everyday computer and curiosity.
I will show you on a real world example how to analyse comma separated text files. Along the way you will learn how to easily code and prototype Apache Spark jobs.
Beginner or expert, if you are prototyping like this, it will help you put your code into production faster.
What kind of system are we using and why?
The increasing data traffic from sensors of the internet of things (IoT) changed the requirements for IT platforms a lot. In the past, it was totally ok to take in data, store it in a SQL database and display it on a front end.
If someone needed to do analytics he mostly had two choices:
Either someone programmed it as a stored procedure on the SQL database, if it’s a relatively simple analysis. Or you extracted data from the database and ran your analytics with external tools like Matlab or R.
Today, in the era of big data, it is no longer enough to only do analytics by hand. Modern big data platforms are required to be capable of doing analytics on the fly as the stream of data is coming in.
Realtime analytics, or streaming analytics, that is what customers demand and that is what modern platforms have to deliver.
In 2009, guys at the University of California Berkeley started working on a framework to tackle this issues. Their idea was to create a distributed system that will use a cluster of machines to do in memory data analytics. Spark was born.
Spark has three major advantages in comparison to other big data analytics frameworks like MapReduce:
- The ability to do streaming analytics
- It is ultra fast because data is held in memory
- Spark enables you to do complex analytics like iterative processes, because there is no need to store intermediate results like in MapReduce
After its open sourcing in 2010 and donation to the Apache foundation in 2013 it became a top level project in 2014. Then Spark’s popularity went through the roof.
Today, Spark is one of the most adopted frameworks for big data analytics.
You can and should profit from Spark analytics in your daily life! Here’s how you do that.
How to prototype Spark code with Apache Zeppelin!
To do analytics with Spark you usually need to do the following:
- Code a Spark job
- Upload it to a running Spark cluster
- Run it and store the results somewhere
- Visualise results with external tools like Excel or Tableau
A neat trick to start programming Spark jobs without a cluster and development environment like Eclipse is by using Apache Zeppelin.
Apache Zeppelin gives you the opportunity to code directly in the browser, run the code on Spark and display the results. It not only displays the results – you can also dynamically explore your results by using SQL statements through the Spark SQL API.
This makes it very easy to learn Spark by simply trying stuff. If you do coding errors, Zeppelin will tell you in the browser where your error is.
Because Zeppelin allows you to display intermediate results you can easily spot data transformation mistakes if you made some. To do that, you just have to display the result in tabular from.
How to get going with Zeppelin
All you need is this:
- A Linux based operating system like CentOS, Mac OS or a virtual Linux machine on Windows (with virtual box)
- Some RAM 8GB and a quad core CPU is fine.
- Download the Zeppelin binary with all interpreters and unpack it: http://mirrors.ae-online.de/apache/zeppelin/zeppelin-0.6.0/zeppelin-0.6.0-bin-all.tgz
- Run Zeppelin and open your browser at localhost:8080 (bin/zeppelin-daemon.sh start)
Personally I use a MacBook Pro 13in Retina with 8GB RAM from 2015.
Data mining and exploration use-case with Twitch.tv chat data
Twitch also casts e-sports championships like Counterstrike or Dota where hundred thousands or sometimes over a million people are watching
The audience can interact with the host and everyone else through a built-in live chat.
Here’s an example screenshot of one of my favourite streamers MANvsGAME playing his and also my favourite game Bloodborne:
The source data
In 2015 I programmed myself a chat logger for various channels I like. Just for fun, because I wanted to play around with Spark and I had no interesting data laying around.
This example has data from one of the biggest streamers of twitch: Lirik. He constantly pulls about 20 thousand viewers every day.
I stored the data in a comma separated form inside a txt file. Every line is a chat event where someone has written a message like: Timestamp,Channel,User,Written Text
To not piss off people I anonymised the names to only display the first three letters then adding “xxx” to them.
You can download this file as well as all the Spark code on my Github page.
The user interface and the code
In Zeppelin you can code and explore data directly in the browser. I use Chrome for that, because I had some issues displaying data with Safari.
Here’s a view of my Zeppelin notebook for this example. In the middle there is the complete source code of this example you can find on my Github page.
To execute your code, you can use the buttons on the top and on the right. The one on the top refreshes everything including your graphs. The one on the right only refreshes the current section.
The code is very simple. There are only five steps of data transformation that you have to do. Then you can start exploring the data.
1. This line is loading the text file from your local hard drive:
val sourceText = sc.textFile("PathToTheFile")
2. Then you are splitting the lines by comma:
Because every line is stored in comma separated style you map (copy) each column with “s(column number)” into an object from the class I named Allcolumns.
val fullchat = sourceText.map(s=>s.split(",")).map(s=>Allcolumns(s(0).drop(11).take(5),s(2),s(3)))
3. After that, split the text in column four into single lines or datasets:
val wordcount = fullchat.flatMap(line => line.mysplit(line))
4. Next, use this code to create data frames so you can use Spark SQL to explore the data:
5. One more thing:
To be able to access the columns by name you need to define the Allcolumns class. The function mysplit will split the text in column number four into single datasets for every word.
When you are finished, the variables Fullchat and Wordcount look like this:
As you can see, every line (dataset) in Fullchat has been split into multiple datasets in Wordcount.
Data exploration with Spark SQL
Ok, coding is boring. Now to the fun part: Exploring the data 🙂
In Zeppelin, you can create your charts below the code area. To analyse the data you have to use a very basic set of five SQL statements.
Don’t be afraid – in the following examples I will explain every statement. You will see it is simple.
Just for reference. If you want to know more about the statements, use the links to w3schools below. They have excellent explanations and examples to try out.
The five statements you have to know:
Discovering stream highlights
One thing you can do with the data is to find highlights. A highlight is when something cool happens during the stream and chat gets excited.
You can do that by analysing how many messages get written per minute. The way to do that is when you use the fullchat variable. Fullchat has one line for every message and so you can count this easily on a minute basis.
What this analysis tells you:
The chart below shows you that there are highlights at the beginning and at the end of the stream as well as multiple highlights in between.
You do this analysis by using this SQL statement:
%sql Select Time, count(*) as Messages from fullchat group by Time order by Time asc
Display me the time and a count of all messages from the fullchat data frame.
Group this by time so you will count the messages per minute.
Then order it by time ascending.
What happened at those highlights?
Twitch has a simple way of expressing emotions in chat, by displaying emotes. These emotes are translated from text. Every channel has its own emotes.
Check out Lirik’s emotes at twitchemotes.com.
You can discover what happened by analysing the viewerships emotions. This can be done by looking at the chat emotes that have been most used in a particular minute.
At 18:23 the most used emotes were positive ones. Like lirikMLG (stands for Major League Gaming www.majorleaguegaming.com/) or lirikTEN a picture of a cat holding a 10 out of 10 sign.
These emotes indicate something good happened. Basically winning.
At 19:02 the chart looks very different. Emotes like lirikFEELS or lirikRip, a crying cat with the text RIP on top, indicates something bad happened. Losing.
Spark SQL Statement:
%sql select Time, count(Text) as ct, Text from wordcount where Time like “18:23” group by Text,Time order by ct desc limit 5
Show me the time, a count of the words and the words from the wordcount variable.
But, only for the time “18:23”.
Group this selection by time (needed to use the “where” statement) and the word in order to count the words.
Order it from the most used word to the least used word.
Limit the output to the five words (because of the order the most used five).
Go explore the data yourself!
The chat data has a lot more knowledge hidden for you to find. That’s why I uploaded the data file and the examples to my Github page.
Go, download the data and play around with it. Then mail me your coolest findings to me: email@example.com.
I will select the most interesting three and add them to this post, with your name on it if you like.
A small hint: These additional analytics results are also on Github:
- How to find the most used emotes in chat of the day
- Pinpointing the most active users
- Analysing when a specific user has written messages
Remember when I told you at the beginning of this article that Spark is one of the most adopted frameworks for big data analytics out there?
It’s because Data Scientists and Solution Architects love it. These two groups of professionals benefit the most from Apache Spark.
Next time we will go over what data scientists and solution architects do and what features of Spark matter the most for them. I will also give you some insider advice on who you should hire to make your data heavy startup or project a success.
Jump directly to the post: Spark a Success Guarantee For Data Scientists
Make sure you don’t miss any of my new posts by subscribing to my newsletter. If you are not subscribed already just put in your E-Mail address right here and hit subscribe.
This way I will be able to send you an E-Mail when I have uploaded the next article in this series.
You know what would also be super awesome? When you share this post with your friends over LinkedIn, Facebook or Twitter.
Thanks a lot!