My Data Engineering Cookbook

What do you actually need to learn to become an awesome data engineer?
Look no further, you find it here.

How to use this page: This is not a training! It's a collection of skills, that I value highly in my daily work as a data engineer. It's intended to be a starting point for you to find the topics to look into.

This project is a work in progress!
Over the next weeks I am going to share with you my thoughts on why each topic is important. I also try to include links to useful resources.

How to find out what is new?
I am going to talk new stuff on my Podcast and YouTube channel first. Then add it to this document.

Help make this collection awesome!
Write me an email to andreaskayy@gmail.com. Sharing your thoughts, what you value and what you think should be included helps a lot.

In this podcast I talked about why I love to work as a data engineer and not a data scientist:

This is everything you need to know:

It's the combination of multiple skills that is important. The technique is called talent stacking.

I talked about why Companies Badly Need Data Engineers

I talked about talent stacks for data engineers in this podcast:

Learn to Write Code

Why this is important: Without coding you cannot do much in data engineering. I cannot count the number of times I needed a quick Java hack.

The possibilities are endless:

  • Writing or quickly getting some data out of a SQL DB
  • Testing to produce messages to a Kafka topic
  • Understanding Source code of a Java Webservice
  • Reading counter statistics out of a HBase key value store

So, which language do I recommend then?

I highly recommend Java. It’s everywhere!

When you are getting into data processing with Spark you should use Scala. But, after learning Java this is easy to do.

Also Python is a great choice. It is super versatile.

Personally however, I am not that big into Python. But I am going to look into it

Where to Learn?
There’s a Java Course on Udemy you could look at: https://www.udemy.com/java-programming-tutorial-for-beginners

Coding Basics

OOP Object oriented programming
What are Unit tests to make sure what you code is working
Functional Programming
How to use build managment tools like Maven
Resilliant testing (?)

I talked about the importance of learning by doing in this podcast:

Learn To Use GitHub

Why this is important: One of the major problems with coding is to keep track of changes. It is also almost impossible to maintain a program you have multiple versions of.

Another is the topic of collaboration and documentation. Which is super Important.

Let’s say you work on a Spark application and your colleges need to make changes while you are on holiday. Without some code management they are in huge trouble:

Where is the code? What have you changed last? Where is the documentation? How do we mark what we have changed?

But if you put your code on GitHub your colleges can find your code. They can understand it through your documentation (please also have in-line comments)

Developers can pull your code, make a new branch and do the changes. After your holiday you can inspect what they have done and merge it with your original code. and you end up having only one application

Where to learn:
Check out the GitHub Guides page where you can learn all the basics: https://guides.github.com/introduction/flow/


This great GitHub commands cheat sheet saved my butt multiple times: https://www.atlassian.com/git/tutorials/atlassian-git-cheatsheet

Pull, Push, Branching, Forking

Also talked about it in this podcast:

Agile Development

Scrum
OKR

I talked about this in this Podcast:

Computer Science Basics

Learn how a Computer Works

CPU,RAM,GPU,HDD
Differences between PCs and Servers

I talked about computer hardware and GPU processing in this podcast:

Computer Networking

ISO/OSI Model
IP Subnetting
Switch, Level 3 Switch
Router
Firewalls

I talked about Network Infrastructure and Techniques in this podcast:

Security & Privacy

SSL Public & Private Key Certificates
What is a certificate authority
JAva Web Tokens

GDPR regulations
Privacy by design

Linux

OS Basics
Shell scripting
Cron jobs
Packet management

Linux Tips are the second part of this podcast:

How to Set Up a Data Science Platform

Security Zone Design

How to secure a multi layered application (UI in different zone then SQL DB)
Cluster security with Kerberos

I talked about security zone design and lambda architecture in this podcast:

Lambda Architecture

Stream and Batch processing
Storage

Big Data

What is big data and where is the difference to data science and data analytics?

I talked about the difference in this podcast:

The 4Vs of Big data: I talked about the 4Vs in this podcast:

When do you have Big Data?
What are the tools associated?

How does a Hadoop based Platform look like?

How does a Hadoop System architecture look like
How to select Hadoop Cluster Hardware
What tools are usually in a with Hadoop Cluster: Yarn, Zookeeper, HDFS, Oozie, Flume, Hive

ETL still relevant for Analytics

I talked about this in this podcast:

The Cloud

AWS,Azure, IBM, Google Cloud basics
cloud vs on premise
up & downsides
Security

Listen to a few thoughts about the cloud in this podcast:

Docker

What is docker and what do you use it for
How to create, start,stop a Container
Docker micro services?
Kubernetes

Why and how to do Docker container orchestration

Podcast about how data science learners use Docker (for data scientists):

How to Ingest Data

Application programming interfaces

REST APIs
HTTP Post/Get
API Design
Implementation
OAuth security

Check out my podcast about how APIs rule the world:

JSON

Super important for REST APIs
How is it used for logging and processing logs

How to - Distributed Processing Data

I talked about why distributed processing is so super important in this Podcast:

MapReduce

Why was MapReduce Invented
How does that work
What is the limitation of MapReduce?

Apache Spark

Spark Basics
What is the difference to MapReduce?
How to do stream processing
How to do batch processing
How does Spark use data from Hadoop
What is a RDD and what is a DataFrame?
Spark coding with Scala
Spark coding with Python
How and why to use SparkSQL?
Machine Learning on Spark? (Tensor Flow)

I talked about the three methods of data streaming in this podcast:

Message queues with Apache Kafka

Why a message queue tool?
Kakfa architecture
What are topics
What does Zookeeper have to do with Kafka
How to produce and consume messages

My YouTube video how to set up Kafka at home: https://youtu.be/7F9tBwTUSeY

My YouTube video how to write to Kafka: https://youtu.be/RboQBZvZCh0

Machine Learning

Training and Applying models
What is deep learning
How to do Machine Learning in production

My podcast about how to do machine learning in production:

How to Store Data

Data Modeling

How to find out how you need to store data for the business case
How to decide what kind of storage you need to use

Check out my podcast how to decide between SQL and NoSQL:

SQL

Database Design
SQL Queries
Stored Procedures
ODBC/JDBC Server Connections

NoSQL

KeyValue Stores (HBase)
Document Stores (HDFS, MongoDB)
Time Series Databases (?)
MPP Databases (Greenplum)

DEV/OPS

Hadoop

Hadoop Cluster setup and management with Cloudera Manager (for example)
Spark code from coding to production
How to monitor and manage data processing pipelines

Airflow Application management
Creating Statistics with Spark and Kafka

Listen to an introduction to Hadoop for Data Scientists:

How to Visualize Data

Mobile Apps

Android & IOS basics
How to design APIs for mobile apps

How to use Webservers to display content

Tomcat, Jetty, NodeRED, React

Business Intelligence Tools

Tableau
PowerBI
Quliksense

Identity & Device Management

What is a digital twin?
Active Directory