What do you actually need to learn to become an awesome data engineer?
Look no further, you find it here.
How to use this page: This is not a training! It's a collection of skills, that I value highly in my daily work as a data engineer. It's intended to be a starting point for you to find the topics to look into.
This project is a work in progress!
Over the next weeks I am going to share with you my thoughts on why each topic is important. I also try to include links to useful resources.
How to find out what is new?
I am going to talk new stuff on my Podcast and YouTube channel first. Then add it to this document.
Help make this collection awesome!
Write me an email to firstname.lastname@example.org. Sharing your thoughts, what you value and what you think should be included helps a lot.
In this podcast I talked about why I love to work as a data engineer and not a data scientist:
This is everything you need to know:
It's the combination of multiple skills that is important. The technique is called talent stacking.
I talked about why Companies Badly Need Data Engineers
I talked about talent stacks for data engineers in this podcast:
Learn to Write Code
Why this is important: Without coding you cannot do much in data engineering. I cannot count the number of times I needed a quick Java hack.
The possibilities are endless:
- Writing or quickly getting some data out of a SQL DB
- Testing to produce messages to a Kafka topic
- Understanding Source code of a Java Webservice
- Reading counter statistics out of a HBase key value store
So, which language do I recommend then?
I highly recommend Java. It’s everywhere!
When you are getting into data processing with Spark you should use Scala. But, after learning Java this is easy to do.
Also Python is a great choice. It is super versatile.
Personally however, I am not that big into Python. But I am going to look into it
Where to Learn?
There’s a Java Course on Udemy you could look at: https://www.udemy.com/java-programming-tutorial-for-beginners
OOP Object oriented programming
What are Unit tests to make sure what you code is working
How to use build managment tools like Maven
Resilliant testing (?)
I talked about the importance of learning by doing in this podcast:
Learn To Use GitHub
Why this is important: One of the major problems with coding is to keep track of changes. It is also almost impossible to maintain a program you have multiple versions of.
Another is the topic of collaboration and documentation. Which is super Important.
Let’s say you work on a Spark application and your colleges need to make changes while you are on holiday. Without some code management they are in huge trouble:
Where is the code? What have you changed last? Where is the documentation? How do we mark what we have changed?
But if you put your code on GitHub your colleges can find your code. They can understand it through your documentation (please also have in-line comments)
Developers can pull your code, make a new branch and do the changes. After your holiday you can inspect what they have done and merge it with your original code. and you end up having only one application
Where to learn:
Check out the GitHub Guides page where you can learn all the basics: https://guides.github.com/introduction/flow/
This great GitHub commands cheat sheet saved my butt multiple times: https://www.atlassian.com/git/tutorials/atlassian-git-cheatsheet
Pull, Push, Branching, Forking
Also talked about it in this podcast:
I talked about this in this Podcast:
Computer Science Basics
Learn how a Computer Works
Differences between PCs and Servers
I talked about computer hardware and GPU processing in this podcast:
Switch, Level 3 Switch
I talked about Network Infrastructure and Techniques in this podcast:
Security & Privacy
SSL Public & Private Key Certificates
What is a certificate authority
JAva Web Tokens
Privacy by design
Linux Tips are the second part of this podcast:
How to Set Up a Data Science Platform
Security Zone Design
How to secure a multi layered application (UI in different zone then SQL DB)
Cluster security with Kerberos
I talked about security zone design and lambda architecture in this podcast:
Stream and Batch processing
What is big data and where is the difference to data science and data analytics?
I talked about the difference in this podcast:
The 4Vs of Big data: I talked about the 4Vs in this podcast:
When do you have Big Data?
What are the tools associated?
How does a Hadoop based Platform look like?
How does a Hadoop System architecture look like
How to select Hadoop Cluster Hardware
What tools are usually in a with Hadoop Cluster: Yarn, Zookeeper, HDFS, Oozie, Flume, Hive
ETL still relevant for Analytics
I talked about this in this podcast:
AWS,Azure, IBM, Google Cloud basics
cloud vs on premise
up & downsides
Listen to a few thoughts about the cloud in this podcast:
What is docker and what do you use it for
How to create, start,stop a Container
Docker micro services?
Why and how to do Docker container orchestration
Podcast about how data science learners use Docker (for data scientists):
How to Ingest Data
Application programming interfaces
Check out my podcast about how APIs rule the world:
Super important for REST APIs
How is it used for logging and processing logs
How to - Distributed Processing Data
I talked about why distributed processing is so super important in this Podcast:
Why was MapReduce Invented
How does that work
What is the limitation of MapReduce?
What is the difference to MapReduce?
How to do stream processing
How to do batch processing
How does Spark use data from Hadoop
What is a RDD and what is a DataFrame?
Spark coding with Scala
Spark coding with Python
How and why to use SparkSQL?
Machine Learning on Spark? (Tensor Flow)
I talked about the three methods of data streaming in this podcast:
Message queues with Apache Kafka
Why a message queue tool?
What are topics
What does Zookeeper have to do with Kafka
How to produce and consume messages
My YouTube video how to set up Kafka at home: https://youtu.be/7F9tBwTUSeY
My YouTube video how to write to Kafka: https://youtu.be/RboQBZvZCh0
Training and Applying models
What is deep learning
How to do Machine Learning in production
My podcast about how to do machine learning in production:
How to Store Data
How to find out how you need to store data for the business case
How to decide what kind of storage you need to use
Check out my podcast how to decide between SQL and NoSQL:
ODBC/JDBC Server Connections
KeyValue Stores (HBase)
Document Stores (HDFS, MongoDB)
Time Series Databases (?)
MPP Databases (Greenplum)
Hadoop Cluster setup and management with Cloudera Manager (for example)
Spark code from coding to production
How to monitor and manage data processing pipelines
Airflow Application management
Creating Statistics with Spark and Kafka
Listen to an introduction to Hadoop for Data Scientists:
How to Visualize Data
Android & IOS basics
How to design APIs for mobile apps
How to use Webservers to display content
Tomcat, Jetty, NodeRED, React
Business Intelligence Tools
Identity & Device Management
What is a digital twin?