Important Data Used for Actually Good Machine Learning

Andreas Kretz Blog

In the previous post we were talking about how the machine learning pipeline works.

Now that's all very nice.

In this one we are going to talk about data types used in this process. It will be totally clear to you what a data engineer does after this one.

The types of data

When you look at it, you have two very important places where you have data.

You have in the training phase two types of data:

  • Data that you use for the training.
  • Meta data with information about the training data
  • Data that basically configures the model, the hyper parameter configuration.

Once you're in production you have the live data that is streaming in. Data that is coming in from from an app, from a IoT device, logs, or whatever.

All different types of data.

Now, here comes the engineering part:

The Data Engineers part, is making this data available. Available to the data scientist and the machine learning process.

Hyper Parameters

So when you look at the model, on the left side you have your hyper parameter configuration. You need to store and manage these configurations somehow.

The hyper parameters are used to configure the algorithm, to train the model.

Training Data & Live Data

Then you have the actual training data.

There's a lot going on with the training data:

Where does it come from? Who owns it? Which is basically data governance.

What's the lineage? Have you modified this data? What did you do, what was the basis, the raw data?

A data catalog is important. It explains which features are available and how different data sets are labeled.

You need to access all this data somehow. In training and in production.

In production you need to have access to the live data.

All this is the data engineers job. Making the data available.

Data Engineering Process

First an architect needs to build the platform. This can also be a good data engineer.

Then the data engineer needs to build the pipelines. How is the data coming in and how is the platform connecting to other systems.

How is that data then put into the storage. What storage is best?

Is there a pre processing for the algorithms necessary? He'll do it.

Once the data and the systems are available, it's time for the machine learning part.

It is ready for processing. Basically ready for the data scientist.

Once the analytics is done the data engineer needs to build pipelines to make it then accessible again. For instance for other analytics processes, for APIs, for front ends and so on.

All in all, the data engineer's part is a computer science part.

That's why I love it so much 🙂

Want to know more about Data Engineering?

Check out my free Data Engineering Cookbook on Github:
https://github.com/andkret/Cookbook

Say Hello On:

Instagram | LinkedIn | Facebook | Twitter | YouTube | Medium | Snapchat

Subscribe to my Newsletter: Here