Back to posts

Transforming Data into Knowledge: Concepts, Types, and Applications

Comprehensive guide covers the foundational concepts of machine learning, its three main types (supervised, unsupervised, and reinforcement learning)

PythonAI
AG
Ala GARBAA 🚀 Full-Stack & DevOps Engineer

Introduction

I believe that machine learning, the art and science of enabling computers to learn from data, is the most exciting field in computer science right now! Why? Because we are living in an age where data abounds, and through self-learning algorithms, we can convert this raw information into useful knowledge. And thankfully, open source libraries have matured, making it easier than ever to break into machine learning and wield its power.

In this guide, I will show you not only how these algorithms are built and how they work, but how you can make use of them in practice. I want this article to serve as a foundation for your journey into transforming data into actionable insights. Here's what we'll cover:

  • The core concepts of machine learning

  • The three types of learning, and which problems they are best for

  • Building blocks of machine learning systems

  • How to install Python with the most essential data analysis and machine learning packages

So buckle up; it's time to transform data into knowledge, and make some data-driven decisions!

The Rise of Machine Learning

In an age of modern technology, there's one thing we have in abundance: data. Machine learning evolved as a subfield of artificial intelligence (AI), focusing on self-learning algorithms that derive knowledge from data to make predictions.

Instead of manually crafting rules and models, machine learning efficiently captures knowledge in data to improve predictive model performance and make data-driven decisions.

Machine learning's increasing importance in computer science research is matched by its ever-greater role in our daily lives. We rely on machine learning for everything from email spam filters to web search, product recommendations, and mobile check deposits. And while the dream of self-driving cars is still evolving, we are increasingly witnessing the practical use of machine learning in medicine.

For instance, deep learning models can now detect skin cancer almost as accurately as humans. And researchers at DeepMind recently used deep learning to predict 3D protein structures, outperforming prior physics-based approaches by a wide margin (https://deepmind.google/technologies/alphafold/).

These are just a few examples; machine learning is becoming more important in healthcare as we speak and not just in healthcare; we have AI tackling climate change—(https://aws.amazon.com/blogs/startups/how-climate-tech-startups-use-generative-ai-to-address-the-climate-crisis) — and AI used for precision agriculture too.

Three Ways to Learn: A Tour of ML Landscapes

Now, let’s get down to business and explore the three types of machine learning: supervised, unsupervised, and reinforcement learning. I’ll explain the differences, and hopefully you'll start to appreciate where each type might be useful.

figure 1
Figure 1: A visually appealing graphic summarizing the three types of machine learning.

Supervised Learning: Teaching a Model with Labels

The goal of supervised learning is to learn a model from labeled training data, which enables us to make predictions about new, unseen data. The word "supervised" means that we already know the desired output or "label" for each training example in our dataset. Think of it as "label learning.” In a nutshell, we teach a model to connect the input data with their corresponding labels.

figure 2
Figure 2: Simple diagram illustrating a supervised learning workflow. Labeled training data is input for a model, which is then used to make predictions about new, unlabeled data.

Consider email spam filtering: we train a model on a dataset of emails, correctly marked as spam or not-spam, and it uses that knowledge to decide where a new email belongs. A supervised learning task with discrete class labels is known as classification.

figure 3
Figure 3: Diagram showing the fit of a regression line.

But it's not always just two categories (like spam or not-spam). A classification model can also be used for multiclass classification, assigning new data points to various classes. Think of handwritten character recognition: a machine learning system can classify new handwritten characters as one of the letters in the alphabet given enough training data. But if you asked it to classify numbers (0 through 9) it would simply fail if it hasn't seen those numbers in the training data.

Another kind of supervised learning is regression. Instead of categorical labels, we predict continuous outcome signals. For example, if we want to predict SAT (Scholastic Assessment Test) scores based on study time, we would fit a line to past student data, to predict the scores of students planning to take the SAT in the future.

figure 4
Figure 4: Diagram showing how a binary classification task would separate data points for class A and class B. A decision boundary is shown in between to separate them.

Reinforcement Learning: Learning Through Interaction

Reinforcement learning focuses on developing systems (agents) that improve performance based on interacting with their environment. If that sounds like supervised learning to you, that's because they are related! Reinforcement learning also involves a feedback mechanism. However, this feedback isn’t the correct ground truth label or value; it's a measure of how well an action was measured by a reward function. The agent can then learn to maximize this reward by exploring different actions and strategies.

figure 5
Figure 5: Simple diagram illustrating a reinforcement learning workflow.

Chess is a popular example: the agent decides on a move (action) based on the state of the board (environment), and it receives a reward (win or lose) at the end of the game. Now, not every action immediately brings a positive or negative reward. For example, by sacrificing a pawn, a chess player might not get a direct reward right away; only at the end of the game will the outcome and reward be clear. Thus, reinforcement learning is concerned with making decisions based on delayed feedback to maximize the total reward, which can be both immediate and delayed.

Unsupervised Learning: Discovering Hidden Structures

Unlike supervised learning (where we have labels) and reinforcement learning (where we have rewards), unsupervised learning helps us to explore the structure of data without a guiding signal.

Clustering helps us to organize data into subgroups based on similarity, while dimensionality reduction allows us to reduce the complexity of our data.

figure 6
Figure 6: Visual representation of clustering and dimensionality reduction.

Clustering is particularly good for organizing a pile of information into subgroups, such as discovering customer groups based on their purchasing patterns. For example, a cluster might be “people who buy a lot of camping gear” or “people who often buy diapers.” Dimensionality reduction, on the other hand, is often used for preprocessing data with a lot of features by compressing the data to a smaller, more usable form. It's often used to eliminate noisy data, which can make it hard to get good results with some machine learning algorithms.

figure 7
Figure 7: A diagram of a 3D roll being projected into 2D space.
3D to 2D Dimensionality Reduction with PCA: A Python Visualization

A Quick Reference for All Your Terminology Needs

Now that I have introduced you to the three broad categories of machine learning, let’s go over some of the essential jargon (with synonyms):

  • Training example: Synonymous with an observation, record, instance, or sample. Each row of a dataset.

  • Training: Equivalent to model fitting or parameter estimation.

  • Feature (x): Column in a dataset. Synonymous with predictor, variable, input, attribute, or covariate.

  • Target (y): The outcome we are trying to predict. Synonymous with response variable, dependent variable, (class) label, and ground truth.

  • Loss function: Measurement of how well our model is performing. Synonymous with cost function or error function.

The Roadmap for Building Machine Learning Systems

Now that you've got a grasp of the core concepts of machine learning and its various flavors, let's take a step back and consider all the steps that are essential for actually building a machine learning system that does something useful.

figure 8
Figure 8: simple diagram showing the steps of a machine learning workflow.

Preprocessing => Training/Selecting a Model => Evaluate/Predict.

Here's a high-level overview that you can keep in mind:

  • Preprocessing: Data rarely comes in the right format, which is why data preprocessing is a crucial first step. This usually involves cleaning the data, removing inconsistencies, and transforming the features. For example, feature scaling makes models perform better, and dimensionality reduction is used to compress data while retaining most relevant information.

  • Training and selecting a model: Now that you've got the data ready to go, it's time to choose and train a predictive model. Since different algorithms have different assumptions, it's usually recommended to evaluate multiple models. And since not even the defaults are perfect, you should use techniques to optimize the model's parameters and hyperparameters to squeeze out the best possible performance.

  • Evaluating models and predicting data: The final step is to evaluate how well your model generalizes to new, unseen data. If satisfied, you can use that model to make actual predictions. (Make sure to use the parameters and transformations from your training steps here to avoid over-optimism.)

Python for Machine Learning: A Step-by-Step Environment Setup Guide

Released under the MIT License. Ala GARBAA © 2009-2025.

Built & designed by Ala GARBAA. RSS Feed