I’ve played around with a couple of small machine learning projects before, including using Genetic Algorithms for a fun video game project. However, I decided to take a step back and learn the fundamentals—even if this particular project isn’t the most exciting one. I followed a wonderful guide on YouTube by Niam Yaraghi, which helped me understand the basics of classification using the Iris dataset.
The Dataset: Iris Flower Data
The Iris dataset is a well-known dataset in machine learning, consisting of 150 samples of iris flowers, divided into three species: Setosa, Versicolor, and Virginica. Each sample has four features:
- Sepal Length (cm)
- Sepal Width (cm)
- Petal Length (cm)
- Petal Width (cm)
Our goal is to build a model that can classify a given flower into one of these three species based on its measurements.
Steps in Building the Model
1. Importing Required Libraries
We start by loading essential Python libraries for data manipulation, visualization, and machine learning:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
2. Loading and Exploring the Data
We load the Iris dataset into a Pandas DataFrame:
iris_data = pd.read_csv('Iris.csv')
To get an initial understanding of the dataset, we can use:
iris_data.head() # View the first five rows
iris_data.info() # Summary of dataset
3. Data Visualization
Before diving into modeling, we visualize the data to understand its distribution.
sns.lmplot(x="SepalLengthCm", y="SepalWidthCm", data=iris_data, hue="Species", fit_reg=False)
plt.show()
This scatter plot gives us insights into how the different species are distributed in terms of sepal length and width.
4. Preparing the Data for Training
Since machine learning models work with numerical values, we need to encode our categorical target variable (Species). We use OrdinalEncoder
for this:
ord_enc = OrdinalEncoder()
iris_data["Species_code"] = ord_enc.fit_transform(iris_data[["Species"]])
Next, we split the dataset into features (X) and target labels (y):
X = iris_data.drop(columns=["Species", "Species_code"])
y = iris_data["Species_code"]
We then divide the data into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1984)
5. Training the k-NN Model
We use the k-Nearest Neighbors (k-NN) algorithm with k=1
to train our model:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
6. Making Predictions and Evaluating the Model
After training, we predict the test set and evaluate the accuracy:
y_pred = knn.predict(X_test)
print(np.mean(y_pred == y_test))
The output gives us the model’s accuracy, which tells us how well it classifies the flowers.
Final Thoughts
While this is a relatively simple and perhaps unexciting project, gaining a deeper understanding of how it works provides a strong foundation for implementing more advanced machine learning algorithms. In the coming months, I will be studying my Machine Learning module at university, so building my knowledge now is crucial—especially as I prepare to work on my thesis project, which involves applying machine learning to cyber threat analysis.