This is an introductory example in Machine Learning and Pattern Recognition of certain data. A Python program is programmed to predict the type of plants.
The iris dataset is used for this. A decision tree is used to classify data. This tutorial uses Python 3.6. Python 3.5 or later is required for this tutorial. It shows how to use Machine Learning to teach a program to create patterns from existing data and calculate predictions from them.
What is Iris Dataset?
The Iris Dataset is a multivariate dataset containing 50 data samples of three "iris" plant species each. From this dataset you can identify certain patterns (data patterns) with the help of machine learning. This dataset is often used by beginners for machine learning projects.
What is a Decision Tree?
A "decision tree" is used to make decisions. It is similar to a flowchart but consists of nodes where decisions are made in a binary system (yes or no). Each decision is represented by a node. A decision tree is very suitable for data with few attributes and it only requires less data preparation. For larger amounts of data, you should use a different algorithm that can make much more accurate predictions.
The following packages must be installed:
- NumPy (>= 1.11.0),
- SciPy (>= 0.17.0),
- joblib (>= 0.11) and
scikit-learn can be installed via the package manager pip:
pip install scikit-learn
Installation on the Windows CMD:
python -m pip install scikit-learn
Now a Python program is created, which should learn from the existing dataset and find out certain patterns. The package "numby" will be used to store the dataset in an array. "Numby" is always used when working with data sets, e.g. Machine Learning.
The package "Scikit-learn" is used for machine learning. The program "tree" (for using a decision tree) and the program "accuracy_score" are called by this package. The Iris dataset is in the package "sklearn.datasets".
from sklearn import tree from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score import numpy #Preparing the data set - Loading the data via iris.data - Loading the descriptions of the data via iris.target #The names of the plant species can be retrieved via "iris_target_names". The names are stored as IDs (numbers) in "data". iris = load_iris() x_coordinate = iris.data y_coordinate = iris.target plant_names = iris.target_names #Create random indexes used to retrieve the data in the iris dataset array_ids = numpy.random.permutation(len(x_coordinate)) #In "train" the data is used for learning for the Machine Learning program. #In "real" the actual data is stored, which is used to check the predicted data. #The last 15 values are used for "real" for checking, the rest for "train". x_coordinate_train = x_coordinate[array_ids[:-15]] x_coordinate_real = x_coordinate[array_ids[-15:]] y_coordinate_train = y_coordinate[array_ids[:-15]] y_coordinate_real = y_coordinate[array_ids[-15:]] #Classify the data using a decision tree and train it with the previously created data. data_classification = tree.DecisionTreeClassifier() data_classification.fit(x_coordinate_train,y_coordinate_train) #Create predictions from existing data (in data set "real") prediction = data_classification.predict(x_coordinate_real) #Display the predicted names print(prediction) #The actual values print(y_coordinate_real) #Calculate the accuracy of the predicted data - # Method accuracy_score() gets the predicted value and the actual value returned print("Accuracy in percent: %.2f" %((accuracy_score(prediction,y_coordinate_real)) * 100))
If this program code is then executed in Python, then the following is output. The output varies after each execution of this program code. The names of the plant species are stored and output as IDs in an array.
[1 1 0 2 2 0 1 2 1 0 2 1 2 0 2] [2 1 0 2 2 0 1 2 1 0 2 1 2 0 2] Accuracy in percent: 93.33
The IDs of iris plant species: 0 is iris setosa, 1 is iris versicolor, 2 is iris virginica
The first line contains calculated predictions created by Machine Learning.
The second row contains the actual values used to verify the correctness of the prediction calculated by this algorithm. As you can see here, the plant species were correctly predicted to about 93%. The accuracy of the predictions can change depending on the call of this program and the amount of data used.
Also try this program with larger data sets than the "15" used here. The more data you supply to this program, the better this program can recognize data patterns and make predictions from them. Machine Learning, as you can see here in this introductory example, is used, for example, in logistics to calculate the number of goods required in the future. For example, existing data on the number of goods orders is used to calculate this forecast.