How to test the performance of algorithm
February 7, 2019
测试算法效率
在这一节,我们需要实现一种方法,使得我们可以测试我们算法的效果。
为了达到测试的效果,我们需要使用训练中从未使用的数据来检查模型对未知的数据,预测结果的好坏。
由此,我们可以将我们所有的数据数据随机的分为两部分,一部分用于训练,一部分用于测试,即
train test split
先加载我们需要的,用于测试的鸢尾花数据集
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
通过 shape 查看数据的形状
X.shape
(150, 4)
y.shape
(150,)
实现 train test split 算法
首先我们将数据打乱,因为样本 X 和标签 y 之间根据索引一一对应,我们不能将两个数据各自打乱,一面产生错误数据。
我们可以先生成乱序的索引,然后将 X 和 y 按照这个乱序的索引排序。
- 生成随机索引
shuffle_indexes = np.random.permutation(len(X))
- 确定测试数据集的比例,分出测试数据集索引
test_ratio = 0.2
test_size = int(len(X) * test_ratio)
可以查看一下测试集的大小
test_size
30
- 生成训练集和测试集的索引
test_indexes = shuffle_indexes[:test_size]
train_indexes = shuffle_indexes[test_size:]
- 生成训练数据集
X_train = X[train_indexes]
y_train = y[train_indexes]
- 生成测试数据集
X_test = X[test_indexes]
y_test = y[test_indexes]
查看一下训练集样本和标签的大小
print(X_train.shape)
print(y_train.shape)
(120, 4)
(120,)
查看一下测试集样本和标签的大小
print(X_test.shape)
print(y_test.shape)
(30, 4)
(30,)
封装算法
def train_test_split(X, y, test_ratio=0.2, seed=None):
assert X.shape[0] == y.shape[0],\
"the size of X must equal to the size of x"
assert 0.0 <= test_ratio <= 1.0,\
"test_ratio must valid"
if seed:
np.random.seed(seed)
shuffle_indexes = np.random.permutation(len(X))
test_size = int(len(X) * test_ratio)
test_indexes = shuffle_indexes[:test_size]
train_indexes = shuffle_indexes[test_size:]
X_train = X[train_indexes]
y_train = y[train_indexes]
X_test = X[test_indexes]
y_test = y[test_indexes]
return X_train, X_test ,y_train, y_test
使用封装好的算法
X_train, X_test ,y_train, y_test = train_test_split(X,y)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(120, 4)
(120,)
(30, 4)
(30,)
将结果用于上一节 kNN 算法
my_kNN_classifier = kNNClassifier(3)
my_kNN_classifier.fit(X_train=X_train, y_train=y_train)
kNN(k=3)
y_predict = my_kNN_classifier.predict(X_predict=X_test)
y_predict
array([2, 0, 0, 1, 0, 0, 0, 2, 1, 1, 0, 0, 0, 2, 1, 0, 0, 2, 0, 1, 2, 1,
0, 1, 2, 0, 0, 2, 1, 2])
y_test
array([2, 0, 0, 1, 0, 0, 0, 2, 1, 1, 0, 0, 0, 2, 1, 0, 0, 2, 0, 1, 2, 2,
0, 1, 2, 0, 0, 2, 1, 2])
通过计算正确预测的比例来衡量效率
accuracy = sum(y_predict == y_test) / len(y_test)
accuracy
0.9666666666666667
sklearn 中的 train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test ,y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(120, 4)
(120,)
(30, 4)
(30,)