How to test the performance of algorithm

February 7, 2019

测试算法效率

在这一节，我们需要实现一种方法，使得我们可以测试我们算法的效果。

为了达到测试的效果，我们需要使用训练中从未使用的数据来检查模型对未知的数据，预测结果的好坏。

由此，我们可以将我们所有的数据数据随机的分为两部分，一部分用于训练，一部分用于测试，即

train test split

先加载我们需要的，用于测试的鸢尾花数据集

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

iris = datasets.load_iris()

X = iris.data
y = iris.target

通过 shape 查看数据的形状

X.shape

(150, 4)

y.shape

(150,)

实现 train test split 算法

首先我们将数据打乱，因为样本 X 和标签 y 之间根据索引一一对应，我们不能将两个数据各自打乱，一面产生错误数据。

我们可以先生成乱序的索引，然后将 X 和 y 按照这个乱序的索引排序。

生成随机索引

shuffle_indexes = np.random.permutation(len(X))

确定测试数据集的比例，分出测试数据集索引

test_ratio = 0.2
test_size = int(len(X) * test_ratio)

可以查看一下测试集的大小

test_size

生成训练集和测试集的索引

test_indexes = shuffle_indexes[:test_size]
train_indexes = shuffle_indexes[test_size:]

生成训练数据集

X_train = X[train_indexes]
y_train = y[train_indexes]

生成测试数据集

X_test = X[test_indexes]
y_test = y[test_indexes]

查看一下训练集样本和标签的大小

print(X_train.shape)
print(y_train.shape)

(120, 4)
(120,)

查看一下测试集样本和标签的大小

print(X_test.shape)
print(y_test.shape)

(30, 4)
(30,)

封装算法

def train_test_split(X, y, test_ratio=0.2, seed=None):
    assert X.shape[0] == y.shape[0],\
        "the size of X must equal to the size of x"
    assert 0.0 <= test_ratio <= 1.0,\
        "test_ratio must valid"
    if seed:
        np.random.seed(seed)
    shuffle_indexes = np.random.permutation(len(X))

    test_size = int(len(X) * test_ratio)
    test_indexes = shuffle_indexes[:test_size]
    train_indexes = shuffle_indexes[test_size:]

    X_train = X[train_indexes]
    y_train = y[train_indexes]

    X_test = X[test_indexes]
    y_test = y[test_indexes]

    return X_train, X_test ,y_train, y_test

使用封装好的算法

X_train, X_test ,y_train, y_test = train_test_split(X,y)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(120, 4)
(120,)
(30, 4)
(30,)

将结果用于上一节 kNN 算法

my_kNN_classifier = kNNClassifier(3)
my_kNN_classifier.fit(X_train=X_train, y_train=y_train)

kNN(k=3)

y_predict = my_kNN_classifier.predict(X_predict=X_test)

y_predict

array([2, 0, 0, 1, 0, 0, 0, 2, 1, 1, 0, 0, 0, 2, 1, 0, 0, 2, 0, 1, 2, 1,
           0, 1, 2, 0, 0, 2, 1, 2])

y_test

array([2, 0, 0, 1, 0, 0, 0, 2, 1, 1, 0, 0, 0, 2, 1, 0, 0, 2, 0, 1, 2, 2,
           0, 1, 2, 0, 0, 2, 1, 2])

通过计算正确预测的比例来衡量效率

accuracy = sum(y_predict == y_test) / len(y_test)

accuracy

0.9666666666666667

sklearn 中的 train_test_split

from sklearn.model_selection import train_test_split

X_train, X_test ,y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(120, 4)
(120,)
(30, 4)
(30,)