一.KNN算法介绍
附近算法,或者说K最附近(KNN,K-NearestNeighbor)分类算法是数据挖掘分类技能中最简略的办法之一。所谓K最近邻,便是K个最近的街坊的意思,说的是每个样本都可以用它最接近的K个附近值来代表。近邻算法便是将数据集合中每一个记录进行分类的办法 。
k近邻法是一种根本的分类和回归办法,是监督学习办法里的一种常用办法。k近邻算法假设给定一个练习数据集,其间的实例类别已定。分类时,对新的实例,依据其k个最近邻的练习实例类别,经过大都表决等方式进行猜测。
二.KNN算法中心思维
KNN算法的中心思维是,假如一个样本在特征空间中的K个最相邻的样本中的大大都属于某一个类别,则该样本也属于这个类别,并具有这个类别上样本的特性。该办法在确认分类决议计划上只依据最附近的一个或者几个样本的类别来决议待分样本所属的类别。KNN办法在类别决议计划时,只与极少量的相邻样本有关。由于KNN办法主要靠周围有限的附近的样本,而不是靠判别类域的办法来确认所属类别的,因此关于类域的穿插或堆叠较多的待分样本集来说,KNN办法较其他办法更为合适。
三.KNN算法三要素
k近邻法三要素:间隔衡量、k值的挑选和分类决议计划规矩。常用的间隔衡量是欧氏间隔及更一般的pL间隔。k值小时,k近邻模型更复杂,简略发生过拟合;k值大时,k近邻模型更简略,又简略欠拟合。因此k值得挑选会对分类成果发生严重影响。k值的挑选反映了对近似差错与估计差错之间的权衡,通常由穿插验证挑选最优的k。
四.KNN算法优缺陷
长处
1.简略,易于了解,易于实现,无需估计参数,无需练习;
2. 合适对稀有事情进行分类;
3.特别合适于多分类问题(multi-modal,目标具有多个类别标签), kNN比SVM的表现要好。缺陷
1.该算法在分类时有个主要的不足是,当样本不平衡时,如一个类的样本容量很大,而其他类样本容量很小时,有或许导致当输入一个新样本时,该样本的K个街坊中大容量类的样本占大都 。
2.该办法的另一个不足之处是核算量较大,由于对每一个待分类的文本都要核算它到整体已知样本的间隔,才能求得它的K个最近邻点 。
五.源码简略实现
1.导包
import numpy as np
from math import sqrt
from collections import Counter
from metrics import accuracy_score
2.初始化kNN分类器
class KNNClassifier:
def __init__(self, k):
"""初始化kNN分类器"""
assert k >= 1, "k must be valid"
self.k = k
self._X_train = None
self._y_train = None
3.练习数据集
def fit(self, X_train, y_train):
"""依据练习数据集X_train和y_train练习kNN分类器"""
assert X_train.shape[0] == y_train.shape[0], \
"the size of X_train must be equal to the size of y_train"
assert self.k <= X_train.shape[0], \
"the size of X_train must be at least k."
self._X_train = X_train
self._y_train = y_train
return self
4.多个值的猜测
def predict(self, X_predict):
"""给定待猜测数据集X_predict,回来表明X_predict的成果向量"""
assert self._X_train is not None and self._y_train is not None, \
"must fit before predict!"
assert X_predict.shape[1] == self._X_train.shape[1], \
"the feature number of X_predict must be equal to X_train"
y_predict = [self._predict(x) for x in X_predict]
return np.array(y_predict)
5.单个值的猜测
def _predict(self, x):
"""给定单个待猜测数据x,回来x的猜测成果值"""
assert x.shape[0] == self._X_train.shape[1], \
"the feature number of x must be equal to X_train"
distances = [sqrt(np.sum((x_train - x) ** 2))
for x_train in self._X_train]
nearest = np.argsort(distances)
topK_y = [self._y_train[i] for i in nearest[:self.k]]
votes = Counter(topK_y)
return votes.most_common(1)[0][0]
6.模型准确度
def score(self, X_test, y_test):
"""依据测试数据集 X_test 和 y_test 确认当时模型的准确度"""
y_predict = self.predict(X_test)
return accuracy_score(y_test, y_predict)
六.悉数代码
import numpy as np
from math import sqrt
from collections import Counter
from metrics import accuracy_score
class KNNClassifier:
def __init__(self, k):
"""初始化kNN分类器"""
assert k >= 1, "k must be valid"
self.k = k
self._X_train = None
self._y_train = None
def fit(self, X_train, y_train):
"""依据练习数据集X_train和y_train练习kNN分类器"""
assert X_train.shape[0] == y_train.shape[0], \
"the size of X_train must be equal to the size of y_train"
assert self.k <= X_train.shape[0], \
"the size of X_train must be at least k."
self._X_train = X_train
self._y_train = y_train
return self
def predict(self, X_predict):
"""给定待猜测数据集X_predict,回来表明X_predict的成果向量"""
assert self._X_train is not None and self._y_train is not None, \
"must fit before predict!"
assert X_predict.shape[1] == self._X_train.shape[1], \
"the feature number of X_predict must be equal to X_train"
y_predict = [self._predict(x) for x in X_predict]
return np.array(y_predict)
def _predict(self, x):
"""给定单个待猜测数据x,回来x的猜测成果值"""
assert x.shape[0] == self._X_train.shape[1], \
"the feature number of x must be equal to X_train"
distances = [sqrt(np.sum((x_train - x) ** 2))
for x_train in self._X_train]
nearest = np.argsort(distances)
topK_y = [self._y_train[i] for i in nearest[:self.k]]
votes = Counter(topK_y)
return votes.most_common(1)[0][0]
def score(self, X_test, y_test):
"""依据测试数据集 X_test 和 y_test 确认当时模型的准确度"""
y_predict = self.predict(X_test)
return accuracy_score(y_test, y_predict)
def __repr__(self):
return "KNN(k=%d)" % self.k