机器学习数据集哪家强？Sklearn预制的这16个数据集不容错过~-六虎

数据是机器学习算法的动力，scikit-learn货sklearn供给了一些高质量的数据集。Scikit-learn（sklearn）是一个建立在SciPy之上的Python机器学习包。其独特之处在于其拥有很多的算法、十分易用以及能够与其他Python库进行整合。

什么是 “Sklearn数据集”？

Sklearn数据集作为scikit-learn（sklearn）库的一部分，是预先装置在库中的。咱们能够轻松地拜访和加载这些数据集，不需要单独下载它们。

要运用这些其中一个特定的数据集，能够简略地从sklearn.datasets模块中导入，并调用适当的函数将数据加载到程序中。

这些数据集一般都是经过预处理的，能够随时运用，这关于需要试验不同机器学习模型和算法的数据科学家来说，能够节约很多时刻和精力。

1. Iris

这个数据集包括150朵鸢尾花的萼片长度、萼片宽度、花瓣长度和花瓣宽度的丈量值，这些花属于三个不同的物种：Setosa、versicolor和virginica。鸢尾花数据集有150行和5列，以dataframe的方式存储。

Sepal.Length – 表明萼片的长度，单位是厘米。
Sepal.Width – 萼片的宽度，单位是厘米。
Petal.Length – 表明花瓣的长度（厘米）。
Species – 代表鸢尾花的品种，有三个可能的值：setosa、versicolor和virginica。

能够运用sklearn.datasets模块的load_iris函数直接从sklearn加载鸢尾花数据集。

# To install sklearn
pip install scikit-learn
# To import sklearn
from sklearn.datasets import load_iris
# Load the iris dataset
iris = load_iris()
# Print the dataset description
print(iris.describe())

这段运用sklearn加载Iris数据集的代码。于2023年3月27日从scikit-learn.org/stable/modu… 获取

2. Diabetes

这个sklearn数据集包括了442名糖尿病患者的信息，包括个人数据和临床丈量值：

年纪
性别
身体质量指数(BMI)
均匀血压
六项血清丈量（如总胆固醇、低密度脂蛋白（LDL）胆固醇、高密度脂蛋白（HDL）胆固醇）。
糖尿病疾病进展的定量丈量（HbA1c）。

糖尿病数据集能够运用sklearn.datasets模块的load_diabetes()函数加载。

from sklearn.datasets import load_diabetes
# Load the diabetes dataset
diabetes = load_diabetes()
# Print some information about the dataset
print(diabetes.describe())

上面是运用sklearn加载糖尿病数据集的代码。于2023年3月28日从scikit-learn.org/stable/data… 获取。

3. Digits

这个sklearn数据集是一个从0到9的手写数字的集合，存储为灰度图画。它总共包括1797个样本，每个样本是一个形状为(8,8)的二维阵列。在 Digits 数据会集有64个变量（或特征），对应于每张数字图画的64个像素。

from sklearn.datasets import load_digits
# Load the digits dataset
digits = load_digits()
# Print the features and target data
print(digits.data)
print(digits.target)

上面是运用sklearn加载Digits数据集的代码。与2023年3月29日从scikit-learn.org/stable/data… 获取。

4. Linnerud

Linnerud数据集包括了20名职业运动员的身体和生理丈量数据。

该数据集包括以下变量：

三个身体锻炼变量–引体向上、仰卧起坐和跳远。
三个生理丈量变量–脉息、收缩压和舒张压。

运用sklearn在Python中加载Linnerud数据集：

from sklearn.datasets import load_linnerud
linnerud = load_linnerud()

上面这段运用sklearn加载linnerud数据集的代码。于2023年3月27日从scikit-learn.org/stable/modu… 获取。

5. Wine

这个sklearn数据集包括了对生长在意大利特定区域的葡萄酒进行化学剖析的成果。数据会集的一些变量：

Alcohol
Malic acid
Ash
Alkalinity of ash
Magnesium
Total phenols
Flavanoids

都是专业名词。我就不翻译了~ 需要用这个数据集的人应该比我更懂。

葡萄酒数据集能够运用sklearn.datasets模块的load_wine()函数加载。

from sklearn.datasets import load_wine
# Load the Wine dataset
wine_data = load_wine()
# Access the features and targets of the dataset
X = wine_data.data  # Features
y = wine_data.target  # Targets
# Access the feature names and target names of the dataset
feature_names = wine_data.feature_names
target_names = wine_data.target_names

上面这段运用sklearn加载葡萄酒质量数据集的代码。于2023年3月28日从scikit-learn.org/stable/data… 获取。

6. Breast Cancer Wisconsin Dataset

这个sklearn数据集由乳腺癌肿瘤的信息组成，开始由William H. Wolberg博士创立。创立该数据集是为了帮助研究人员和机器学习从业者将肿瘤分类为恶性（癌症）或良性（非癌症）。

这个数据集包括的变量：

ID number
Diagnosis (M = malignant, B = benign).
Radius (the mean of distances from the centre to points on the perimeter).
Texture (the standard deviation of gray-scale values).
Perimeter
Area
Smoothness (the local variation in radius lengths).
Compactness (the perimeter^2 / area – 1.0).
Concavity (the severity of concave portions of the contour).
Concave points (the number of concave portions of the contour).
Symmetry
Fractal dimension (“coastline approximation” – 1).

你能够运用sklearn.datasets模块的load_breast_cancer函数直接从sklearn加载乳腺癌肿瘤的数据集。

from sklearn.datasets import load_breast_cancer
# Load the Breast Cancer Wisconsin dataset
cancer = load_breast_cancer()
# Print the dataset description
print(cancer.describe())

上面这段是运用sklearn加载Breast Cancer Wisconsin Dataset的代码。于2023年3月28日从scikit-learn.org/stable/modu… 获取。

7. Boston Housing

波士顿住宅数据集包括了马萨诸塞州波士顿区域的住宅信息。它有大约506行和14列的数据。

数据会集的一些变量包括：

CRIM – 各镇的人均犯罪率。
INDUS – 每个乡镇的非零售商业用地份额。
CHAS – 查尔斯河虚拟变量（=1，如果区块与河流相连；否则为0）。
NOX – 一氧化氮的浓度（每1000万份）。
RM – 每个住宅的均匀房间数。
AGE – 1940年曾经制作的自建房的份额。
DIS – 到波士顿五个工作中心的加权间隔。
RAD – 辐射状高速公路的可达性指数。
TAX – 每10,000美元的财产税全额税率。
PTRATIO – 各镇的学生-教师比率。
B – 1000(Bk – 0.63)^2，其中-Bk是各镇黑人的份额。
LSTAT – 人口中地位较低的百分比。
MEDV – 业主自住宅子的中位价值，单位为1000美元。

能够运用sklearn.datasets模块的load_boston函数直接从scikit-learn加载波士顿住宅数据集。

from sklearn.datasets import load_boston
# Load the Boston Housing dataset
boston = load_boston()
# Print the dataset description
print(boston.describe())

以上是运用sklearn加载波士顿住宅数据集的代码。于2023年3月29日从scikit-learn.org/0.15/module… 获取。

8. Olivetti Faces

奥利维蒂人脸数据集是1992年4月至1994年4月期间在AT&T实验室拍照的人脸灰度图画的集合。它包括10个人的400张图画，每个人有40张在不同视点和不同光线条件下拍照的图画。

你能够经过运用数据集模块中的fetch_olivetti_faces函数在sklearn中加载Olivetti脸部数据集。

from sklearn.datasets import fetch_olivetti_faces
# Load the dataset
faces = fetch_olivetti_faces()
# Get the data and target labels
X = faces.data
y = faces.target

这是运用sklearn加载Olivetti Faces数据集的代码。于2023年3月29日从scikit-learn.org/stable/modu… 获取。

9. California Housing

这个sklearn数据集包括了关于房子价值中位数的信息，以及加利福尼亚的人口普查区的特点。它还包括20,640个实例和8个特征。

数据会集的一些变量：

MedInc–街区的收入中位数。
HouseAge – 街区内房子的中位年纪。
AveRooms – 每个家庭的均匀房间数。
AveBedrms – 每个家庭的均匀卧室数量。
Population – 街区的人口。
AveOccup–家庭均匀占用率。
Latitude – 街区的纬度，以十进制为单位。
Longitude – 街区的经度，以小数点后的度数表明。

你能够运用sklearn的fetch_california_housing函数来加载加州住宅数据集。

from sklearn.datasets import fetch_california_housing
# Load the dataset
california_housing = fetch_california_housing()
# Get the features and target variable
X = california_housing.data
y = california_housing.target

运用sklearn加载加州住宅数据集的代码。于2023年3月29日从scikit-learn.org/stable/modu… 获取。

10. MNIST

MNIST数据集在机器学习和计算机视觉范畴很受欢迎并被广泛运用。它由70,000张手写数字0-9的灰度图画组成，其中60,000张用于练习，10,000张用于测验。每张图画的巨细为28×28像素，并有一个相应的标签，表明它所代表的数字。

你能够运用以下代码从sklearn加载MNIST数据集：

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

注意：MNIST数据集是Digits数据集的一个子集。

上面是运用sklearn加载MNIST数据集的代码。于2023年3月30日从scikit-learn.org/stable/modu… 获取。

11. Fashion-MNIST

Fashion-MNIST 数据集是由Zalando Research创立的，作为原始MNIST数据集的替代。Fashion-MNIST数据集由70,000张灰度图画组成（练习集60,000张，测验集10,000张），都是服装相关的内容。

这些图画巨细为28×28像素，代表了10个不同类其他服装，包括T恤/上衣、长裤、套头衫、连衣裙、大衣、凉鞋、衬衫、运动鞋、包和踝靴。它类似于原始的MNIST数据集，但由于服装类其他复杂性和品种更多，分类使命更具挑战性。

你能够运用fetch_openml函数加载这个sklearn数据集。

from sklearn.datasets import fetch_openml
fmnist = fetch_openml(name='Fashion-MNIST')

运用sklearn加载Fashion-MNIST数据集的代码。取自 scikit-learn.org/stable/modu…

Generated Sklearn数据集

Generated Sklearn数据集是组成数据集，运用Python的sklearn库生成。它们被用于测验、基准测验和开发机器学习算法/模型。

12. make_classification

这个函数生成一个随机的n类分类数据集，具有指定数量的样本、特征和信息特征。

下面是一个生成这个sklearn数据集的示例代码，有100个样本、5个特征和3个类：

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=5, n_informative=3, n_classes=3, random_state=42)

这段代码生成了一个有100个样本和5个特征的数据集，其中有3个类和3个信息性特征。剩下的特征将是多余的或是数据噪声。

运用sklearn加载make_classification数据集的代码。于2023年3月30日从scikit-learn.org/stable/modu… 获取。

13. make_regression

这个函数生成一个具有指定数量的样本、特征和数据噪音的随机回归数据集。

下面是生成这个sklearn数据集的示例代码，有100个样本，5个特征，数据噪音水平为0.1：

from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)

这段代码生成了一个有100个样本和5个特征的数据集，噪声水平为0.1。方针变量y将是一个连续变量。

运用sklearn加载make_regression数据集的代码。于2023年3月30日从scikit-learn.org/stable/modu… 获取。

14. make_blobs

这个函数生成一个具有指定数量的样本和聚类的随机数据集。

下面是一个生成具有100个样本和3个聚类的sklearn数据集的示例代码：

from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=3, random_state=42)

这段代码生成了一个有100个样本和2个特征（x和y坐标）的数据集，有3个以随机位置为中心的聚类，而且没有噪音数据。

上面是运用sklearn加载make_blobs数据集的代码。于2023年3月30日从scikit-learn.org/stable/modu… 获取。

15. make_moons and make_circles

这些函数生成具有非线性边界的数据集，对测验非线性分类算法很有用。

下面是一个加载make_moons数据集的示例代码：

from sklearn.datasets import make_moons
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)

这段代码生成了一个有1000个样本和2个特征（x和y坐标）的数据集，两类之间有一个非线性的边界，而且在数据中加入了0.2个标准差的高斯噪声。

运用sklearn加载make_moons数据集的代码。2023年3月30日从scikit-learn.org/stable/modu… 获取

下面是一个生成和加载make_circles数据集的示例代码：

from sklearn.datasets import make_circles
X, y = make_circles(n_samples=1000, noise=0.05, random_state=42)

运用sklearn加载make_circles数据集的代码。于2023年3月30日从scikit-learn.org/stable/modu… 获取。

16. make_sparse_coded_signal

这个函数生成了一个稀少编码信号数据集，对测验紧缩感应算法很有用。

下面是一个加载这个sklearn数据集的示例代码：

from sklearn.datasets import make_sparse_coded_signal
X, y, w = make_sparse_coded_signal(n_samples=100, n_components=10, n_features=50, n_nonzero_coefs=3, random_state=42)

这段代码生成了一个有100个样本、50个特征和10个原子的稀少编码信号数据集。

运用sklearn加载make_sparse_coded_signal数据集的代码。2023年3月30日从scikit-learn.org/stable/modu… 获取。

这些Sklearn数据集的常见运用场景

Iris – 这个Sklearn数据集一般用于分类使命，并被用作测验分类算法的基准数据集。
Diabetes – 这个数据集包括糖尿病患者的医疗信息，用于医疗剖析中的分类和回归使命。
Digits – 这个sklearn数据集包括手写数字的图画，一般用于图画分类和模式识别使命。
Linnerud – 这个数据集包括20名运动员的体能和医疗数据，一般用于多变量回归剖析。
Wine – 这个sklearn数据集包括葡萄酒的化学剖析，一般用于分类和聚类使命。
Breast Cancer Wisconsin – 这个数据集包括乳腺癌患者的医疗信息，一般用于医疗剖析中的分类使命。
Boston Housing – 这个sklearn数据集包括关于波士顿住宅的信息，一般用于回归使命。
Olivetti Faces – 该数据集包括人脸的灰度图画，一般用于图画分类和面部识别使命。
California Housing – 这个sklearn数据集包括关于加州住宅的信息，一般用于回归使命。
MNIST – 这个数据集包括手写数字的图画，一般用于图画分类和模式识别使命。
Fashion-MNIST – 这个sklearn数据集包括服装项目的图画，一般用于图画分类和模式识别使命。
make_classification – 这个数据集是一个随机生成的数据集，用于二进制和多类别分类使命。
make_regression – 这个数据集是一个随机生成的数据集，用于回归使命。
make_blobs – 这个sklearn数据集是一个随机生成的数据集，用于聚类使命。
make_moons 和 make_circles – 这些数据集是为分类使命随机生成的数据集，一般用于测验非线性分类器。
make_sparse_coded_signal – 这个数据集是一个随机生成的数据集，用于信号处理中的稀少编码使命。

本文正在参与「金石方案」

机器学习数据集哪家强？Sklearn预制的这16个数据集不容错过~