机器学习基础操作线性模型¶
在实验练习08中我们使用的数据为两个城市在不同因素影响下某小时共享单车租用的数量。现在我们利用该数据集构建一个回归预测模型,根据数据中的某些属性预测该时段内的单车租用数量,该表所有值均为整型数据,表结构为:
id 记录编号,无其他意义
city 城市代号,0为北京,1为上海
hour 小时,代表时间
is_workday 是否为工作日,0为否,1为是
temp_air 大气温度,单位为摄氏度
temp_body 体感温度,单位为摄氏度
weather 天气代号,1为晴天,2为多云或阴天,3为雨天或雪天
wind 风级,数值越大代表风速越大
y 该小时内共享单车被租用的数量
请完成以下任务
数据集已上传至data文件夹中,具体文件为bike.csv。请使用pandas库读取该文件。
id属性对构建回归预测模型没有帮助,请剔除掉该列。
我们暂不考虑不同城市对单车租用的影响,请筛选出上海市的所有数据,然后剔除city列。
为简化数据,请将hour列中原来6点-18点统一为1;19点-次日5点统一为0。
y列为单车租用数量,是我们的预测目标(标签),请将该列提取出来,并转换为一个numpy列向量,将原先的y列剔除。
请将DataFrame对象转换为Numpy数组,方便后续操作。
请按照训练集与测试集8:2的比例将原始数据集划分。
请分别对训练集数据、训练集标签、测试集数据和测试集标签进行归一化。
请先构建一个线性回归模型(多元一次函数),然后利用训练集训练模型。
利用测试集对训练好的模型进行评估。提示:使用predict(data_array)方法输入测试集,该函数返回值为模型预测值。
模型评估:请使用均方根误差(RMSE)作为评估指标,并输出RMSE值。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
D:\programs\anaconda\lib\site-packages\pandas\core\arrays\masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import (
#1. 数据集已上传至data文件夹中,具体文件为bike.csv。请使用pandas库读取该文件。
import pandas as pd
df = pd.read_csv('data/bike.csv')
df.head()
| id | city | hour | is_workday | weather | temp_air | temp_body | wind | y | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 22 | 1 | 2 | 3.0 | 0.7 | 0 | 15 |
| 1 | 2 | 0 | 10 | 1 | 1 | 21.0 | 24.9 | 3 | 48 |
| 2 | 3 | 0 | 0 | 1 | 1 | 25.3 | 27.4 | 0 | 21 |
| 3 | 4 | 0 | 7 | 0 | 1 | 15.7 | 16.2 | 0 | 11 |
| 4 | 5 | 1 | 10 | 1 | 1 | 21.1 | 25.0 | 2 | 39 |
#2. id属性对构建回归预测模型没有帮助,请剔除掉该列。
df = df.drop('id', axis=1)
df.head()
| city | hour | is_workday | weather | temp_air | temp_body | wind | y | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 22 | 1 | 2 | 3.0 | 0.7 | 0 | 15 |
| 1 | 0 | 10 | 1 | 1 | 21.0 | 24.9 | 3 | 48 |
| 2 | 0 | 0 | 1 | 1 | 25.3 | 27.4 | 0 | 21 |
| 3 | 0 | 7 | 0 | 1 | 15.7 | 16.2 | 0 | 11 |
| 4 | 1 | 10 | 1 | 1 | 21.1 | 25.0 | 2 | 39 |
#3. 我们暂不考虑不同城市对单车租用的影响,请筛选出上海市的所有数据,然后剔除city列。
df = df[df['city'] == 1]
df = df.drop('city', axis=1)
df.head()
| hour | is_workday | weather | temp_air | temp_body | wind | y | |
|---|---|---|---|---|---|---|---|
| 4 | 10 | 1 | 1 | 21.1 | 25.0 | 2 | 39 |
| 5 | 0 | 1 | 1 | 20.4 | 18.2 | 0 | 12 |
| 9 | 4 | 1 | 3 | 17.4 | 18.0 | 3 | 2 |
| 10 | 0 | 1 | 1 | 14.9 | 15.3 | 2 | 6 |
| 11 | 8 | 0 | 1 | 25.0 | 28.1 | 0 | 25 |
#4. 为简化数据,请将hour列中原来6点-18点统一为1;19点-次日5点统一为0。
# 将hour列中原来6点-18点统一为1;19点-次日5点统一为0
df['hour'] = df['hour'].apply(lambda x: 1 if 6 <= x <= 18 else 0)
df.head()
| hour | is_workday | weather | temp_air | temp_body | wind | y | |
|---|---|---|---|---|---|---|---|
| 4 | 1 | 1 | 1 | 21.1 | 25.0 | 2 | 39 |
| 5 | 0 | 1 | 1 | 20.4 | 18.2 | 0 | 12 |
| 9 | 0 | 1 | 3 | 17.4 | 18.0 | 3 | 2 |
| 10 | 0 | 1 | 1 | 14.9 | 15.3 | 2 | 6 |
| 11 | 1 | 0 | 1 | 25.0 | 28.1 | 0 | 25 |
#5. y列为单车租用数量,是我们的预测目标(标签),请将该列提取出来,并转换为一个numpy列向量,将原先的y列剔除。
y = df['y'].values
df = df.drop('y', axis=1)
print(y[:5])
[39 12 2 6 25]
#6. 请将DataFrame对象转换为Numpy数组,方便后续操作。
X = df.values
print(X[:5])
[[ 1. 1. 1. 21.1 25. 2. ] [ 0. 1. 1. 20.4 18.2 0. ] [ 0. 1. 3. 17.4 18. 3. ] [ 0. 1. 1. 14.9 15.3 2. ] [ 1. 0. 1. 25. 28.1 0. ]]
#7. 请按照训练集与测试集8:2的比例将原始数据集划分。
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("训练集数据形状:", X_train.shape)
print("训练集标签形状:", y_train.shape)
print("测试集数据形状:", X_test.shape)
print("测试集标签形状:", y_test.shape)
训练集数据形状: (3998, 6) 训练集标签形状: (3998,) 测试集数据形状: (1000, 6) 测试集标签形状: (1000,)
#8. 请分别对训练集数据、训练集标签、测试集数据和测试集标签进行归一化。
mm = MinMaxScaler()
X_train = mm.fit_transform(X_train)
X_test = mm.transform(X_test)
y_train = mm.fit_transform(y_train.reshape(-1, 1))
y_test = mm.transform(y_test.reshape(-1, 1))
print("归一化后的训练集数据:", X_train[:5])
print("归一化后的训练集标签:", y_train[:5])
print("归一化后的测试集数据:", X_test[:5])
print("归一化后的测试集标签:", y_test[:5])
归一化后的训练集数据: [[0. 1. 0. 0.47544643 0.48848684 0.42857143] [1. 0. 0. 0.42857143 0.45230263 0.71428571] [1. 1. 0. 0.64955357 0.64144737 0.28571429] [0. 0. 0. 0.22767857 0.29440789 0. ] [0. 1. 0.5 0.41071429 0.4375 0. ]] 归一化后的训练集标签: [[0.46596859] [0.16230366] [0.42408377] [0.06282723] [0.06282723]] 归一化后的测试集数据: [[1. 1. 1. 0.68973214 0.67269737 0.14285714] [0. 0. 0. 0.65625 0.59703947 0.14285714] [1. 0. 0. 0.17633929 0.20394737 0.14285714] [0. 1. 0. 0.22767857 0.22861842 0.28571429] [0. 1. 1. 0.29910714 0.28453947 0.42857143]] 归一化后的测试集标签: [[0.83769634] [0.15706806] [0.05235602] [0.06282723] [0.07853403]]
#9. 请先构建一个线性回归模型(多元一次函数),然后利用训练集训练模型。
model = LinearRegression()
model.fit(X_train, y_train)
# 模型参数
print("模型的参数:")
print(model.coef_, model.intercept_)
模型的参数: [[ 0.1741928 -0.00035667 -0.08475134 0.16933327 0.24873477 0.02942741]] [-0.06703018]
#10. 利用测试集对训练好的模型进行评估。提示:使用predict(data_array)方法输入测试集,该函数返回值为模型预测值。
y_pred = model.predict(X_test)
# 显示前几行预测值和实际值
print("预测值:", y_pred[:5])
print("实际值:", y_test[:5])
预测值: [[ 0.31037635] [ 0.19680316] [ 0.19195544] [ 0.03643989] [-0.01810279]] 实际值: [[0.83769634] [0.15706806] [0.05235602] [0.06282723] [0.07853403]]
#11. 模型评估:请使用均方根误差(RMSE)作为评估指标,并输出RMSE值。
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("均方根误差(RMSE):", rmse)
均方根误差(RMSE): 0.15576456611631956
利用K近邻算法学习鸢尾花数据集¶
K近邻(K-Nearest Neighbors,简称KNN)算法是一种基本的分类和回归方法,也是监督学习中最简单、直观的方法之一。
该算法基于一个简单的思想:如果一个样本在特征空间中的k个最近邻居中的大多数属于某个类别,则该样本也属于这个类别。
利用鸢尾花数据集进行实验,鸢尾花数据集可以通过sklearn库导入
- 加载数据,划分鸢尾花数据集,训练集比例0.2,随机种子42
- 以花萼长度为横轴,花萼宽度为纵轴绘制数据的散点图
- 对鸢尾花数据的特征进行PCA降维,并且可视化降维后的结果,x轴为主成分1,y轴为主成分2
- 以K=3训练分类器
- 在测试集上进行测试,最后输出分类准确率
#1. 加载数据,划分鸢尾花数据集,训练集比例0.2,随机种子42
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
iris = load_iris()
# 打印鸢尾花数据集的信息
print(iris.DESCR)
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
|details-start|
**References**
|details-split|
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
|details-end|
#2. 以花萼长度为横轴,花萼宽度为纵轴绘制数据的散点图
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='plasma', edgecolor='k')
plt.xlabel('length(cm)')
plt.ylabel('width(cm)')
plt.title('graph')
plt.colorbar(label='sort')
plt.show()
#3. 对鸢尾花数据的特征进行PCA降维,并且可视化降维后的结果,x轴为主成分1,y轴为主成分2
from sklearn.decomposition import PCA
# PCA降维
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='plasma', edgecolor='k')
plt.xlabel('principal component 1')
plt.ylabel('principal component 2')
plt.title('result')
plt.colorbar(label='sort')
plt.show()
#4. 以K=3训练分类器
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=3)
#5. 在测试集上进行测试,最后输出分类准确率
y_pred = knn.predict(X_test)
# 准确率
accuracy = accuracy_score(y_test, y_pred)
print("K准确率:", accuracy)
K准确率: 1.0