Homework8
数据探索与可视化¶
实验目标¶
- 学习并掌握如何使用Python进行数据探索与可视化。
- 理解并应用数据探索的基本方法,包括数据清洗、格式化和描述性统计分析。
- 使用Matplotlib和Seaborn等可视化工具,绘制不同类型的图表,分析数据分布及特征之间的关系。
- 利用pandas_profiling生成数据的交互式报告,全面了解数据的分布、缺失值和异常值。
实验要求¶
- 使用pandas读取数据集github_bot_processed_data.csv。探索pandas的显示选项,以便查看更多数据行和列。
- 使用head()方法查看数据的前几行,并使用info()方法查看每列的数据类型,观察两者的区别。 使用describe()方法生成数据的描述性统计信息。
- 对数据进行格式化处理,并展示不同格式(例如,日期、货币、百分比等)的效果。
- 对public_repos、public_gists、followers、following等列进行对数变换,并查看其影响。
- 使用Matplotlib绘制图表
- 绘制条形图:展示label列的类别分布。
- 绘制堆积柱状图:展示多个布尔特征(如site_admin、company等)的分布。
- 绘制直方图:展示log_public_repos的对数变换后的数据分布。
- 绘制散点图:展示public_repos与followers之间的关系。
- 绘制散点矩阵:展示多个数值型特征之间的成对关系。
- 使用Seaborn绘制图表
- 绘制箱线图:展示不同label类别下log_followers的分布。
- 绘制成对图:展示不同特征之间的成对关系,并根据label分类。
- 绘制热图:展示log_public_repos、log_public_gists、log_followers和log_following等特征之间的相关性。
- 绘制小提琴图:展示label与log_followers之间的分布差异。
- 使用pandas_profiling.ProfileReport()生成交互式数据分析报告,分析数据的统计分布、缺失值、异常值等。
- 使用pandas读取数据集github_bot_processed_data.csv。探索pandas的显示选项,以便查看更多数据行和列。
In [1]:
Copied!
import pandas as pd
df = pd.read_csv('data/github_bot_processed_data.csv')
print("打印df的形状:")
print(df.shape)
print("打印df:")
print(df)
print("打印df的前五行:")
print(df.head())
print("打印df的统计信息:")
print(df.describe())
import pandas as pd
df = pd.read_csv('data/github_bot_processed_data.csv')
print("打印df的形状:")
print(df.shape)
print("打印df:")
print(df)
print("打印df的前五行:")
print(df.head())
print("打印df的统计信息:")
print(df.describe())
D:\programs\anaconda\lib\site-packages\pandas\core\arrays\masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import (
打印df的形状:
(19768, 15)
打印df:
label type site_admin company blog location hireable \
0 Human True False False False False False
1 Human True False False True False True
2 Human True False True True True True
3 Bot True False False False True False
4 Human True False False False False True
... ... ... ... ... ... ... ...
19763 Bot True False True True True False
19764 Human True False False False False False
19765 Human True False True False True False
19766 Human True False True False False False
19767 Bot True False False False True False
bio public_repos \
0 NaN 26
1 I just press the buttons randomly, and the pro... 30
2 Time is unimportant,\nonly life important. 103
3 NaN 49
4 NaN 11
... ... ...
19763 Tony came to Linux in 1994 and has never looke... 36
19764 NaN 16
19765 Software engineer at RealTracs. 13
19766 NaN 7
19767 NaN 10
public_gists followers following created_at \
0 1 5 1 2011-09-26 17:27:03+00:00
1 3 9 6 2015-06-29 10:12:46+00:00
2 49 1212 221 2008-08-29 16:20:03+00:00
3 0 84 2 2014-05-20 18:43:09+00:00
4 1 6 2 2012-08-16 14:19:13+00:00
... ... ... ... ...
19763 16 11 4 2014-07-02 23:27:34+00:00
19764 0 3 0 2017-12-06 21:56:31+00:00
19765 0 10 1 2015-11-14 14:44:05+00:00
19766 0 2 0 2021-11-23 18:55:29+00:00
19767 0 1 0 2016-04-22 22:11:59+00:00
updated_at text_bot_count
0 2023-10-13 11:21:10+00:00 0
1 2023-10-07 06:26:14+00:00 0
2 2023-10-02 02:11:21+00:00 0
3 2023-10-12 12:54:59+00:00 0
4 2023-10-06 11:58:41+00:00 0
... ... ...
19763 2023-08-15 16:38:34+00:00 0
19764 2023-07-26 18:32:25+00:00 0
19765 2022-08-23 21:09:49+00:00 0
19766 2023-10-06 22:50:45+00:00 0
19767 2022-07-07 19:48:21+00:00 0
[19768 rows x 15 columns]
打印df的前五行:
label type site_admin company blog location hireable \
0 Human True False False False False False
1 Human True False False True False True
2 Human True False True True True True
3 Bot True False False False True False
4 Human True False False False False True
bio public_repos \
0 NaN 26
1 I just press the buttons randomly, and the pro... 30
2 Time is unimportant,\nonly life important. 103
3 NaN 49
4 NaN 11
public_gists followers following created_at \
0 1 5 1 2011-09-26 17:27:03+00:00
1 3 9 6 2015-06-29 10:12:46+00:00
2 49 1212 221 2008-08-29 16:20:03+00:00
3 0 84 2 2014-05-20 18:43:09+00:00
4 1 6 2 2012-08-16 14:19:13+00:00
updated_at text_bot_count
0 2023-10-13 11:21:10+00:00 0
1 2023-10-07 06:26:14+00:00 0
2 2023-10-02 02:11:21+00:00 0
3 2023-10-12 12:54:59+00:00 0
4 2023-10-06 11:58:41+00:00 0
打印df的统计信息:
public_repos public_gists followers following text_bot_count
count 19768.000000 19768.000000 19768.000000 19768.000000 19768.000000
mean 84.139215 25.214083 245.497015 44.520741 0.061362
std 574.750217 635.690142 1535.939961 366.793439 0.341003
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 11.000000 0.000000 7.000000 0.000000 0.000000
50% 35.000000 2.000000 33.000000 4.000000 0.000000
75% 83.000000 10.000000 125.000000 22.000000 0.000000
max 50000.000000 55781.000000 95752.000000 27775.000000 5.000000
- 使用head()方法查看数据的前几行,并使用info()方法查看每列的数据类型,观察两者的区别。 使用describe()方法生成数据的描述性统计信息。
In [2]:
Copied!
print("数据集的前几行:")
print(df.head())
print("数据集的前几行:")
print(df.head())
数据集的前几行:
label type site_admin company blog location hireable \
0 Human True False False False False False
1 Human True False False True False True
2 Human True False True True True True
3 Bot True False False False True False
4 Human True False False False False True
bio public_repos \
0 NaN 26
1 I just press the buttons randomly, and the pro... 30
2 Time is unimportant,\nonly life important. 103
3 NaN 49
4 NaN 11
public_gists followers following created_at \
0 1 5 1 2011-09-26 17:27:03+00:00
1 3 9 6 2015-06-29 10:12:46+00:00
2 49 1212 221 2008-08-29 16:20:03+00:00
3 0 84 2 2014-05-20 18:43:09+00:00
4 1 6 2 2012-08-16 14:19:13+00:00
updated_at text_bot_count
0 2023-10-13 11:21:10+00:00 0
1 2023-10-07 06:26:14+00:00 0
2 2023-10-02 02:11:21+00:00 0
3 2023-10-12 12:54:59+00:00 0
4 2023-10-06 11:58:41+00:00 0
In [3]:
Copied!
print("数据集的基本信息:")
print(df.info())
print("数据集的基本信息:")
print(df.info())
数据集的基本信息: <class 'pandas.core.frame.DataFrame'> RangeIndex: 19768 entries, 0 to 19767 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 label 19768 non-null object 1 type 19768 non-null bool 2 site_admin 19768 non-null bool 3 company 19768 non-null bool 4 blog 19768 non-null bool 5 location 19768 non-null bool 6 hireable 19768 non-null bool 7 bio 8839 non-null object 8 public_repos 19768 non-null int64 9 public_gists 19768 non-null int64 10 followers 19768 non-null int64 11 following 19768 non-null int64 12 created_at 19768 non-null object 13 updated_at 19768 non-null object 14 text_bot_count 19768 non-null int64 dtypes: bool(6), int64(5), object(4) memory usage: 1.5+ MB None
两者的区别:
head()显示数据集前几行,默认5行
info()则会显示数据集的基本结构和每列的统计信息
In [4]:
Copied!
print("数据统计信息:")
print(df.describe())
print("数据统计信息:")
print(df.describe())
数据统计信息:
public_repos public_gists followers following text_bot_count
count 19768.000000 19768.000000 19768.000000 19768.000000 19768.000000
mean 84.139215 25.214083 245.497015 44.520741 0.061362
std 574.750217 635.690142 1535.939961 366.793439 0.341003
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 11.000000 0.000000 7.000000 0.000000 0.000000
50% 35.000000 2.000000 33.000000 4.000000 0.000000
75% 83.000000 10.000000 125.000000 22.000000 0.000000
max 50000.000000 55781.000000 95752.000000 27775.000000 5.000000
- 对数据进行格式化处理,并展示不同格式(例如,日期、货币、百分比等)的效果。
In [5]:
Copied!
#日期格式化
df['created_at'] = pd.to_datetime(df['created_at']).dt.strftime('%Y-%m-%d')
df['updated_at'] = pd.to_datetime(df['updated_at']).dt.strftime('%Y-%m-%d')
#数值列百分数处理
df['followers'] = df['followers'].apply(lambda x: f"{x:,}")
df['following'] = df['following'].apply(lambda x: f"{x:,}")
df['public_repos'] = df['public_repos'].apply(lambda x: f"{x:,}")
df['engagement_rate'] = (
df['followers'].str.replace(",", "").astype(float) /
df['following'].str.replace(",", "").astype(float)
).replace([float('inf'), -float('inf')], 0) # 处理除零错误
df['engagement_rate'] = df['engagement_rate'].fillna(0).apply(lambda x: f"{x:.2%}")
print(df.head())
#日期格式化
df['created_at'] = pd.to_datetime(df['created_at']).dt.strftime('%Y-%m-%d')
df['updated_at'] = pd.to_datetime(df['updated_at']).dt.strftime('%Y-%m-%d')
#数值列百分数处理
df['followers'] = df['followers'].apply(lambda x: f"{x:,}")
df['following'] = df['following'].apply(lambda x: f"{x:,}")
df['public_repos'] = df['public_repos'].apply(lambda x: f"{x:,}")
df['engagement_rate'] = (
df['followers'].str.replace(",", "").astype(float) /
df['following'].str.replace(",", "").astype(float)
).replace([float('inf'), -float('inf')], 0) # 处理除零错误
df['engagement_rate'] = df['engagement_rate'].fillna(0).apply(lambda x: f"{x:.2%}")
print(df.head())
label type site_admin company blog location hireable \
0 Human True False False False False False
1 Human True False False True False True
2 Human True False True True True True
3 Bot True False False False True False
4 Human True False False False False True
bio public_repos \
0 NaN 26
1 I just press the buttons randomly, and the pro... 30
2 Time is unimportant,\nonly life important. 103
3 NaN 49
4 NaN 11
public_gists followers following created_at updated_at text_bot_count \
0 1 5 1 2011-09-26 2023-10-13 0
1 3 9 6 2015-06-29 2023-10-07 0
2 49 1,212 221 2008-08-29 2023-10-02 0
3 0 84 2 2014-05-20 2023-10-12 0
4 1 6 2 2012-08-16 2023-10-06 0
engagement_rate
0 500.00%
1 150.00%
2 548.42%
3 4200.00%
4 300.00%
- 对public_repos、public_gists、followers、following等列进行对数变换,并查看其影响。
In [6]:
Copied!
import numpy as np
from IPython.display import display
log_transformed_data = df.copy()
# 将需要变换的列转换为数值类型(去掉格式化后的逗号)
columns_to_transform = ['public_repos', 'public_gists', 'followers', 'following']
for col in columns_to_transform:
# 如果列已是数字类型则跳过转换
if log_transformed_data[col].dtype == 'object':
log_transformed_data[col] = log_transformed_data[col].str.replace(",", "").astype(float)
for col in columns_to_transform:
log_transformed_data[f'log_{col}'] = np.log1p(log_transformed_data[col])
transformed_columns = [f'log_{col}' for col in columns_to_transform]
log_transformed_subset = log_transformed_data[transformed_columns]
# 在 Notebook 中展示
display(log_transformed_subset.head())
import numpy as np
from IPython.display import display
log_transformed_data = df.copy()
# 将需要变换的列转换为数值类型(去掉格式化后的逗号)
columns_to_transform = ['public_repos', 'public_gists', 'followers', 'following']
for col in columns_to_transform:
# 如果列已是数字类型则跳过转换
if log_transformed_data[col].dtype == 'object':
log_transformed_data[col] = log_transformed_data[col].str.replace(",", "").astype(float)
for col in columns_to_transform:
log_transformed_data[f'log_{col}'] = np.log1p(log_transformed_data[col])
transformed_columns = [f'log_{col}' for col in columns_to_transform]
log_transformed_subset = log_transformed_data[transformed_columns]
# 在 Notebook 中展示
display(log_transformed_subset.head())
| log_public_repos | log_public_gists | log_followers | log_following | |
|---|---|---|---|---|
| 0 | 3.295837 | 0.693147 | 1.791759 | 0.693147 |
| 1 | 3.433987 | 1.386294 | 2.302585 | 1.945910 |
| 2 | 4.644391 | 3.912023 | 7.100852 | 5.402677 |
| 3 | 3.912023 | 0.000000 | 4.442651 | 1.098612 |
| 4 | 2.484907 | 0.693147 | 1.945910 | 1.098612 |
- 使用Matplotlib绘制图表
- 绘制条形图:展示label列的类别分布。
- 绘制堆积柱状图:展示多个布尔特征(如site_admin、company等)的分布。
- 绘制直方图:展示log_public_repos的对数变换后的数据分布。
- 绘制散点图:展示public_repos与followers之间的关系。
- 绘制散点矩阵:展示多个数值型特征之间的成对关系。
In [7]:
Copied!
# 绘制条形图:展示label列的类别分布
import matplotlib.pyplot as plt
plt.figure(figsize=(6, 4))
log_transformed_data['label'].value_counts().plot(kind='bar', color=['skyblue', 'orange'])
plt.title('Label Distribution')
plt.xlabel('Label')
plt.ylabel('Count')
plt.show()
# 绘制条形图:展示label列的类别分布
import matplotlib.pyplot as plt
plt.figure(figsize=(6, 4))
log_transformed_data['label'].value_counts().plot(kind='bar', color=['skyblue', 'orange'])
plt.title('Label Distribution')
plt.xlabel('Label')
plt.ylabel('Count')
plt.show()
In [8]:
Copied!
# 绘制堆积柱状图:展示多个布尔特征(如site_admin、company等)的分布。
bool_columns = ['site_admin', 'company', 'blog', 'location', 'hireable']
stacked_data = log_transformed_data[bool_columns].astype(int)
stacked_data.sum().plot(kind='bar', stacked=True, figsize=(8, 6), color=['orange', 'green', 'blue', 'purple', 'red'])
plt.title('Boolean Features')
plt.xlabel('Features')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
# 绘制堆积柱状图:展示多个布尔特征(如site_admin、company等)的分布。
bool_columns = ['site_admin', 'company', 'blog', 'location', 'hireable']
stacked_data = log_transformed_data[bool_columns].astype(int)
stacked_data.sum().plot(kind='bar', stacked=True, figsize=(8, 6), color=['orange', 'green', 'blue', 'purple', 'red'])
plt.title('Boolean Features')
plt.xlabel('Features')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
In [9]:
Copied!
# 绘制直方图:展示log_public_repos的对数变换后的数据分布。
log_transformed_data['log_public_repos'] = np.log1p(log_transformed_data['public_repos'])
plt.figure(figsize=(8, 6))
plt.hist(log_transformed_data['log_public_repos'], bins=20, color='teal', alpha=0.7)
plt.title('Log-transformed Public Repos Distribution')
plt.xlabel('log_public_repos')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
# 绘制直方图:展示log_public_repos的对数变换后的数据分布。
log_transformed_data['log_public_repos'] = np.log1p(log_transformed_data['public_repos'])
plt.figure(figsize=(8, 6))
plt.hist(log_transformed_data['log_public_repos'], bins=20, color='teal', alpha=0.7)
plt.title('Log-transformed Public Repos Distribution')
plt.xlabel('log_public_repos')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
In [10]:
Copied!
# 绘制散点图:展示public_repos与followers之间的关系。
plt.figure(figsize=(8, 6))
plt.scatter(
log_transformed_data['public_repos'],
log_transformed_data['followers'],
alpha=0.5, c='purple'
)
plt.title('Public Repos vs. Followers')
plt.xlabel('Public Repos')
plt.ylabel('Followers')
plt.grid(True)
plt.show()
# 绘制散点图:展示public_repos与followers之间的关系。
plt.figure(figsize=(8, 6))
plt.scatter(
log_transformed_data['public_repos'],
log_transformed_data['followers'],
alpha=0.5, c='purple'
)
plt.title('Public Repos vs. Followers')
plt.xlabel('Public Repos')
plt.ylabel('Followers')
plt.grid(True)
plt.show()
In [11]:
Copied!
# 绘制散点矩阵:展示多个数值型特征之间的成对关系。
from pandas.plotting import scatter_matrix
numeric_cols = ['public_repos', 'public_gists', 'followers', 'following']
for col in numeric_cols:
log_transformed_data[col] = log_transformed_data[col].replace(",", "", regex=True).astype(float)
for col in numeric_cols:
log_transformed_data[f'log_{col}'] = np.log1p(log_transformed_data[col])
plt.figure(figsize=(10, 8))
scatter_matrix(log_transformed_data[numeric_cols], alpha=0.5, diagonal='kde', color='green')
plt.suptitle('Scatter Matrix of Numeric Features', fontsize=16)
plt.show()
# 绘制散点矩阵:展示多个数值型特征之间的成对关系。
from pandas.plotting import scatter_matrix
numeric_cols = ['public_repos', 'public_gists', 'followers', 'following']
for col in numeric_cols:
log_transformed_data[col] = log_transformed_data[col].replace(",", "", regex=True).astype(float)
for col in numeric_cols:
log_transformed_data[f'log_{col}'] = np.log1p(log_transformed_data[col])
plt.figure(figsize=(10, 8))
scatter_matrix(log_transformed_data[numeric_cols], alpha=0.5, diagonal='kde', color='green')
plt.suptitle('Scatter Matrix of Numeric Features', fontsize=16)
plt.show()
<Figure size 1000x800 with 0 Axes>
- 使用Seaborn绘制图表
- 绘制箱线图:展示不同label类别下log_followers的分布。
- 绘制成对图:展示不同特征之间的成对关系,并根据label分类。
- 绘制热图:展示log_public_repos、log_public_gists、log_followers和log_following等特征之间的相关性。
- 绘制小提琴图:展示label与log_followers之间的分布差异。
In [12]:
Copied!
# 绘制箱线图:展示不同label类别下log_followers的分布。
import seaborn as sns
plt.figure(figsize=(8, 6))
sns.boxplot(
x='label',
y='log_followers',
data=log_transformed_data,
hue='label', # 设置 hue 为 'label' 符合 Seaborn 的要求
palette='Set2',
dodge=False # 关闭分组效果以保持单一颜色
)
plt.title('Boxplot of Log Followers by Label')
plt.xlabel('Label')
plt.ylabel('Log Followers')
plt.show()
# 绘制箱线图:展示不同label类别下log_followers的分布。
import seaborn as sns
plt.figure(figsize=(8, 6))
sns.boxplot(
x='label',
y='log_followers',
data=log_transformed_data,
hue='label', # 设置 hue 为 'label' 符合 Seaborn 的要求
palette='Set2',
dodge=False # 关闭分组效果以保持单一颜色
)
plt.title('Boxplot of Log Followers by Label')
plt.xlabel('Label')
plt.ylabel('Log Followers')
plt.show()
In [13]:
Copied!
# 绘制成对图:展示不同特征之间的成对关系,并根据label分类。
pair_features = ['log_public_repos', 'log_public_gists', 'log_followers', 'log_following']
sns.pairplot(
log_transformed_data[pair_features + ['label']],
hue='label',
palette='Set1',
diag_kind='kde',
markers=["o", "s"]
)
plt.suptitle('Pairplot of Features by Label', y=1.02)
plt.show()
# 绘制成对图:展示不同特征之间的成对关系,并根据label分类。
pair_features = ['log_public_repos', 'log_public_gists', 'log_followers', 'log_following']
sns.pairplot(
log_transformed_data[pair_features + ['label']],
hue='label',
palette='Set1',
diag_kind='kde',
markers=["o", "s"]
)
plt.suptitle('Pairplot of Features by Label', y=1.02)
plt.show()
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
D:\programs\anaconda\lib\site-packages\seaborn\_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
In [14]:
Copied!
# 绘制热图:展示log_public_repos、log_public_gists、log_followers和log_following等特征之间的相关性。
correlation_matrix = log_transformed_data[pair_features].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', square=True)
plt.title('Correlation Heatmap of Log Features')
plt.show()
# 绘制热图:展示log_public_repos、log_public_gists、log_followers和log_following等特征之间的相关性。
correlation_matrix = log_transformed_data[pair_features].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', square=True)
plt.title('Correlation Heatmap of Log Features')
plt.show()
In [18]:
Copied!
# 绘制小提琴图:展示label与log_followers之间的分布差异。
plt.figure(figsize=(8, 6))
sns.violinplot(
x='label',
y='log_followers',
data=log_transformed_data,
hue='label', # 添加分类信息
palette='muted',
)
plt.title('Violin Plot of Log Followers by Label')
plt.xlabel('Label')
plt.ylabel('Log Followers')
plt.show()
# 绘制小提琴图:展示label与log_followers之间的分布差异。
plt.figure(figsize=(8, 6))
sns.violinplot(
x='label',
y='log_followers',
data=log_transformed_data,
hue='label', # 添加分类信息
palette='muted',
)
plt.title('Violin Plot of Log Followers by Label')
plt.xlabel('Label')
plt.ylabel('Log Followers')
plt.show()
- 使用pandas_profiling.ProfileReport()生成交互式数据分析报告,分析数据的统计分布、缺失值、异常值等。
In [19]:
Copied!
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title='hw8result', explorative=True)
profile.to_file('result.html')
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title='hw8result', explorative=True)
profile.to_file('result.html')
D:\programs\anaconda\lib\site-packages\ydata_profiling\profile_report.py:358: UserWarning: Try running command: 'pip install --upgrade Pillow' to avoid ValueError warnings.warn(
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]