Homework7

数据预处理¶

数据质量：缺失值处理、异常处理、重复数据
数据结构：格式转化、数据合并

1. 删除重复数据，并输出去重前后的数据量¶

2. 缺失值处理¶

首先，去掉 gravatar_id 列，并查看各列的缺失值的情况
其次，将可转化成 boolean 变量的列字段转化成 boolean 变量（转成布尔类型是为了便于处理缺失字段，如是否存在公司、位置等等），文本数据用空字符串填充空值......
最后，再次看各列有无缺失值

3. 数据变换，将created_at、updated_at转为时间戳¶

4. 数据可视化¶

4.1 可视化bot和hunman类型的情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）
4.2 可视化bot类型账号的created_at情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）
4.3 可视化human类型账号的created_at情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）
4.4 可视化bot类型账号的followers和following情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）
4.5 可视化human类型账号的followers和following情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）

In [13]:

Copied!





import pandas as pd
raw_data = pd.read_csv('data/github_bot_raw_data.csv') 
#列字段解读
columns = [
    'actor_id',  # GitHub用户的ID (示例值: 1081405)
    'label',  # 用户标签（"Human"或"Bot"） (示例值: Human)
    'login',  # GitHub用户的登录名 (示例值: dlazesz)
    'id',  # 用户的GitHub ID (示例值: 1081405)
    'node_id',  # 用户的GitHub节点ID (示例值: MDQ6VXNlcjEwODE0MDU=)
    'avatar_url',  # GitHub头像URL (示例值: https://avatars.githubusercontent.com/u/1081405?v=4)
    'gravatar_id',  # Gravatar ID (示例值: None)
    'url',  # GitHub用户的URL (示例值: https://api.github.com/users/dlazesz)
    'html_url',  # GitHub用户的HTML URL (示例值: https://github.com/dlazesz)
    'followers_url',  # GitHub用户的粉丝URL (示例值: https://api.github.com/users/dlazesz/followers)
    'following_url',  # GitHub用户的关注URL (示例值: https://api.github.com/users/dlazesz/following{/other_user})
    'gists_url',  # 用户的GitHub Gists URL (示例值: https://api.github.com/users/dlazesz/gists{/gist_id})
    'starred_url',  # 用户的GitHub Starred URL (示例值: https://api.github.com/users/dlazesz/starred{/owner}{/repo})
    'subscriptions_url',  # 用户的GitHub订阅URL (示例值: https://api.github.com/users/dlazesz/subscriptions)
    'organizations_url',  # 用户的GitHub组织URL (示例值: https://api.github.com/users/dlazesz/orgs)
    'repos_url',  # 用户的GitHub仓库URL (示例值: https://api.github.com/users/dlazesz/repos)
    'events_url',  # 用户的GitHub事件URL (示例值: https://api.github.com/users/dlazesz/events{/privacy})
    'received_events_url',  # 用户的GitHub接收事件URL (示例值: https://api.github.com/users/dlazesz/received_events)
    'type',  # 用户类型，通常为"User" (示例值: User)
    'site_admin',  # 表示用户是否是GitHub网站管理员的标志 (示例值: False)
    'name',  # 用户的姓名 (示例值: Indig Balázs)
    'company',  # 用户所在公司 (示例值: None)
    'blog',  # 用户的博客 (示例值: None)
    'location',  # 用户的位置 (示例值: None)
    'email',  # 用户的电子邮件 (示例值: None)
    'hireable',  # 表示用户是否愿意被雇佣的标志 (示例值: None)
    'bio',  # 用户在其GitHub资料中提供的自我介绍或个人简介 (示例值: None)
    'twitter_username',  # 用户的Twitter用户名 (示例值: None)
    'public_repos',  # 用户在GitHub上的公共代码仓库数量 (示例值: 26)
    'public_gists',  # 用户的公共Gists数量 (示例值: 1)
    'followers',  # 关注该用户的其他GitHub用户数量 (示例值: 5)
    'following',  # 该用户关注的其他GitHub用户数量 (示例值: 1)
    'created_at',  # 用户的GitHub帐户创建日期 (示例值: 2011-09-26T17:27:03Z)
    'updated_at',  # 用户的GitHub帐户最后更新日期 (示例值: 2023-10-13T11:21:10Z)
]
data = raw_data[columns]
import pandas as pd
raw_data = pd.read_csv('data/github_bot_raw_data.csv') 
#列字段解读
columns = [
    'actor_id',  # GitHub用户的ID (示例值: 1081405)
    'label',  # 用户标签（"Human"或"Bot"） (示例值: Human)
    'login',  # GitHub用户的登录名 (示例值: dlazesz)
    'id',  # 用户的GitHub ID (示例值: 1081405)
    'node_id',  # 用户的GitHub节点ID (示例值: MDQ6VXNlcjEwODE0MDU=)
    'avatar_url',  # GitHub头像URL (示例值: https://avatars.githubusercontent.com/u/1081405?v=4)
    'gravatar_id',  # Gravatar ID (示例值: None)
    'url',  # GitHub用户的URL (示例值: https://api.github.com/users/dlazesz)
    'html_url',  # GitHub用户的HTML URL (示例值: https://github.com/dlazesz)
    'followers_url',  # GitHub用户的粉丝URL (示例值: https://api.github.com/users/dlazesz/followers)
    'following_url',  # GitHub用户的关注URL (示例值: https://api.github.com/users/dlazesz/following{/other_user})
    'gists_url',  # 用户的GitHub Gists URL (示例值: https://api.github.com/users/dlazesz/gists{/gist_id})
    'starred_url',  # 用户的GitHub Starred URL (示例值: https://api.github.com/users/dlazesz/starred{/owner}{/repo})
    'subscriptions_url',  # 用户的GitHub订阅URL (示例值: https://api.github.com/users/dlazesz/subscriptions)
    'organizations_url',  # 用户的GitHub组织URL (示例值: https://api.github.com/users/dlazesz/orgs)
    'repos_url',  # 用户的GitHub仓库URL (示例值: https://api.github.com/users/dlazesz/repos)
    'events_url',  # 用户的GitHub事件URL (示例值: https://api.github.com/users/dlazesz/events{/privacy})
    'received_events_url',  # 用户的GitHub接收事件URL (示例值: https://api.github.com/users/dlazesz/received_events)
    'type',  # 用户类型，通常为"User" (示例值: User)
    'site_admin',  # 表示用户是否是GitHub网站管理员的标志 (示例值: False)
    'name',  # 用户的姓名 (示例值: Indig Balázs)
    'company',  # 用户所在公司 (示例值: None)
    'blog',  # 用户的博客 (示例值: None)
    'location',  # 用户的位置 (示例值: None)
    'email',  # 用户的电子邮件 (示例值: None)
    'hireable',  # 表示用户是否愿意被雇佣的标志 (示例值: None)
    'bio',  # 用户在其GitHub资料中提供的自我介绍或个人简介 (示例值: None)
    'twitter_username',  # 用户的Twitter用户名 (示例值: None)
    'public_repos',  # 用户在GitHub上的公共代码仓库数量 (示例值: 26)
    'public_gists',  # 用户的公共Gists数量 (示例值: 1)
    'followers',  # 关注该用户的其他GitHub用户数量 (示例值: 5)
    'following',  # 该用户关注的其他GitHub用户数量 (示例值: 1)
    'created_at',  # 用户的GitHub帐户创建日期 (示例值: 2011-09-26T17:27:03Z)
    'updated_at',  # 用户的GitHub帐户最后更新日期 (示例值: 2023-10-13T11:21:10Z)
]
data = raw_data[columns]

1. 删除重复数据，并输出去重前后的数据量¶

In [14]:

Copied!





initial_size = data.shape[0]
data.drop_duplicates(inplace=True)
final_size = data.shape[0]
print(f"初始大小: {initial_size}, 去重后的数据量: {final_size}")
initial_size = data.shape[0]
data.drop_duplicates(inplace=True)
final_size = data.shape[0]
print(f"初始大小: {initial_size}, 去重后的数据量: {final_size}")

初始大小: 20358, 去重后的数据量: 19779

2. 缺失值处理¶

首先，去掉 gravatar_id 列，并查看各列的缺失值的情况
其次，将可转化成 boolean 变量的列字段转化成 boolean 变量（转成布尔类型是为了便于处理缺失字段，如是否存在公司、位置等等），文本数据用空字符串填充空值......
最后，再次看各列有无缺失值

In [15]:

Copied!





# 去掉gravatar_id列
data.drop(columns=['gravatar_id'], inplace=True)
# 查看各列缺失值情况
print("缺失值情况：")
print(data.isnull().sum())
# 去掉gravatar_id列
data.drop(columns=['gravatar_id'], inplace=True)
# 查看各列缺失值情况
print("缺失值情况：")
print(data.isnull().sum())

缺失值情况：
actor_id                   0
label                      0
login                      0
id                         0
node_id                    0
avatar_url                 0
url                        0
html_url                   0
followers_url              0
following_url              0
gists_url                  0
starred_url                0
subscriptions_url          0
organizations_url          0
repos_url                  0
events_url                 0
received_events_url        0
type                       0
site_admin                 0
name                    2589
company                 8976
blog                   11262
location                7079
email                  11739
hireable               16481
bio                    10930
twitter_username       14859
public_repos               0
public_gists               0
followers                  0
following                  0
created_at                 0
updated_at                 0
dtype: int64

In [16]:

Copied!





data = data.copy()
# 将可转化成boolean的列转化成boolean类型
data['company'] = data['company'].astype(bool)
data['location'] = data['location'].astype(bool)
data['email'] = data['email'].astype(bool)
data['hireable'] = data['hireable'].astype(bool)
data['twitter_username'] = data['twitter_username'].astype(bool)

# 用空字符串填充data的缺失值
data = data.fillna('')

print("处理后的缺失值:")
print(data.isnull().sum())
data = data.copy()
# 将可转化成boolean的列转化成boolean类型
data['company'] = data['company'].astype(bool)
data['location'] = data['location'].astype(bool)
data['email'] = data['email'].astype(bool)
data['hireable'] = data['hireable'].astype(bool)
data['twitter_username'] = data['twitter_username'].astype(bool)

# 用空字符串填充data的缺失值
data = data.fillna('')

print("处理后的缺失值:")
print(data.isnull().sum())

处理后的缺失值:
actor_id               0
label                  0
login                  0
id                     0
node_id                0
avatar_url             0
url                    0
html_url               0
followers_url          0
following_url          0
gists_url              0
starred_url            0
subscriptions_url      0
organizations_url      0
repos_url              0
events_url             0
received_events_url    0
type                   0
site_admin             0
name                   0
company                0
blog                   0
location               0
email                  0
hireable               0
bio                    0
twitter_username       0
public_repos           0
public_gists           0
followers              0
following              0
created_at             0
updated_at             0
dtype: int64

3. 数据变换，将created_at、updated_at转为时间戳¶

In [17]:

Copied!

data['created_at'] = pd.to_datetime(data['created_at'], errors='coerce')
data['updated_at'] = pd.to_datetime(data['updated_at'], errors='coerce')
data['created_at'] = pd.to_datetime(data['created_at'], errors='coerce')
data['updated_at'] = pd.to_datetime(data['updated_at'], errors='coerce')

In [18]:

Copied!





# 打印前2行的时间戳效果
print("打印前2行的时间戳：")
print(data['created_at'].head(n=2))
print(data['updated_at'].head(n=2))
# 打印前2行的时间戳效果
print("打印前2行的时间戳：")
print(data['created_at'].head(n=2))
print(data['updated_at'].head(n=2))

打印前2行的时间戳：
0   2011-09-26 17:27:03+00:00
1   2015-06-29 10:12:46+00:00
Name: created_at, dtype: datetime64[ns, UTC]
0   2023-10-13 11:21:10+00:00
1   2023-10-07 06:26:14+00:00
Name: updated_at, dtype: datetime64[ns, UTC]

In [19]:

Copied!

print(data[['created_at', 'updated_at']].dtypes)
print(data[['created_at', 'updated_at']].dtypes)

created_at    datetime64[ns, UTC]
updated_at    datetime64[ns, UTC]
dtype: object

4. 数据可视化¶

4.1 可视化bot和hunman类型的情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）
4.2 可视化bot类型账号的created_at情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）
4.3 可视化human类型账号的created_at情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）
4.4 可视化bot类型账号的followers和following情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）
4.5 可视化human类型账号的followers和following情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）

4.1 可视化bot和hunman类型的情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）

In [20]:

Copied!





import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# Bot和Human账户数量
label_counts = data['label'].value_counts()

# 绘制柱状图
plt.figure(figsize=(8, 6))
label_counts.plot(kind='bar', color=['skyblue', 'orange'])
plt.title("Bot和Human账户数量比较")
plt.xticks(rotation=0)
plt.show()
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# Bot和Human账户数量
label_counts = data['label'].value_counts()

# 绘制柱状图
plt.figure(figsize=(8, 6))
label_counts.plot(kind='bar', color=['skyblue', 'orange'])
plt.title("Bot和Human账户数量比较")
plt.xticks(rotation=0)
plt.show()

No description has been provided for this image

使用柱状图直观显示两类账号的数量对比，方便对比分布情况。可以看到，Human 账号远多于 Bot 账号，这表明在 GitHub 社区中，用户账号主要是人类。通过观察人类和机器人账户的数量分布，可以推测机器人账号主要用于自动化任务或数据收集等特殊功能，不是主要的用户群体。

In [22]:

Copied!





import pandas as pd
import seaborn as sns

data['year'] = data['created_at'].dt.year  # 年份

bot_counts = data[data['label'] == 'Bot'].groupby('year').size()
human_counts = data[data['label'] == 'Human'].groupby('year').size()

plt.figure(figsize=(8, 6))
sns.set(style='whitegrid')

sns.lineplot(x=bot_counts.index, y=bot_counts.values, label='Bot', color='red', marker='o')
for x, y in zip(bot_counts.index, bot_counts.values):
    plt.text(x, y + 10, str(y), color='black', ha='center', fontsize=9)

sns.lineplot(x=human_counts.index, y=human_counts.values, label='Human', color='blue', marker='o')
for x, y in zip(human_counts.index, human_counts.values):
    plt.text(x, y + 10, str(y), color='black', ha='center', fontsize=9)
 
plt.title('New Accounts', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of New Accounts', fontsize=12)
plt.legend(title='Account Type', fontsize=10)
plt.xticks(rotation=45)  

plt.show()
import pandas as pd
import seaborn as sns

data['year'] = data['created_at'].dt.year  # 年份

bot_counts = data[data['label'] == 'Bot'].groupby('year').size()
human_counts = data[data['label'] == 'Human'].groupby('year').size()

plt.figure(figsize=(8, 6))
sns.set(style='whitegrid')

sns.lineplot(x=bot_counts.index, y=bot_counts.values, label='Bot', color='red', marker='o')
for x, y in zip(bot_counts.index, bot_counts.values):
    plt.text(x, y + 10, str(y), color='black', ha='center', fontsize=9)

sns.lineplot(x=human_counts.index, y=human_counts.values, label='Human', color='blue', marker='o')
for x, y in zip(human_counts.index, human_counts.values):
    plt.text(x, y + 10, str(y), color='black', ha='center', fontsize=9)
 
plt.title('New Accounts', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of New Accounts', fontsize=12)
plt.legend(title='Account Type', fontsize=10)
plt.xticks(rotation=45)  

plt.show()

使用折线图展示bot和human的情况，直观地反映Bot和Human两类账号每年的新增数量变化趋势，便于对比两类账号的数量变化。每一年的Human账号的新增数量显著高于Bot账号，再次说明 Human用户是平台的主体。

4.2 可视化bot类型账号的created_at情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）

In [7]:

Copied!





# 提取Bot类型账号的创建时间
bot_created_at = data[data['label'] == 'Bot']['created_at']

plt.figure(figsize=(10, 6))
bot_created_at.hist(bins=30, color='red')
plt.title("机器人账户创建时间分布")
plt.xlabel("创建日期")
plt.ylabel("数量")
plt.show()
# 提取Bot类型账号的创建时间
bot_created_at = data[data['label'] == 'Bot']['created_at']

plt.figure(figsize=(10, 6))
bot_created_at.hist(bins=30, color='red')
plt.title("机器人账户创建时间分布")
plt.xlabel("创建日期")
plt.ylabel("数量")
plt.show()

使用直方图显示创建时间的分布，便于观察机器账号的活跃增加或减少的趋势。若创建时间在特定年份或时间段出现高峰，说明那段时间可能增加了更多的机器人账号。可以看到，机器人账号的创建时间集中在特定年份。根据峰值年份，可以推测出某些时间段对机器人需求增加（如推出新功能或活动需要）。早期（2008-2012年），新机器人账号数量增长缓慢。这可能是因为GitHub刚刚创立，开发者对机器人工具的需求不高；中期（2013-2018年）新机器人账号数量增长明显，在2018年前后达到高峰。这可能是因为17,18年是开源项目在GitHub的发展高峰，如推出GitHub Marketplace、被微软收购，触发了新的机器人工具创建；后期（2019年后）新机器人账号数量下降，可能是因为GitHub加强了对机器人账户的管理，必须通过验证或认证，或因为有了其他替代工具。

4.3 可视化human类型账号的created_at情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）

In [8]:

Copied!





# 提取Human类型账号的创建时间
human_created_at = data[data['label'] == 'Human']['created_at']

plt.figure(figsize=(10, 6))
human_created_at.hist(bins=30, color='blue')
plt.title("人类账户创建时间分布")
plt.xlabel("创建日期")
plt.ylabel("数量")
plt.show()
# 提取Human类型账号的创建时间
human_created_at = data[data['label'] == 'Human']['created_at']

plt.figure(figsize=(10, 6))
human_created_at.hist(bins=30, color='blue')
plt.title("人类账户创建时间分布")
plt.xlabel("创建日期")
plt.ylabel("数量")
plt.show()

与机器人账号相同，使用直方图便于观察用户增长趋势。若人类账号创建日期在某些年份明显增加，说明 GitHub 社区在那段时间吸引了大量新用户。可以观察到，GitHub 社区在某些时间段可能因开发者数量上升或活动推广导致用户创建高峰，这也反映了开源开发的流行趋势。

2008-2010年，新用户数量较少且增长缓慢。原因可能是GitHub作为一个新兴平台，还未被广泛接受；2010-2012年，用户数量开始呈明显增长趋势。这可能与GitHub在技术社区中的逐渐普及有关，尤其是早期的项目托管功能逐渐被认可。2013-2015年，新用户注册数量达到峰值。这可能是开源文化的推动使更多开发者加入GitHub。企业用户的加入和大量开源项目的涌入，也可能导致这一增长。从2016年开始，新用户注册数量逐渐减少。这可能是因为平台用户逐渐趋于饱和，新增用户数量开始放缓。2021年新真人用户达到低估，可能是数据未统计，或由于疫情影响。

4.4 可视化bot类型账号的followers和following情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）

In [26]:

Copied!





bot_data = data[data['label'] == 'Bot']

bot_data = bot_data[['followers', 'following']]
plt.figure(figsize=(10, 6))
sns.scatterplot(x='followers', y='following', data=bot_data, s = 5)
plt.title('Bot Followers and Following')
plt.xlim(0, 500)
plt.ylim(0, 500)
plt.show()
bot_data = data[data['label'] == 'Bot']

bot_data = bot_data[['followers', 'following']]
plt.figure(figsize=(10, 6))
sns.scatterplot(x='followers', y='following', data=bot_data, s = 5)
plt.title('Bot Followers and Following')
plt.xlim(0, 500)
plt.ylim(0, 500)
plt.show()

In [28]:

Copied!





plt.figure(figsize=(10, 6))
plt.scatter(bot_data['followers'], bot_data['following'], alpha=0.7, color='purple')
plt.title("Bot Followers and Following")
plt.xlabel("followers")
plt.ylabel("following")
plt.show()
plt.figure(figsize=(10, 6))
plt.scatter(bot_data['followers'], bot_data['following'], alpha=0.7, color='purple')
plt.title("Bot Followers and Following")
plt.xlabel("followers")
plt.ylabel("following")
plt.show()

散点图适合展示两个变量之间的关系。在这里，可以展示关注者数量（followers）和关注数量（following）之间是否存在相关性。点分布较为随机，且大部分点处于较低的关注者和关注数量区域，说明机器人账号之间的关注数量和关注者数量之间并无显著相关性。

大部分机器人账号几乎没有关注者或仅关注少数用户，这反映出它们主要用于执行特定的自动化任务，互动性较低。

4.5 可视化human类型账号的followers和following情况（展示图表自选，并在报告中说明选择原因、结果分析以及数据洞察）

In [29]:

Copied!





# 筛选出 Human 类型账号的数据
human_data = data[data['label'] == 'Human']
human_data = human_data[['followers', 'following']]
plt.figure(figsize=(5,3))
sns.scatterplot(x='followers', y='following', data=human_data, s = 5)
plt.title('Human Followers and Following')
# 设置x轴y轴的范围
plt.xlim(0, 5000)
plt.ylim(0, 5000)
plt.show()
# 筛选出 Human 类型账号的数据
human_data = data[data['label'] == 'Human']
human_data = human_data[['followers', 'following']]
plt.figure(figsize=(5,3))
sns.scatterplot(x='followers', y='following', data=human_data, s = 5)
plt.title('Human Followers and Following')
# 设置x轴y轴的范围
plt.xlim(0, 5000)
plt.ylim(0, 5000)
plt.show()

散点图适合展示两个变量之间的关系。在这里，可以展示关注者数量（followers）和关注数量（following）之间是否存在相关性。散点图显示出大多数人类账户的关注者和关注数量都较低，数据点主要集中在左下角，说明大部分 GitHub 用户在社交上并不活跃或属于普通用户，关注和被关注的数量都有限。右侧靠近横轴的少量点代表一些账户关注者数量很高但关注数量很低，通常为知名开发者或组织，体现了在 GitHub 社区中具有较大影响力的用户特征。

大部分真人账号的followers和following都集中在较低范围内：followers数量小于1000，following数量也大多低于1000，这说明多数真人用户都是未被大规模关注的。有一些followers和/或following数值特别高的账号，这些账号可能是名人、组织或在社区中拥有较大影响力的用户。followers和following并没有明显的线性相关性。

大部分人类用户的关注者和关注数量都较低，反映了 GitHub 上用户的社交模式。GitHub 主要以项目和代码为中心，相较于其他社交平台，用户之间的互动更偏向技术协作而非广泛的社交联系。