Lab3 MongoDB 的聚合管道¶
何为聚合操作¶
聚合操作主要是通过对数据进行分组后做出一些简单的运算,例如平均,求和,最值等。MongoDB 中的聚合运算主要通过 aggregate() 方法实现
Python 中aggregate()方式实现了利用聚合管道对文档进行变化计算和展示。文档进入聚合管道会依次经过筛选 (filtering),分组 (grouping) 并聚合,排序 (sorting),投射 (projecting),限制 (limiting) (或者跳过 (skipping) ) 变化。
| 符号 | 含义 | 描述 |
|---|---|---|
| $match | 筛选 | 按照一定条件筛选出特定的文档记录 |
| $project | 选择 | 修改输入文档的结构。可以用来重命名、增加或删除域,也可以用于创建计算结果以及嵌套文档 |
| $group | 分组 | 对文档按照字段分组,以便做一些聚合运算 |
| $sort | 排序 | 对文档按照字段排序 |
| $limit | 限制 | 限制MongoDB聚合管道返回的文档数 |
| $skip | 跳过 | 在聚合管道中跳过指定数量的文档,并返回余下的文档 |
简单的聚合¶
首先连接数据库创建数据表并且插入数据
# 连接Mongodb数据库
import pymongo
client = pymongo.MongoClient("mongodb://ecnu10211900416:ECNU10211900416@172.16.14.60:27017/")
db = client["ecnu10211900416"]
users_col = db["users"]
# 插入数据
users_col.delete_many({})
users = [
{
"name": "Joe",
"gender": "m",
"age": 23,
"birthdate": {"day": 15, "month": 3, "year": 1997},
"hobby": ["football", "basketball", "reading"],
"city": "Beijing",
"time": [9, 18],
},
{
"name": "Kate",
"gender": "f",
"age": 22,
"birthdate": {"day": 25, "month": 7, "year": 1998},
"hobby": ["reading", "piano"],
"city": "Hangzhou",
"time": [8, 17],
},
{
"name": "Rose",
"gender": "f",
"age": 24,
"birthdate": {"day": 3, "month": 3, "year": 1996},
"hobby": ["basketball", "running", "traveling"],
"city": "Shanghai",
"time": [9, 19],
},
{
"name": "Jason",
"gender": "m",
"age": 21,
"birthdate": {"day": 17, "month": 12, "year": 1999},
"hobby": ["cooking", "photography"],
"city": "Chengdu",
"time": [8, 20],
},
{
"name": "Grace",
"gender": "f",
"age": 22,
"birthdate": {"day": 10, "month": 6, "year": 1998},
"hobby": ["photography", "cooking", "drama"],
"city": "Nanjing",
"time": [9, 18],
},
{
"name": "Jessica",
"gender": "f",
"age": 22,
"birthdate": {"day": 21, "month": 3, "year": 1998},
"hobby": ["cooking", "piano"],
"city": "Shanghai",
"time": [10, 19],
},
{
"name": "Donna",
"gender": "f",
"age": 22,
"birthdate": {"day": 24, "month": 9, "year": 1998},
"hobby": ["violin", "drama"],
"city": "Shanghai",
"time": [9, 20],
},
{
"name": "Apple",
"gender": "m",
"age": 23,
"birthdate": {"day": 20, "month": 9, "year": 1997},
"hobby": ["violin", "running"],
"city": "Chengdu",
"time": [9, 19],
},
{
"name": "Baba",
"gender": "f",
"age": 25,
"birthdate": {"day": 20, "month": 9, "year": 1995},
"hobby": ["violin", "basketball"],
"city": "Chengdu",
"time": [10, 19],
},
]
users_col.insert_many(users)
content = users_col.find()
for each in content:
print(each)
{'_id': ObjectId('651536c99eef77e68c9c8699'), 'name': 'Joe', 'gender': 'm', 'age': 23, 'birthdate': {'day': 15, 'month': 3, 'year': 1997}, 'hobby': ['football', 'basketball', 'reading'], 'city': 'Beijing', 'time': [9, 18]}
{'_id': ObjectId('651536c99eef77e68c9c869a'), 'name': 'Kate', 'gender': 'f', 'age': 22, 'birthdate': {'day': 25, 'month': 7, 'year': 1998}, 'hobby': ['reading', 'piano'], 'city': 'Hangzhou', 'time': [8, 17]}
{'_id': ObjectId('651536c99eef77e68c9c869b'), 'name': 'Rose', 'gender': 'f', 'age': 24, 'birthdate': {'day': 3, 'month': 3, 'year': 1996}, 'hobby': ['basketball', 'running', 'traveling'], 'city': 'Shanghai', 'time': [9, 19]}
{'_id': ObjectId('651536c99eef77e68c9c869c'), 'name': 'Jason', 'gender': 'm', 'age': 21, 'birthdate': {'day': 17, 'month': 12, 'year': 1999}, 'hobby': ['cooking', 'photography'], 'city': 'Chengdu', 'time': [8, 20]}
{'_id': ObjectId('651536c99eef77e68c9c869d'), 'name': 'Grace', 'gender': 'f', 'age': 22, 'birthdate': {'day': 10, 'month': 6, 'year': 1998}, 'hobby': ['photography', 'cooking', 'drama'], 'city': 'Nanjing', 'time': [9, 18]}
{'_id': ObjectId('651536c99eef77e68c9c869e'), 'name': 'Jessica', 'gender': 'f', 'age': 22, 'birthdate': {'day': 21, 'month': 3, 'year': 1998}, 'hobby': ['cooking', 'piano'], 'city': 'Shanghai', 'time': [10, 19]}
{'_id': ObjectId('651536c99eef77e68c9c869f'), 'name': 'Donna', 'gender': 'f', 'age': 22, 'birthdate': {'day': 24, 'month': 9, 'year': 1998}, 'hobby': ['violin', 'drama'], 'city': 'Shanghai', 'time': [9, 20]}
{'_id': ObjectId('651536c99eef77e68c9c86a0'), 'name': 'Apple', 'gender': 'm', 'age': 23, 'birthdate': {'day': 20, 'month': 9, 'year': 1997}, 'hobby': ['violin', 'running'], 'city': 'Chengdu', 'time': [9, 19]}
{'_id': ObjectId('651536c99eef77e68c9c86a1'), 'name': 'Baba', 'gender': 'f', 'age': 25, 'birthdate': {'day': 20, 'month': 9, 'year': 1995}, 'hobby': ['violin', 'basketball'], 'city': 'Chengdu', 'time': [10, 19]}
按照城市分组计数
result = users_col.aggregate([{"$group": {"_id": "$city", "count": {"$sum": 1}}}])
for each in result:
print(each)
{'_id': 'Nanjing', 'count': 1}
{'_id': 'Shanghai', 'count': 3}
{'_id': 'Beijing', 'count': 1}
{'_id': 'Chengdu', 'count': 3}
{'_id': 'Hangzhou', 'count': 1}
从以上代码中,聚合管道中只有 group 一个操作。在 group 中,可以看到是按照 city 字段进行分组,最后通过“加一”聚合来实现分组计数的。在group中指定了两个字段,第一个是主键 '_id',来源于 city,第二个是 count,来源于求和。当然也可以根据我们的需要,修改/增减字段。
以下列出了常用的聚集运算
| 符号 | 含义 |
|---|---|
| $sum | 求和 |
| $avg | 求平均 |
| $min | 最小值 |
| $max | 最大值 |
| $push | 聚合成数组 |
| $addToSet | 聚合成几何 |
| $first | 排序取第一个 |
| $last | 排序取最后一个 |
练习¶
Task 1 计算不同性别用户的平均年龄,最大年龄,最小年龄并且输出
# todo
pipeline = [
{
"$group": {
"_id": "$gender",
"avg_age": {"$avg": "$age"},
"max_age": {"$max": "$age"},
"min_age": {"$min": "$age"}
}
}
]
result = users_col.aggregate(pipeline)
for each in result:
print(each)
"""
目标结果
{'_id': 'f', 'avg_age': 22.833333333333332, 'max_age': 25, 'min_age': 22}
{'_id': 'm', 'avg_age': 22.333333333333332, 'max_age': 23, 'min_age': 21}
"""
{'_id': 'm', 'avg_age': 22.333333333333332, 'max_age': 23, 'min_age': 21}
{'_id': 'f', 'avg_age': 22.833333333333332, 'max_age': 25, 'min_age': 22}
"\n目标结果\n{'_id': 'f', 'avg_age': 22.833333333333332, 'max_age': 25, 'min_age': 22}\n{'_id': 'm', 'avg_age': 22.333333333333332, 'max_age': 23, 'min_age': 21}\n"
Task 2 列出不同性别的同学名单 (提示:$push)
# todo
pipeline = [
{
"$group": {
"_id": "$gender",
"students": {"$push": "$name"}
}
}
]
result = users_col.aggregate(pipeline)
for each in result:
print(each)
"""
目标结果
{'_id': 'm', 'list': ['Joe', 'Jason', 'Apple']}
{'_id': 'f', 'list': ['Kate', 'Rose', 'Grace', 'Jessica', 'Donna', 'Baba']}
"""
{'_id': 'm', 'students': ['Joe', 'Jason', 'Apple']}
{'_id': 'f', 'students': ['Kate', 'Rose', 'Grace', 'Jessica', 'Donna', 'Baba']}
"\n目标结果\n{'_id': 'm', 'list': ['Joe', 'Jason', 'Apple']}\n{'_id': 'f', 'list': ['Kate', 'Rose', 'Grace', 'Jessica', 'Donna', 'Baba']}\n"
复杂查询¶
一个完整的查询一般需要经过:
- 通过条件筛选文档记录(选择文档的行记录)
- 分组并聚合
- 对文档按照某些字段排序
- 调整文档的键值对形式(调整文档的列)
- 通过limit或者skip展示特定数量的记录
例如:筛选年龄大于等于20岁的同学,并将这些学生的按照城市分组计算平均年龄后升序排列,显示城市和平均年龄
match = {"$match": {"age": {"$gte": 20}}}
group = {"$group": {"_id": "$city", "avg_age": {"$avg": "$age"}}}
sort = {"$sort": {"avg_age": 1}} # 1代表升序,-1代表降序
project = {"$project": {"avg_age": 1}}
result = users_col.aggregate([match, group, sort, project])
for each in result:
print(each)
{'_id': 'Hangzhou', 'avg_age': 22.0}
{'_id': 'Nanjing', 'avg_age': 22.0}
{'_id': 'Shanghai', 'avg_age': 22.666666666666668}
{'_id': 'Chengdu', 'avg_age': 23.0}
{'_id': 'Beijing', 'avg_age': 23.0}
格式化输出每个同学的生日日期,按照生日日期排序
sort = {"$sort": {"birthdate.year": -1, "birthdate.month": -1, "birthdate.day": -1}}
project = {
"$project": {
"_id": 0,
"name": 1,
"birthday": {
"$concat": [
{"$toString": "$birthdate.year"},
"-",
{"$toString": "$birthdate.month"},
"-",
{"$toString": "$birthdate.month"},
]
},
}
} # 0代表不显示该字段,1代表显示该字段
result = users_col.aggregate([sort, project])
for each in result:
print(each)
{'name': 'Jason', 'birthday': '1999-12-12'}
{'name': 'Donna', 'birthday': '1998-9-9'}
{'name': 'Kate', 'birthday': '1998-7-7'}
{'name': 'Grace', 'birthday': '1998-6-6'}
{'name': 'Jessica', 'birthday': '1998-3-3'}
{'name': 'Apple', 'birthday': '1997-9-9'}
{'name': 'Joe', 'birthday': '1997-3-3'}
{'name': 'Rose', 'birthday': '1996-3-3'}
{'name': 'Baba', 'birthday': '1995-9-9'}
以上的‘toString’的作用是将数字转换成字符串,更多函数可以参见https://blog.csdn.net/weixin_43632687/article/details/104201185
按照城市,性别分组计数
group = {
"$group": {"_id": {"city": "$city", "gender": "$gender"}, "count": {"$sum": 1}}
}
project = {
"$project": {
"_id": 0,
"city": "$_id.city",
"gender": "$_id.gender",
"count": "$count",
}
}
result = users_col.aggregate([group, project])
for each in result:
print(each)
{'city': 'Hangzhou', 'gender': 'f', 'count': 1}
{'city': 'Nanjing', 'gender': 'f', 'count': 1}
{'city': 'Chengdu', 'gender': 'f', 'count': 1}
{'city': 'Beijing', 'gender': 'm', 'count': 1}
{'city': 'Chengdu', 'gender': 'm', 'count': 2}
{'city': 'Shanghai', 'gender': 'f', 'count': 3}
以上用到的重命名方法和多字段分组聚合的方法需要好好体会
练习¶
Task 3 找出喜欢'violin'的人数(提示:$in)
# todo
match={'$match':{'hobby':{'$in':['violin']}}}
project={'$project':{'_id':'like_violin','count':'$count'}}
group={'$group':{'_id':'like_violin','count':{'$sum':1}}}
result = users_col.aggregate([match,group,project])
for each in result:
print(each)
"""
目标结果
{'_id': 'like_violin', 'count': 3}
"""
{'_id': 'like_violin', 'count': 3}
"\n目标结果\n{'_id': 'like_violin', 'count': 3}\n"
$unwind 拆分数组,查询拥有各个爱好的学生人数
# 先通过数组拆分将一条记录拆分成多条记录
unwind = {"$unwind": "$hobby"}
group = {"$group": {"_id": "$hobby", "count": {"$sum": 1}}}
result = users_col.aggregate([unwind, group])
for each in result:
print(each)
{'_id': 'basketball', 'count': 3}
{'_id': 'reading', 'count': 2}
{'_id': 'running', 'count': 2}
{'_id': 'cooking', 'count': 3}
{'_id': 'photography', 'count': 2}
{'_id': 'violin', 'count': 3}
{'_id': 'piano', 'count': 2}
{'_id': 'football', 'count': 1}
{'_id': 'traveling', 'count': 1}
{'_id': 'drama', 'count': 2}
练习¶
Task 4 找出爱好个数为3的同学,展示姓名,年龄与爱好(不使用 \$size 来求长度,要求使用 \$unwind 来拆分数组和 \$push 来合并数组)
# todo
pipeline = [
{'$unwind': '$hobby'},
{
'$group': {
'_id':{
'name': '$name',
'age': '$age'
},
'count': {'$sum': 1},
'hobbies': {'$push': '$hobby'}
}
},
{'$match': {'count': {'$in':[3,3]} }},
{
'$project': {
'_id': 0,
'name': '$_id.name',
'age': '$_id.age',
'hobby': '$hobbies'
}
}
]
result = users_col.aggregate(pipeline)
for each in result:
print(each)
"""
目标结果
{'name': 'Rose', 'age': 24, 'hobby': ['basketball', 'running', 'traveling']}
{'name': 'Grace', 'age': 22, 'hobby': ['photography', 'cooking', 'drama']}
{'name': 'Joe', 'age': 23, 'hobby': ['football', 'basketball', 'reading']}
"""
{'name': 'Rose', 'age': 24, 'hobby': ['basketball', 'running', 'traveling']}
{'name': 'Joe', 'age': 23, 'hobby': ['football', 'basketball', 'reading']}
{'name': 'Grace', 'age': 22, 'hobby': ['photography', 'cooking', 'drama']}
"\n目标结果\n{'name': 'Rose', 'age': 24, 'hobby': ['basketball', 'running', 'traveling']}\n{'name': 'Grace', 'age': 22, 'hobby': ['photography', 'cooking', 'drama']}\n{'name': 'Joe', 'age': 23, 'hobby': ['football', 'basketball', 'reading']}\n"
索引:对 Lab2 的一些补充¶
单条索引:按照年龄生序¶
users_col.create_index([("name", 1)], unique=True)
'name_1'
复合索引:创建复合索引,按照姓名生序,按照年龄降序¶
users_col.create_index([("name", 1), ("age", -1)], unique=True)
'name_1_age_-1'
删除索引¶
print("删除前索引信息\n", users_col.index_information())
users_col.drop_index("name_1") # 括号里面的参数是索引名
users_col.drop_index("name_1_age_-1")
print("删除后索引信息\n", users_col.index_information())
删除前索引信息
{'_id_': {'v': 2, 'key': [('_id', 1)]}, 'name_1': {'v': 2, 'key': [('name', 1)], 'unique': True}, 'name_1_age_-1': {'v': 2, 'key': [('name', 1), ('age', -1)], 'unique': True}}
删除后索引信息
{'_id_': {'v': 2, 'key': [('_id', 1)]}}
性能测试¶
索引的价值在于提高基于索引字进行段数据查询的效率,我们可以通过构造一批数据,对比建索引前的查询时间来体会索引的价值
# 构造3000000条数据
import random
# 先清空一下数据库
users_col.delete_many({})
batch_users = []
sex = ["f", "m"]
for i in range(300000):
user = {
"name": "xxx" + str(i),
"age": random.randint(20, 55), # 产生20,55之间的随机数
"gender": sex[random.randint(0, 1)],
}
batch_users.append(user)
users_col.insert_many(batch_users)
<pymongo.results.InsertManyResult at 0x7f3e60ae3850>
# 直接查询用时
import datetime
starttime = datetime.datetime.now()
result = users_col.find(
{
"$or": [
{"name": "xxx10000"},
{"name": "xxx140000"},
{"name": "xxx9000"},
{"name": "xxx23000"},
{"name": "xxx24050"},
{"name": "xxx12000"},
{"name": "xxx14300"},
{"name": "xxx9300"},
{"name": "xxx23300"},
{"name": "xxx24350"},
{"name": "xxx11100"},
{"name": "xxx15200"},
{"name": "xxx8100"},
{"name": "xxx22100"},
{"name": "xxx26150"},
{"name": "xxx10200"},
{"name": "xxx14020"},
{"name": "xxx9020"},
{"name": "xxx23020"},
{"name": "xxx24070"},
{"name": "xxx10300"},
{"name": "xxx14030"},
{"name": "xxx9030"},
{"name": "xxx23030"},
{"name": "xxx24080"},
]
}
)
for each in result:
print(each)
endtime = datetime.datetime.now()
print("开始时间:", starttime)
print("结束时间:", endtime)
print("时间差(微秒):", (endtime - starttime).microseconds)
{'_id': ObjectId('6515384d9eef77e68c9ca646'), 'name': 'xxx8100', 'age': 37, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9ca9ca'), 'name': 'xxx9000', 'age': 37, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9ca9de'), 'name': 'xxx9020', 'age': 41, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9ca9e8'), 'name': 'xxx9030', 'age': 33, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9caaf6'), 'name': 'xxx9300', 'age': 37, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9cadb2'), 'name': 'xxx10000', 'age': 38, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9cae7a'), 'name': 'xxx10200', 'age': 38, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9caede'), 'name': 'xxx10300', 'age': 24, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9cb1fe'), 'name': 'xxx11100', 'age': 22, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9cb582'), 'name': 'xxx12000', 'age': 26, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9cbd66'), 'name': 'xxx14020', 'age': 48, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9cbd70'), 'name': 'xxx14030', 'age': 50, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9cbe7e'), 'name': 'xxx14300', 'age': 21, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9cc202'), 'name': 'xxx15200', 'age': 52, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9cdcf6'), 'name': 'xxx22100', 'age': 26, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9ce07a'), 'name': 'xxx23000', 'age': 54, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9ce08e'), 'name': 'xxx23020', 'age': 27, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9ce098'), 'name': 'xxx23030', 'age': 53, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9ce1a6'), 'name': 'xxx23300', 'age': 22, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9ce494'), 'name': 'xxx24050', 'age': 25, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9ce4a8'), 'name': 'xxx24070', 'age': 53, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9ce4b2'), 'name': 'xxx24080', 'age': 38, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9ce5c0'), 'name': 'xxx24350', 'age': 49, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9cecc8'), 'name': 'xxx26150', 'age': 25, 'gender': 'f'}
{'_id': ObjectId('6515384e9eef77e68c9ea982'), 'name': 'xxx140000', 'age': 35, 'gender': 'f'}
开始时间: 2023-09-28 08:24:49.589881
结束时间: 2023-09-28 08:24:49.748082
时间差(微秒): 158201
# 创建索引查询用时间
users_col.create_index([("name", 1)], unique=True)
starttime = datetime.datetime.now()
result = users_col.find(
{
"$or": [
{"name": "xxx10000"},
{"name": "xxx140000"},
{"name": "xxx9000"},
{"name": "xxx23000"},
{"name": "xxx24050"},
{"name": "xxx12000"},
{"name": "xxx14300"},
{"name": "xxx9300"},
{"name": "xxx23300"},
{"name": "xxx24350"},
{"name": "xxx11100"},
{"name": "xxx15200"},
{"name": "xxx8100"},
{"name": "xxx22100"},
{"name": "xxx26150"},
{"name": "xxx10200"},
{"name": "xxx14020"},
{"name": "xxx9020"},
{"name": "xxx23020"},
{"name": "xxx24070"},
{"name": "xxx10300"},
{"name": "xxx14030"},
{"name": "xxx9030"},
{"name": "xxx23030"},
{"name": "xxx24080"},
]
}
)
for each in result:
print(each)
endtime = datetime.datetime.now()
print("开始时间:", starttime)
print("结束时间:", endtime)
print("时间差(微秒):", (endtime - starttime).microseconds)
users_col.drop_index("name_1") # 结束后删除索引以防之后忘记删除
{'_id': ObjectId('6515384d9eef77e68c9cadb2'), 'name': 'xxx10000', 'age': 38, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9cae7a'), 'name': 'xxx10200', 'age': 38, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9caede'), 'name': 'xxx10300', 'age': 24, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9cb1fe'), 'name': 'xxx11100', 'age': 22, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9cb582'), 'name': 'xxx12000', 'age': 26, 'gender': 'f'}
{'_id': ObjectId('6515384e9eef77e68c9ea982'), 'name': 'xxx140000', 'age': 35, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9cbd66'), 'name': 'xxx14020', 'age': 48, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9cbd70'), 'name': 'xxx14030', 'age': 50, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9cbe7e'), 'name': 'xxx14300', 'age': 21, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9cc202'), 'name': 'xxx15200', 'age': 52, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9cdcf6'), 'name': 'xxx22100', 'age': 26, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9ce07a'), 'name': 'xxx23000', 'age': 54, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9ce08e'), 'name': 'xxx23020', 'age': 27, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9ce098'), 'name': 'xxx23030', 'age': 53, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9ce1a6'), 'name': 'xxx23300', 'age': 22, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9ce494'), 'name': 'xxx24050', 'age': 25, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9ce4a8'), 'name': 'xxx24070', 'age': 53, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9ce4b2'), 'name': 'xxx24080', 'age': 38, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9ce5c0'), 'name': 'xxx24350', 'age': 49, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9cecc8'), 'name': 'xxx26150', 'age': 25, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9ca646'), 'name': 'xxx8100', 'age': 37, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9ca9ca'), 'name': 'xxx9000', 'age': 37, 'gender': 'm'}
{'_id': ObjectId('6515384d9eef77e68c9ca9de'), 'name': 'xxx9020', 'age': 41, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9ca9e8'), 'name': 'xxx9030', 'age': 33, 'gender': 'f'}
{'_id': ObjectId('6515384d9eef77e68c9caaf6'), 'name': 'xxx9300', 'age': 37, 'gender': 'm'}
开始时间: 2023-09-28 08:24:54.889493
结束时间: 2023-09-28 08:24:55.063970
时间差(微秒): 174477
在我的机器上运行后发现建立索引前后明显的查询时间分别为307881微秒和11248,创建索引之后的查询耗时有了明显的降低
users_col.delete_many({})
# 插入无关数据后记得删除,保持良好习惯,以免学院服务器崩坏
<pymongo.results.DeleteResult at 0x7f3e60ae36d0>
练习¶
Task 5 下面需要同学们探索对比,创建索引对插入数据的影响
# todo
# 无索引
users_col.delete_many({})
users = []
sex = ["f", "m"]
for i in range(300000):
user = {
"name": "user" + str(i),
"age": random.randint(20, 55),
"gender": sex[random.randint(0, 1)],
}
users.append(user)
starttime = datetime.datetime.now()
users_col.insert_many(users)
endtime = datetime.datetime.now()
print("开始时间:", starttime)
print("结束时间:", endtime)
print("无索引插入数据时间差(微秒):", (endtime - starttime).microseconds)
# 有索引
users_col.delete_many({})
users_col.create_index([("name", 1)], unique=True)
starttime = datetime.datetime.now()
users_col.insert_many(users)
endtime = datetime.datetime.now()
print("开始时间:", starttime)
print("结束时间:", endtime)
print("有索引插入数据时间差(微秒):", (endtime - starttime).microseconds)
users_col.drop_index("name_1")
users_col.delete_many({})
开始时间: 2023-09-28 08:55:14.357789 结束时间: 2023-09-28 08:55:18.568521 无索引插入数据时间差(微秒): 210732 开始时间: 2023-09-28 08:55:21.919259 结束时间: 2023-09-28 08:55:26.374543 有索引插入数据时间差(微秒): 455284
<pymongo.results.DeleteResult at 0x7f3e60ae0130>
代码的思路: 先清空数据库,然后创建一堆数据,在没有索引的情况下插入数据并记录用时 再清空数据库,建立索引之后插入数据并记录用时
在shuishan运行结果: 开始时间: 2023-09-28 08:55:14.357789 结束时间: 2023-09-28 08:55:18.568521 无索引插入数据时间差(微秒): 210732 开始时间: 2023-09-28 08:55:21.919259 结束时间: 2023-09-28 08:55:26.374543 有索引插入数据时间差(微秒): 455284
可发现创建索引后插入时间不减反增。 为什么会出现这样的情况呢? 插入数据最效率高的方法就是无索引直接插入。建立索引之后,每次插入都要更新索引,把它插入到索引中的适当位置。 这个额外的操作会有额外的开销,使得插入操作变慢。 设计数据库时,我们可以通过建立索引来提高查询性能,但不能滥用这一特性,还是要根据实际情况去考虑