In this article, I will describe how to use aggregate (aggregate function in SQL) after connecting to mongodb with Python. For information on how to start mongodb and install pymongo, please see the following articles. https://qiita.com/bc_yuuuuuki/items/2b92598434f6cc320112
For the preparation data, we will use the article information of Qiita that plunged into mongoDB in the following article [Python] Qiita article information is pushed into mongoDB
How to use aggregate of mongoDB does not come to me if you are accustomed to SQL. The table below is a comparison table of SQL and aggregate.
SQL | aggregate |
---|---|
WHERE | $match |
GROUP BY | $group |
HAVING | $match |
SELECT | $project |
ORDER BY | $sort |
LIMIT | $limit |
SUM() | $sum |
COUNT() | $sum |
I am making various classes that use mongoDB using pymongo.
mongo_sample.py
from pymongo import MongoClient
class MongoSample(object):
def __init__(self, dbName, collectionName):
self.client = MongoClient()
self.db = self.client[dbName] #Set DB name
self.collection = self.db.get_collection(collectionName)
def aggregate(self, filter, **keyword):
return self.collection.aggregate(filter, keyword)
I'm just creating a function for calling aggregate.
First is the code.
aggregate_sample.py
from mongo_sample import MongoSample
import pprint
# arg1:DB Name
# arg2:Collection Name
mongo = MongoSample("db", "qiita")
#Maximum value
pipeline = [
{"$group":{ "_id":"title","page_max_view":{"$max":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------Maximum value-----------------------------")
pprint.pprint(list(results))
#minimum value
pipeline = [
{"$group":{ "_id":"title","page_min_view":{"$min":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------minimum value-----------------------------")
pprint.pprint(list(results))
#Average value
pipeline = [
{"$group":{ "_id":"average","page_average_view":{"$avg":"$page_views_count"}}}
]
#total
pipeline = [
{"$group":{"_id":"page_total_count","total":{"$sum":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print("------------------------Average value-----------------------------")
pprint.pprint(list(results))
#Count the number of occurrences for each tag
pipeline = [
{ "$unwind": "$tag_list"},
{ "$group": { "_id": "$tag_list", "count": { "$sum":1}}},
{ "$sort": {"count": -1, "_id":1}}
]
results = mongo.aggregate(pipeline)
print("------------------------Aggregate value-----------------------------")
pprint.pprint(list(results))
What you are doing is not a big deal. The maximum value, minimum value, average value, and count for each tag are acquired.
pprint needs to be installed.
pip install pprint
We will compare each with the operation method of mongoDB.
First, mongoDB commands The example is only the maximum value. If you change max to min, avg, or sum, it will be the minimum / average / maximum.
db.qiita.aggregate([{$group:{_id:"page_max_views",total:{$max:"$page_views_count"}}}])
pipeline = [
{"$group":{ "_id":"title","page_max_view":{"$max":"$page_views_count"}}}
]
Execution result
[{'_id': 'title', 'page_max_view': 2461}]
This way, the result is that "_id" is fixed to "title" and the maximum value in all records is obtained.
However, I want to display the title of the article because I want to know which article is read the most.
mongoDB command
> db.qiita.aggregate([{$project:{title:1,page_views_count:1}},{$group:{_id:"$title", total:{$max:"$page_views_count"}}},{$sort:{total:-1}}])
{"_id": "Operating mongodb with Python-Part 2: find-", "total": 2461}
{"_id": "Operating mongodb with Python-Part 3: update-", "total": 1137}
{"_id": "Operating mongodb with Python-Part 4: insert-", "total": 1102}
{"_id": "Various search conditions using pymongo (AND / OR / partial match / range search)", "total": 1019}
(Omitted)
With this command, I could see the title of the article and the number of times the page was viewed. Obviously, it doesn't make much sense because it is grouped by article name. .. If it is the maximum value that does not need grouping, it seems good to sort with find and set limit.
Try to get the maximum value for each tag1.
> db.qiita.aggregate([{$group:{_id:"$tag1", total:{$max:"$page_views_count"}}},{$sort:{total:-1}}])
{ "_id" : "Python", "total" : 2461 }
{ "_id" : "Vagrant", "total" : 946 }
{ "_id" : "Java", "total" : 617 }
{ "_id" : "Hyperledger", "total" : 598 }
{ "_id" : "solidity", "total" : 363 }
{ "_id" : "Ethereum", "total" : 347 }
{"_id": "blockchain", "total": 232}
{ "_id" : "Blockchain", "total" : 201 }
{ "_id" : "coverage", "total" : 199 }
Yes. I got it with a good feeling.
For the time being, I will change the python code as well.
# Maximum value
pipeline = [
{"$group":{ "_id":"$tag1","page_max_view":{"$max":"$page_views_count"}}}
]
results = mongo.aggregate(pipeline)
print ("------------------------ Maximum value --------------------- -------- ")
pprint.pprint(list(results))
I would like to count how many articles are written for each tag. The aggregation uses an item called tag_list, which looks like this data:
> db.qiita.find({},{_id:0,tag_list:1})
{ "tag_list" : [ "Python", "MongoDB", "Python3", "pymongo" ] }
{ "tag_list" : [ "Python", "Python3" ] }
{"tag_list": ["Python", "Python3", "Blockchain", "Blockchain", "Hyperledger-Iroha"]}
{"tag_list": ["Blockchain", "Blockchain", "Hyperledger-Iroha"]}
{ "tag_list" : [ "Blockchain", "Ethereum", "Hyperledger", "Hyperledger-sawtooth" ] }
{"tag_list": ["Blockchain", "Hyperledger", "Hyperledger-sawtooth"]}
{"tag_list": ["Java", "blockchain", "Hyperledger", "Hyperledger-Iroha"]}
{"tag_list": ["Blockchain", "Hyperledger", "Hyperledger-Iroha"]}
{"tag_list": ["Java", "Ethereum", "Blockchain", "Hyperledger", "Hyperledger-Iroha"]}
{"tag_list": ["Java", "blockchain", "Hyperledger", "Hyperledger-Iroha"]}
{ "tag_list" : [ "Hyperledger", "Hyperledger-Iroha", "Hyperledger-burrow", "Hyperledger-sawtooth", "Hyperledger-besu" ] }
{ "tag_list" : [ "Vagrant", "VirtualBox", "Hyper-V" ] }
{"tag_list": ["Java", "Ethereum", "solidity", "blockchain", "web3j"]}
{"tag_list": ["Java", "Ethereum", "blockchain", "web3j"]}
{"tag_list": ["Java", "Ethereum", "blockchain", "web3j"]}
{"tag_list": ["Java", "Ethereum", "blockchain", "web3j"]}
{"tag_list": ["Java", "Ethereum", "solidity", "blockchain", "web3j"]}
{"tag_list": ["Java", "Ethereum", "blockchain", "web3j"]}
{"tag_list": ["Java", "Ethereum", "blockchain", "web3j"]}
{"tag_list": ["Ethereum", "blockchain"]}
It's quite annoying to aggregate the data stored in this format in SQL. ..
In mongoDB, by using something called unwind, it is possible to divide and aggregate LIST format data.
> db.qiita.aggregate( { $project:{tag_list:1}}, { $unwind: "$tag_list"}, { $group: { _id: "$tag_list", count: { $sum:1}}},{ $sort: {"count": -1, "_id":1}} )
{"_id": "blockchain", "count": 16}
{ "_id" : "Ethereum", "count" : 11 }
{ "_id" : "Java", "count" : 10 }
{ "_id" : "Python", "count" : 9 }
{ "_id" : "Python3", "count" : 9 }
{ "_id" : "Hyperledger", "count" : 7 }
{ "_id" : "Hyperledger-Iroha", "count" : 7 }
{ "_id" : "MongoDB", "count" : 7 }
{ "_id" : "web3j", "count" : 7 }
{ "_id" : "solidity", "count" : 4 }
{ "_id" : "Blockchain", "count" : 3 }
{ "_id" : "Hyperledger-sawtooth", "count" : 3 }
{ "_id" : "Hyper-V", "count" : 1 }
{ "_id" : "Hyperledger-besu", "count" : 1 }
{ "_id" : "Hyperledger-burrow", "count" : 1 }
{ "_id" : "Vagrant", "count" : 1 }
{ "_id" : "VirtualBox", "count" : 1 }
{ "_id" : "coverage", "count" : 1 }
{ "_id" : "pymongo", "count" : 1 }
{ "_id" : "truffle", "count" : 1 }
The python code does not include "{" $ project ": {" tag_list ": 1}}". The result did not change with or without it. I'm not sure how to use this project.
There are many parts that are difficult to understand if you are accustomed to SQL, but it seems that flexible aggregation can be done by using unwind etc.
-Operating mongodb with Python-Part 1- -Operating mongodb with Python-Part 2: find- -Operating mongodb with Python-Part 3: update- -Operating mongodb with Python-Part 4: insert- -Operating mongodb with Python-Part 5: delete- -Various search conditions using pymongo (AND / OR / partial match / range search)
Recommended Posts