The other day I wrote an article to create a follow-up correlation diagram on Twitter.
[Python] I tried to visualize the follow relationship of Twitter
In the above article, get follow account information with Twitter API and register it in mongoDB. After that, the logic was to get the data from mongoDB and draw it on the graph while checking if they are following each other.
I learned that using GraphDB is convenient for network analysis, so I used GraphDB as well.
python:3.7 gremlinpython:3.4.6 gremlin:3.4.6
I built the environment on Windows. You can download the tools for Windows from the following.
https://downloads.apache.org/tinkerpop/3.4.6/
It is "server" to download. It would be nice to have a "console", but I won't use it in this article.
After downloading, just unzip the ZIP, place it in any folder, and execute bat under the bin folder.
gremlinpython You can install it with the pip command.
pip install gremlinpython
Now that the environment is ready, we will implement it.
Gremlin is a DB that can handle Graph type data models. Since mongoDB could not manage the relation between data, we will register the data relation while registering the data in Gremlin.
The data of mongoDB is as follows. A list of Twitter accounts and the accounts they follow is registered. A lot of the following data is registered.
{
"_id" : ObjectId("5e6c52a475646eb49cfbd62b"),
"screen_name" : "yurinaNECOPLA",
"followers_info" : [
{
"screen_name" : "Task_fuuka",
"id" : NumberLong("784604847710605312")
},
(Omitted)
{
"screen_name" : "nemui_oyasumi_y",
"id" : NumberLong("811491671560974336")
}
]
}
from gremlin_python import statics
from gremlin_python.structure.graph import Graph
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.traversal import TraversalSideEffects
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from mongo_dao import MongoDAO
mongo = MongoDAO("db", "followers_info")
graph = Graph()
# Gremlin connection creation
g = graph.traversal().withRemote(DriverRemoteConnection('ws://localhost:8182/gremlin','g'))
start_name = 'yurinaNECOPLA'
def addValueEdge(parent_name, depth):
if depth == 0:
return False
print(parent_name)
result = mongo.find_one(filter={'screen_name': parent_name})
if result == None or len(result) == 0:
return False
# Add vertices
g.addV(parent_name).property('screen_name', parent_name).toSet()
p = g.V().has('screen_name', parent_name).toList()[0]
for follower in result['followers_info']:
if addValueEdge(follower['screen_name'], depth-1):
cList = g.V().has('screen_name', follower['screen_name']).toList()
if len(cList) != 0:
# Add edge
g.addE('follow').from_(p).to(cList[0]).toSet()
return True
addValueEdge(start_name, 3)
Data and edges are registered recursively in this way.
The basic construction is the same as when data was acquired from mongoDB.
import json
import networkx as nx
import matplotlib.pyplot as plt
from gremlin_python import statics
from gremlin_python.structure.graph import Graph
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.traversal import TraversalSideEffects
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
start_screen_name = 'yurinaNECOPLA'
graph = Graph()
# Gremlin connection creation
g = graph.traversal().withRemote(DriverRemoteConnection('ws://localhost:8182/gremlin','g'))
# Create a new graph
G = nx.Graph()
#Add node
G.add_node(start_screen_name)
def add_edge(screen_name, depth):
if depth == 0:
return
name = g.V().has('screen_name', screen_name).toList()[0]
follows_list = g.V(name).both().valueMap().toList()
for follow in follows_list:
print(follow['screen_name'][0])
G.add_edge(screen_name, follow['screen_name'][0])
add_edge(follow['screen_name'][0], depth-1)
add_edge(start_screen_name, 3)
# Creating a diagram. figsize is the size of the figure
plt.figure(figsize=(10, 8))
# Determine the layout of the figure. The smaller the value of k, the denser the figure
pos = nx.spring_layout(G, k=0.8)
# Drawing nodes and edges
# _color: Specify color
# alpha: Specifying transparency
nx.draw_networkx_edges(G, pos, edge_color='y')
nx.draw_networkx_nodes(G, pos, node_color='r', alpha=0.5)
# Add node name
nx.draw_networkx_labels(G, pos, font_size=10)
# Setting not to display X-axis and Y-axis
plt.axis('off')
plt.savefig("mutual_follow.png ")
# Draw a diagram
plt.show()
The key points are as follows.
name = g.V().has('screen_name', screen_name).toList()[0]
follows_list = g.V(name).both().valueMap().toList()
for follow in follows_list:
print(follow['screen_name'][0])
G.add_edge(screen_name, follow['screen_name'][0])
add_edge(follow['screen_name'][0], depth-1)
The first line gets the follower information from Gremlin. You can get the edge information of the information obtained in the second line. Since this data is a dict type list, you can get the account name by getting one by one and getting screen_name.
Execution result
The result was quite unsightly, but I was able to create a correlation diagram.
The above correlation diagram also shows the cyclical relationships such as account A → account B, account B → account C, and account C → account A.
If you want to prevent circulation, you can realize it by adding control that the data already registered at the timing of registering in Gremlin is not added.
def registCheck(screen_name):
check = g.V().has('screen_name', screen_name).toList()
if len(check) == 0:
return False
else:
return True
def addValueEdge(parent_name, depth):
if depth == 0 or registCheck(parent_name):
return False
print(parent_name)
result = mongo.find_one(filter={'screen_name': parent_name})
if result == None or len(result) == 0:
return False
# Add vertices
g.addV(parent_name).property('screen_name', parent_name).toSet()
p = g.V().has('screen_name', parent_name).toList()[0]
for follower in result['followers_info']:
if addValueEdge(follower['screen_name'], depth-1):
cList = g.V().has('screen_name', follower['screen_name']).toList()
if len(cList) != 0:
# Add edge
g.addE('follow').from_(p).to(cList[0]).toSet()
return True
By adding a registerCheck to check if the data is registered in Gremlin, the cyclic relationship could be excluded.
In the figure with circulation, accounts that are closely related to each other and follow each other are output together. In the figure without circulation, the account that the starting account is following is output nearby, but since the logic is recursively constructed, the account that is following each other with account A is output at a distant position. Some are. It seems that we still need to consider how to set the edge.
Registering data relationships is similar to RDB, but I got the impression that it is very difficult to handle intuitively with gremlin python. If you can read the document and understand the mechanism to some extent, it will be useful for network analysis.
Recommended Posts