The link structure of the web is a large-scale network that you can easily play. Network analysis is a link structure on the web ① We will actually analyze the network acquired. This may take more than 3 hours to spare, so please try it with your heart. Source code etc. are available on Author GitHub.
networkx
from an adjacency matrixNetworkX Network analysis module. --Multiple information can be given to the node edge of the graph. --You can easily perform network analysis such as central / strongly connected component decomposition. --Beautiful visualization.
~~ I couldn't afford to put together the recipes ~~, so I will put together the websites that I referred to.
NetworkX Official Documentation Qiita: [Python] Summary of basic usage of NetworkX 2.0 PyQ: Graph Theory and NetworkX [DATUM STUDIO: I tried to visualize the national surname network with networkx](https://datumstudio.jp/blog/networkx%E3%81%A7%E5%85%A8%E5%9B%BD%E5%90% 8D% E5% AD% 97% E3% 83% 8D% E3% 83% 83% E3% 83% 88% E3% 83% AF% E3% 83% BC% E3% 82% AF% E3% 82% 92% E5% 8F% AF% E8% A6% 96% E5% 8C% 96% E3% 81% 97% E3% 81% A6% E3% 81% BF% E3% 81% 9F)
import.py
from urllib.request import urlopen
from bs4 import BeautifulSoup
import networkx as nx
from tqdm import tqdm_notebook as tqdm
import numpy as np
from scipy import stats
import pandas as pd
pd.options.display.max_colwidth = 500
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import re
Network analysis is a link structure on the web ①.
make_graph.py
G = nx.DiGraph()
for i in tqdm(range(len(adj))):
for j in range(len(adj)):
if adj[i][j]==1:
G.add_edge(i,j)
nx.DiGraph()
Create a directed graph.
G.add_edge(i,j)
Add an edge to the graph G
.
~~ There is a smarter way to add it. Yup. ~~
cal_centrality.py
degree = nx.degree_centrality(G)
print("degree completed")
closeness = nx.closeness_centrality(G)
print("closeness completed")
betweenness = nx.betweenness_centrality(G)
print("betweenness completed")
pagerank = nx.pagerank(G)
print("pagerank completed")
strong_connection = sorted(nx.strongly_connected_components(G), key=lambda x: len(x), reverse=True)
print("strongly connected components completed")
nx.degree_centrality(G)
Degree centrality.
The sum of the out-degree and in-degree of each node.
nx.degree_centrality (G)
returns the degree centrality value divided by $ n-1
nx.closeness_centrality(G)
Proximity centrality.
The reciprocal of the average shortest path length from each node to all other nodes.
nx.betweenness_centrality(G)
Mediation centrality.
Percentage of the number of times the shortest path from all other nodes to reachable nodes passes through each node.
nx.strongly_connected_components(G)
Strongly linked component decomposition.
sorted(nx.strongly_connected_components(G), key=lambda x: len(x), reverse=True)
By doing so, an array sorted in descending order of the number of vertices of the strongly connected component is returned.
DataFrame
for pandas
make_df.py
df = pd.DataFrame({"ID": range(len(url_list)), "URL": url_list)
df["category"] = df["URL"].apply(lambda s: s[s.index(":")+3:s.index("/", 8)])
df.loc[df['URL'] ==start_url, 'category'] = "start_page"
df["degree_centrality"] = df["ID"].map(degree)
df["closeness_centrality"] = df["ID"].map(closeness)
df["betweenness_centrality"] = df["ID"].map(betweenness)
df["Pagerank"] = df["ID"].map(pagerank)
df = df.assign(Pagerank_rank=len(df)-stats.mstats.rankdata(df["Pagerank"])+1)
df = df.assign(degree_centrality_rank=len(df)-stats.mstats.rankdata(df["degree_centrality"])+1)
df = df.assign(closeness_centrality_rank=len(df)-stats.mstats.rankdata(df["closeness_centrality"])+1)
df = df.assign(betweenness_centrality_rank=len(df)-stats.mstats.rankdata(df["betweenness_centrality"])+1)
df.to_csv(f"./{fname}.csv")
df["category"]
Classify each page of the network by top page.
[Network analysis is a link structure on the web ①](https://qiita.com/takahiro_hasegawa/items/2d2a979ead0522c112a2#%E3%83%AA%E3%83%B3%E3%82%AF%E6%A7%8B % E9% 80% A0% E8% A7% A3% E6% 9E% 90% E3% 81% AE% E9% 96% A2% E6% 95% B0), and in ʻurl_list, enter the URL that starts with http. Stored. Classify these URLs by top page and call them
category`.
Useful for visualization.
df.assign(Pagerank_rank=len(df)-stats.mstats.rankdata(df["###"])+1)
Add rankings of various centralities as new columns.
stats.mstats.rankdata
takes the rank of the same value as their average value.
stats.mstats.rankdata
ranks from 0 in ascending order. Take the reverse order of them and add the order from the 1st place to the DataFrame.
draw_graph.py
pos = nx.spring_layout(G)
nx.spring_layout(G)
Calculate the position to plot the graph nicely.
If you do not define pos
outside the function, a graph with a different shape will be drawn each time.
draw_graph.py
def draw_char_graph(G, pos, title, node_type):
plt.figure(figsize=(15, 15))
nx.draw_networkx_edges(G,
pos=pos,
edge_color="gray",
edge_cmap=plt.cm.Greys,
edge_vmin=-3e4,
width=0.3,
alpha=0.2,
arrows=False)
if node_type=="centrality":
node1=nx.draw_networkx_nodes(G,
pos=pos,
node_color="blue",
alpha=0.8,
node_size=[ d["closeness"]*300 for (n,d) in G.nodes(data=True)])
node2=nx.draw_networkx_nodes(G,
pos=pos,
node_color="green",
alpha=0.8,
node_size=[ d["degree"]*2000 for (n,d) in G.nodes(data=True)])
node3=nx.draw_networkx_nodes(G,
pos=pos,
node_color="yellow",
alpha=0.8,
node_size=[ d["betweenness"]*5000 for (n,d) in G.nodes(data=True)])
node4=nx.draw_networkx_nodes(G,
pos=pos,
node_color="red",
alpha=0.8,
node_size=[ d["pagerank"]*10000 for (n,d) in G.nodes(data=True)])
plt.legend([node1, node2, node3,node4], ["closeness", "degree","betweenness","pagerank"],markerscale=1,fontsize=18)
plt.title(f"centrality: {start_url}\n {nx.number_of_nodes(G)} nodes,{nx.number_of_edges(G)} edges",fontsize=25)
elif node_type=="simple":
nx.draw_networkx_nodes(G,
pos=pos,
node_color="blue",
node_size=5)
plt.title(f"{start_url}\n {nx.number_of_nodes(G)} nodes,{nx.number_of_edges(G)} edges",fontsize=25)
elif node_type=="strong_connection":
nx.draw_networkx_nodes(G,
pos=pos,
node_color="black",
node_size=10)
node1=nx.draw_networkx_nodes(G,
pos=pos,
node_color="blue",
nodelist=strong_connection[0],
node_size=30)
node2=nx.draw_networkx_nodes(G,
pos=pos,
node_color="yellow",
nodelist=strong_connection[1],
node_size=30)
node3=nx.draw_networkx_nodes(G,
pos=pos,
node_color="red",
nodelist=strong_connection[2],
node_size=30)
plt.title(f"strongly connected nodes: {title}\n {nx.number_of_nodes(G)} nodes,{nx.number_of_edges(G)} edges",fontsize=25)
plt.legend([node1, node2, node3], [f"elements: {len(strong_connection[0])} ({round(len(strong_connection[0])/nx.number_of_nodes(G)*100,2)}%)",
f"elements: {len(strong_connection[1])} ({round(len(strong_connection[1])/nx.number_of_nodes(G)*100,2)}%)",
f"elements: {len(strong_connection[2])} ({round(len(strong_connection[2])/nx.number_of_nodes(G)*100,2)}%)"],markerscale=1,fontsize=18)
plt.axis('off')
plt.savefig(f"{title}_graph_{node_type}", dpi=300)
plt.show()
draw_char_graph(G, pos, title, node_type)
A function that draws a graph.
title
The title of the graph to draw and the file name to save
node_type
node_type | Graph drawn |
---|---|
"simple" |
Normal graph |
"centrality" |
Graph with each centrality painted separately |
"strong_connection" |
Top 3 vertices of strongly connected components |
visualize.py
sns.pairplot(df.drop("ID", axis=1), hue='category')
plt.savefig(f'{fname}_pairplot.png')
Everyone loves pair plot
.
visualize.py
draw_char_graph(G, pos, fname, node_type="simple")
visualize.py
draw_char_graph(G, pos, fname, node_type="centrality")
visualize.py
draw_char_graph(G, pos, fname, node_type="strong_connection")
I'm not sure because it's too large. If it is a small network, it will be drawn neatly.
visualize.py
df_important = df[(df["Pagerank_rank"]<=10) | (df["degree_centrality_rank"]<=10) | (df["closeness_centrality_rank"]<=10) | (df["betweenness_centrality_rank"]<=10)]
df_important = df_important[["URL", "Pagerank_rank", "degree_centrality_rank", "closeness_centrality_rank", "betweenness_centrality_rank"]]
df_important.to_csv(f"./{fname}_important.csv")
## | Pagerank | Order centrality | Proximity centrality | Mediation centrality | URL |
---|---|---|---|---|---|
2167 | 1.0 | 3.0 | 1.0 | 1.0 | https://zozo.jp/ |
7043 | 2.0 | 1.0 | 2.0 | 2.0 | https://zozo.jp/brand/ |
6492 | 3.0 | 2.0 | 3.0 | 3.0 | https://zozo.jp/shop/ |
2612 | 4.0 | 4.0 | 4.0 | 4.0 | https://zozo.jp/category/ |
4618 | 5.0 | 39.5 | 14.5 | 16.5 | https://zozo.jp/category/jacket-outerwear/no-collar-jacket/ |
801 | 8.0 | 39.5 | 14.5 | 16.5 | https://zozo.jp/category/jacket-outerwear/mods-coat/ |
2803 | 8.0 | 39.5 | 14.5 | 16.5 | https://zozo.jp/category/jacket-outerwear/jacket/ |
19707 | 8.0 | 39.5 | 14.5 | 16.5 | https://zozo.jp/category/jacket-outerwear/ma1/ |
20142 | 8.0 | 39.5 | 14.5 | 16.5 | https://zozo.jp/category/jacket-outerwear/pea-coat/ |
27308 | 8.0 | 39.5 | 14.5 | 16.5 | https://zozo.jp/category/jacket-outerwear/riders-jacket/ |
15378 | 198.0 | 6.0 | 27029.0 | 5.0 | https://wear.jp/ |
3828 | 313.0 | 7.0 | 315.0 | 52.0 | https://zozo.jp/fashionnews/ |
23684 | 315.0 | 8.0 | 317.0 | 319.0 | https://zozo.jp/coordinate/ |
19611 | 316.0 | 9.0 | 318.0 | 320.0 | https://zozo.jp/welcome/ |
21862 | 330.0 | 5.0 | 27030.0 | 6.0 | https://corp.zozo.com/ir/ |
11315 | 334.0 | 10.0 | 27033.0 | 315.0 | https://corp.zozo.com/about/profile/ |
It's fun! !! !!
Recommended Posts