WHY
The demand to get only arbitrary information from structured data (HTML, Json, XML) is also needed for anomaly detection, chlorler, and data analysis. There are also methods of manually tuning and acquiring by supervised machine learning, but the method introduced this time is a method of focusing on the structure of structured data, calculating the difference in the structure, and acquiring only arbitrary information. I'd like to introduce_______
The following points can be achieved by this method.
--Highly versatile information extraction is possible --Since teacher data is not required, the creation cost is low. --Can also be applied to preprocessing for teacher data creation
WHAT
This section describes the specific procedure.
--Prepare the source of information (HTML, JSON, etc.) that you want to get. --Compare the page with each source to calculate the distance. --Get the ones that are close together.
This is the procedure. For distance calculation, describe from Tree Edit Distance, which is intuitively easy to understand, and then describe PQ Edit Distance, which is an approximate calculation method.
"PQ Edit Distance" is good when dealing with a large amount of data or when you want to make a comparison focusing on the overall structural difference, and "Tree Edit Distance" is good when you want to compare individual elements and when you want to handle a small amount of data.
In the author's environment, there is a difference of about 100 times in calculation time, so when dealing with a large amount of data (it depends on the environment, it cannot be said unconditionally), "PQ Edit Distance" is selected.
Tree Edit Distance
Structured data works well with tree structures. At this time, the Tree Edit Distance comes to mind immediately. Tree Edit Distance is a technique mainly used to calculate the distance between tree structures.
There are three operations required to make a tree structure of structural data and measure the difference in the structure.
add to
Delete
Replace
First, assume a simple array S and think as follows.
gamma is the replacement cost H as the first element T for the rest of the elements
Operation from above Replacement: The replacement cost between the first elements and the remaining elements are recursively calculated and the total value of the distances is returned. Delete: Recursively calculate the cost of deleting the first element and the cost of the array to be compared with the remaining elements, and return the total value of the distance. Insert: The additional cost between the first elements and the cost calculation other than the additional cost of the array to be compared with the original array are recursively returned.
It is done recursively because d is inside.
d(S_1, S_2) = min(\gamma(H(S_1),H(S_2)) + d(T(S_1),T(S_2)), \\
\gamma(H(S_1),\vec{0}) + d(T(S_1),S_2), \\
\gamma(\vec{0},H(S_2)) + d(S_1,T(S_2)),
)
Next, let's think about trees. In the case of a tree, you should think about the root and child nodes, but if there are multiple child nodes, multiple trees will be created when the root is removed, and the above description cannot be made. Think of a tree as a collection of ordered trees. If you delete the root node, you can decompose it into the forest of the child node of the rightmost tree and the forest of other nodes.
Replacement: The total value of the replacement distance cost calculation of the rightmost tree, the forest distance calculation of the child node of the rightmost tree, and the distance calculation of the other forests. Delete: The total value of the deletion distance cost calculation of the rightmost tree and the distance between the forest of the child node of the rightmost tree after deletion and the forest to be compared with other forests Addition: The total value of the additional distance cost calculation of the rightmost tree and the distance between the original forest and the forest of the child node of the rightmost tree to be compared and the other forests to be compared.
Figure
d(F_1, F_2) = min(\gamma(R(F_1),R(F_2)) + d(C(F_1),C(F_2))+ d(L(F_1),L(F_2)), \\
\gamma(R(F_1),\vec{0}) + d(L(F_1) + C(F_1),F_2), \\
\gamma(\vec{0},R(F_2)) + d(F_1, L(F_2) + C(F_2)),
)
When comparing and calculating all nodes, the calculation cost of ʻO (n ^ 2) `is incurred. Even if the distance cost is memorized and calculated by dynamic programming, the distance calculation cost will be high. Therefore, it is conceivable to use an approximate method without performing a strict distance calculation.
PQ Edit Distance
PQ Edit Distance is a method in which the calculation cost converges at ʻO (nlogn) and the calculation space is also suppressed to ʻO (n)
.
n is the number of nodes in the tree.
PQ Edit Distance creates what is called a profile from a tree structure without considering the replace, insert, and delete operations. This profile covers the patterns that the tree structure can take and is a method to measure the degree of matching of the patterns. Profile creation depends on P and Q parameters.
There are two essential parts.
--Creating a PQ profile --Distance calculation
For creating PQ profiles, pq-gram treats trees and subtrees with a special structure.
Normal tree
Trees handled by PQ (P: 2, Q: 3)
Since the values of P and Q depend on the structure of the tree, the values of P and Q are currently manually tuned to select the best one. I would like to know if this can be set automatically.
Create a profile by performing recursive processing on roots and leaves as shown below.
Definition 6: Define the distance of pq grams as follows
\Delta^{p,q}(\vec{T_1}, \vec{T_2}) = 1 - 2\frac{P^{p,q}(\vec{T_1}) \cap P^{p,q}(\vec{T_2})}{P^{p,q}(\vec{T_1}) \cup P^{p,q}(\vec{T_2})}
The reason for doubling is that the maximum match rate is half that of the union of two trees.
In the case of PQ, the difference is grasped by the distance between profiles. Although it is an approximate method, it has some advantages over Tree Edit Distance.
In the figure, in the case of Tree Edit Distance, the edit distance is 2 for both T'and T''. This is because T'has not enough "g" and "k" elements, and T "has not enough" c "and" e "elements, so the editing distance is 2 compared to T. However, T'and T'' are very different in structure. It is not very good that these distances are treated as the same. In the case of PQ, the difference is clear. This is the difference between Tree Edit Distance, which measures only individual differences, and PQ Edit Distance, which measures the overall difference.
HOW
We will introduce the algorithm using the above definition and introduce an implementation example.
Consider applying it to the following tree structures.
First, prepare two shift registers.
anc: p (for parent node) sib: q (for child nodes)
The register process is to take out the old one and insert the new one.
Example
shift((a,b,c),x) return (b,c,x)
Create a pq-gram by combining anc and sib.
Based on these, the profile of PQ gram is returned.
anc step down sib shifts from left to right
The flow of the specific algorithm is shown in the figure.
The target trees for this processing are as follows.
What you want to do in this process is to create a profile. Profile processing can be described recursively, so it will be as follows.
pg-GRAM-PROFILE(T, p, q)
P: empty relation with schema(labels)
anc: shift register of size p (filled with *)
P = PROFILE(T, p, q, P, root(T), anc)
return P
Initialize the profile first. Fill the anc with "*". Pass the tree you want to create a profile for this time, the set p and q values, the profile, the root node of the tree, and anc to the PROFILE function.
PROFILE(T, p, q, P, r, anc)
anc:= shift(anc, l(r))
sib: shift register of size q (filled with *)
if r is a leaf then
P := P U (anc ○ sib)
else
for each child c (from left to right) of r do
sib := shift(sib, l(c))
P := P U (anc ○ sib)
P := PROFILE(T, p, q, P, c, anc)
for k := 1 to q-1
sib := shift(sib, *)
P := P U (anc ○ sib)
return P
Insert the label r in anc. (R is the root node at first, and the node changes due to recursive processing) Initialize sib. If r is a leaf, create a union of profiles concatenated with anc and sib If not, processing is performed for each child node of the node. In the case of the first processing, processing is performed for each child node (a, b, c) with a as the root node.
The first anc is fixed at *, a
Insert the second into sib t *, *, a
.
The part we are paying attention to here is the following part.
Concatenate anc and sib to create *, a, *, *, a
and add it to PROFILE to create a union.
Since it is a recursive process, if there is a child node, we will focus on the following part and process it.
When processing for ʻe with anc as ʻa, a
anc:a, a
sib:*, *, e
When anc is processed as ʻa, e`
anc:a, e
sib:*, *, *
Since it finally reaches the leaves, the union is added so far. Repeat this recursively to create a PQ profile.
Measure the distance of the tree structure between this profile and the distance defined earlier.
The values of p and q are hyperparameters and depend on the tree structure. It is necessary to set the most appropriate parameters according to the structure of the tree.
--Creating a Node for tree structure data --Shift Register for editing --PROFILE for edit distance calculation
This can be achieved with a simple implementation method.
Structured data needs to be converted to a tree structure for handling. I created it simply with the code below
Initialization process to set a label on Node Processing to add a child node to Node
class NodePQ(object):
def __init__(self, label):
self.label = label
self.children = list()
def addkid(self, node, before=False):
if before:
self.children.insert(0, node)
else:
self.children.append(node)
return node
Enter the description of Shift Register
.
--Prepare a list initialized with the size of the argument * --The list is combined with a conch. --The oldest element is removed by shift processing and the element given by the argument is added.
class ShiftRegister(object):
def __init__(self, size):
self.register = list()
for i in range(size):
self.register.append("*")
def concatenate(self, reg):
temp = list(self.register)
temp.extend(reg.register)
return temp
def shift(self, el):
self.register.pop(0)
self.register.append(el)
PROFILE
This is the part that creates anc and sib and recursively creates PROFILE.
--Set ShiftRegister for ancestors in the initialization part. --Create a PROFILE using the set ancestors. --Sort the created PROFILE
class ProfilePQ(object):
def __init__(self, root, p=2, q=3):
super(ProfilePQ, self).__init__()
ancestors = ShiftRegister(p)
self.list = list()
self.profile(root, p, q, ancestors)
self.sort()
This part is an important process for PQ distance.
--Set the root label given as an argument to ancestors. --Set shift registers for q size in siblings. --When the number of children becomes 0, concatenate ancestors and siblings to create a profile. --Add an element to siblings for root children and click. This process is recursive until there are no more children. --When you go to the youngest child, fill in the siblings under it with "*"
def profile(self, root, p, q, ancestors):
ancestors.shift(root.label)
siblings = ShiftRegister(q)
if(len(root.children) == 0):
self.append(ancestors.concatenate(siblings))
else:
for child in root.children:
siblings.shift(child.label)
self.append(ancestors.concatenate(siblings))
self.profile(child, p, q, copy.deepcopy(ancestors))
for i in range(q-1):
siblings.shift("*")
self.append(ancestors.concatenate(siblings))
It will be the distance calculation part.
This ʻedit_distance calculates the distance between PROFILE and the other given as an argument. The match rate of the Node element created by ʻintersection
is calculated.
def edit_distance(self, other):
with Bench():
union = len(self) + len(other)
return 1.0 - 2.0*(self.intersection(other)/union)
def intersection(self, other):
intersect = 0.0
i = j = 0
while i < len(self) and j < len(other):
intersect += self.gram_edit_distance(self[i], other[j])
if self[i] == other[j]:
i += 1
j += 1
elif self[i] < other[j]:
i += 1
else:
j += 1
return intersect
def gram_edit_distance(self, gram1, gram2):
distance = 0.0
if gram1 == gram2:
distance = 1.0
return distance
This is the end of the implementation, but you need to test it to make sure it works properly.
The following process checks if the Profiles are equal.
--Check length and return False
if different, or False
if the first element is different
--Get elements from profile and return False
if each element does not match at least one
def checkProfileEquality(self, profile1, profile2):
if len(profile1) != len(profile2) or len(profile1[0]) != len(profile2[0]):
return False
for gram1 in profile1:
contains = False
for gram2 in profile2:
if gram1 == gram2:
contains = True
break
if contains == False:
return False
return True
Initial settings for p and q. Random value setting
def setUp(self):
self.p = 2
self.q = 3
num_random = 10
self.trees = list()
self.profiles = list()
Create a tree for testing
# Construct one-node trees
self.small_tree1 = NodePQ("a")
self.small_tree2 = NodePQ("b")
self.trees.append(self.small_tree1)
self.trees.append(self.small_tree2)
self.small_profile1 = [['*','a','*','*','*']]
self.small_profile2 = [['*','b','*','*','*']]
Create a random tree and create a profile from that tree.
# Construct a few randomly generated trees
for i in range(0, num_random):
depth = random.randint(1, 10)
width = random.randint(1, 5)
self.trees.append(randtree(depth=depth, width=width, repeat=4))
# Generate Profiles
for tree1 in self.trees:
self.profiles.append(ProfilePQ(tree1, self.p, self.q))
Create a random tree by the following process.
def randtree(depth=2, alpha='abcdefghijklmnopqrstuvwxyz', width=2, repeat=2):
labels = [''.join(x) for x in itertools.product(alpha, repeat=repeat)]
random.shuffle(labels)
labels = (x for x in labels)
root = NodePQ("root")
p = [root]
c = list()
for x in range(depth-1):
for y in p:
for z in range(random.randint(1,1+width)):
n = NodePQ(next(labels))
y.addkid(n)
c.append(n)
p = c
c = list()
return root
I am calculating if the distance calculation is correct
def testProfileCreation(self):
small_tree1_equality = self.checkProfileEquality(self.profiles[0], self.small_profile1)
small_tree2_equality = self.checkProfileEquality(self.profiles[1], self.small_profile2)
self.assertEqual(small_tree1_equality, True)
self.assertEqual(small_tree2_equality, True)
We check whether the calculation is correct even if the distance between A and B and the distance between B and A are reversed.
def testSymmetry(self):
"""x.edit_distance(y) should be the same as y.edit_distance(x)"""
for profile1 in self.profiles:
for profile2 in self.profiles:
self.assertEqual(profile1.edit_distance(profile2), profile2.edit_distance(profile1))
Check if the distance calculation is between 0 and 1
def testEditDistanceBoundaries(self):
for profile1 in self.profiles:
for profile2 in self.profiles:
self.assertTrue(profile1.edit_distance(profile2) <= 1.0 and profile1.edit_distance(profile2) >= 0
The advantage of this method is that it can be used universally for structured data. Since it is a method that can be used if there is a data set that can be converted to a tree structure such as HTML, json, XML, etc., it has a wide range of applications such as extraction of necessary data and abnormality detection, so I would appreciate it if you could try it.
Tree Edit Distance and application to natural language processing Approximate Matching of Hierarc hical Data Using pq -Grams
Recommended Posts