This is the explanation of ** PointNet **, which is the most basic deep learning model for ** point cloud data **. In order to understand PointNet, it is necessary to understand the point cloud data, so we will first explain the point cloud data, then explain the theory of PointNet and implement it with ** PyTorch **.
In addition, as a simple experiment using PointNet, we will perform a ** binary classification task ** that samples 3D data from a uniform distribution and a normal distribution and guesses from which distribution the data was sampled.
The PointNet paper is here. The implemented code is posted on GitHub.
Point cloud data is data that has been attracting attention in autonomous driving in recent years. This is because the data obtained from a sensor called LiDAR used in autonomous driving is point cloud data. In addition, it is used in three-dimensional measurement in the construction industry and chemical calculations for molecules, and its application range is wide-ranging.
Although point cloud data has such a wide range of applications, it has three major properties **, and it is necessary to consider the following three properties when handling point cloud data in machine learning.
<img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/302831/9e731cee-f997-6f85-b7c1-ac37a025ba5c.png ", width="35%">
Invariance is the property that the output is invariant even if the permutations of the point cloud data are exchanged and input to the machine learning model.
For example, in the case of an image, it is possible to order each pixel from the upper left to the lower right. However, point cloud data does not allow you to order each point, so each time you enter into a machine learning model, the point cloud is entered in a different permutation. At this time, the machine learning model is required to output the same value (invariant) every time for point cloud inputs in different permutations.
In other words, satisfying the following equation is a condition of invariance.
Movement invariance is the property that the output is invariant even if point cloud data with translation or rotational movement is input to the machine learning model.
Actually, this property is not unique to point cloud data, and images have the same property.
The translation invariance is expressed by the following equation.
You can see that the output is invariant with respect to the rotational movement of the input data.
As will be explained in detail later, PointNet approximately acquires movement invariance by applying affine transformation (parallel / rotational movement) to the input point group. However, ** I am not strictly satisfied with the movement invariance **. In order to strictly satisfy the movement invariance, for example, a method of using the distance between two points as a feature can be considered. (Even if two points are translated or rotated, the distance is constant. Therefore, the movement invariance can be strictly satisfied by using the distance between the two points as a feature. SchNet and [HIP-NN](https://aip. I use that method in a chemistry paper called scitation.org/doi/abs/10.1063/1.5011181).)
Locality is the property that ** points that are spatially close to each other have some kind of close relationship, and points that are spatially distant are less related to each other **. This property is not unique to point clouds, and images and the like also have this property. If it is an image, locality can be satisfied by using a convolution layer. (Actually, PointNet is not satisfied with the locality. The model that overcomes the biggest drawback of PointNet is PointNet ++. -on-point-sets-in-a-metric-space).)
First, I will show you the architecture of PointNet. The blue part is the Classification Network, and the yellow part is the Segmentation Network. As the name suggests, Classification and Segmentation are used according to the purpose. This time I'll just explain the blue Classification Network, but that's where the essence of PointNet lies. <img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/302831/01a21a0e-8582-98d4-e902-fb348b9482cb.png ", width="100%"> As a flow of Classification Network, first, the input point group is subjected to affine transformation by ** input transform **, and the movement invariance is approximately acquired. Next, the point cloud after the affine transformation is processed by the neural network, and the affine transformation is performed again by the feature transform. Then, it is processed by a neural network, and finally ** Max Pooling ** is used to acquire invariance and obtain output.
Max Pooling
The most important part of PointNet ** is Max Pooling **. Max Pooling is a very simple function, ** a function that outputs the largest of the input elements **. For example, if the input element is {0, 1, 2, 3}, the output of Max Pooling will be 3, which is the maximum element.
Input transform Input transform (feature transform) moves the input point cloud in parallel and rotation by applying ** affine matrix ** to the input point cloud, and obtains ** movement invariance ** approximately. Although it is an affine matrix, it can be obtained as the output of ** T-Net **. T-Net has a structure like a mini Point-Net and consists of a combination of neural network and Max Pooling. If you input a 3D point cloud to this T-Net, you will get an affine matrix as an output.
As explained earlier, T-Net is a network that takes a 3D point cloud as an input and outputs an affine matrix.
As shown below, the non-linear transformation by the neural network (Non Linear) is repeated, Max Pooling is sandwiched in the middle, and finally the (9 × 1) size Tensor is output. Resize this output to (3x3) to get the affine matrix. Then, the matrix product of the obtained affine matrix and the input data is calculated and passed to the next layer.
Note that the feature transform that performs affine transformation on the feature amount in the middle of PointNet is almost the same, so the explanation is omitted.
model.py
class InputTNet(nn.Module):
def __init__(self, num_points):
super(InputTNet, self).__init__()
self.num_points = num_points
self.main = nn.Sequential(
NonLinear(3, 64),
NonLinear(64, 128),
NonLinear(128, 1024),
MaxPool(1024, self.num_points),
NonLinear(1024, 512),
NonLinear(512, 256),
nn.Linear(256, 9)
)
# shape of input_data is (batchsize x num_points, channel)
def forward(self, input_data):
matrix = self.main(input_data).view(-1, 3, 3)
out = torch.matmul(input_data.view(-1, self.num_points, 3), matrix)
out = out.view(-1, 3)
return out
By the way, NonLinear is a self-made function that summarizes Dense, ReLU, and Batch Normalization.
model.py
class NonLinear(nn.Module):
def __init__(self, input_channels, output_channels):
super(NonLinear, self).__init__()
self.input_channels = input_channels
self.output_channels = output_channels
self.main = nn.Sequential(
nn.Linear(self.input_channels, self.output_channels),
nn.ReLU(inplace=True),
nn.BatchNorm1d(self.output_channels))
def forward(self, input_data):
return self.main(input_data)
model.py
class PointNet(nn.Module):
def __init__(self, num_points, num_labels):
super(PointNet, self).__init__()
self.num_points = num_points
self.num_labels = num_labels
self.main = nn.Sequential(
InputTNet(self.num_points),
NonLinear(3, 64),
NonLinear(64, 64),
FeatureTNet(self.num_points),
NonLinear(64, 64),
NonLinear(64, 128),
NonLinear(128, 1024),
MaxPool(1024, self.num_points),
NonLinear(1024, 512),
nn.Dropout(p = 0.3),
NonLinear(512, 256),
nn.Dropout(p = 0.3),
NonLinear(256, self.num_labels),
)
def forward(self, input_data):
return self.main(input_data)
As a simple experiment, I randomly sampled 3D points from a uniform distribution and a normal distribution, and used PointNet to predict which one was sampled from.
The function that samples from the probability distribution has the following implementation.
sampler.py
def data_sampler(batch_size, num_points):
half_batch_size = int(batch_size/2)
normal_sampled = torch.randn(half_batch_size, num_points, 3)
uniform_sampled = torch.rand(half_batch_size, num_points, 3)
normal_labels = torch.ones(half_batch_size)
uniform_labels = torch.zeros(half_batch_size)
input_data = torch.cat((normal_sampled, uniform_sampled), dim=0)
labels = torch.cat((normal_labels, uniform_labels), dim=0)
data_shuffle = torch.randperm(batch_size)
return input_data[data_shuffle].view(-1, 3), labels[data_shuffle].view(-1, 1)
Using the PointNet implemented earlier and this sampling function, learning and evaluation are performed as follows. The batch size is 64 and the number of point clouds in one dataset is 16.
Although it is new_param, it sets the initial value of the bias of the final layer of TNet to be the identity matrix (flattened version). Such initialization is recommended in the paper.
main.py
batch_size = 64
num_points = 16
num_labels = 1
pointnet = PointNet(num_points, num_labels)
new_param = pointnet.state_dict()
new_param['main.0.main.6.bias'] = torch.eye(3, 3).view(-1)
new_param['main.3.main.6.bias'] = torch.eye(64, 64).view(-1)
pointnet.load_state_dict(new_param)
criterion = nn.BCELoss()
optimizer = optim.Adam(pointnet.parameters(), lr=0.001)
loss_list = []
accuracy_list = []
for iteration in range(100+1):
pointnet.zero_grad()
input_data, labels = data_sampler(batch_size, num_points)
output = pointnet(input_data)
output = nn.Sigmoid()(output)
error = criterion(output, labels)
error.backward()
optimizer.step()
if iteration % 10 == 0:
with torch.no_grad():
output[output > 0.5] = 1
output[output < 0.5] = 0
accuracy = (output==labels).sum().item()/batch_size
loss_list.append(error.item())
accuracy_list.append(accuracy)
print('Iteration : {} Loss : {}'.format(iteration, error.item()))
print('Iteration : {} Accuracy : {}'.format(iteration, accuracy))
The result is as follows:
I'm not sure if the task is too easy or PointNet is good, but you can see that it's categorized well. The code given here is a part of the whole, so if you want to know the whole, please see GitHub here.
Recommended Posts