I tried to compare the accuracy of Japanese BERT and Japanese Distil BERT sentence classification with PyTorch & Introduction of BERT accuracy improvement technique

Introduction

I tried using Japanese BERT with huggingface / transformers in previous article, but if you use huggingface / transformers, other pre-learned The BERT model of is also easy to handle.

In List of usable models, it seems that there are other models that seem to be Japanese, such as DistilBERT and ALBERT. Both are positioned as lightweight BERT.

This time, while briefly introducing DistilBERT provided by Namco Bandai that can also be used from the hugging face. , I tried to compare the accuracy with normal BERT. Finally, I will introduce one of the techniques to improve the accuracy when classifying sentences with BERT.

What is DistilBERT?

I will borrow the README of Namco Bandai's Github as it is.

DistilBERT is a model released by Huggingface at NeurIPS 2019, and its name stands for "Distilated-BERT". Please refer to here for submitted papers.

Distil BERT is a small, fast and light Transformer model based on the BERT architecture. DistilBERT is said to have 40% fewer parameters than BERT-base, run 60% faster, and maintain 97% of BERT's performance as measured by the GLUE Benchmark.

DistilBERT is trained using knowledge distillation, a technique that compresses a large model called a teacher into a small model called a student. By distilling BERT, you can obtain a Transformer model that has many similarities to the original BERT model, but is lighter and runs faster.

In short, it feels like a lightweight version of BERT-base that achieves high speed (hence the accuracy is slightly inferior to BERT-base). Let's actually use it to see how fast it is and how inaccurate it is.

How to use DistilBERT

[Official Github](https://github.com/BandaiNamcoResearchInc/DistilBERT-base-jp/blob/master/docs/GUIDE.md#transformers-%E3%83%A9%E3%82%A4%E3%83% 96% E3% 83% A9% E3% 83% AA% E3% 83% BC% E3% 81% 8B% E3% 82% 89% E8% AA% AD% E8% BE% BC% E3% 81% BF) You can easily call it from hugging face / transformers on the street.

-For tokenizer, if you don't add cl-tohoku / to the name as shown below, an error will occur.

from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")
distil_model = AutoModel.from_pretrained("bandainamco-mirai/distilbert-base-japanese")  

Basically, it can be used in the same way as the Japanese BERT-base introduced in Last article, but due to the difference in the internal network structure, fine tuning is a little. It looks like it needs to be changed.

For the time being, let's check the structure of the contents of Distil BERT.

print(distil_model)

It's long, so keep it closed.

DistilBERT model structure
DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(32000, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(in_features=3072, out_features=768, bias=True)
        )
        (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (1): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(in_features=3072, out_features=768, bias=True)
        )
        (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (2): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(in_features=3072, out_features=768, bias=True)
        )
        (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (3): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(in_features=3072, out_features=768, bias=True)
        )
        (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (4): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(in_features=3072, out_features=768, bias=True)
        )
        (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (5): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(in_features=3072, out_features=768, bias=True)
        )
        (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
    )
  )
)

The difference from BERT-base is that there are 12 transformer blocks in BERT-base, but only 6 in Distil BERT. You can also see that the naming of the contents layer is a little different from BERT-base.

Therefore, when fine-tuning, you can write as follows. An example of a classification model declaration is also described.

This time there is no particular effect, but let's also check the DistilBERT reference. The model return is also slightly different.

(Similar to the previous article, the title classification of livedoor news corpus is assumed.)

Model declaration

import torch
from torch import nn
import torch.nn.functional as F
from transformers import *

class DistilBertClassifier(nn.Module):
  def __init__(self):
    super(DistilBertClassifier, self).__init__()
    # BERT-This is the only place that is different from base.
    self.distil_bert = AutoModel.from_pretrained("bandainamco-mirai/distilbert-base-japanese")
    #The number of dimensions of the hidden layer of DistilBERT is 768,9 livedoor news categories
    self.linear = nn.Linear(768, 9)
    #Weight initialization processing
    nn.init.normal_(self.linear.weight, std=0.02)
    nn.init.normal_(self.linear.bias, 0)

  def forward(self, input_ids):
    vec, _ = self.distil_bert(input_ids)
    #Get only the vector of the first token cls
    vec = vec[:,0,:]
    vec = vec.view(-1, 768)
    #Convert dimensions for classification in fully connected layers
    out = self.linear(vec)
    return F.log_softmax(out)

#Instance of classification model
distil_classifier = DistilBertClassifier()

Fine tuning

#First of all OFF
for param in distil_classifier.parameters():
    param.requires_grad = False

#Update only the last layer of DistilBERT ON
# BERT-base is.encoder.layer[-1]It was,
#In the case of DistilBERT, as confirmed above, the structure is as follows..transfomer.layer[-1]It will be.
for param in distil_classifier.distil_bert.transformer.layer[-1].parameters():
    param.requires_grad = True

#Class classification is also ON
for param in distil_classifier.linear.parameters():
    param.requires_grad = True

import torch.optim as optim

#The learning rate should be small for the pre-learned part, and large for the last fully connected layer.
#Don't forget to change this for Distil BERT
optimizer = optim.Adam([
    {'params': distil_classifier.distil_bert.transformer.layer[-1].parameters(), 'lr': 5e-5},
    {'params': distil_classifier.linear.parameters(), 'lr': 1e-4}
])

Comparison of BERT-base and Distil BERT

Similar to Last time, it handles the task of title classification of livedoor news corpus.

BERT-base

--The following source code is almost the same as last time.

Model definition & fine tuning

class BertClassifier(nn.Module):
  def __init__(self):
    super(BertClassifier, self).__init__()
    self.bert = BertModel.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
    #The number of dimensions of the hidden layer of BERT is 768,9 livedoor news categories
    self.linear = nn.Linear(768, 9)
    #Weight initialization processing
    nn.init.normal_(self.linear.weight, std=0.02)
    nn.init.normal_(self.linear.bias, 0)

  def forward(self, input_ids):
    # last_Receive only hidden
    vec, _ = self.bert(input_ids)
    #Get only the vector of the first token cls
    vec = vec[:,0,:]
    vec = vec.view(-1, 768)
    #Convert dimensions for classification in fully connected layers
    out = self.linear(vec)
    return F.log_softmax(out)

#Classification model instance declaration
bert_classifier = BertClassifier()

#Fine tuning settings
#First of all OFF
for param in bert_classifier.parameters():
    param.requires_grad = False

#Update only the last layer of BERT ON
for param in bert_classifier.bert.encoder.layer[-1].parameters():
    param.requires_grad = True

#Class classification is also ON
for param in bert_classifier.linear.parameters():
    param.requires_grad = True

import torch.optim as optim

#The learning rate should be small for the pre-learned part, and large for the last fully connected layer.
optimizer = optim.Adam([
    {'params': bert_classifier.bert.encoder.layer[-1].parameters(), 'lr': 5e-5},
    {'params': bert_classifier.linear.parameters(), 'lr': 1e-4}
])

#Loss function settings
loss_function = nn.NLLLoss()

Learning & reasoning

#Measure the study time.
import time

start = time.time()
#GPU settings
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#Send network to GPU
bert_classifier.to(device)
losses = []

#The number of epochs is 10
for epoch in range(10):
  all_loss = 0
  for idx, batch in enumerate(train_iter):
    batch_loss = 0
    bert_classifier.zero_grad()
    input_ids = batch.Text[0].to(device)
    label_ids = batch.Label.to(device)
    out = bert_classifier(input_ids)
    batch_loss = loss_function(out, label_ids)
    batch_loss.backward()
    optimizer.step()
    all_loss += batch_loss.item()
  print("epoch", epoch, "\t" , "loss", all_loss)

end = time.time()

print ("time : ", end - start)
#epoch 0 	 loss 251.19750046730042
#epoch 1 	 loss 110.7038831859827
#epoch 2 	 loss 82.88570280373096
#epoch 3 	 loss 67.0771074667573
#epoch 4 	 loss 56.24497305601835
#epoch 5 	 loss 42.61423560976982
#epoch 6 	 loss 35.98485875874758
#epoch 7 	 loss 25.728398952633142
#epoch 8 	 loss 20.40780107676983
#epoch 9 	 loss 16.567239843308926
#time :  101.97362518310547

#inference
answer = []
prediction = []
with torch.no_grad():
    for batch in test_iter:

        text_tensor = batch.Text[0].to(device)
        label_tensor = batch.Label.to(device)

        score = bert_classifier(text_tensor)
        _, pred = torch.max(score, 1)

        prediction += list(pred.cpu().numpy())
        answer += list(label_tensor.cpu().numpy())
print(classification_report(prediction, answer, target_names=categories))
#                precision    recall  f1-score   support

# kaden-channel       0.94      0.92      0.93       172
#dokujo-tsushin       0.75      0.86      0.80       156
#        peachy       0.81      0.68      0.74       211
#   movie-enter       0.78      0.81      0.80       171
#          smax       0.98      0.91      0.94       176
#livedoor-homme       0.68      0.83      0.75        83
#  it-life-hack       0.79      0.94      0.86       150
#    topic-news       0.81      0.76      0.78       172
#  sports-watch       0.89      0.82      0.85       185

#      accuracy                           0.83      1476
#     macro avg       0.83      0.84      0.83      1476
#  weighted avg       0.84      0.83      0.83      1476

The learning time for 10 epochs was about 102 seconds, and the accuracy was 0.83 (F score).

DistilBERT

Learning & reasoning

--Perform learning & inference as follows based on the model defined above and fine tuning settings ――But it's no different from BERT-base, but just in case ...

import time

#GPU settings
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#Send network to GPU
distil_classifier.to(device)
losses = []

start = time.time()
#The number of epochs is 10
for epoch in range(10):
  all_loss = 0
  for idx, batch in enumerate(train_iter):
    batch_loss = 0
    distil_classifier.zero_grad()
    input_ids = batch.Text[0].to(device)
    label_ids = batch.Label.to(device)
    out = distil_classifier(input_ids)
    batch_loss = loss_function(out, label_ids)
    batch_loss.backward()
    optimizer.step()
    all_loss += batch_loss.item()
  print("epoch", epoch, "\t" , "loss", all_loss)

end = time.time()
print ("time : ", end - start)
#/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:26: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
#epoch 0 	 loss 450.1027842760086
#epoch 1 	 loss 317.39041769504547
#epoch 2 	 loss 211.34138756990433
#epoch 3 	 loss 144.4813650548458
#epoch 4 	 loss 106.24609130620956
#epoch 5 	 loss 83.87273170053959
#epoch 6 	 loss 68.9661111086607
#epoch 7 	 loss 59.31868125498295
#epoch 8 	 loss 49.874382212758064
#epoch 9 	 loss 41.56027300283313
#time :  60.22182369232178


from sklearn.metrics import classification_report

answer = []
prediction = []
with torch.no_grad():
    for batch in test_iter:

        text_tensor = batch.Text[0].to(device)
        label_tensor = batch.Label.to(device)

        score = distil_classifier(text_tensor)
        _, pred = torch.max(score, 1)

        prediction += list(pred.cpu().numpy())
        answer += list(label_tensor.cpu().numpy())
print(classification_report(prediction, answer, target_names=categories))

#/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:26: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
#                precision    recall  f1-score   support

# kaden-channel       0.93      0.96      0.95       163
#dokujo-tsushin       0.88      0.88      0.88       178
#        peachy       0.86      0.75      0.80       202
#   movie-enter       0.86      0.84      0.85       183
#          smax       0.96      0.95      0.95       165
#livedoor-homme       0.67      0.71      0.69        96
#  it-life-hack       0.91      0.91      0.91       178
#    topic-news       0.80      0.86      0.83       148
#  sports-watch       0.88      0.91      0.89       163

#      accuracy                           0.87      1476
#     macro avg       0.86      0.86      0.86      1476
#  weighted avg       0.87      0.87      0.87      1476

――The learning time of 10 epochs was about 60 seconds, and the accuracy was 0.87 (F score). ――It's nice that the learning time is faster, but the accuracy has improved. --Originally, it was supposed to be a little less accurate than BERT-base, but it seems that it may be higher. ――Somehow, the task of classifying the titles of the livedoor news corpus, which I always try as an experiment, may not be very good ...

Try to improve the accuracy of BERT-base

From here, I will introduce one technique to improve the accuracy when classifying Japanese BERT sentences, not by comparing with Distil BERT.

(Originally, you should first thoroughly consider preprocessing according to the task, but it seems to be an accuracy improvement technique that does not depend much on the task, so I will introduce it here.)

Although the technique is said to be introduced in 5.3 Feature-based Approach with BERT of BERT's paper, the NLP competition previously held at kaggle It seems to be the 1st method of Jigsaw Unintended Bias in Toxicity Classification.

Please refer to the following article for details of the technique.

-[Kaggle Competition Review] Google QUEST Q & A

The point is that out of the 12 Encoder layers of BERT-base, it is better to combine the vectors of the CLS tokens of the final 4 layers than to use only the vectors of the CLS tokens of the final layer. It seems. (I don't know why ...)

The idea is so simple that I'll try it in this livedoor news corpus title classification task.

Implementation

Model declaration

class BertClassifierRevised(nn.Module):
  def __init__(self):
    super(BertClassifierRevised, self).__init__()
    self.bert = BertModel.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
    #The number of dimensions of the hidden layer of BERT is 768, but since the vector of the last 4 layers is combined, it is set to 768 × 4 dimensions.
    self.linear = nn.Linear(768*4, 9)
    #Weight initialization processing
    nn.init.normal_(self.linear.weight, std=0.02)
    nn.init.normal_(self.linear.bias, 0)
  
  #Prepare a function to get the vector of cls token
  def _get_cls_vec(self, vec):
    return vec[:,0,:].view(-1, 768)

  def forward(self, input_ids):
    #First return value last_hidden_In state, you can only get the last layer, so
    # output_hidden_states=Declare True to get all hidden layer vectors,
    #Third return value(State of all hidden layers)To get.
    _, _,  hidden_states = self.bert(input_ids, output_hidden_states=True)

    #Get the vector of cls token from each of the last 4 hidden layers
    vec1 = self._get_cls_vec(hidden_states[-1])
    vec2 = self._get_cls_vec(hidden_states[-2])
    vec3 = self._get_cls_vec(hidden_states[-3])
    vec4 = self._get_cls_vec(hidden_states[-4])

    #Combine four cls tokens into one vector.
    vec = torch.cat([vec1, vec2, vec3, vec4], dim=1)

    #Convert dimensions for classification in fully connected layers
    out = self.linear(vec)
    return F.log_softmax(out)

#Instance declaration
bert_classifier_revised = BertClassifierRevised()

Fine tuning

#First of all OFF
for param in bert_classifier_revised.parameters():
    param.requires_grad = False

#Turn on the last 4 layers of BERT
for param in bert_classifier_revised.bert.encoder.layer[-1].parameters():
    param.requires_grad = True

for param in bert_classifier_revised.bert.encoder.layer[-2].parameters():
    param.requires_grad = True

for param in bert_classifier_revised.bert.encoder.layer[-3].parameters():
    param.requires_grad = True

for param in bert_classifier_revised.bert.encoder.layer[-4].parameters():
    param.requires_grad = True

#Class classification is also ON
for param in bert_classifier_revised.linear.parameters():
    param.requires_grad = True

import torch.optim as optim

#The learning rate should be small for the pre-learned part, and large for the last fully connected layer.
optimizer = optim.Adam([
    {'params': bert_classifier_revised.bert.encoder.layer[-1].parameters(), 'lr': 5e-5},
    {'params': bert_classifier_revised.bert.encoder.layer[-2].parameters(), 'lr': 5e-5},
    {'params': bert_classifier_revised.bert.encoder.layer[-3].parameters(), 'lr': 5e-5},
    {'params': bert_classifier_revised.bert.encoder.layer[-4].parameters(), 'lr': 5e-5},
    {'params': bert_classifier_revised.linear.parameters(), 'lr': 1e-4}
])

#Loss function settings
loss_function = nn.NLLLoss()

Learning & reasoning


import time

start = time.time()
#GPU settings
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#Send network to GPU
bert_classifier_revised.to(device)
losses = []

#The number of epochs is 5
for epoch in range(10):
  all_loss = 0
  for idx, batch in enumerate(train_iter):
    batch_loss = 0
    bert_classifier_revised.zero_grad()
    input_ids = batch.Text[0].to(device)
    label_ids = batch.Label.to(device)
    out = bert_classifier_revised(input_ids)
    batch_loss = loss_function(out, label_ids)
    batch_loss.backward()
    optimizer.step()
    all_loss += batch_loss.item()
  print("epoch", epoch, "\t" , "loss", all_loss)

end = time.time()

print ("time : ", end - start)
#epoch 0 	 loss 196.0047192275524
#epoch 1 	 loss 75.8067753687501
#epoch 2 	 loss 42.30751228891313
#epoch 3 	 loss 16.470114511903375
#epoch 4 	 loss 7.427484432584606
#epoch 5 	 loss 2.9392087209271267
#epoch 6 	 loss 1.5984382012393326
#epoch 7 	 loss 1.7370687873335555
#epoch 8 	 loss 0.9278695838729618
#epoch 9 	 loss 1.499190401067608
#time :  149.01919651031494

#inference
answer = []
prediction = []
with torch.no_grad():
    for batch in test_iter:

        text_tensor = batch.Text[0].to(device)
        label_tensor = batch.Label.to(device)

        score = bert_classifier_revised(text_tensor)
        _, pred = torch.max(score, 1)

        prediction += list(pred.cpu().numpy())
        answer += list(label_tensor.cpu().numpy())
print(classification_report(prediction, answer, target_names=categories))
#                precision    recall  f1-score   support

# kaden-channel       0.80      0.99      0.89       137
#dokujo-tsushin       0.89      0.86      0.88       183
#        peachy       0.78      0.82      0.80       168
#   movie-enter       0.87      0.88      0.87       176
#          smax       0.95      0.93      0.94       168
#livedoor-homme       0.72      0.83      0.77        88
#  it-life-hack       0.95      0.79      0.86       215
#    topic-news       0.83      0.84      0.83       159
#  sports-watch       0.92      0.86      0.89       182

#      accuracy                           0.86      1476
#     macro avg       0.86      0.87      0.86      1476
#  weighted avg       0.87      0.86      0.86      1476

--Loss decreases faster than BERT base ――The learning time was about 150 seconds, which was a little longer than BERT-base. --The accuracy is 0.86, which is improved from 0.83 of BERT-base. Great.

in conclusion

--I compared BERT-base and Distil BERT. As a result, DistilBERT is better in terms of speed and accuracy, but I feel that I understand a little about how to use DistilBERT. ――In the second half, we introduced a plan to consider the cls token of the final 4 layers as a plan to improve the accuracy of BERT. BERT-base It seems that it definitely contributes to the improvement of accuracy compared to. From now on, when implementing the classification model with BERT, let's use the final 4 layers for the time being.

end

Recommended Posts

I tried to compare the accuracy of Japanese BERT and Japanese Distil BERT sentence classification with PyTorch & Introduction of BERT accuracy improvement technique
I tried to implement sentence classification & Attention visualization by Japanese BERT in PyTorch
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to implement sentence classification by Self Attention with PyTorch
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
[PyTorch] Introduction to Japanese document classification using BERT
[Introduction to AWS] I tried porting the conversation app and playing with text2speech @ AWS ♪
I tried to implement and learn DCGAN with PyTorch
[Introduction to Pytorch] I tried categorizing Cifar10 with VGG16 ♬
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried to find the entropy of the image with python
I tried to find the average of the sequence with TensorFlow
I tried to automatically post to ChatWork at the time of deployment with fabric and ChatWork Api
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
I tried to verify the yin and yang classification of Hololive members by machine learning
I tried to automate the watering of the planter with Raspberry Pi
[Introduction to Python] I compared the naming conventions of C # and Python.
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
I tried to expand the size of the logical volume with LVM
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
I tried to improve the efficiency of daily work with Python
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
I tried to visualize the age group and rate distribution of Atcoder
10 methods to improve the accuracy of BERT
I tried to express sadness and joy with the stable marriage problem.
I tried to take the difference of Config before and after work with pyATS / Genie self-made script
I tried how to improve the accuracy of my own Neural Network
[PyTorch] Introduction to document classification using BERT
I tried to get the authentication code of Qiita API with Python.
I tried to automatically extract the movements of PES players with software
[Introduction to Pytorch] I played with sinGAN ♬
I tried to learn the angle from sin and cos with chainer
I tried to extract and illustrate the stage of the story using COTOHA
I tried to verify and analyze the acceleration of Python by Cython
I tried to implement CVAE with PyTorch
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
I tried to streamline the standard role of new employees with Python
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I tried to get the movie information of TMDb API with Python
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to control the network bandwidth and delay with the tc command
I tried image recognition of "Moon and Soft-shelled Turtle" with Pytorch (using torchvision.datasets.ImageFolder which corresponds to from_from_directry of keras)
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried running the TensorFlow tutorial with comments (text classification of movie reviews)
The story of making soracom_exporter (I tried to monitor SORACOM Air with Prometheus)
I tried to create a model with the sample of Amazon SageMaker Autopilot
I tried to automatically send the literature of the new coronavirus to LINE with Python
I tried to detect Mario with pytorch + yolov3
I tried to implement reading Dataset with PyTorch
I tried to save the data with discord
I tried to touch the API of ebay
I tried to correct the keystone of the image
[Introduction to PID] I tried to control and play ♬
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
Introduction to Deep Learning for the first time (Chainer) Japanese character recognition Chapter 4 [Improvement of recognition accuracy by expanding data]
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"