1.First of all

One of the similar image search algorithms is dHash. For the contents of the algorithm, see "Calculate the similarity of images using Perceptual Hash". Is easy to understand, but I understand that it is a similar image search algorithm with the following characteristics.

Very fast hash value calculation
Search accuracy is relatively high, especially false-positive
Since the hash value is calculated after grayscale conversion is performed on the target image, it is resistant to slight color blurring.
Since the hash value is calculated after reducing the target image to a 9x8 size image, it is resistant to slight misalignment.

So, I tried using this dHash to see if I could identify the position of the scene on the course from a scene (onboard video) of the play of the racing game Assetto Corsa.

Specifically, the flow is as follows.

概要.png

① First, all frames are extracted and saved as PNG images from a play video (onboard video) that goes around the course.

② Calculates the dHash hash value for all saved frame images. In addition, since it is possible to identify the position of the vehicle at a certain time from the telemetry data acquired when shooting the play video, the hash value calculated in combination with the position information is saved in the search CSV file.

③ On the other hand, select one scene where you want to identify the position on the course from another play video.

④ Calculates the dHash hash value for the selected one-scene image.

⑤ Search the CSV file for search for the image with the hash value closest to the calculated hash value of dHash. Since the position information is linked to the image hit by the search in (2), that position is regarded as the position on the course of the selected scene.

In one scene of a racing game, even if you are close to the course, there will be a slight misalignment due to differences in the lines that pass through each play. I think the point of this time is to be able to absorb such differences and search for similar images.

2. Implementation code

This time, we will implement the above process in Python.

2-1. Extracting frame images from play videos

I used OpenCV this time to extract the images of all frames from the play video file (mp4 file). I refer to the following page.

Cut out a frame from a video file with Python, OpenCV and save it

The frame image is saved with the file name "(frame number) .png ".

`01_extract_frames.py`


import cv2
import sys

def extract_frame(video_file, save_dir):
    capture = cv2.VideoCapture(video_file)

    frame_no = 0

    while True:
        retval, frame = capture.read()

        if retval:
            cv2.imwrite(r'{}\{:06d}.png'.format(save_dir, frame_no), frame)
            frame_no = frame_no + 1
        else:
            break

if __name__ == '__main__':
    video_file = sys.argv[1]
    save_dir = sys.argv[2]

    extract_frame(video_file, save_dir)

This script is also used in ③.

2-2. Linking dHash calculation and location information

Of the extracted frame images, the hash value of dHash is calculated using the dhash function of the ImageHash package for the image corresponding to a certain period (a specific lap part). In addition, it is linked with the telemetry data (extracting only the relevant part in advance) acquired by the following in-game App script and output to the search CSV file.

(ACTelemetry.py)[https://github.com/abe-masanori/AC_analysis/blob/master/data_acq/ACTelemetry.py]

`02_calc_dhash.py`


from PIL import Image, ImageFilter
import imagehash
import csv
import sys

frame_width = 1280
frame_hight = 720
trim_lr = 140
trim_tb = 100

dhash_size = 8

def calc_dhash(frame_dir, frame_no_from, frame_no_to, telemetry_file, output_file):

    #Read telemetry data file
    position_data = [row for row in csv.reader(open(telemetry_file), delimiter = '\t')]

    writer = csv.writer(open(output_file, mode = 'w', newline=''))

    for i in range(frame_no_from, frame_no_to + 1):
        #Read the extracted frame image and crop it (to erase the time etc. displayed at the edge of the image)
        frame = Image.open(r'{}\{:06d}.png'.format(frame_dir, i))
        trimed_frame = frame.crop((
            trim_lr, 
            trim_tb, 
            frame_width - trim_lr, 
            frame_hight - trim_tb))

        #Calculation of dHash value
        dhash_value = str(imagehash.dhash(trimed_frame, hash_size = dhash_size))
        
        #Linking with telemetry data
        #Since both images and telemetry are output at regular intervals, they are simply linked in proportion to the number of lines.
        position_no = round((len(position_data) - 1) * (i - frame_no_from) / (frame_no_to - frame_no_from)) 

        writer.writerow([
                i,
                dhash_value,
                position_data[position_no][9], 
                position_data[position_no][10]
        ])

if __name__ == '__main__':
    frame_dir = sys.argv[1]
    frame_no_from = int(sys.argv[2])
    frame_no_to = int(sys.argv[3])
    telemetry_file = sys.argv[4]
    output_file = sys.argv[5]
    
    calc_dhash(frame_dir, frame_no_from, frame_no_to, telemetry_file, output_file)

As a result of this script, the following information (frame number), (hash value), and (2D coordinate position in meters) are output to the CSV file.

731,070b126ee741c080,-520.11,139.89
732,070b126ee7c1c080,-520.47,139.90
733,070b126ee7c1c480,-520.84,139.92

This script is also used in ④.

2-3. Search for the closest hash value

The image with the hash value closest to the specified hash value is searched from the search CSV file output in 2-2.

The Hamming distance is used for the closeness of hash values. I use popcount from the gmpy2 package to calculate the Hamming distance (because it seems to be very fast).

`03_match_frame.py`


import csv
import gmpy2
import sys

def match_frame(base_file, search_hash):

    base_data = [row for row in csv.reader(open(base_file))]

    min_distance = 64
    min_line = None

    results = []

    for base_line in base_data:
        distance = gmpy2.popcount(
            int(base_line[1], 16) ^
            int(search_hash, 16)
        )

        if distance < min_distance:
            min_distance = distance
            results = [base_line]
        elif distance == min_distance:
            results.append(base_line)
    
    print("Distance = {}".format(min_distance))
    for min_line in results:
        print(min_line)

if __name__ == '__main__':
    base_file = sys.argv[1]
    search_hash = sys.argv[2]

    match_frame(base_file, search_hash)

As shown below, the location information associated with the image information with the closest hash value is output. (If there are multiple images with the same distance, all image information will be displayed.)

> python.exe 03_match_frame.py dhash_TOYOTA_86.csv cdc9cebc688f3f47
Distance = 8
['13330', 'c9cb4cb8688f3f7f', '-1415.73', '-58.39']
['13331', 'c9eb4cbc688f3f7f', '-1415.39', '-58.44']

3. Search results

This time, all frames of the play video of the Nürburgring Nordschleife (total length 20.81km) run on the TOYOTA 86 GT are extracted ⇒ Hash value calculation is performed, and it is close to the image of some scenes of another play video run on the BMW Z4. I will look for the image.

First of all, let's check if the images of the three famous corners can be searched properly.

Schwedenkreuz

	Image used for search	Hit image
image
Hash value	ced06061edcf9f2d	0c90e064ed8f1f3d
location information	(-2388.29, 69.74)	(-2416.50, 66.67)

Hash value distance = 10, location information deviation = 28.4m

In dHash, it is said that if the hash value distance is 10 or less, it is regarded as the same image, but it is just barely. Even on the image, the shape of the corners is similar, but the positions of the surrounding trees are slightly different, so it is difficult to judge whether they are similar or not.

Bergwerk

	Image used for search	Hit image
image
Hash value	7c5450640c73198c	7c7c50642d361b0a
location information	(317.58, -121.52)	(316.18, -121.45)

Hash value distance = 11, location information deviation = 1.4m

This is pretty close in terms of images. However, the hash value distance is 11 which is larger than before.

Karussell

	Image used for search	Hit image
image
Hash value	665d1d056078cde6	665c1d856050da8d
location information	(2071.48, 77.01)	(2071.23, 77.12)

Hash value distance = 13, location information deviation = 0.27m

The position of the vehicle is also very close, and the images look quite similar, but if you look closely, the orientation is slightly different. The hash value distance is 13 which is quite large.

As mentioned above, I tried it in three famous corners, but it seems that I can identify the closest position.

In addition, I tried it in 10 randomly selected scenes, and it became as follows.

Only the correct position hits: 6
Hit both correct and incorrect positions: 1
Hit only in the wrong position: 3

Only one case where only the wrong position was hit is shown below.

	Image used for search	Hit image
image
Hash value	b7b630b24c1e1f1e	b7b43839481e3f1f
location information	(1439.61, -18.69)	(2059.41, 37.44)

Hash value distance = 9, location information deviation = 622.34m

It seems that the way the trees grow on the left and right and the tip of the course are quite different from the left curve or straight, but the hash value distance is relatively close to 9.

By the way, the images that I would like you to hit are as follows.

Image used for search	Hit image

Hash value difference = 13

At first glance, the images look similar, but if you look closely, they are out of alignment, and as a result, I think the difference is relatively large.

4. Finally

In this article, I searched for similar images using dHash for racing game scenes.

The accuracy is relatively good for characteristic scenes such as famous corners, but when I randomly selected the scenes, the winning percentage was 70% (please forgive the fact that the number of confirmed cases is as small as 10).

It is not a solid interpretation, but my personal impression is as follows.

It seems that dHash is regarded as the same image if the hash value distance is 10 or less, but this time, even a fairly similar image has a distance of about 8-15, so it is necessary to consider the judgment threshold. ..
Even if the image is different from what people see, the distance may be about 8-10, so it cannot be said that there are few false-positives. (I can't say anything about this because I feel that "people see different things" may differ from person to person.)

If you want to improve the accuracy, I think you can think of the following measures.

Check not only the closest image but also the image in a range close to some extent.
When searching, search not only one scene but also the scenes before and after

Use dHash to locate on the course from a scene in a racing game