[SWIFT] Copying an object with ARKit + CoreML + LiDAR

I copied the scribbles in the previous article "Geometry of scribbles with ARKit + Vision + iOS14", so this time I examined the processing required to copy the object. As an overview, Vision + CoreML (DeeplabV3) is used to cut out an object, and ARKit's depthMap (iOS14 ~) is used to create a three-dimensional effect.

Completed image

Since the depth information is acquired only by one shot of the copy button tap, the three-dimensional effect gives the signboard a three-dimensional effect. ↓ is the result of copying the car and displaying it in SCNView. This article mainly describes some of the difficulties in using depthMap.

Before getting into the main subject, about the depth information that can be obtained from ARKit and its reliability. depthMap

On LiDAR-equipped terminals such as the iPhone12Pro, depth information can be acquired in real time from the rear camera. In this example, the depth information in the middle part of the screen is visualized in grayscale, but in reality the depth information is the distance from the camera and the unit is meters (when confirmed with iPhone 12 Pro, the type is Float 32. In this example, 0 to 5 m is 0. Displayed in ~ 255 gradations). It seems that the depth of this gif video is a little delayed, but this is because the screen recording of iOS is used, so if you do not record it, the delay is almost unknown. The method of acquiring and processing depth information will be described later.

confidenceMap

From ARKit, it is possible to acquire the reliability of the information together with the depth information at the same resolution as the depth information. The accuracy is defined by ARConfidenceLevel and has 3 levels of low, medium and high. high represents the most reliable. This is the image of the accuracy information as low = black, medium = gray, high = white. It can be seen that the accuracy of the contour part of the object is relatively poor, and that the accuracy is poor if small objects such as fingers are lined up in the direction of the camera (the surface of the object facing the camera is unreliable). It seems that it is high and the reliability decreases as it goes to the side. Since there is only one LiDAR sensor, it is understandable that the accuracy of the side that is difficult to see from the sensor decreases).

[Main subject] How to copy an object

In this article, it is realized by the following procedure.

① Cut out the central part of the captured image with 513 x 513 pixels (2) Cut out the depth information at the position / aspect ratio of (1). ③ Cut out the reliability information of the depth in the same way as ② ④ Segmentation of ① using Vision + CoreML ⑤ Acquire the coordinates of the object recognized in ④ and the highly reliable part of ③. ⑥ Convert the depth information (2D) of ⑤ to 3D coordinates ⑦ ⑥ is made into a 3D model and added to the scene

I had a hard time ②. This conversion was troublesome because it was necessary to acquire depth information according to the position and size cut out in (1).

The reliability is obtained in ③ and used in the judgment of ⑤, but this is because the depth value of the object and the background and the boundary between the objects is often suspicious (extremely far), and it is a 3D model. I'm doing it to prevent things from going wrong.

The procedure is explained below.

① Cut out the central part of the captured image with 513 x 513 pixels

Vision + CoreML is used to cut out the object, and the Core ML model uses Deeplab V3 model published by Apple. Since the image size that can be handled by this CoreML model is 513 x 513, the central part of the captured image is cut out at this size. The captured image passed from ARKit does not consider the orientation and screen size of the terminal. Please refer to the previously written ARKit + Vision + iOS14 Geometry ① [Outline Detection] for how to cut out the displayed part.

(2) Cut out the depth information at the position / aspect ratio of (1).

Depth information is provided in a ** Float32 array **. Therefore, CIImage/CGImage cannot be used to cut out or process depth information. When I started making the sample of this article, I thought that it should be treated as an image, but since the unit is meters, it is different. .. .. I divided Float32 into 8bit units and assigned it to RGBA, but I also stumbled upon my head, but it may not return to the original Float32 during processing (color space, edge processing, etc.), so I decided to do it properly. did.

Depth information depthMap can be obtained via sceneDepth of ARFrame. In this sample, it is obtained from the extension of ARFrame, so it becomes as follows.

guard let pixelBuffer = self.sceneDepth?.depthMap else { return ([], 0) }

If you check with iPhone12Pro + iOS14.2, the size of the depth information that can be obtained with depthMap is 　256 x 192 It has become. On the other hand, the resolution of the image that can be obtained from capturedImage of ARFrame is 　1920 x 1440 Is. Both are 1.333: 1, so it seems that they can be cut out by using displayTransform of ARFrame as in ①.

var displayTransform = self.displayTransform(for: .portrait, viewportSize: viewPortSize)

First of all, I am wondering what to set to viewPortSize when getting displayTransform. The API document says "The size, in points, of the view intended for rendering the camera image.", So the drawing size I want to specify, but the depth information is different from the screen size. The API documentation also states "The affine transform does not scale to the viewport's pixel size.", Which states that the affine transform does not scale to the viewport's pixel size. Shouldn't it be given? The result of actually checking it is as follows (checked with iPhone12Pro + iOS14.2).

Give the aspect ratio: CGSize (width: 1.0, height: 844/390)

▿ CGAffineTransform

a : 0.0
b : 1.0
c : -1.623076923076923
d : 0.0
tx : 1.3115384615384615
ty : -0.0

Give the screen size (pt): CGSize (width: 390, height: 844)

▿ CGAffineTransform

a : 0.0
b : 1.0
c : -1.623076923076923
d : 0.0
tx : 1.3115384615384615
ty : -0.0

Match. I decided to set the aspect ratio of the screen. From here, the depthMap data is extracted on the screen using the affine transformation matrix within the range that can be analyzed by CoreML.

displayTransform is CGAffineTransform and contents are in the following matrix.

\begin{bmatrix}
a & b & 0 \\
c & d & 0 \\
t_{x} & t_{y} & 1 \\
\end{bmatrix}

In the case of portrait, the captured image (and depth information) X-axis and Y-axis are upside down, so add that.

\begin{bmatrix}
-1 & 0 & 0 \\
0 & -1 & 0 \\
1 & 1 & 1 \\
\end{bmatrix}
\begin{bmatrix}
a & b & 0 \\
c & d & 0 \\
t_{x} & t_{y} & 1 \\
\end{bmatrix}
=
\begin{bmatrix}
-a & -b & 0 \\
-c & -d & 0 \\
a+c+t_{x} & b+d+t_{y} & 1 \\
\end{bmatrix}

\begin{bmatrix}
x' & y' & 1 \\
\end{bmatrix}
=
\begin{bmatrix}
x & y & 1 \\
\end{bmatrix}
×
\begin{bmatrix}
-a & -b & 0 \\
-c & -d & 0 \\
a+c+t_{x} & b+d+t_{y} & 1 \\
\end{bmatrix}

x' = -ax + -cy + a + c + t_{x} \\
= -a(x-1)-c(y-1)+ t_{x} \\
y' = -bx + -dy + b + d + t_{y} \\
= -b(x-1)-d(y-1)+ t_{y} \\

Try to apply the acquired value.

x' = 1.62(y-1)+ 1.31 \\
y' = -(x-1) \\

The following can be seen from this result.

--The x-axis and y-axis are swapped. --The converted x takes a value between -0.31 and 1.31. In other words, only the 1/1.62 part in the center of y before conversion is displayed (as far as I can confirm, the range from 0 to 1 is displayed). --The converted y takes a value between 1.0 and 0.0. That is, all in the x direction before conversion is displayed.

Since it is necessary to cut off both sides of the converted x-axis, find the size. Since it is known that the central part is to be cut out, it is calculated by c and height of the depth information.

let sideCutoff = Int((1.0 - (1.0 / displayTransform.c)) / 2.0 * CGFloat(pixelBuffer.height))

This is calculated to be 36. The size of the acquired depth information is ** 256 x 192 **, so if you cut off 36 on both sides, it will be ** 120 x 256 ** (vertical and horizontal are swapped) after conversion.

After that, since the data to be passed to CoreML is square, cut out the square part according to it. The result is a size of 120x120.

func cropPortraitCenterData<T>(sideCutoff: Int) -> ([T], Int) {
    CVPixelBufferLockBaseAddress(self, CVPixelBufferLockFlags(rawValue: CVOptionFlags(0)))
    guard let baseAddress = CVPixelBufferGetBaseAddress(self) else { return ([], 0) }
    let pointer = UnsafeMutableBufferPointer<T>(start: baseAddress.assumingMemoryBound(to: T.self),
                                                count: width * height)
    var dataArray: [T] = []
    //Calculate the width size on the screen. The values are cut off from the full horizontal size of the screen.
    //* Although it is confusing, the data obtained from ARKit during portrait is landscape, so it is calculated by height.
    let size = height - sideCutoff * 2
    //Get the data of the vertical center part of the screen. The acquisition order is upside down.
    for x in (Int((width / 2) - (size / 2)) ..< Int((width / 2) + (size / 2))).reversed() {
        //Get the data of the horizontal center part of the screen. The acquisition order is reversed left and right.
        for y in (sideCutoff ..< (height - sideCutoff)).reversed() {
            let index = y * width + x
            dataArray.append(pointer[index])
        }
    }
    CVPixelBufferUnlockBaseAddress(self, CVPixelBufferLockFlags(rawValue: CVOptionFlags(0)))
    return (dataArray, size)
}

The process of cutting out was written in the extension of CVPixelBuffer. In the case of depth reliability information described later, the type is handled by generics because the data stored in CVPixelBuffer is an array of UInt8. I couldn't find anything that could transform the contents of the array into affine in one shot, so I had to implement the process of swapping the x-axis and y-axis and reversing the data upside down, left and right (only for portraits). ). If you can convert up to this point, it will be easy to combine it with the camera capture image like the gif video of depthMap and confidenceMap at the beginning of this article (By the way, it is converted to MTL Texture with Metal and recorded).

③ Cut out the reliability information of the depth in the same way as ②

The processing content is the same as ②. The confidenceMap can be obtained via the sceneDepth of the ARFrame.

guard let pixelBuffer = self.sceneDepth?.confidenceMap else { return ([], 0) }

Note that the depth information is Float32, while the reliability information is UInt8.

④ Segmentation of ① using Vision + CoreML

Perform object segmentation using Vison + CoreML (DeepLab V3). The method is the same as this reference article "Simple Semantic Image Segmentation in an iOS Application — DeepLabV3 Implementation".

The types of objects that can be segmented with DeepLabV3 are as follows when referring to Metadata from Xcode.

When the hand is actually recognized, the label value = 15 (the position of the person in "labels") is obtained. This is an example of taking a picture of a hand and displaying the segmentation result overlaid on the captured image. It can be seen that the recognition of the hand is slightly delayed. This time he wanted to recognize not only the human body but also cars and flower pots, so he tried using DeepLab V3, but if you just want to recognize the human body, ARKit's People Occlusion is faster. Reference: Everyone is a Super Saiyan with ARKit + Metal

The point is how to access MLMultiArray, which stores the segmentation results. MLMultiArray has a property called dataPointer, which can be accessed as an array through assumingMemoryBound (to :).

guard let observations = request.results as? [VNCoreMLFeatureValueObservation],
      let segmentationmap = observations.first?.featureValue.multiArrayValue else { return }
//The segmentation result is an array of Int32
let labels = segmentationmap.dataPointer.assumingMemoryBound(to: Int32.self)
let centerLabel = labels[centerIndex]

Accessed as an array of Int32. This is because if you check DeepLabV3.mlmodel from Xcode, it is defined as such. Actually, the value can be taken correctly with Int32. スクリーンショット_2020-12-27_7_52_55.png

⑤ Acquire the coordinates of the object recognized in ④ and the highly reliable part of ③.

Decide whether to draw according to the following rules.

--Only the same object as the center of the screen is displayed among the segmentation results. --Display only the part with high reliability of depth information

The segmentation result is judged by the following isInSegment part.

let depthDeeplabV3ScaleFactor = Float(self.detectSize) / Float(self.depthSize)    //Detection resolution/Depth resolution ratio
let isInSegment: (Int, Int) -> Bool = { (x, y) in
    guard x < self.depthSize && y < self.depthSize else { return false }
    let segmentX = Int(Float(x) * depthDeeplabV3ScaleFactor + depthDeeplabV3ScaleFactor / 2)
    let segmentY = Int(Float(y) * depthDeeplabV3ScaleFactor + depthDeeplabV3ScaleFactor / 2)
    let segmentIndex = (segmentedImageSize - segmentY - 1) * segmentedImageSize + segmentX
    //Model if it matches the label in the center
    return labels[segmentIndex] == centerLabel
}

The argument is the coordinates of the depth information. When this sample is run on iPhone12Pro + iOS14.2, the depth information is 120x120, so it is judged by the segmentation result (513x513) corresponding to this coordinate.

The reliability is judged by the following isConfidentDepth part.

//Only reliable depth information is 3D modeled
let isConfidentDepth: (Int, Int) -> Bool = { (x, y) in
    guard x < self.depthSize && y < self.depthSize else { return false }
    return self.depthArray[y * self.depthSize + x] >= 0.0
}

Before this judgment process, the depth information with low reliability is set to -1 in depthArray, so 0.0 or more is judged to be valid.

By the way, this is where -1 is set.

guard depthArray.count == depthConfidenceArray.count else { return }
self.depthArray = depthConfidenceArray.enumerated().map {
    //Depth if confidence is less than high-Rewrite to 1
    return $0.element >= UInt8(ARConfidenceLevel.high.rawValue) ? depthArray[$0.offset] : -1
}

The depth reliability information is checked and the corresponding depth information value is rewritten.

⑥ Convert the depth information (2D) of ⑤ to 3D coordinates

self.cameraIntrinsicsInversed = frame.camera.intrinsics.inverse
//(Omitted)
let depth = self.depthArray[y * self.depthSize + x]
let x_px = Float(x) * depthScreenScaleFactor
let y_px = Float(y) * depthScreenScaleFactor
//Convert 2D depth information to 3D
let localPoint = cameraIntrinsicsInversed * simd_float3(x_px, y_px, 1) * depth

The intrinsics of ARCamera contains a matrix that converts a 3D scale to 2D. Therefore, by using this inverse matrix, the depth information that is 2D can be expanded to the 3D scale. The code for this part is based on Apple's LiDAR sample Visualizing a Point Cloud Using Scene Depth.

For the contents of intrinsics, I got an image in this article "I tried to superimpose the evaluation value on a real Othello board with ARKit".

Now that we have the information to be displayed as 3D (the 3D coordinates of an arbitrary image segment and the part with highly reliable depth information), we will model it.

⑦ ⑥ is made into a 3D model and added to the scene

Make it 3D based on the arrangement of the depth information array. At a certain position in the depth information, polygons are created according to the rule that the depth information directly below and to the right is a triangle, and the depth information directly below and below the left is a triangle.

//Triangle (downward)
if isInSegment(x, y), isInSegment(x, y + 1), isInSegment(x + 1, y),
   isConfidentDepth(x, y), isConfidentDepth(x, y + 1), isConfidentDepth(x + 1, y) {
    //Add polygon index within segment and if depth information is reliable
    indices.append(Int32(y * self.depthSize + x))
    indices.append(Int32(y * self.depthSize + x + 1))
    indices.append(Int32((y + 1) * self.depthSize + x))
    
    if localPoint.y > yMax { yMax = localPoint.y }
    if localPoint.y < yMin { yMin = localPoint.y }
}
//Triangle (upward)
if isInSegment(x, y), isInSegment(x - 1, y + 1), isInSegment(x, y + 1),
   isConfidentDepth(x, y), isConfidentDepth(x - 1, y + 1), isConfidentDepth(x, y + 1){
    //Add polygon index within segment and if depth information is reliable
    indices.append(Int32(y * self.depthSize + x))
    indices.append(Int32((y + 1) * self.depthSize + x))
    indices.append(Int32((y + 1) * self.depthSize + x - 1))
    
    if localPoint.y > yMax { yMax = localPoint.y }
    if localPoint.y < yMin { yMin = localPoint.y }
}

If only this rule is used, the jaggedness will be noticeable if the accuracy of the boundary part is poor. There is room for improvement.

For details on how to create geometry & nodes, see this article "How to create custom geometry with SceneKit + bonus".

Once the geometry is created, it will be possible to determine the collision with the floor.

let bodyGeometry = SCNBox(width: 5.0,
                          height: CGFloat(yMax - yMin),
                          length: 5.0,
                          chamferRadius: 0.0)
let bodyShape = SCNPhysicsShape(geometry: bodyGeometry, options: nil)
node.physicsBody = SCNPhysicsBody(type: .dynamic, shape: bodyShape)
//Drop from 3m above
node.simdWorldPosition = SIMD3<Float>(0.0, 3.0, 0.0)

DispatchQueue.main.async {
    self.scnView.scene.rootNode.addChildNode(node)
}

Since the shape of the created geometry is based on depth information, it is the size from the camera to the distance of the target object. At this time, the origin of the geometry is the position of the camera. The SCNPhysicsShape used for collision detection should also match the size of the geometry, but this time it is only necessary to be able to collide with the floor, so set the height of the geometry only for the height, and the width and depth are the size (5m) that is considered to collide reliably. I have to.

Finally

In the sample made this time, the geometry is created based on the direction in which the camera is facing with the camera as the origin (the direction of the camera is the reference). Therefore, the horizontal plane is not considered and the orientation is not considered. In the example above, the car was shot looking down at an angle, so the geometry is made in that orientation. If you consider the horizontal properly, you should be able to create a horizontal geometry on the outer white line (BoundingBox). This area was dealt with at another time because Apple's LiDAR sample Visualizing a Point Cloud Using Scene Depth seemed to convert it to world coordinates properly at the time of drawing.

That's all for the explanation. I made it by groping, so I think there are some mistakes and what I should do. It would be helpful if you could point out.

Whole source code

`ViewController.swift`


import ARKit
import Vision
import UIKit
import SceneKit

class ViewController: UIViewController, ARSessionDelegate, ARSCNViewDelegate {

    @IBOutlet weak var scnView: ARSCNView!
    
    //Segmentation range. DeepLab V3.Fit to the image size of mlmodel
    private let detectSize: CGFloat = 513.0
    // Vision Model
    private var visonRequest: VNCoreMLRequest?
    //Depth processing result
    private var depthArray: [Float32] = []
    private var depthSize = 0
    private var cameraIntrinsicsInversed: simd_float3x3?
    //3D texture image. The captured image when the copy button is pressed is set.
    private var texutreImage: CGImage?
    //Copy button pressed
    private var isButtonPressed = false
    //Floor thickness(m)
    private let floorThickness: CGFloat = 0.5
    //Local coordinates of the floor. Lower the Y coordinate by the thickness of the floor
    private lazy var floorLocalPosition = SCNVector3(0.0, -floorThickness/2, 0.0)
    //First recognized anchor
    private var firstAnchorUUID: UUID?
    
    override func viewDidLoad() {
        super.viewDidLoad()
        
        //CoreML settings
        setupVison()
        //AR Session started
        self.scnView.delegate = self
        self.scnView.session.delegate = self
        let configuration = ARWorldTrackingConfiguration()
        if ARWorldTrackingConfiguration.supportsFrameSemantics(.sceneDepth) {
            configuration.planeDetection = [.horizontal]
            configuration.frameSemantics = [.sceneDepth]
            self.scnView.session.run(configuration, options: [.removeExistingAnchors, .resetTracking])
        } else {
            print("Does not work on this terminal")
        }
    }

    //Anchor added
    func renderer(_: SCNSceneRenderer, didAdd node: SCNNode, for anchor: ARAnchor) {
        guard anchor is ARPlaneAnchor, self.firstAnchorUUID == nil else { return }

        self.firstAnchorUUID = anchor.identifier
        //Add floor node
        let floorNode = SCNScene.makeFloorNode(width: 10.0, height: self.floorThickness, length: 10.0)
        floorNode.position = floorLocalPosition
        DispatchQueue.main.async {
            node.addChildNode(floorNode)
        }
    }

    //Anchor updated
    func renderer(_: SCNSceneRenderer, didUpdate node: SCNNode, for anchor: ARAnchor) {
        guard anchor is ARPlaneAnchor else { return }

        if let childNode = node.childNodes.first {
            DispatchQueue.main.async {
                //Reposition floor nodes
                childNode.position = self.floorLocalPosition
            }
        }
    }

    //AR frame updated
    func session(_ session: ARSession, didUpdate frame: ARFrame) {
        guard self.isButtonPressed else { return }
        self.isButtonPressed = false

        let aspectRatio = self.scnView.bounds.height / self.scnView.bounds.width
        //Deeplab V3 size in the center of the captured image(513x513)Cut out with
        let image = frame.cropCenterSquareImage(fullWidthScale: self.detectSize,
                                                aspectRatio: aspectRatio,
                                                orientation: self.scnView.window!.windowScene!.interfaceOrientation)
        let context = CIContext(options: nil)
        self.texutreImage = context.createCGImage(image, from: image.extent)

        //Get depth information
        let (depthArray, depthSize) = frame.cropPortraitCenterSquareDepth(aspectRatio: aspectRatio)
        //Get depth confidence information
        let (depthConfidenceArray, _) = frame.cropPortraitCenterSquareDepthConfidence(aspectRatio: aspectRatio)
        //Extract only highly reliable depth information
        guard depthArray.count == depthConfidenceArray.count else { return }
        self.depthArray = depthConfidenceArray.enumerated().map {
            //Depth if confidence is less than high-Rewrite to 1
            return $0.element >= UInt8(ARConfidenceLevel.high.rawValue) ? depthArray[$0.offset] : -1
        }
        self.depthSize = depthSize
        //Inverse matrix of "camera focal length and center point offset information". Prepared to extend 2D depth information to 3D. Reference: https://qiita.com/tanaka-a/items/042fdbd3da6d6332e7e2
        self.cameraIntrinsicsInversed = frame.camera.intrinsics.inverse
        
        //Perform segmentation
        let handler = VNImageRequestHandler(ciImage: image, options: [:])
        try? handler.perform([self.visonRequest!])
    }
    
    //The copy button was pressed
    @IBAction func pressButton(_ sender: Any) {
        isButtonPressed = true
    }
}

// MARK: -

extension ViewController {
    
    private func setupVison() {
        
        guard let visionModel = try? VNCoreMLModel(for: DeepLabV3(configuration: MLModelConfiguration()).model) else { return }
        let request = VNCoreMLRequest(model: visionModel) { request, error in
            //Receive segmentation results
            guard let observations = request.results as? [VNCoreMLFeatureValueObservation],
                  let segmentationmap = observations.first?.featureValue.multiArrayValue else { return }
            //3D model generation of mask part
            self.draw3DModel(segmentationmap: segmentationmap)
        }
        
        request.imageCropAndScaleOption = .centerCrop
        self.visonRequest = request
    }
    
    private func draw3DModel(segmentationmap: MLMultiArray) {
        
        guard !self.depthArray.isEmpty, let cameraIntrinsicsInversed = self.cameraIntrinsicsInversed else { return }

        //The segmentation result is an array of Int32
        let labels = segmentationmap.dataPointer.assumingMemoryBound(to: Int32.self)

        //Get the label in the center of the screen
        let segmentedImageSize = Int(self.detectSize)
        let centerIndex = (segmentedImageSize / 2) * segmentedImageSize + (segmentedImageSize / 2)
        let centerLabel = labels[centerIndex]
        print("Label value in the center of the screen[\(centerLabel)]")
        
        //Depth information size (this sample for iPhone 12 Pro+iOS14.When executed in 2, the segmentation result is centered on 120x120)(513x513)Determine if it is a 3D model target by referring to
        let depthDeeplabV3ScaleFactor = Float(self.detectSize) / Float(self.depthSize)    //Detection resolution/Depth resolution ratio
        let isInSegment: (Int, Int) -> Bool = { (x, y) in
            guard x < self.depthSize && y < self.depthSize else { return false }
            let segmentX = Int(Float(x) * depthDeeplabV3ScaleFactor + depthDeeplabV3ScaleFactor / 2)
            let segmentY = Int(Float(y) * depthDeeplabV3ScaleFactor + depthDeeplabV3ScaleFactor / 2)
            let segmentIndex = (segmentedImageSize - segmentY - 1) * segmentedImageSize + segmentX
            //Model if it matches the label in the center
            return labels[segmentIndex] == centerLabel
        }
        
        //Only reliable depth information is 3D modeled
        let isConfidentDepth: (Int, Int) -> Bool = { (x, y) in
            guard x < self.depthSize && y < self.depthSize else { return false }
            return self.depthArray[y * self.depthSize + x] >= 0.0
        }
        
        //Generate polygon vertex coordinates and texture coordinates
        var vertices: [SCNVector3] = []
        var texcoords: [CGPoint] = []
        var indices: [Int32] = []
        var yMax: Float = 0.0
        var yMin: Float = 0.0
        let depthScreenScaleFactor = Float(self.scnView.bounds.width * UIScreen.screens.first!.scale / CGFloat(self.depthSize))
        for y in 0 ..< self.depthSize {
            for x in 0 ..< self.depthSize {
                //Create vertex coordinates (make something that will not be displayed in the end)
                let depth = self.depthArray[y * self.depthSize + x]
                let x_px = Float(x) * depthScreenScaleFactor
                let y_px = Float(y) * depthScreenScaleFactor
                //Convert 2D depth information to 3D
                let localPoint = cameraIntrinsicsInversed * simd_float3(x_px, y_px, 1) * depth
                
                //Depth information is positive, SceneKit is negative in the back, so Z is sign inversion
                vertices.append(SCNVector3(localPoint.x, localPoint.y, -localPoint.z))
                
                //Use the coordinates on the captured image as the texture coordinates
                let x_coord = CGFloat(x) * CGFloat(depthDeeplabV3ScaleFactor) / self.detectSize
                let y_coord = CGFloat(y) * CGFloat(depthDeeplabV3ScaleFactor) / self.detectSize
                texcoords.append(CGPoint(x: x_coord, y: 1 - y_coord))
                
                //Triangle (downward)
                if isInSegment(x, y), isInSegment(x, y + 1), isInSegment(x + 1, y),
                   isConfidentDepth(x, y), isConfidentDepth(x, y + 1), isConfidentDepth(x + 1, y) {
                    //Add polygon index within segment and if depth information is reliable
                    indices.append(Int32(y * self.depthSize + x))
                    indices.append(Int32(y * self.depthSize + x + 1))
                    indices.append(Int32((y + 1) * self.depthSize + x))
                    
                    if localPoint.y > yMax { yMax = localPoint.y }
                    if localPoint.y < yMin { yMin = localPoint.y }
                }
                //Triangle (upward)
                if isInSegment(x, y), isInSegment(x - 1, y + 1), isInSegment(x, y + 1),
                   isConfidentDepth(x, y), isConfidentDepth(x - 1, y + 1), isConfidentDepth(x, y + 1){
                    //Add polygon index within segment and if depth information is reliable
                    indices.append(Int32(y * self.depthSize + x))
                    indices.append(Int32((y + 1) * self.depthSize + x))
                    indices.append(Int32((y + 1) * self.depthSize + x - 1))
                    
                    if localPoint.y > yMax { yMax = localPoint.y }
                    if localPoint.y < yMin { yMin = localPoint.y }
                }
            }
        }
        
        //Geometry creation
        let vertexSource = SCNGeometrySource(vertices: vertices)
        let texcoordSource = SCNGeometrySource(textureCoordinates: texcoords)
        let geometryElement = SCNGeometryElement(indices: indices, primitiveType: .triangles)
        let geometry = SCNGeometry(sources: [vertexSource, texcoordSource], elements: [geometryElement])
        
        //Material creation
        let material = SCNMaterial()
        material.lightingModel = .constant
        material.diffuse.contents = self.texutreImage
        geometry.materials = [material]
        //Node creation
        let node = SCNNode(geometry: geometry)
        //The size of the collision judgment is appropriate just by hitting the floor.
        let bodyGeometry = SCNBox(width: 5.0,
                                  height: CGFloat(yMax - yMin),
                                  length: 5.0,
                                  chamferRadius: 0.0)
        let bodyShape = SCNPhysicsShape(geometry: bodyGeometry, options: nil)
        node.physicsBody = SCNPhysicsBody(type: .dynamic, shape: bodyShape)
        //Drop from 3m above
        node.simdWorldPosition = SIMD3<Float>(0.0, 3.0, 0.0)

        DispatchQueue.main.async {
            self.scnView.scene.rootNode.addChildNode(node)
        }
    }
}

`SwiftExtensions.swift`


import UIKit
import ARKit

extension CVPixelBuffer {
    
    var width: Int { CVPixelBufferGetWidth(self) }
    var height: Int { CVPixelBufferGetHeight(self) }
    
    func cropPortraitCenterData<T>(sideCutoff: Int) -> ([T], Int) {
        CVPixelBufferLockBaseAddress(self, CVPixelBufferLockFlags(rawValue: CVOptionFlags(0)))
        guard let baseAddress = CVPixelBufferGetBaseAddress(self) else { return ([], 0) }
        let pointer = UnsafeMutableBufferPointer<T>(start: baseAddress.assumingMemoryBound(to: T.self),
                                                    count: width * height)
        var dataArray: [T] = []
        //Calculate the width size on the screen. The values are cut off from the full horizontal size of the screen.
        //* Although it is confusing, the data obtained from ARKit during portrait is landscape, so it is calculated by height.
        let size = height - sideCutoff * 2
        //Get the data of the vertical center part of the screen. The acquisition order is upside down.
        for x in (Int((width / 2) - (size / 2)) ..< Int((width / 2) + (size / 2))).reversed() {
            //Get the data of the horizontal center part of the screen. The acquisition order is reversed left and right.
            for y in (sideCutoff ..< (height - sideCutoff)).reversed() {
                let index = y * width + x
                dataArray.append(pointer[index])
            }
        }
        CVPixelBufferUnlockBaseAddress(self, CVPixelBufferLockFlags(rawValue: CVOptionFlags(0)))
        return (dataArray, size)
    }
}

extension ARFrame {
    
    func cropCenterSquareImage(fullWidthScale: CGFloat, aspectRatio: CGFloat, orientation: UIInterfaceOrientation) -> CIImage {
        let pixelBuffer = self.capturedImage
        
        //Convert input image to screen size
        let imageSize = CGSize(width: pixelBuffer.width, height: pixelBuffer.height)
        let image = CIImage(cvImageBuffer: pixelBuffer)
        // 1)Input image 0.0〜1.Convert to 0 coordinates
        let normalizeTransform = CGAffineTransform(scaleX: 1.0/imageSize.width, y: 1.0/imageSize.height)
        // 2)For portraits, flip the X and Y axes
        var flipTransform = CGAffineTransform.identity
        if orientation.isPortrait {
            //Invert both X and Y axes
            flipTransform = CGAffineTransform(scaleX: -1, y: -1)
            //Since both the X and Y axes move to the minus side, move to the plus side.
            flipTransform = flipTransform.concatenating(CGAffineTransform(translationX: 1, y: 1))
        }
        // 3)Move to the orientation / position of the screen on the input image
        let viewPortSize = CGSize(width: fullWidthScale, height: fullWidthScale * aspectRatio)
        let displayTransform = self.displayTransform(for: orientation, viewportSize: viewPortSize)
        // 4) 0.0〜1.Convert from 0 coordinate system to screen coordinate system
        let toViewPortTransform = CGAffineTransform(scaleX: viewPortSize.width, y: viewPortSize.height)
        // 5)Convert from 1 to 4 and clip the converted image to the specified size
        let transformedImage = image
            .transformed(by: normalizeTransform
                            .concatenating(flipTransform)
                            .concatenating(displayTransform)
                            .concatenating(toViewPortTransform))
            .cropped(to: CGRect(x: 0,
                                y: CGFloat(Int(viewPortSize.height / 2.0 - fullWidthScale / 2.0)),
                                width: fullWidthScale,
                                height: fullWidthScale))
        
        return transformedImage
    }
    
    func cropPortraitCenterSquareDepth(aspectRatio: CGFloat) -> ([Float32], Int) {
        guard let pixelBuffer = self.sceneDepth?.depthMap else { return ([], 0) }
        return cropPortraitCenterSquareMap(pixelBuffer, aspectRatio)
    }
    
    func cropPortraitCenterSquareDepthConfidence(aspectRatio: CGFloat) -> ([UInt8], Int) {
        guard let pixelBuffer = self.sceneDepth?.confidenceMap else { return ([], 0) }
        return cropPortraitCenterSquareMap(pixelBuffer, aspectRatio)
    }
    
    private func cropPortraitCenterSquareMap<T>(_ pixelBuffer: CVPixelBuffer, _ aspectRatio: CGFloat) -> ([T], Int) {
        
        let viewPortSize = CGSize(width: 1.0, height: aspectRatio)
        var displayTransform = self.displayTransform(for: .portrait, viewportSize: viewPortSize)
        //In the case of portrait, both X-axis and Y-axis are inverted
        var flipTransform =  CGAffineTransform(scaleX: -1, y: -1)
        //Since both the X and Y axes move to the minus side, move to the plus side.
        flipTransform = flipTransform.concatenating(CGAffineTransform(translationX: 1, y: 1))
        
        displayTransform = displayTransform.concatenating(flipTransform)
        let sideCutoff = Int((1.0 - (1.0 / displayTransform.c)) / 2.0 * CGFloat(pixelBuffer.height))
        
        return pixelBuffer.cropPortraitCenterData(sideCutoff: sideCutoff)
    }
}

extension SCNScene {
    static func makeFloorNode(width: CGFloat, height: CGFloat, length: CGFloat) -> SCNNode {
        let geometry = SCNBox(width: width, height: height, length: length, chamferRadius: 0.0)
        let material = SCNMaterial()
        material.lightingModel = .shadowOnly
        geometry.materials = [material]
        let node = SCNNode(geometry: geometry)
        node.castsShadow = false
        node.physicsBody = SCNPhysicsBody.static()
        return node
    }
}