[SWIFT] I made an iPhone Theremin with Vision framework + AudioKit

What I made

** "iPhone Pseudo Theremin" ** When you hold your hand toward the iPhone, it depends on the position of your hand (to be exact, the position of your index finger). The pitch/loudness of the sound that sounds changes. Since it's Christmas, I'm trying to play something like "Silent Night". .. .. difficult. .. ..

Why did you make this

I don't have an instrument at home ** I'm lonely ** (I had a lot of piano/guitar/drums at my parents' house) → If you make a sound on your iPhone, it will be ** fun ** ❓ → This article If you look at it, you can turn your iPhone into an instrument, but ** What is Theremin **? → Theremin performance video → The sound is so beautiful, it sounds amazing without touching it, ** I want to make a pseudo theremin **

How to make a pseudo theremin

According to this article, "Theremin sounds louder as you move your hand closer to the vertical antenna, and quieter as you move your hand closer to the horizontal antenna" **. If you want to make something like a theremin on your iPhone, why not determine the sound based on the distance between your iPhone screen and your hand? --- At first I thought so, but ** I gave up because it seemed difficult **. When I was wondering what to do, I remembered that there was Detecting hand poses using Vision Framework in WWDC20. Why not use this framework to make the Y coordinate of the index finger of your right hand on the screen correspond to the pitch of the sound **? It seems difficult to track both hands, so why not try to adjust the volume with one hand? In that case, should ** X coordinates correspond to the loudness of the sound **? It seems that AudioKit can be used to output sound from the iPhone.

Implementation

Since HandPose of Vision Framework is iOS 14.0+, I will make it work on my iOS 14 iPhone. ** I really wanted to do it on an iPad with a big screen **, but I gave up ** because I only had an iPad with iOS 12 at home. The implementation is roughly divided into two steps. ** ① When the app is started, the image of the in-camera will be displayed on the iPhone screen so that the coordinates of the index finger of the right hand reflected in the in-camera can be taken ** ** ② Convert the coordinates of the index finger of the right hand to the pitch/loudness of the sound to make a sound **

① When you start the application, the image of the in-camera will be displayed on the iPhone screen so that the coordinates of the index finger of the right hand reflected in the in-camera can be taken.

The code for face recognition with an in-camera was found in this article. To be able to recognize hands, it would be nice if the ** face recognition part of this code could be changed to hand recognition **. For hand recognition, Apple provided the Demo App Code, which I will refer to. Only the part where the index finger of this code is tracked is extracted, and the above-mentioned face recognition part is rewritten as hand recognition.

The code below is ** "When you start the app, the image of the in-camera is displayed on the screen of the iPhone, and the coordinates of the index finger of the right hand reflected in the in-camera are written out with a print statement" **. The Y coordinate is set so that the upper part of the screen is 0 and the lower part of the screen is 1. The X coordinate is 0 on the right side of the screen and 1 on the left side of the screen. I cut and pasted the code, so there may be some unnecessary parts. .. ..

ViewController.swift


import UIKit
import Vision
import AVFoundation

class ViewController: UIViewController,
                      AVCaptureVideoDataOutputSampleBufferDelegate {
    private var handPoseRequest = VNDetectHumanHandPoseRequest()
    var indexTip  = CGPoint (x: 0,
                             y: 0)
    private var _captureSession = AVCaptureSession()
    private var _videoDevice = AVCaptureDevice.default(for: AVMediaType.video)
    private var _videoOutput = AVCaptureVideoDataOutput()
    private var _videoLayer : AVCaptureVideoPreviewLayer? = nil
    private var rectArray:[UIView] = []
    var image : UIImage!
    func setupVideo( camPos:AVCaptureDevice.Position,
                     orientaiton:AVCaptureVideoOrientation ){
        //Camera related settings
        self._captureSession = AVCaptureSession()
        self._videoOutput = AVCaptureVideoDataOutput()
        self._videoDevice = AVCaptureDevice.default(.builtInWideAngleCamera,
                                                    for: .video,
                                                    position: camPos)
        //Create Input and add to Session
        do {
            let videoInput = try AVCaptureDeviceInput(device: self._videoDevice!) as AVCaptureDeviceInput
            self._captureSession.addInput(videoInput)
        } catch let error as NSError {
            print(error)
        }
        //Create Output and add to Session
        self._videoOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as AnyHashable as! String : Int(kCVPixelFormatType_32BGRA)]
        self._videoOutput.setSampleBufferDelegate(self,
                                                  queue: DispatchQueue.main)
        self._videoOutput.alwaysDiscardsLateVideoFrames = true
        self._captureSession.addOutput(self._videoOutput)
        for connection in self._videoOutput.connections {
            connection.videoOrientation = orientaiton
        }
        //Create an output layer
        self._videoLayer = AVCaptureVideoPreviewLayer(session: self._captureSession)
        self._videoLayer?.frame = UIScreen.main.bounds
        self._videoLayer?.videoGravity = AVLayerVideoGravity.resizeAspectFill
        self._videoLayer?.connection?.videoOrientation = orientaiton
        self.view.layer.addSublayer(self._videoLayer!)
        //Start recording
        self._captureSession.startRunning()
    }
    private func imageFromSampleBuffer(sampleBuffer: CMSampleBuffer) -> UIImage {
        let imageBuffer: CVImageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer)!
        CVPixelBufferLockBaseAddress(imageBuffer,
                                     CVPixelBufferLockFlags(rawValue: 0))
        let colorSpace = CGColorSpaceCreateDeviceRGB()
        let bitmapInfo = (CGBitmapInfo.byteOrder32Little.rawValue | CGImageAlphaInfo.premultipliedFirst.rawValue)
        let context = CGContext(data: CVPixelBufferGetBaseAddressOfPlane(imageBuffer,
                                                                         0),
                                width: CVPixelBufferGetWidth(imageBuffer),
                                height: CVPixelBufferGetHeight(imageBuffer),
                                bitsPerComponent: 8,
                                bytesPerRow: CVPixelBufferGetBytesPerRow(imageBuffer),
                                space: colorSpace,
                                bitmapInfo: bitmapInfo)
        let imageRef = context!.makeImage()
        CVPixelBufferUnlockBaseAddress(imageBuffer,
                                       CVPixelBufferLockFlags(rawValue: 0))
        let resultImage: UIImage = UIImage(cgImage: imageRef!)
        return resultImage
    }
    func captureOutput(_ output: AVCaptureOutput,
                       didOutput sampleBuffer: CMSampleBuffer,
                       from connection: AVCaptureConnection) {
        let handler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer,
                                            orientation: .up,
                                            options: [:])
        do {
            // Perform VNDetectHumanHandPoseRequest
            try handler.perform([handPoseRequest])
            // Continue only when a hand was detected in the frame.
            // Since we set the maximumHandCount property of the request to 1, there will be at most one observation.
            guard let observation = handPoseRequest.results?.first else {
                return
            }
            // Get points for index finger.
            let indexFingerPoints = try observation.recognizedPoints(.indexFinger)
            // Look for tip points.
            guard let indexTipPoint = indexFingerPoints[.indexTip] else {
                return
            }
            indexTip = CGPoint(x: indexTipPoint.location.x,
                               y: 1 - indexTipPoint.location.y)
            print(indexTip)
        } catch {
            
        }
    }
    override func viewDidLoad() {
        super.viewDidLoad()
        // This sample app detects one hand only.
        handPoseRequest.maximumHandCount = 1
        setupVideo(camPos: .front,
                   orientaiton: .portrait)
    }
}

Don't forget the ** camera permissions **. スクリーンショット 2020-11-12 19.27.29.png

② Convert the coordinates of the index finger of the right hand to the pitch/loudness of the sound to make a sound.

As mentioned earlier, AudioKit is used here. It was my first time to use AudioKit, so I wrote it according to this article, which describes how to use it, but ** got angry **. In the red line below, "Module'AudioKit' has no member named'output'" "Module'AudioKit' has no member named'start'". It's probably because the version of the library used in the article is different from the version of the library actually introduced. .. .. スクリーンショット 2020-11-13 18.47.36.png If you refer to the part of AudiKit's Official Page that says "Example Code" in "AudioKit V4.11", AKManager.output = oscillator Because there was a description, I tried to fix the red line error part like that ↓ スクリーンショット 2020-11-13 19.24.13.png Then I still get an error. In addition, the following error was also issued. スクリーンショット 2020-11-13 19.24.41.png ** Hmm **. When I looked at the Official page again, it said something like ** "Users who install AudioKit for the first time should install ver.5" **. Furthermore, since there was an explanation as follows, I will re-install ver.5.

To add AudioKit to your Xcode project, select File -> Swift Packages -> Add Package Depedancy. Enter https://github.com/AudioKit/AudioKit for the URL. Check the use branch option and enter v5-main or v5-develop.

After the ** long and long loading time **, correct the code by referring to the migration guide, and finally ** "Convert the coordinates of the index finger of the right hand to the pitch/loudness of the sound and make a sound. I arrived at **. The code is described below, but it is possible to realize that "the sound is high at the top of the screen, the sound is low at the bottom of the screen, the sound is loud at the right of the screen, and the sound is low at the left of the screen".

However, I couldn't play it as it was. ** I'm not a theremin player, so I couldn't remember which sound was produced at which finger position. ** ** Therefore, I tried to draw a yellow line for each scale by referring to this article. I referred to this article for the calculation method of the scale required when drawing a line. When the app starts, 440hz beeps like a startup sound, but I don't care about that.

ViewController.swift


import UIKit
import Vision
import AVFoundation
import AudioKit

class ViewController: UIViewController,
                      AVCaptureVideoDataOutputSampleBufferDelegate {
    let oscillator = Oscillator()
    let engine = AudioEngine()
    private var handPoseRequest = VNDetectHumanHandPoseRequest()
    var indexTip  = CGPoint (x: 0,
                             y: 0)
    private var _captureSession = AVCaptureSession()
    private var _videoDevice = AVCaptureDevice.default(for: AVMediaType.video)
    private var _videoOutput = AVCaptureVideoDataOutput()
    private var _videoLayer : AVCaptureVideoPreviewLayer? = nil
    private var rectArray:[UIView] = []
    var image : UIImage!
    func setupVideo( camPos:AVCaptureDevice.Position,
                     orientaiton:AVCaptureVideoOrientation){
        //Camera related settings
        self._captureSession = AVCaptureSession()
        self._videoOutput = AVCaptureVideoDataOutput()
        self._videoDevice = AVCaptureDevice.default(.builtInWideAngleCamera,
                                                    for: .video,
                                                    position: camPos)
        //Create Input and add to Session
        do {
            let videoInput = try AVCaptureDeviceInput(device: self._videoDevice!) as AVCaptureDeviceInput
            self._captureSession.addInput(videoInput)
        } catch let error as NSError {
            print(error)
        }
        //Create Output and add to Session
        self._videoOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as AnyHashable as! String : Int(kCVPixelFormatType_32BGRA)]
        self._videoOutput.setSampleBufferDelegate(self,
                                                  queue: DispatchQueue.main)
        self._videoOutput.alwaysDiscardsLateVideoFrames = true
        self._captureSession.addOutput(self._videoOutput)
        for connection in self._videoOutput.connections {
            connection.videoOrientation = orientaiton
        }
        //Create an output layer
        self._videoLayer = AVCaptureVideoPreviewLayer(session: self._captureSession)
        self._videoLayer?.frame = UIScreen.main.bounds
        self._videoLayer?.videoGravity = AVLayerVideoGravity.resizeAspectFill
        self._videoLayer?.connection?.videoOrientation = orientaiton
        self.view.layer.addSublayer(self._videoLayer!)
        //Start recording
        self._captureSession.startRunning()
    }
    private func imageFromSampleBuffer(sampleBuffer: CMSampleBuffer) -> UIImage {
        let imageBuffer: CVImageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer)!
        CVPixelBufferLockBaseAddress(imageBuffer,
                                     CVPixelBufferLockFlags(rawValue: 0))
        let colorSpace = CGColorSpaceCreateDeviceRGB()
        let bitmapInfo = (CGBitmapInfo.byteOrder32Little.rawValue | CGImageAlphaInfo.premultipliedFirst.rawValue)
        let context = CGContext(data: CVPixelBufferGetBaseAddressOfPlane(imageBuffer, 0),
                                width: CVPixelBufferGetWidth(imageBuffer),
                                height: CVPixelBufferGetHeight(imageBuffer),
                                bitsPerComponent: 8,
                                bytesPerRow: CVPixelBufferGetBytesPerRow(imageBuffer),
                                space: colorSpace,
                                bitmapInfo: bitmapInfo)
        let imageRef = context!.makeImage()
        CVPixelBufferUnlockBaseAddress(imageBuffer,
                                       CVPixelBufferLockFlags(rawValue: 0))
        let resultImage: UIImage = UIImage(cgImage: imageRef!)
        return resultImage
    }
    func captureOutput(_ output: AVCaptureOutput,
                       didOutput sampleBuffer: CMSampleBuffer,
                       from connection: AVCaptureConnection) {
        let handler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer,
                                            orientation: .up,
                                            options: [:])
        do {
            // Perform VNDetectHumanHandPoseRequest
            try handler.perform([handPoseRequest])
            // Continue only when a hand was detected in the frame.
            // Since we set the maximumHandCount property of the request to 1, there will be at most one observation.
            guard let observation = handPoseRequest.results?.first else {
                oscillator.stop()
                return
            }
            // Get points for index finger.
            let indexFingerPoints = try observation.recognizedPoints(.indexFinger)
            // Look for tip points.
            guard let indexTipPoint = indexFingerPoints[.indexTip] else {
                return
            }
            indexTip = CGPoint(x: 1 - indexTipPoint.location.x,
                               y: 1 - indexTipPoint.location.y)
            //Replace the coordinates of the tip of the index finger with the frequency from the lower la to the normal la
            let frequency = 440.000 - 220 * indexTip.y
            oscillator.frequency = AUValue(frequency)
            oscillator.amplitude = AUValue(indexTip.x)
            if oscillator.isStopped {
                oscillator.start()
            }
        } catch {

        }
    }
    override func viewDidLoad() {
        super.viewDidLoad()
        let mixer = Mixer(oscillator)
        engine.output = mixer
        try? engine.start()
        oscillator.start()
        // This app detects one hand only.
        handPoseRequest.maximumHandCount = 1
        setupVideo(camPos: .front,
                   orientaiton: .portrait)
        drawLines(positionArray: frequencyToPosition(frequencyArray: notes()))
    }
    //A function that draws a line at the location of each scale.
    func drawLines(positionArray: [CGFloat]){
        let linePath = UIBezierPath()
        for position in positionArray {
            linePath.move(to: CGPoint(x: 0,
                                      y: position))
            linePath.addLine(to: CGPoint(x: 400,
                                         y: position))
            let lineLayer = CAShapeLayer()
            lineLayer.path = linePath.cgPath
            lineLayer.strokeColor = UIColor.yellow.cgColor
            lineLayer.lineWidth = 4
            self.view.layer.addSublayer(lineLayer)
        }
    }
    //A function that replaces frequency with the y coordinate on the screen.
    func frequencyToPosition(frequencyArray: [Float]) -> [CGFloat] {
        var yPosition : Float = 0.0
        var positionArray : [CGFloat] = []
        for frequency in frequencyArray {
            let x  = (frequency - 440.0) / -220.0
            yPosition = Float(UIScreen.main.bounds.height) * x
            positionArray.append(CGFloat(yPosition))
        }
        return positionArray
    }
    //A function that returns an Array of frequencies from low to normal.
    func notes() -> [Float] {
        var f : Float = 0
        var frequencyArray : [Float] = []
        for d in -12 ... 0 {
            f = 440.0 * pow(2.0,
                            Float(d) / 12.0)
            frequencyArray.append(f)
        }
        return frequencyArray
    }
}

This completes the pseudo theremin!

in conclusion

I made it, but I don't feel like I can play it properly. If anyone can, please show me. .. .. I also want to actually play the theremin.

Recommended Posts

I made an iPhone Theremin with Vision framework + AudioKit
I made an eco server with scala
I created an api domain with Spring Framework. Part 2
I created an api domain with Spring Framework. Part 1
I made an Android application that GETs with HTTP
I made an interpreter (compiler?) With about 80 lines in Ruby.
I want to push an app made with Rails 6 to GitHub
I made an adapter for communication class with Retrofit2 + Rx2
I made a GUI with Swing
I made an annotation in Java.
I created an Atlassian-like web framework
Run an application made with Java8 with Java6
CentOS8 doesn't have unar binaries so I made an alternative with bash
I made an app to scribble with PencilKit on a PDF file
I made a risky die with Ruby
I made a rock-paper-scissors app with kotlin
I made a rock-paper-scissors app with android
04. I made a front end with SpringBoot + Thymeleaf
I made a mosaic art with Pokemon images
I made an app for myself! (Reading management app)
I made a gender selection column with enum
I made blackjack with Ruby (I tried using minitest)
I made an Android app for MiRm service
I made an API client for Nature Remo
Run an application made with Go on Heroku
I made a LINE bot with Rails + heroku
I made a portfolio with Ruby On Rails
I made a function to register images with API in Spring Framework. Part 1 (API edition)