See, Hear, and Read: Deep Aligned Representations arxiv-vanity.com/papers/1706.00932/
I trained image and caption pairs. As a result, it was found that the relationship between sound and text, which feels good quantitatively, could be learned even for unknown images. CNN for text, sounds, and images. The fun is that the lower layers do not share weights, but only the upper layers share weights.
We trained paired images and sounds, or images and texts, so that they belong to the same group (KL divergence).
I fed the voice, text and image data. For learning, there are two pairs of image + text and image + voice. However, as a result, we were able to learn voice and text pairs as well. (It can be said that you are learning a pair by using an image as a bridge)
The point is to take the opportunity to learn recognition that spans three multimodal things: images, music, and text. The point of learning a very human perception. I think this is the first time for me to approach such a large scale and three senses and modals. As a result, the sound / text pair is not trained, but it is possible by using the image as a bridge. In other words, when translating from English to any language, it's like passing through the bridge language (an iron-clad English-language approach) English → image image → French English is sound or text. French is text or sound Where is the heart of technology and method? We did all the CNNs on audio, images and text. The point of creating a new network that connects all the upper layers.
Cross-modal search scores were compared. Cross-modal search is, for example, when you want data of an imaged modal (text, image, or sound), you can search by query from another modal to see if you can get the desired data.
Sound and language Voice search. It is quite common. Probabilistic model Language and vision Image → Different from automatic text caption generation. In this experiment, only the relationship between the image and the voice or text is learned. Also, new stores use CNN instead of RNN for texts.