Member-only story
How to Build a Multi-modal Image Search App from Scratch
Part 1 — VISTA embedding model
Are you interested in building cutting-edge image search applications?
In this blog post, we’ll share the first part to prep us for creating a powerful image search engine that can handle multi-modal search input. Let’s dive in!
Understanding Image Search Methods
Before diving into the app-building process, it’s crucial to understand the three primary methods of image search:
- Text-based search: Users input text queries (e.g., “a blue suit”) to find relevant images.
- Image-based search: Users upload an image to find similar or related images.
- Multi-modal search: Users provide both text and image inputs to refine their search results.
Our tutorial focuses on the third method, multi-modal search, which offers the most flexibility and power for users seeking specific visual content.
Introducing Vista: Multi-modal Embedding Model
At the heart of our image search app is a revolutionary model called Vista. This multi-modal embedding model allows for the seamless integration of both image and text inputs, resulting in more accurate and relevant search results.