Member-only story

How to Build a Multi-modal Image Search App from Scratch

Angelina Yang
4 min readJan 16, 2025

--

Part 1 — VISTA embedding model

Are you interested in building cutting-edge image search applications?

In this blog post, we’ll share the first part to prep us for creating a powerful image search engine that can handle multi-modal search input. Let’s dive in!

Understanding Image Search Methods

Before diving into the app-building process, it’s crucial to understand the three primary methods of image search:

  1. Text-based search: Users input text queries (e.g., “a blue suit”) to find relevant images.
  2. Image-based search: Users upload an image to find similar or related images.
  3. Multi-modal search: Users provide both text and image inputs to refine their search results.

Our tutorial focuses on the third method, multi-modal search, which offers the most flexibility and power for users seeking specific visual content.

Introducing Vista: Multi-modal Embedding Model

At the heart of our image search app is a revolutionary model called Vista. This multi-modal embedding model allows for the seamless integration of both image and text inputs, resulting in more accurate and relevant search results.

--

--

No responses yet