Member-only story

How to Build a Multi-modal Image Search App from Scratch

4 min readJan 16, 2025

Part 1 — VISTA embedding model

Are you interested in building cutting-edge image search applications?

In this blog post, we’ll share the first part to prep us for creating a powerful image search engine that can handle multi-modal search input. Let’s dive in!

Understanding Image Search Methods

Before diving into the app-building process, it’s crucial to understand the three primary methods of image search:

Text-based search: Users input text queries (e.g., “a blue suit”) to find relevant images.
Image-based search: Users upload an image to find similar or related images.
Multi-modal search: Users provide both text and image inputs to refine their search results.

Our tutorial focuses on the third method, multi-modal search, which offers the most flexibility and power for users seeking specific visual content.

Introducing Vista: Multi-modal Embedding Model

At the heart of our image search app is a revolutionary model called Vista. This multi-modal embedding model allows for the seamless integration of both image and text inputs, resulting in more accurate and relevant search results.

How to Build a Multi-modal Image Search App from Scratch

Part 1 — VISTA embedding model

Understanding Image Search Methods

Introducing Vista: Multi-modal Embedding Model

Written by Angelina Yang

No responses yet