New Breakthroughs in Text Embedding 🚀!

Angelina Yang
4 min readFeb 15, 2024

Text embeddings play pivotal roles in Natural Language Processing (NLP) tasks such as modern information retrieval, question answering, and retrieval-augmented generation (RAG). From classics like GloVe and BERT to the latest state-of-the-art (SOTA) models, these embeddings encode semantic information, proving indispensable in various applications.

Text Embedding📽️

If you’re new to the concept of embeddings, check out our quick 2-minute interview prep on this topic:

Concretely,

Text embeddings are vector representations of natural language that encode its semantic information. They are widely used in various natural language processing (NLP) tasks, such as information retrieval (IR), question answering, semantic textual similarity, bitext mining, item recommendation, etc.

From the renowned BERT to sophisticated multi-stage trained models like E5 and BGE, the objective is to capture the rich contextual details of natural language, balancing between training cost and efficiency. However, challenges arise when implementing complex multi-stage training with curated data, encompassing concerns related to quantity, cost, and the diversity of tasks.

Recently, a new option emerged from Microsoft research called E5-mistral-7b-instruct.

Idea

The concept appears intuitive: with powerful Large Language Models (LLMs) at our disposal and a need for training data for embedding models,

why not employ LLMs to create synthetic data?

The research team executed this idea by harnessing the capabilities of LLMs to produce synthetic data, encompassing a wide array of text embedding tasks across 93 languages and generating hundreds of thousands of instances.

Employing a two-step prompting strategy, LLMs are initially guided to brainstorm a pool of potential tasks and subsequently prompted to generate data tailored to a selected task from the pool.

--

--