Optimized Embedding Model using Optimum-Intel

LlamaIndex has support for loading quantized embedding models for Intel, using the Optimum-Intel library.

Optimized models are smaller and faster, with minimal accuracy loss, see the documentation and an optimization guide using the IntelLabs/fastRAG library.

Optimization is based on math instructions in the Xeon® 4th generation or newer processors.

In order to be able to load and use the quantized models, install the required dependency pip install optimum[exporters] optimum-intel neural-compressor intel_extension_for_pytorch.

Loading is done using the class IntelEmbedding; usage is similar to any HuggingFace local embedding model; See example:

%pip install llama-index-embeddings-huggingface-optimum-intel

from llama_index.embeddings.huggingface_optimum_intel import IntelEmbedding

embed_model = IntelEmbedding("Intel/bge-small-en-v1.5-rag-int8-static")

embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

384
[-0.0032782123889774084, -0.013396517373621464, 0.037944991141557693, -0.04642259329557419, 0.027709005400538445]