Hybrid RAG with Qdrant: multi-tenancy, custom sharding, distributed setup
What you’ll build
Section titled “What you’ll build”This notebook implements a production-style Hybrid RAG on Qdrant using LlamaIndex, designed for multitenancy and scale-out via custom sharding.
- Hybrid search: dense embeddings + sparse BM25 for higher recall and precision.
- Multitenancy: isolate tenants using payload filters and shard routing.
- Custom sharding: keep each tenant local for performance and cost efficiency.
- Distributed Qdrant: multi-node setup with replication for high availability and throughput.
This notebook walks through an end to end Retrieval Augmented Generation workflow that uses Qdrant as a distributed hybrid search backend and LlamaIndex as the orchestration layer. You will build a tenant aware RAG that combines dense vectors with sparse signals, you will isolate data per tenant with filters, and you will route data and queries with a custom shard key for scale.
Install dependencies
Section titled “Install dependencies”About the dependencies
Section titled “About the dependencies”- llama-index: orchestration layer for ingestion, indexing, and retrieval.
- llama-index-vector-stores-qdrant: Qdrant integration with hybrid support.
- fastembed: lightweight CPU-friendly embedding/sparse models
%pip install -U llama-index llama-index-vector-stores-qdrant fastembedMake sure you have a distributed Qdrant cluster up and running. Here is a compose.yaml file:
services: qdrant_primary: image: "qdrant/qdrant:latest" ports: - "6333:6333" environment: QDRANT__CLUSTER__ENABLED: "true" command: ["./qdrant", "--uri", "http://qdrant_primary:6335"] restart: always qdrant_secondary: image: "qdrant/qdrant:latest" environment: QDRANT__CLUSTER__ENABLED: "true" command: ["./qdrant", "--bootstrap", "http://qdrant_primary:6335"] restart: alwaysImports and global settings
Section titled “Imports and global settings”Settings and connectivity
Section titled “Settings and connectivity”- Embeddings:
FastEmbedEmbedding('BAAI/bge-base-en-v1.5')is a compact, high-quality baseline. - Connection:
QDRANT_URLdefaults to an HTTP endpoint; setQDRANT_API_KEYfor secured/cloud setups.
import os
from qdrant_client import AsyncQdrantClient, QdrantClientfrom qdrant_client import models
from llama_index.core import ( Settings, VectorStoreIndex, Document, StorageContext,)from llama_index.vector_stores.qdrant import QdrantVectorStorefrom llama_index.embeddings.fastembed import FastEmbedEmbedding
# Embeddings, small and fastSettings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5")
# Qdrant connection, local by default, set QDRANT_URL and QDRANT_API_KEY for cloudQDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
client: QdrantClient = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)aclient: AsyncQdrantClient = AsyncQdrantClient( url=QDRANT_URL, api_key=QDRANT_API_KEY)COLLECTION = "hybrid_rag_multitenant_sharding_demo"Create distributed-ready collection
Section titled “Create distributed-ready collection”Configure dual-vector schema (dense + sparse)
Section titled “Configure dual-vector schema (dense + sparse)”- Define vector field names:
densefor embeddings andsparsefor BM25‑style signals. - Dense config:
- Determine embedding dimensionality at runtime by probing
Settings.embed_model(avoids hardcoding). - Use cosine distance for semantic similarity.
- Determine embedding dimensionality at runtime by probing
- Sparse config:
- Enable an in‑memory sparse index (
on_disk=False) to support hybrid scoring.
- Enable an in‑memory sparse index (
- These settings establish the collection’s dual‑index layout used later by QdrantVectorStore for hybrid retrieval.
dense_vector_name = "dense"dense_config = models.VectorParams( size=len(Settings.embed_model.get_text_embedding("probe")), distance=models.Distance.COSINE,)sparse_vector_name = "sparse"sparse_config = models.SparseVectorParams( index=models.SparseIndexParams(on_disk=False))Shard keys and selector contract
Section titled “Shard keys and selector contract”shard_keys: [‘tenant_a’, ‘tenant_b’] — predefined partitions used with custom sharding to keep each tenant local.payload_indexes: keyword index ontenant_idto accelerate filter-based queries.shard_key_selector_fn(tenant_id) -> tenant_id: returns the shard key used for both writes and reads.
shard_keys = ["tenant_a", "tenant_b"]payload_indexes = [ { "field_name": "tenant_id", "field_schema": models.PayloadSchemaType.KEYWORD, }]
def shard_key_selector_fn(tenant_id: str) -> models.ShardKeySelector: return tenant_idInitialize hybrid Qdrant store with custom sharding
Section titled “Initialize hybrid Qdrant store with custom sharding”This step creates or attaches to the collection named in COLLECTION and configures a dual‑vector hybrid store:
- Hybrid search:
enable_hybrid=Truewithdense_vector_name='dense'andsparse_vector_name='sparse'. - Dense config:
dense_configuses cosine distance and derives size fromSettings.embed_model. - Sparse config:
sparse_configenables an in‑memory sparse index;fastembed_sparse_model='Qdrant/bm25'supplies BM25‑style signals. - Distributed topology:
sharding_method=Customwithshard_keys=['tenant_a','tenant_b'].shard_key_selector_fn(tenant_id) -> tenant_idroutes both writes and reads.shard_number=6,replication_factor=2for scale and High availability.
- Payload index:
payload_indexesaccelerates filtering ontenant_id.
Idempotent behavior: the vector store will create the collection if missing and reuse it on subsequent runs.
vector_store = QdrantVectorStore( collection_name=COLLECTION, client=client, aclient=aclient, dense_vector_name=dense_vector_name, sparse_vector_name=sparse_vector_name, enable_hybrid=True, dense_config=dense_config, sparse_config=sparse_config, fastembed_sparse_model="Qdrant/bm25", shard_number=6, sharding_method=models.ShardingMethod.CUSTOM, shard_key_selector_fn=shard_key_selector_fn, shard_keys=shard_keys, replication_factor=2, payload_indexes=payload_indexes,)Prepare multi-tenant dataset
Section titled “Prepare multi-tenant dataset”We create two tenants with small document sets. Each Document carries tenant_id, tags, and a doc_id.
Dataset design and extensibility
Section titled “Dataset design and extensibility”We simulate two tenants with a few short documents each. Every Document carries:
tenant_idfor isolation and shard routing,tagsfor quick filtering and debugging,textcontent used for dense/sparse indexing.
TENANT_DOCS: dict[str, list[Document]] = { "tenant_a": [ Document( text="Solar panels reduce electricity bills and carbon footprint", metadata={"tenant_id": "tenant_a", "tags": ["energy", "solar"]}, ), Document( text="Inverters convert DC power to AC for home appliances", metadata={"tenant_id": "tenant_a", "tags": ["energy", "hardware"]}, ), Document( text="Net metering policies vary by region and utility provider", metadata={ "tenant_id": "tenant_a", "tags": ["policy", "regulation"], }, ), ], "tenant_b": [ Document( text="Kubernetes orchestrates containers across a cluster", metadata={"tenant_id": "tenant_b", "tags": ["cloud", "k8s"]}, ), Document( text="Service meshes add observability and traffic management", metadata={ "tenant_id": "tenant_b", "tags": ["cloud", "networking"], }, ), Document( text="Helm charts package and deploy Kubernetes applications", metadata={"tenant_id": "tenant_b", "tags": ["cloud", "devops"]}, ), ],}Ingest with shard key for locality
Section titled “Ingest with shard key for locality”Here we embed text with the active Settings.embed_model, then upsert each point with payload and a shard key. This keeps each tenant local to a shard group in a cluster.
Embedding strategy
Section titled “Embedding strategy”- FastEmbed keeps this demo CPU-friendly. For production, consider a service (e.g., text-embedding-3-large or in-house model) and cache embeddings.
- If you change the model, update
dense_config.sizeto match and consider reindexing. - Avoid embedding on every run in notebooks; persist or cache to speed up iterations.
def create_dense_embeddings(docs: list[Document]) -> list[Document]: for doc in docs: doc.embedding = Settings.embed_model.get_text_embedding(doc.text) return docsIngestion flow and locality guarantees
Section titled “Ingestion flow and locality guarantees”- We embed each doc with the configured embedding model (dense) and rely on the vector store to build the sparse representation.
- Writes use
shard_identifier=tenant_id, ensuring documents live on the intended shard group.
Tip: For large batches, prefer the async ingestion APIs and chunk documents for backpressure control.
for tenant_id, docs in TENANT_DOCS.items(): docs = create_dense_embeddings(docs) await vector_store.async_add(docs, shard_identifier=tenant_id)Index wrapping and reusability
Section titled “Index wrapping and reusability”StorageContext.from_defaults(vector_store=vector_store) binds the Qdrant collection to LlamaIndex’s VectorStoreIndex without re-ingesting data.
Benefits:
- Reuse the same physical collection for multiple retrievers or query pipelines.
- Swap retrieval modes (dense-only, sparse-only, hybrid) via retriever config, not data layout.
- Keep ingestion concerns (sharding, replication) decoupled from application query logic.
storage_context = StorageContext.from_defaults(vector_store=vector_store)index = VectorStoreIndex.from_vector_store( vector_store, storage_context=storage_context)Multi-tenant retrieval
Section titled “Multi-tenant retrieval”Use a tenant-scoped hybrid retriever and keep queries shard-local. You can also use metadata filters if you want to filter within the tenant’s data.
Retrieval tips for hybrid mode
Section titled “Retrieval tips for hybrid mode”- Set
vector_store_query_mode=HYBRIDto combine dense and sparse. Tunesimilarity_top_k,sparse_top_k, andhybrid_top_k. - Pass
vector_store_kwargs={"shard_identifier": tenant_id}to keep queries within the tenant’s shard. - Add metadata filters (e.g., on
tenant_idortags) to further narrow candidates when needed.
from llama_index.core.retrievers import VectorIndexRetrieverfrom llama_index.core.vector_stores.types import VectorStoreQueryMode
def create_retriever_for_tenant(tenant_id: str) -> VectorIndexRetriever: if tenant_id not in shard_keys: raise ValueError( f"Unknown tenant_id: {tenant_id}. Expected one of {shard_keys}" ) return VectorIndexRetriever( index=index, vector_store_query_mode=VectorStoreQueryMode.HYBRID, similarity_top_k=5, sparse_top_k=5, hybrid_top_k=5, vector_store_kwargs={"shard_identifier": tenant_id}, )
tenant_id = "tenant_b"retriever = create_retriever_for_tenant(tenant_id)
query = "manage microservices traffic and observability"results = retriever.retrieve(query)
print(f"Tenant: {tenant_id} | Query: {query}")for i, r in enumerate(results, 1): meta = r.node.metadata print( f"{i}. score={r.score:.4f} | tags={meta.get('tags')} | text={r.node.get_content()}" )Tenant: tenant_b | Query: manage microservices traffic and observability1. score=4.6271 | tags=['cloud', 'networking'] | text=Service meshes add observability and traffic management2. score=0.1213 | tags=['cloud', 'k8s'] | text=Kubernetes orchestrates containers across a cluster3. score=0.0000 | tags=['cloud', 'devops'] | text=Helm charts package and deploy Kubernetes applicationsInterpreting results
Section titled “Interpreting results”- The printout shows the hybrid score, tags (metadata), and snippet of the matched text.
- Verify tenant isolation by switching
tenant_idand observing that results come only from that tenant’s documents.