# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License.
Bring-Your-Own Vector Store¶
This notebook demonstrates how to implement a custom vector store and register for usage with GraphRAG.
Overview¶
GraphRAG uses a plug-and-play architecture that allow for easy integration of custom vector stores (outside of what is natively supported) by following a factory design pattern. This allows you to:
- Extend functionality: Add support for new vector database backends
- Customize behavior: Implement specialized search logic or data structures
- Integrate existing systems: Connect GraphRAG to your existing vector database infrastructure
What You'll Learn¶
- Understanding the
VectorStoreinterface - Implementing a custom vector store class
- Registering your vector store with the
VectorStoreFactory - Testing and validating your implementation
- Configuring GraphRAG to use your custom vector store
Let's get started!
Step 1: Import Required Dependencies¶
First, let's import the necessary GraphRAG components and other dependencies we'll need.
pip install graphrag
Step 2: Understand the VectorStore Interface¶
Before using a custom vector store, let's examine the VectorStore interface to understand what methods need to be implemented.
import inspect
# Let's inspect the VectorStore class to understand the required methods
from typing import Any
import numpy as np
import yaml
from graphrag_vectors import (
IndexSchema,
TextEmbedder,
VectorStore,
VectorStoreDocument,
VectorStoreFactory,
VectorStoreSearchResult,
)
print("VectorStore Abstract Methods:")
print("=" * 80)
abstract_methods = []
for name, method in inspect.getmembers(VectorStore, predicate=inspect.isfunction):
if getattr(method, "__isabstractmethod__", False):
abstract_methods.append(name)
print(f"\n{name}:")
print(f" {inspect.signature(method)}")
print(f"\nTotal abstract methods to implement: {len(abstract_methods)}")
VectorStore Abstract Methods: ================================================================================ connect: (self) -> None create_index: (self) -> None load_documents: (self, documents: list[graphrag_vectors.vector_store.VectorStoreDocument]) -> None search_by_id: (self, id: str) -> graphrag_vectors.vector_store.VectorStoreDocument similarity_search_by_vector: (self, query_embedding: list[float], k: int = 10) -> list[graphrag_vectors.vector_store.VectorStoreSearchResult] Total abstract methods to implement: 5
Step 3: Implement a Custom Vector Store¶
Now let's implement a simple in-memory vector store as an example. This vector store will:
- Store documents and vectors in memory using Python data structures
- Support all required VectorStore methods
Note: This is a simplified example for demonstration. Production vector stores would typically use optimized libraries like FAISS, more sophisticated indexing, and persistent storage.
class SimpleInMemoryVectorStore(VectorStore):
"""A simple in-memory vector store implementation for demonstration purposes.
This vector store stores documents and their embeddings in memory and provides
basic similarity search functionality using cosine similarity.
WARNING: This is for demonstration only - not suitable for production use.
For production, consider using optimized vector databases like LanceDB,
Azure AI Search, or other specialized vector stores.
"""
# Internal storage for documents and vectors
documents: dict[str, VectorStoreDocument]
vectors: dict[str, np.ndarray]
connected: bool
def __init__(self, **kwargs: Any):
"""Initialize the in-memory vector store."""
super().__init__(**kwargs)
self.documents: dict[str, VectorStoreDocument] = {}
self.vectors: dict[str, np.ndarray] = {}
self.connected = False
def connect(self, **kwargs: Any) -> None:
"""Connect to the vector store (simulated for in-memory store)."""
print("Connecting to in-memory vector store...")
self.connected = True
print("Connected successfully!")
def create_index(self, **kwargs: Any) -> None:
"""Create an index (simulated for in-memory store).
In a real vector database, this would create the necessary data structures
and indexes for efficient vector search.
"""
print(f"Creating index: {self.index_name}")
# For in-memory store, we just ensure our storage dictionaries are ready
if not isinstance(self.documents, dict):
self.documents = {}
if not isinstance(self.vectors, dict):
self.vectors = {}
print("Index created successfully!")
def load_documents(
self, documents: list[VectorStoreDocument], overwrite: bool = False
) -> None:
"""Load documents into the vector store."""
if overwrite:
print("Clearing existing documents...")
self.documents.clear()
self.vectors.clear()
print(f"Loading {len(documents)} documents...")
for doc in documents:
self.documents[doc.id] = doc
if doc.vector:
self.vectors[doc.id] = np.array(doc.vector)
print(f"Successfully loaded {len(documents)} documents!")
def similarity_search_by_vector(
self, query_embedding: list[float], k: int = 10, **kwargs: Any
) -> list[VectorStoreSearchResult]:
"""Search for similar documents using a query vector."""
if not self.vectors:
return []
query_vector = np.array(query_embedding)
# Calculate cosine similarity for all documents
similarities = []
for doc_id, doc_vector in self.vectors.items():
# Cosine similarity
similarity = np.dot(query_vector, doc_vector) / (
np.linalg.norm(query_vector) * np.linalg.norm(doc_vector)
)
similarities.append((doc_id, similarity))
# Sort by similarity (highest first) and take top k
similarities.sort(key=lambda x: x[1], reverse=True)
top_results = similarities[:k]
# Convert to search results
results = []
for doc_id, score in top_results:
doc = self.documents[doc_id]
results.append(VectorStoreSearchResult(document=doc, score=float(score)))
return results
def similarity_search_by_text(
self,
text: str,
text_embedder: TextEmbedder,
k: int = 10,
**kwargs: Any,
) -> list[VectorStoreSearchResult]:
"""Search for similar documents using a text query."""
# Embed the query text
query_embedding = text_embedder(text)
# Use vector search
return self.similarity_search_by_vector(query_embedding, k, **kwargs)
def search_by_id(self, id: str) -> VectorStoreDocument | None:
"""Retrieve a document by its ID."""
return self.documents.get(id)
Step 4: Register the Custom Vector Store¶
Now let's register our custom vector store with the VectorStoreFactory so it can be used throughout GraphRAG.
# Register our custom vector store with a unique identifier
CUSTOM_VECTOR_STORE_TYPE = "simple_memory"
# Register the vector store class
VectorStoreFactory().register(CUSTOM_VECTOR_STORE_TYPE, SimpleInMemoryVectorStore)
print(f"✅ Registered custom vector store with type: '{CUSTOM_VECTOR_STORE_TYPE}'")
# Verify registration
available_types = VectorStoreFactory().keys()
print(f"\n📋 Available vector store types: {available_types}")
print(
f"🔍 Is our custom type supported? {CUSTOM_VECTOR_STORE_TYPE in VectorStoreFactory()}"
)
✅ Registered custom vector store with type: 'simple_memory' 📋 Available vector store types: ['simple_memory'] 🔍 Is our custom type supported? True
Step 5: Test the Custom Vector Store¶
Let's create some sample data and test our custom vector store implementation.
# Create sample documents with mock embeddings
def create_mock_embedding(dimension: int = 384) -> list[float]:
"""Create a random embedding vector for testing."""
return np.random.normal(0, 1, dimension).tolist()
# Sample documents
sample_documents = [
VectorStoreDocument(
id="doc_1",
vector=create_mock_embedding(),
),
VectorStoreDocument(
id="doc_2",
vector=create_mock_embedding(),
),
VectorStoreDocument(
id="doc_3",
vector=create_mock_embedding(),
),
VectorStoreDocument(
id="doc_4",
vector=create_mock_embedding(),
),
]
print(f"📝 Created {len(sample_documents)} sample documents")
📝 Created 4 sample documents
# Test creating vector store using the factory
schema = IndexSchema(index_name="test_collection")
# Create vector store instance using factory
vector_store = VectorStoreFactory().create(
CUSTOM_VECTOR_STORE_TYPE, {"index_schema": schema}
)
print(f"✅ Created vector store instance: {type(vector_store).__name__}")
print(f"📊 Initial stats: {vector_store.get_stats()}")
✅ Created vector store instance: SimpleInMemoryVectorStore
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[6], line 10 5 vector_store = VectorStoreFactory().create( 6 CUSTOM_VECTOR_STORE_TYPE, {"index_schema": schema} 7 ) 9 print(f"✅ Created vector store instance: {type(vector_store).__name__}") ---> 10 print(f"📊 Initial stats: {vector_store.get_stats()}") AttributeError: 'SimpleInMemoryVectorStore' object has no attribute 'get_stats'
# Connect and load documents
vector_store.connect()
vector_store.create_index()
vector_store.load_documents(sample_documents)
print(f"📊 Updated stats: {vector_store.get_stats()}")
Connecting to in-memory vector store... Connected successfully! Creating index: vector_index Index created successfully! Loading 4 documents... Successfully loaded 4 documents!
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[7], line 6 3 vector_store.create_index() 4 vector_store.load_documents(sample_documents) ----> 6 print(f"📊 Updated stats: {vector_store.get_stats()}") AttributeError: 'SimpleInMemoryVectorStore' object has no attribute 'get_stats'
# Test similarity search
query_vector = create_mock_embedding() # Random query vector for testing
search_results = vector_store.similarity_search_by_vector(
query_vector,
k=3, # Get top 3 similar documents
)
print(f"🔍 Found {len(search_results)} similar documents:\n")
for i, result in enumerate(search_results, 1):
doc = result.document
print(f"{i}. ID: {doc.id}")
print(f" Similarity Score: {result.score:.4f}")
print()
🔍 Found 3 similar documents: 1. ID: doc_4 Similarity Score: 0.0282 2. ID: doc_1 Similarity Score: -0.0035 3. ID: doc_3 Similarity Score: -0.0095
# Test search by ID
try:
found_doc = vector_store.search_by_id("doc_2")
print("✅ Found document by ID:")
print(f" ID: {found_doc.id}")
except KeyError as e:
print(f"❌ Error: {e}")
✅ Found document by ID: ID: doc_2
Step 6: Configuration for GraphRAG¶
Now let's see how you would configure GraphRAG to use your custom vector store in a settings file.
# Example GraphRAG yaml settings
example_settings = {
"vector_store": {
"default_vector_store": {
"type": CUSTOM_VECTOR_STORE_TYPE, # "simple_memory"
"collection_name": "graphrag_entities",
# Add any custom parameters your vector store needs
"custom_parameter": "custom_value",
}
},
# Other GraphRAG configuration...
"models": {
"default_embedding_model": {
"type": "embedding",
"model_provider": "openai",
"model": "text-embedding-3-small",
}
},
}
# Convert to YAML format for settings.yml
yaml_config = yaml.dump(example_settings, default_flow_style=False, indent=2)
print("📄 Example settings.yml configuration:")
print("=" * 40)
print(yaml_config)
📄 Example settings.yml configuration:
========================================
models:
default_embedding_model:
model: text-embedding-3-small
model_provider: openai
type: embedding
vector_store:
default_vector_store:
collection_name: graphrag_entities
custom_parameter: custom_value
type: simple_memory
Step 7: Integration with GraphRAG Pipeline¶
Here's how your custom vector store would be used in a typical GraphRAG pipeline.
# Example of how GraphRAG would use your custom vector store
def simulate_graphrag_pipeline():
"""Simulate how GraphRAG would use the custom vector store."""
print("🚀 Simulating GraphRAG pipeline with custom vector store...\n")
# 1. GraphRAG creates vector store using factory
schema = IndexSchema(index_name="graphrag_entities")
store = VectorStoreFactory().create(
CUSTOM_VECTOR_STORE_TYPE,
{"index_schema": schema, "similarity_threshold": 0.3},
)
store.connect()
store.create_index()
print("✅ Step 1: Vector store created and connected")
# 2. During indexing, GraphRAG loads extracted entities
entity_documents = [
VectorStoreDocument(
id=f"entity_{i}",
vector=create_mock_embedding(),
)
for i in range(10)
]
store.load_documents(entity_documents)
print(f"✅ Step 2: Loaded {len(entity_documents)} entity documents")
# 3. During query time, GraphRAG searches for relevant entities
query_embedding = create_mock_embedding()
relevant_entities = store.similarity_search_by_vector(query_embedding, k=5)
print(f"✅ Step 3: Found {len(relevant_entities)} relevant entities for query")
# 4. GraphRAG uses these entities for context building
context_entities = [result.document for result in relevant_entities]
print("✅ Step 4: Context built using retrieved entities")
print(f"📊 Final stats: {store.get_stats()}")
return context_entities
# Run the simulation
context = simulate_graphrag_pipeline()
print(f"\n🎯 Retrieved {len(context)} entities for context building")
🚀 Simulating GraphRAG pipeline with custom vector store... Connecting to in-memory vector store... Connected successfully! Creating index: vector_index Index created successfully! ✅ Step 1: Vector store created and connected Loading 10 documents... Successfully loaded 10 documents! ✅ Step 2: Loaded 10 entity documents ✅ Step 3: Found 5 relevant entities for query ✅ Step 4: Context built using retrieved entities
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[11], line 45 41 return context_entities 44 # Run the simulation ---> 45 context = simulate_graphrag_pipeline() 46 print(f"\n🎯 Retrieved {len(context)} entities for context building") Cell In[11], line 39, in simulate_graphrag_pipeline() 36 context_entities = [result.document for result in relevant_entities] 38 print("✅ Step 4: Context built using retrieved entities") ---> 39 print(f"📊 Final stats: {store.get_stats()}") 41 return context_entities AttributeError: 'SimpleInMemoryVectorStore' object has no attribute 'get_stats'
Step 8: Testing and Validation¶
Let's create a comprehensive test suite to ensure our vector store works correctly.
def test_custom_vector_store():
"""Comprehensive test suite for the custom vector store."""
print("🧪 Running comprehensive vector store tests...\n")
# Test 1: Basic functionality
print("Test 1: Basic functionality")
store = VectorStoreFactory().create(
CUSTOM_VECTOR_STORE_TYPE,
{"index_schema": IndexSchema(index_name="test")},
)
store.connect()
store.create_index()
# Load test documents
test_docs = sample_documents[:2]
store.load_documents(test_docs)
assert len(store.documents) == 2, "Should have 2 documents"
assert len(store.vectors) == 2, "Should have 2 vectors"
print("✅ Basic functionality test passed")
# Test 2: Search functionality
print("\nTest 2: Search functionality")
query_vec = create_mock_embedding()
results = store.similarity_search_by_vector(query_vec, k=5)
assert len(results) <= 2, "Should not return more results than documents"
assert all(isinstance(r, VectorStoreSearchResult) for r in results), (
"Should return VectorStoreSearchResult objects"
)
assert all(-1 <= r.score <= 1 for r in results), (
"Similarity scores should be between -1 and 1"
)
print("✅ Search functionality test passed")
# Test 3: Search by ID
print("\nTest 3: Search by ID")
found_doc = store.search_by_id("doc_1")
assert found_doc.id == "doc_1", "Should find correct document"
try:
store.search_by_id("nonexistent")
assert False, "Should raise KeyError for nonexistent ID"
except KeyError:
pass # Expected
print("✅ Search by ID test passed")
# Test 4: Error handling
print("\nTest 5: Error handling")
disconnected_store = VectorStoreFactory().create(
CUSTOM_VECTOR_STORE_TYPE,
{"index_schema": IndexSchema(index_name="test2")},
)
try:
disconnected_store.load_documents(test_docs)
assert False, "Should raise error when not connected"
except RuntimeError:
pass # Expected
try:
disconnected_store.similarity_search_by_vector(query_vec)
assert False, "Should raise error when not connected"
except RuntimeError:
pass # Expected
print("✅ Error handling test passed")
print("\n🎉 All tests passed! Your custom vector store is working correctly.")
# Run the tests
test_custom_vector_store()
🧪 Running comprehensive vector store tests... Test 1: Basic functionality Connecting to in-memory vector store... Connected successfully! Creating index: vector_index Index created successfully! Loading 2 documents... Successfully loaded 2 documents! ✅ Basic functionality test passed Test 2: Search functionality ✅ Search functionality test passed Test 3: Search by ID
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In[12], line 73 69 print("\n🎉 All tests passed! Your custom vector store is working correctly.") 72 # Run the tests ---> 73 test_custom_vector_store() Cell In[12], line 42, in test_custom_vector_store() 40 try: 41 store.search_by_id("nonexistent") ---> 42 assert False, "Should raise KeyError for nonexistent ID" 43 except KeyError: 44 pass # Expected AssertionError: Should raise KeyError for nonexistent ID
Summary and Next Steps¶
Congratulations! You've successfully learned how to implement and register a custom vector store with GraphRAG. Here's what you accomplished:
What You Built¶
- ✅ Custom Vector Store Class: Implemented
SimpleInMemoryVectorStorewith all required methods - ✅ Factory Integration: Registered your vector store with
VectorStoreFactory - ✅ Comprehensive Testing: Validated functionality with a full test suite
- ✅ Configuration Examples: Learned how to configure GraphRAG to use your vector store
Key Takeaways¶
- Interface Compliance: Always implement all methods from
VectorStore - Factory Pattern: Use
VectorStoreFactory.register()to make your vector store available - Testing: Validate your implementation thoroughly before production use
- Configuration: Use YAML or environment variables for flexible configuration
Production Considerations¶
For production use, consider:
- Persistence: Add data persistence mechanisms
- Scalability: Use optimized vector search libraries (FAISS, HNSW)
- Error Handling: Implement robust error handling and logging
- Performance: Add caching, batching, and connection pooling
- Security: Implement authentication and authorization
- Monitoring: Add metrics and health checks
Resources¶
Happy building! 🚀