# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License.
Bring-Your-Own Vector Store¶
This notebook demonstrates how to implement a custom vector store and register for usage with GraphRAG.
Overview¶
GraphRAG uses a plug-and-play architecture that allow for easy integration of custom vector stores (outside of what is natively supported) by following a factory design pattern. This allows you to:
- Extend functionality: Add support for new vector database backends
- Customize behavior: Implement specialized search logic or data structures
- Integrate existing systems: Connect GraphRAG to your existing vector database infrastructure
What You'll Learn¶
- Understanding the
BaseVectorStore
interface - Implementing a custom vector store class
- Registering your vector store with the
VectorStoreFactory
- Testing and validating your implementation
- Configuring GraphRAG to use your custom vector store
Let's get started!
Step 1: Import Required Dependencies¶
First, let's import the necessary GraphRAG components and other dependencies we'll need.
pip install graphrag
from typing import Any
import numpy as np
import yaml
from graphrag.data_model.types import TextEmbedder
# GraphRAG vector store components
from graphrag.vector_stores.base import (
BaseVectorStore,
VectorStoreDocument,
VectorStoreSearchResult,
)
from graphrag.vector_stores.factory import VectorStoreFactory
Step 2: Understand the BaseVectorStore Interface¶
Before using a custom vector store, let's examine the BaseVectorStore
interface to understand what methods need to be implemented.
# Let's inspect the BaseVectorStore class to understand the required methods
import inspect
print("BaseVectorStore Abstract Methods:")
print("=" * 40)
abstract_methods = []
for name, method in inspect.getmembers(BaseVectorStore, predicate=inspect.isfunction):
if getattr(method, "__isabstractmethod__", False):
signature = inspect.signature(method)
abstract_methods.append(f"• {name}{signature}")
print(f"• {name}{signature}")
print(f"\nTotal abstract methods to implement: {len(abstract_methods)}")
BaseVectorStore Abstract Methods: ======================================== • connect(self, **kwargs: Any) -> None • filter_by_id(self, include_ids: list[str] | list[int]) -> Any • load_documents(self, documents: list[graphrag.vector_stores.base.VectorStoreDocument], overwrite: bool = True) -> None • search_by_id(self, id: str) -> graphrag.vector_stores.base.VectorStoreDocument • similarity_search_by_text(self, text: str, text_embedder: collections.abc.Callable[[str], list[float]], k: int = 10, **kwargs: Any) -> list[graphrag.vector_stores.base.VectorStoreSearchResult] • similarity_search_by_vector(self, query_embedding: list[float], k: int = 10, **kwargs: Any) -> list[graphrag.vector_stores.base.VectorStoreSearchResult] Total abstract methods to implement: 6
Step 3: Implement a Custom Vector Store¶
Now let's implement a simple in-memory vector store as an example. This vector store will:
- Store documents and vectors in memory using Python data structures
- Support all required BaseVectorStore methods
Note: This is a simplified example for demonstration. Production vector stores would typically use optimized libraries like FAISS, more sophisticated indexing, and persistent storage.
class SimpleInMemoryVectorStore(BaseVectorStore):
"""A simple in-memory vector store implementation for demonstration purposes.
This vector store stores documents and their embeddings in memory and provides
basic similarity search functionality using cosine similarity.
WARNING: This is for demonstration only - not suitable for production use.
For production, consider using optimized vector databases like LanceDB,
Azure AI Search, or other specialized vector stores.
"""
# Internal storage for documents and vectors
documents: dict[str, VectorStoreDocument]
vectors: dict[str, np.ndarray]
connected: bool
def __init__(self, **kwargs: Any):
"""Initialize the in-memory vector store."""
super().__init__(**kwargs)
self.documents: dict[str, VectorStoreDocument] = {}
self.vectors: dict[str, np.ndarray] = {}
self.connected = False
print(
f"🚀 SimpleInMemoryVectorStore initialized for collection: {self.collection_name}"
)
def connect(self, **kwargs: Any) -> None:
"""Connect to the vector storage (no-op for in-memory store)."""
self.connected = True
print(f"✅ Connected to in-memory vector store: {self.collection_name}")
def load_documents(
self, documents: list[VectorStoreDocument], overwrite: bool = True
) -> None:
"""Load documents into the vector store."""
if not self.connected:
msg = "Vector store not connected. Call connect() first."
raise RuntimeError(msg)
if overwrite:
self.documents.clear()
self.vectors.clear()
loaded_count = 0
for doc in documents:
if doc.vector is not None:
doc_id = str(doc.id)
self.documents[doc_id] = doc
self.vectors[doc_id] = np.array(doc.vector, dtype=np.float32)
loaded_count += 1
print(f"📚 Loaded {loaded_count} documents into vector store")
def _cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:
"""Calculate cosine similarity between two vectors."""
# Normalize vectors
norm1 = np.linalg.norm(vec1)
norm2 = np.linalg.norm(vec2)
if norm1 == 0 or norm2 == 0:
return 0.0
return float(np.dot(vec1, vec2) / (norm1 * norm2))
def similarity_search_by_vector(
self, query_embedding: list[float], k: int = 10, **kwargs: Any
) -> list[VectorStoreSearchResult]:
"""Perform similarity search using a query vector."""
if not self.connected:
msg = "Vector store not connected. Call connect() first."
raise RuntimeError(msg)
if not self.vectors:
return []
query_vec = np.array(query_embedding, dtype=np.float32)
similarities = []
# Calculate similarity with all stored vectors
for doc_id, stored_vec in self.vectors.items():
similarity = self._cosine_similarity(query_vec, stored_vec)
similarities.append((doc_id, similarity))
# Sort by similarity (descending) and take top k
similarities.sort(key=lambda x: x[1], reverse=True)
top_k = similarities[:k]
# Create search results
results = []
for doc_id, score in top_k:
document = self.documents[doc_id]
result = VectorStoreSearchResult(document=document, score=score)
results.append(result)
return results
def similarity_search_by_text(
self, text: str, text_embedder: TextEmbedder, k: int = 10, **kwargs: Any
) -> list[VectorStoreSearchResult]:
"""Perform similarity search using text (which gets embedded first)."""
# Embed the text first
query_embedding = text_embedder(text)
# Use vector search with the embedding
return self.similarity_search_by_vector(query_embedding, k, **kwargs)
def filter_by_id(self, include_ids: list[str] | list[int]) -> Any:
"""Build a query filter to filter documents by id.
For this simple implementation, we return the list of IDs as the filter.
"""
return [str(id_) for id_ in include_ids]
def search_by_id(self, id: str) -> VectorStoreDocument:
"""Search for a document by id."""
doc_id = str(id)
if doc_id not in self.documents:
msg = f"Document with id '{id}' not found"
raise KeyError(msg)
return self.documents[doc_id]
def get_stats(self) -> dict[str, Any]:
"""Get statistics about the vector store (custom method)."""
return {
"collection_name": self.collection_name,
"document_count": len(self.documents),
"vector_count": len(self.vectors),
"connected": self.connected,
"vector_dimension": len(next(iter(self.vectors.values())))
if self.vectors
else 0,
}
print("✅ SimpleInMemoryVectorStore class defined!")
✅ SimpleInMemoryVectorStore class defined!
Step 4: Register the Custom Vector Store¶
Now let's register our custom vector store with the VectorStoreFactory
so it can be used throughout GraphRAG.
# Register our custom vector store with a unique identifier
CUSTOM_VECTOR_STORE_TYPE = "simple_memory"
# Register the vector store class
VectorStoreFactory.register(CUSTOM_VECTOR_STORE_TYPE, SimpleInMemoryVectorStore)
print(f"✅ Registered custom vector store with type: '{CUSTOM_VECTOR_STORE_TYPE}'")
# Verify registration
available_types = VectorStoreFactory.get_vector_store_types()
print(f"\n📋 Available vector store types: {available_types}")
print(
f"🔍 Is our custom type supported? {VectorStoreFactory.is_supported_type(CUSTOM_VECTOR_STORE_TYPE)}"
)
✅ Registered custom vector store with type: 'simple_memory' 📋 Available vector store types: ['lancedb', 'azure_ai_search', 'cosmosdb', 'simple_memory'] 🔍 Is our custom type supported? True
Step 5: Test the Custom Vector Store¶
Let's create some sample data and test our custom vector store implementation.
# Create sample documents with mock embeddings
def create_mock_embedding(dimension: int = 384) -> list[float]:
"""Create a random embedding vector for testing."""
return np.random.normal(0, 1, dimension).tolist()
# Sample documents
sample_documents = [
VectorStoreDocument(
id="doc_1",
text="GraphRAG is a powerful knowledge graph extraction and reasoning framework.",
vector=create_mock_embedding(),
attributes={"category": "technology", "source": "documentation"},
),
VectorStoreDocument(
id="doc_2",
text="Vector stores enable efficient similarity search over high-dimensional data.",
vector=create_mock_embedding(),
attributes={"category": "technology", "source": "research"},
),
VectorStoreDocument(
id="doc_3",
text="Machine learning models can process and understand natural language text.",
vector=create_mock_embedding(),
attributes={"category": "AI", "source": "article"},
),
VectorStoreDocument(
id="doc_4",
text="Custom implementations allow for specialized behavior and integration.",
vector=create_mock_embedding(),
attributes={"category": "development", "source": "tutorial"},
),
]
print(f"📝 Created {len(sample_documents)} sample documents")
📝 Created 4 sample documents
# Test creating vector store using the factory
vector_store_config = {"collection_name": "test_collection"}
# Create vector store instance using factory
vector_store = VectorStoreFactory.create_vector_store(
CUSTOM_VECTOR_STORE_TYPE, vector_store_config
)
print(f"✅ Created vector store instance: {type(vector_store).__name__}")
print(f"📊 Initial stats: {vector_store.get_stats()}")
🚀 SimpleInMemoryVectorStore initialized for collection: test_collection ✅ Created vector store instance: SimpleInMemoryVectorStore 📊 Initial stats: {'collection_name': 'test_collection', 'document_count': 0, 'vector_count': 0, 'connected': False, 'vector_dimension': 0}
# Connect and load documents
vector_store.connect()
vector_store.load_documents(sample_documents)
print(f"📊 Updated stats: {vector_store.get_stats()}")
✅ Connected to in-memory vector store: test_collection 📚 Loaded 4 documents into vector store 📊 Updated stats: {'collection_name': 'test_collection', 'document_count': 4, 'vector_count': 4, 'connected': True, 'vector_dimension': 384}
# Test similarity search
query_vector = create_mock_embedding() # Random query vector for testing
search_results = vector_store.similarity_search_by_vector(
query_vector,
k=3, # Get top 3 similar documents
)
print(f"🔍 Found {len(search_results)} similar documents:\n")
for i, result in enumerate(search_results, 1):
doc = result.document
print(f"{i}. ID: {doc.id}")
print(f" Text: {doc.text[:60]}...")
print(f" Similarity Score: {result.score:.4f}")
print(f" Category: {doc.attributes.get('category', 'N/A')}")
print()
🔍 Found 3 similar documents: 1. ID: doc_3 Text: Machine learning models can process and understand natural l... Similarity Score: 0.0746 Category: AI 2. ID: doc_4 Text: Custom implementations allow for specialized behavior and in... Similarity Score: 0.0292 Category: development 3. ID: doc_1 Text: GraphRAG is a powerful knowledge graph extraction and reason... Similarity Score: 0.0022 Category: technology
# Test search by ID
try:
found_doc = vector_store.search_by_id("doc_2")
print("✅ Found document by ID:")
print(f" ID: {found_doc.id}")
print(f" Text: {found_doc.text}")
print(f" Attributes: {found_doc.attributes}")
except KeyError as e:
print(f"❌ Error: {e}")
# Test filter by ID
id_filter = vector_store.filter_by_id(["doc_1", "doc_3"])
print(f"\n🔧 ID filter result: {id_filter}")
✅ Found document by ID: ID: doc_2 Text: Vector stores enable efficient similarity search over high-dimensional data. Attributes: {'category': 'technology', 'source': 'research'} 🔧 ID filter result: ['doc_1', 'doc_3']
Step 6: Configuration for GraphRAG¶
Now let's see how you would configure GraphRAG to use your custom vector store in a settings file.
# Example GraphRAG yaml settings
example_settings = {
"vector_store": {
"default_vector_store": {
"type": CUSTOM_VECTOR_STORE_TYPE, # "simple_memory"
"collection_name": "graphrag_entities",
# Add any custom parameters your vector store needs
"custom_parameter": "custom_value",
}
},
# Other GraphRAG configuration...
"models": {
"default_embedding_model": {
"type": "openai_embedding",
"model": "text-embedding-3-small",
}
},
}
# Convert to YAML format for settings.yml
yaml_config = yaml.dump(example_settings, default_flow_style=False, indent=2)
print("📄 Example settings.yml configuration:")
print("=" * 40)
print(yaml_config)
📄 Example settings.yml configuration: ======================================== models: default_embedding_model: model: text-embedding-3-small type: openai_embedding vector_store: default_vector_store: collection_name: graphrag_entities custom_parameter: custom_value type: simple_memory
Step 7: Integration with GraphRAG Pipeline¶
Here's how your custom vector store would be used in a typical GraphRAG pipeline.
# Example of how GraphRAG would use your custom vector store
def simulate_graphrag_pipeline():
"""Simulate how GraphRAG would use the custom vector store."""
print("🚀 Simulating GraphRAG pipeline with custom vector store...\n")
# 1. GraphRAG creates vector store using factory
config = {"collection_name": "graphrag_entities", "similarity_threshold": 0.3}
store = VectorStoreFactory.create_vector_store(CUSTOM_VECTOR_STORE_TYPE, config)
store.connect()
print("✅ Step 1: Vector store created and connected")
# 2. During indexing, GraphRAG loads extracted entities
entity_documents = [
VectorStoreDocument(
id=f"entity_{i}",
text=f"Entity {i} description: Important concept in the knowledge graph",
vector=create_mock_embedding(),
attributes={"type": "entity", "importance": i % 3 + 1},
)
for i in range(10)
]
store.load_documents(entity_documents)
print(f"✅ Step 2: Loaded {len(entity_documents)} entity documents")
# 3. During query time, GraphRAG searches for relevant entities
query_embedding = create_mock_embedding()
relevant_entities = store.similarity_search_by_vector(query_embedding, k=5)
print(f"✅ Step 3: Found {len(relevant_entities)} relevant entities for query")
# 4. GraphRAG uses these entities for context building
context_entities = [result.document for result in relevant_entities]
print("✅ Step 4: Context built using retrieved entities")
print(f"📊 Final stats: {store.get_stats()}")
return context_entities
# Run the simulation
context = simulate_graphrag_pipeline()
print(f"\n🎯 Retrieved {len(context)} entities for context building")
🚀 Simulating GraphRAG pipeline with custom vector store... 🚀 SimpleInMemoryVectorStore initialized for collection: graphrag_entities ✅ Connected to in-memory vector store: graphrag_entities ✅ Step 1: Vector store created and connected 📚 Loaded 10 documents into vector store ✅ Step 2: Loaded 10 entity documents ✅ Step 3: Found 5 relevant entities for query ✅ Step 4: Context built using retrieved entities 📊 Final stats: {'collection_name': 'graphrag_entities', 'document_count': 10, 'vector_count': 10, 'connected': True, 'vector_dimension': 384} 🎯 Retrieved 5 entities for context building
Step 8: Testing and Validation¶
Let's create a comprehensive test suite to ensure our vector store works correctly.
def test_custom_vector_store():
"""Comprehensive test suite for the custom vector store."""
print("🧪 Running comprehensive vector store tests...\n")
# Test 1: Basic functionality
print("Test 1: Basic functionality")
store = VectorStoreFactory.create_vector_store(
CUSTOM_VECTOR_STORE_TYPE, {"collection_name": "test"}
)
store.connect()
# Load test documents
test_docs = sample_documents[:2]
store.load_documents(test_docs)
assert len(store.documents) == 2, "Should have 2 documents"
assert len(store.vectors) == 2, "Should have 2 vectors"
print("✅ Basic functionality test passed")
# Test 2: Search functionality
print("\nTest 2: Search functionality")
query_vec = create_mock_embedding()
results = store.similarity_search_by_vector(query_vec, k=5)
assert len(results) <= 2, "Should not return more results than documents"
assert all(isinstance(r, VectorStoreSearchResult) for r in results), (
"Should return VectorStoreSearchResult objects"
)
assert all(-1 <= r.score <= 1 for r in results), (
"Similarity scores should be between -1 and 1"
)
print("✅ Search functionality test passed")
# Test 3: Search by ID
print("\nTest 3: Search by ID")
found_doc = store.search_by_id("doc_1")
assert found_doc.id == "doc_1", "Should find correct document"
try:
store.search_by_id("nonexistent")
assert False, "Should raise KeyError for nonexistent ID"
except KeyError:
pass # Expected
print("✅ Search by ID test passed")
# Test 4: Filter functionality
print("\nTest 4: Filter functionality")
filter_result = store.filter_by_id(["doc_1", "doc_2"])
assert filter_result == ["doc_1", "doc_2"], "Should return filtered IDs"
print("✅ Filter functionality test passed")
# Test 5: Error handling
print("\nTest 5: Error handling")
disconnected_store = VectorStoreFactory.create_vector_store(
CUSTOM_VECTOR_STORE_TYPE, {"collection_name": "test2"}
)
try:
disconnected_store.load_documents(test_docs)
assert False, "Should raise error when not connected"
except RuntimeError:
pass # Expected
try:
disconnected_store.similarity_search_by_vector(query_vec)
assert False, "Should raise error when not connected"
except RuntimeError:
pass # Expected
print("✅ Error handling test passed")
print("\n🎉 All tests passed! Your custom vector store is working correctly.")
# Run the tests
test_custom_vector_store()
🧪 Running comprehensive vector store tests... Test 1: Basic functionality 🚀 SimpleInMemoryVectorStore initialized for collection: test ✅ Connected to in-memory vector store: test 📚 Loaded 2 documents into vector store ✅ Basic functionality test passed Test 2: Search functionality ✅ Search functionality test passed Test 3: Search by ID ✅ Search by ID test passed Test 4: Filter functionality ✅ Filter functionality test passed Test 5: Error handling 🚀 SimpleInMemoryVectorStore initialized for collection: test2 ✅ Error handling test passed 🎉 All tests passed! Your custom vector store is working correctly.
Summary and Next Steps¶
Congratulations! You've successfully learned how to implement and register a custom vector store with GraphRAG. Here's what you accomplished:
What You Built¶
- ✅ Custom Vector Store Class: Implemented
SimpleInMemoryVectorStore
with all required methods - ✅ Factory Integration: Registered your vector store with
VectorStoreFactory
- ✅ Comprehensive Testing: Validated functionality with a full test suite
- ✅ Configuration Examples: Learned how to configure GraphRAG to use your vector store
Key Takeaways¶
- Interface Compliance: Always implement all methods from
BaseVectorStore
- Factory Pattern: Use
VectorStoreFactory.register()
to make your vector store available - Configuration: Vector stores are configured in GraphRAG settings files
- Testing: Thoroughly test all functionality before deploying
Next Steps¶
Check out the API Overview notebook to learn how to index and query data via the graphrag API.
Resources¶
Happy building! 🚀