diskannpy
Documentation Overview
diskannpy
is mostly structured around 2 distinct processes: Index Builder Functions and Search Classes
It also includes a few nascent utilities.
And lastly, it makes substantial use of type hints, with various shorthand type aliases documented.
When reading the diskannpy
code we refer to the type aliases, though pdoc
helpfully expands them.
Index Builders
build_disk_index
- To build an index that cannot fully fit into memory when searchingbuild_memory_index
- To build an index that can fully fit into memory when searching
Search Classes
StaticMemoryIndex
- for indices that can fully fit in memory and won't be changed during the search operationsStaticDiskIndex
- for indices that cannot fully fit in memory, thus relying on disk IO to search, and also won't be changed during search operationsDynamicMemoryIndex
- for indices that can fully fit in memory and will be mutated via insert/deletion operations as well as search operations
Parameter Defaults
diskannpy.defaults
- Default values exported from the C++ extension for Python users
Parameter and Response Type Aliases
DistanceMetric
- What distance metrics doesdiskannpy
support?VectorDType
- What vector datatypes doesdiskannpy
support?QueryResponse
- What can I expect as a response to my search?QueryResponseBatch
- What can I expect as a response to my batch search?VectorIdentifier
- What types dodiskannpy
support as vector identifiers?VectorIdentifierBatch
- A batch of identifiers of the exact same type. The type can change, but they must all change.VectorLike
- How does a vector look todiskannpy
, to be inserted or searched with.VectorLikeBatch
- A batch of those vectors, to be inserted or searched with.Metadata
- DiskANN vector binary file metadata (num_points, vector_dim)
Utilities
vectors_to_file
- Turns a 2 dimensionalnumpy.typing.NDArray[VectorDType]
with shape(number_of_points, vector_dim)
into a DiskANN vector bin file.vectors_from_file
- Reads a DiskANN vector bin file representing stored vectors into a numpy ndarray.vectors_metadata_from_file
- Reads metadata stored in a DiskANN vector bin file without reading the entire filetags_to_file
- Turns a 1 dimensionalnumpy.typing.NDArray[VectorIdentifier]
into a DiskANN tags bin file.tags_from_file
- Reads a DiskANN tags bin file representing stored tags into a numpy ndarray.valid_dtype
- Checks if a given vector dtype is supported bydiskannpy
1# Copyright (c) Microsoft Corporation. All rights reserved. 2# Licensed under the MIT license. 3 4""" 5# Documentation Overview 6`diskannpy` is mostly structured around 2 distinct processes: [Index Builder Functions](#index-builders) and [Search Classes](#search-classes) 7 8It also includes a few nascent [utilities](#utilities). 9 10And lastly, it makes substantial use of type hints, with various shorthand [type aliases](#parameter-and-response-type-aliases) documented. 11When reading the `diskannpy` code we refer to the type aliases, though `pdoc` helpfully expands them. 12 13## Index Builders 14- `build_disk_index` - To build an index that cannot fully fit into memory when searching 15- `build_memory_index` - To build an index that can fully fit into memory when searching 16 17## Search Classes 18- `StaticMemoryIndex` - for indices that can fully fit in memory and won't be changed during the search operations 19- `StaticDiskIndex` - for indices that cannot fully fit in memory, thus relying on disk IO to search, and also won't be changed during search operations 20- `DynamicMemoryIndex` - for indices that can fully fit in memory and will be mutated via insert/deletion operations as well as search operations 21 22## Parameter Defaults 23- `diskannpy.defaults` - Default values exported from the C++ extension for Python users 24 25## Parameter and Response Type Aliases 26- `DistanceMetric` - What distance metrics does `diskannpy` support? 27- `VectorDType` - What vector datatypes does `diskannpy` support? 28- `QueryResponse` - What can I expect as a response to my search? 29- `QueryResponseBatch` - What can I expect as a response to my batch search? 30- `VectorIdentifier` - What types do `diskannpy` support as vector identifiers? 31- `VectorIdentifierBatch` - A batch of identifiers of the exact same type. The type can change, but they must **all** change. 32- `VectorLike` - How does a vector look to `diskannpy`, to be inserted or searched with. 33- `VectorLikeBatch` - A batch of those vectors, to be inserted or searched with. 34- `Metadata` - DiskANN vector binary file metadata (num_points, vector_dim) 35 36## Utilities 37- `vectors_to_file` - Turns a 2 dimensional `numpy.typing.NDArray[VectorDType]` with shape `(number_of_points, vector_dim)` into a DiskANN vector bin file. 38- `vectors_from_file` - Reads a DiskANN vector bin file representing stored vectors into a numpy ndarray. 39- `vectors_metadata_from_file` - Reads metadata stored in a DiskANN vector bin file without reading the entire file 40- `tags_to_file` - Turns a 1 dimensional `numpy.typing.NDArray[VectorIdentifier]` into a DiskANN tags bin file. 41- `tags_from_file` - Reads a DiskANN tags bin file representing stored tags into a numpy ndarray. 42- `valid_dtype` - Checks if a given vector dtype is supported by `diskannpy` 43""" 44 45from typing import Any, Literal, NamedTuple, Type, Union 46 47import numpy as np 48from numpy import typing as npt 49 50DistanceMetric = Literal["l2", "mips", "cosine"] 51""" Type alias for one of {"l2", "mips", "cosine"} """ 52VectorDType = Union[Type[np.float32], Type[np.int8], Type[np.uint8]] 53""" Type alias for one of {`numpy.float32`, `numpy.int8`, `numpy.uint8`} """ 54VectorLike = npt.NDArray[VectorDType] 55""" Type alias for something that can be treated as a vector """ 56VectorLikeBatch = npt.NDArray[VectorDType] 57""" Type alias for a batch of VectorLikes """ 58VectorIdentifier = np.uint32 59""" 60Type alias for a vector identifier, whether it be an implicit array index identifier from StaticMemoryIndex or 61StaticDiskIndex, or an explicit tag identifier from DynamicMemoryIndex 62""" 63VectorIdentifierBatch = npt.NDArray[np.uint32] 64""" Type alias for a batch of VectorIdentifiers """ 65 66 67class QueryResponse(NamedTuple): 68 """ 69 Tuple with two values, identifiers and distances. Both are 1d arrays, positionally correspond, and will contain the 70 nearest neighbors from [0..k_neighbors) 71 """ 72 73 identifiers: npt.NDArray[VectorIdentifier] 74 """ A `numpy.typing.NDArray[VectorIdentifier]` array of vector identifiers, 1 dimensional """ 75 distances: npt.NDArray[np.float32] 76 """ 77 A `numpy.typing.NDAarray[numpy.float32]` of distances as calculated by the distance metric function, 1 dimensional 78 """ 79 80 81class QueryResponseBatch(NamedTuple): 82 """ 83 Tuple with two values, identifiers and distances. Both are 2d arrays, with dimensionality determined by the 84 rows corresponding to the number of queries made, and the columns corresponding to the k neighbors 85 requested. The two 2d arrays have an implicit, position-based relationship 86 """ 87 88 identifiers: npt.NDArray[VectorIdentifier] 89 """ 90 A `numpy.typing.NDArray[VectorIdentifier]` array of vector identifiers, 2 dimensional. The row corresponds to index 91 of the query, and the column corresponds to the k neighbors requested 92 """ 93 distances: np.ndarray[np.float32] 94 """ 95 A `numpy.typing.NDAarray[numpy.float32]` of distances as calculated by the distance metric function, 2 dimensional. 96 The row corresponds to the index of the query, and the column corresponds to the distance of the query to the 97 *k-th* neighbor 98 """ 99 100 101from . import defaults 102from ._builder import build_disk_index, build_memory_index 103from ._common import valid_dtype 104from ._dynamic_memory_index import DynamicMemoryIndex 105from ._files import ( 106 Metadata, 107 tags_from_file, 108 tags_to_file, 109 vectors_from_file, 110 vectors_metadata_from_file, 111 vectors_to_file, 112) 113from ._static_disk_index import StaticDiskIndex 114from ._static_memory_index import StaticMemoryIndex 115 116__all__ = [ 117 "build_disk_index", 118 "build_memory_index", 119 "StaticDiskIndex", 120 "StaticMemoryIndex", 121 "DynamicMemoryIndex", 122 "defaults", 123 "DistanceMetric", 124 "VectorDType", 125 "QueryResponse", 126 "QueryResponseBatch", 127 "VectorIdentifier", 128 "VectorIdentifierBatch", 129 "VectorLike", 130 "VectorLikeBatch", 131 "Metadata", 132 "vectors_metadata_from_file", 133 "vectors_to_file", 134 "vectors_from_file", 135 "tags_to_file", 136 "tags_from_file", 137 "valid_dtype", 138]
53def build_disk_index( 54 data: Union[str, VectorLikeBatch], 55 distance_metric: DistanceMetric, 56 index_directory: str, 57 complexity: int, 58 graph_degree: int, 59 search_memory_maximum: float, 60 build_memory_maximum: float, 61 num_threads: int, 62 pq_disk_bytes: int = defaults.PQ_DISK_BYTES, 63 vector_dtype: Optional[VectorDType] = None, 64 index_prefix: str = "ann", 65) -> None: 66 """ 67 This function will construct a DiskANN disk index. Disk indices are ideal for very large datasets that 68 are too large to fit in memory. Memory is still used, but it is primarily used to provide precise disk 69 locations for fast retrieval of smaller subsets of the index without compromising much on recall. 70 71 If you provide a numpy array, it will save this array to disk in a temp location 72 in the format DiskANN's PQ Flash Index builder requires. This temp folder is deleted upon index creation completion 73 or error. 74 75 ## Distance Metric and Vector Datatype Restrictions 76 | Metric \ Datatype | np.float32 | np.uint8 | np.int8 | 77 |-------------------|------------|----------|---------| 78 | L2 | ✅ | ✅ | ✅ | 79 | MIPS | ✅ | ❌ | ❌ | 80 | Cosine [^bug-in-disk-cosine] | ❌ | ❌ | ❌ | 81 82 [^bug-in-disk-cosine]: For StaticDiskIndex, Cosine distances are not currently supported. 83 84 ### Parameters 85 - **data**: Either a `str` representing a path to a DiskANN vector bin file, or a numpy.ndarray, 86 of a supported dtype, in 2 dimensions. Note that `vector_dtype` must be provided if data is a `str` 87 - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3 88 vector dtypes, but `mips` is only available for single precision floats. 89 - **index_directory**: The index files will be saved to this **existing** directory path 90 - **complexity**: The size of the candidate nearest neighbor list to use when building the index. Values between 75 91 and 200 are typical. Larger values will take more time to build but result in indices that provide higher recall 92 for the same search complexity. Use a value that is at least as large as `graph_degree` unless you are prepared 93 to compromise on quality 94 - **graph_degree**: The degree of the graph index, typically between 60 and 150. A larger maximum degree will 95 result in larger indices and longer indexing times, but better search quality. 96 - **search_memory_maximum**: Build index with the expectation that the search will use at most 97 `search_memory_maximum`, in gb. 98 - **build_memory_maximum**: Build index using at most `build_memory_maximum` in gb. Building processes typically 99 require more memory, while search memory can be reduced. 100 - **num_threads**: Number of threads to use when creating this index. `0` is used to indicate all available 101 logical processors should be used. 102 - **pq_disk_bytes**: Use `0` to store uncompressed data on SSD. This allows the index to asymptote to 100% 103 recall. If your vectors are too large to store in SSD, this parameter provides the option to compress the 104 vectors using PQ for storing on SSD. This will trade off recall. You would also want this to be greater 105 than the number of bytes used for the PQ compressed data stored in-memory. Default is `0`. 106 - **vector_dtype**: Required if the provided `data` is of type `str`, else we use the `data.dtype` if np array. 107 - **index_prefix**: The prefix of the index files. Defaults to "ann". 108 """ 109 110 _assert( 111 (isinstance(data, str) and vector_dtype is not None) 112 or isinstance(data, np.ndarray), 113 "vector_dtype is required if data is a str representing a path to the vector bin file", 114 ) 115 dap_metric = _valid_metric(distance_metric) 116 _assert_is_positive_uint32(complexity, "complexity") 117 _assert_is_positive_uint32(graph_degree, "graph_degree") 118 _assert(search_memory_maximum > 0, "search_memory_maximum must be larger than 0") 119 _assert(build_memory_maximum > 0, "build_memory_maximum must be larger than 0") 120 _assert_is_nonnegative_uint32(num_threads, "num_threads") 121 _assert_is_nonnegative_uint32(pq_disk_bytes, "pq_disk_bytes") 122 _assert(index_prefix != "", "index_prefix cannot be an empty string") 123 124 index_path = Path(index_directory) 125 _assert( 126 index_path.exists() and index_path.is_dir(), 127 "index_directory must both exist and be a directory", 128 ) 129 130 vector_bin_path, vector_dtype_actual = _valid_path_and_dtype( 131 data, vector_dtype, index_directory, index_prefix 132 ) 133 _assert(dap_metric != _native_dap.COSINE, "Cosine is currently not supported in StaticDiskIndex") 134 if dap_metric == _native_dap.INNER_PRODUCT: 135 _assert( 136 vector_dtype_actual == np.float32, 137 "Integral vector dtypes (np.uint8, np.int8) are not supported with distance metric mips" 138 ) 139 140 num_points, dimensions = vectors_metadata_from_file(vector_bin_path) 141 142 if vector_dtype_actual == np.uint8: 143 _builder = _native_dap.build_disk_uint8_index 144 elif vector_dtype_actual == np.int8: 145 _builder = _native_dap.build_disk_int8_index 146 else: 147 _builder = _native_dap.build_disk_float_index 148 149 index_prefix_path = os.path.join(index_directory, index_prefix) 150 151 _builder( 152 distance_metric=dap_metric, 153 data_file_path=vector_bin_path, 154 index_prefix_path=index_prefix_path, 155 complexity=complexity, 156 graph_degree=graph_degree, 157 final_index_ram_limit=search_memory_maximum, 158 indexing_ram_budget=build_memory_maximum, 159 num_threads=num_threads, 160 pq_disk_bytes=pq_disk_bytes, 161 ) 162 _write_index_metadata( 163 index_prefix_path, vector_dtype_actual, dap_metric, num_points, dimensions 164 )
This function will construct a DiskANN disk index. Disk indices are ideal for very large datasets that are too large to fit in memory. Memory is still used, but it is primarily used to provide precise disk locations for fast retrieval of smaller subsets of the index without compromising much on recall.
If you provide a numpy array, it will save this array to disk in a temp location in the format DiskANN's PQ Flash Index builder requires. This temp folder is deleted upon index creation completion or error.
Distance Metric and Vector Datatype Restrictions
Metric \ Datatype | np.float32 | np.uint8 | np.int8 |
---|---|---|---|
L2 | ✅ | ✅ | ✅ |
MIPS | ✅ | ❌ | ❌ |
Cosine 1 | ❌ | ❌ | ❌ |
Parameters
- data: Either a
str
representing a path to a DiskANN vector bin file, or a numpy.ndarray, of a supported dtype, in 2 dimensions. Note thatvector_dtype
must be provided if data is astr
- distance_metric: A
str
, strictly one of {"l2", "mips", "cosine"}.l2
andcosine
are supported for all 3 vector dtypes, butmips
is only available for single precision floats. - index_directory: The index files will be saved to this existing directory path
- complexity: The size of the candidate nearest neighbor list to use when building the index. Values between 75
and 200 are typical. Larger values will take more time to build but result in indices that provide higher recall
for the same search complexity. Use a value that is at least as large as
graph_degree
unless you are prepared to compromise on quality - graph_degree: The degree of the graph index, typically between 60 and 150. A larger maximum degree will result in larger indices and longer indexing times, but better search quality.
- search_memory_maximum: Build index with the expectation that the search will use at most
search_memory_maximum
, in gb. - build_memory_maximum: Build index using at most
build_memory_maximum
in gb. Building processes typically require more memory, while search memory can be reduced. - num_threads: Number of threads to use when creating this index.
0
is used to indicate all available logical processors should be used. - pq_disk_bytes: Use
0
to store uncompressed data on SSD. This allows the index to asymptote to 100% recall. If your vectors are too large to store in SSD, this parameter provides the option to compress the vectors using PQ for storing on SSD. This will trade off recall. You would also want this to be greater than the number of bytes used for the PQ compressed data stored in-memory. Default is0
. - vector_dtype: Required if the provided
data
is of typestr
, else we use thedata.dtype
if np array. - index_prefix: The prefix of the index files. Defaults to "ann".
-
For StaticDiskIndex, Cosine distances are not currently supported. ↩
167def build_memory_index( 168 data: Union[str, VectorLikeBatch], 169 distance_metric: DistanceMetric, 170 index_directory: str, 171 complexity: int, 172 graph_degree: int, 173 num_threads: int, 174 alpha: float = defaults.ALPHA, 175 use_pq_build: bool = defaults.USE_PQ_BUILD, 176 num_pq_bytes: int = defaults.NUM_PQ_BYTES, 177 use_opq: bool = defaults.USE_OPQ, 178 vector_dtype: Optional[VectorDType] = None, 179 tags: Union[str, VectorIdentifierBatch] = "", 180 filter_labels: Optional[list[list[str]]] = None, 181 universal_label: str = "", 182 filter_complexity: int = defaults.FILTER_COMPLEXITY, 183 index_prefix: str = "ann", 184) -> None: 185 """ 186 This function will construct a DiskANN memory index. Memory indices are ideal for smaller datasets whose 187 indices can fit into memory. Memory indices are faster than disk indices, but usually cannot scale to massive 188 sizes in an individual index on an individual machine. 189 190 `diskannpy`'s memory indices take two forms: a `diskannpy.StaticMemoryIndex`, which will not be mutated, only 191 searched upon, and a `diskannpy.DynamicMemoryIndex`, which can be mutated AND searched upon in the same process. 192 193 ## Important Note: 194 You **must** determine the type of index you are building for. If you are building for a 195 `diskannpy.DynamicMemoryIndex`, you **must** supply a valid value for the `tags` parameter. **Do not supply 196 tags if the index is intended to be `diskannpy.StaticMemoryIndex`**! 197 198 ## Distance Metric and Vector Datatype Restrictions 199 200 | Metric \ Datatype | np.float32 | np.uint8 | np.int8 | 201 |-------------------|------------|----------|---------| 202 | L2 | ✅ | ✅ | ✅ | 203 | MIPS | ✅ | ❌ | ❌ | 204 | Cosine | ✅ | ✅ | ✅ | 205 206 ### Parameters 207 208 - **data**: Either a `str` representing a path to an existing DiskANN vector bin file, or a numpy.ndarray of a 209 supported dtype in 2 dimensions. Note that `vector_dtype` must be provided if `data` is a `str`. 210 - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3 211 vector dtypes, but `mips` is only available for single precision floats. 212 - **index_directory**: The index files will be saved to this **existing** directory path 213 - **complexity**: The size of the candidate nearest neighbor list to use when building the index. Values between 75 214 and 200 are typical. Larger values will take more time to build but result in indices that provide higher recall 215 for the same search complexity. Use a value that is at least as large as `graph_degree` unless you are prepared 216 to compromise on quality 217 - **graph_degree**: The degree of the graph index, typically between 60 and 150. A larger maximum degree will 218 result in larger indices and longer indexing times, but better search quality. 219 - **num_threads**: Number of threads to use when creating this index. `0` is used to indicate all available 220 logical processors should be used. 221 - **alpha**: The alpha parameter (>=1) is used to control the nature and number of points that are added to the 222 graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably more 223 distance comparisons compared to a lower alpha value. 224 - **use_pq_build**: Use product quantization during build. Product quantization is a lossy compression technique 225 that can reduce the size of the index on disk. This will trade off recall. Default is `True`. 226 - **num_pq_bytes**: The number of bytes used to store the PQ compressed data in memory. This will trade off recall. 227 Default is `0`. 228 - **use_opq**: Use optimized product quantization during build. 229 - **vector_dtype**: Required if the provided `data` is of type `str`, else we use the `data.dtype` if np array. 230 - **tags**: Tags can be defined either as a path on disk to an existing .tags file, or provided as a np.array of 231 the same length as the number of vectors. Tags are used to identify vectors in the index via your *own* 232 numbering conventions, and is absolutely required for loading DynamicMemoryIndex indices `from_file`. 233 - **filter_labels**: An optional, but exhaustive list of categories for each vector. This is used to filter 234 search results by category. If provided, this must be a list of lists, where each inner list is a list of 235 categories for the corresponding vector. For example, if you have 3 vectors, and the first vector belongs to 236 categories "a" and "b", the second vector belongs to category "b", and the third vector belongs to no categories, 237 you would provide `filter_labels=[["a", "b"], ["b"], []]`. If you do not want to provide categories for a 238 particular vector, you can provide an empty list. If you do not want to provide categories for any vectors, 239 you can provide `None` for this parameter (which is the default) 240 - **universal_label**: An optional label that indicates that this vector should be included in *every* search 241 in which it also meets the knn search criteria. 242 - **filter_complexity**: Complexity to use when using filters. Default is 0. 0 is strictly invalid if you are 243 using filters. 244 - **index_prefix**: The prefix of the index files. Defaults to "ann". 245 """ 246 _assert( 247 (isinstance(data, str) and vector_dtype is not None) 248 or isinstance(data, np.ndarray), 249 "vector_dtype is required if data is a str representing a path to the vector bin file", 250 ) 251 dap_metric = _valid_metric(distance_metric) 252 _assert_is_positive_uint32(complexity, "complexity") 253 _assert_is_positive_uint32(graph_degree, "graph_degree") 254 _assert( 255 alpha >= 1, 256 "alpha must be >= 1, and realistically should be kept between [1.0, 2.0)", 257 ) 258 _assert_is_nonnegative_uint32(num_threads, "num_threads") 259 _assert_is_nonnegative_uint32(num_pq_bytes, "num_pq_bytes") 260 _assert_is_nonnegative_uint32(filter_complexity, "filter_complexity") 261 _assert(index_prefix != "", "index_prefix cannot be an empty string") 262 _assert( 263 filter_labels is None or filter_complexity > 0, 264 "if filter_labels is provided, filter_complexity must not be 0" 265 ) 266 267 index_path = Path(index_directory) 268 _assert( 269 index_path.exists() and index_path.is_dir(), 270 "index_directory must both exist and be a directory", 271 ) 272 273 vector_bin_path, vector_dtype_actual = _valid_path_and_dtype( 274 data, vector_dtype, index_directory, index_prefix 275 ) 276 if dap_metric == _native_dap.INNER_PRODUCT: 277 _assert( 278 vector_dtype_actual == np.float32, 279 "Integral vector dtypes (np.uint8, np.int8) are not supported with distance metric mips" 280 ) 281 282 num_points, dimensions = vectors_metadata_from_file(vector_bin_path) 283 if filter_labels is not None: 284 _assert( 285 len(filter_labels) == num_points, 286 "filter_labels must be the same length as the number of points" 287 ) 288 289 if vector_dtype_actual == np.uint8: 290 _builder = _native_dap.build_memory_uint8_index 291 elif vector_dtype_actual == np.int8: 292 _builder = _native_dap.build_memory_int8_index 293 else: 294 _builder = _native_dap.build_memory_float_index 295 296 index_prefix_path = os.path.join(index_directory, index_prefix) 297 298 filter_labels_file = "" 299 if filter_labels is not None: 300 label_counts = {} 301 filter_labels_file = f"{index_prefix_path}_pylabels.txt" 302 with open(filter_labels_file, "w") as labels_file: 303 for labels in filter_labels: 304 for label in labels: 305 label_counts[label] = 1 if label not in label_counts else label_counts[label] + 1 306 if len(labels) == 0: 307 print("default", file=labels_file) 308 else: 309 print(",".join(labels), file=labels_file) 310 with open(f"{index_prefix_path}_label_metadata.json", "w") as label_metadata_file: 311 json.dump(label_counts, label_metadata_file, indent=True) 312 313 if isinstance(tags, str) and tags != "": 314 use_tags = True 315 shutil.copy(tags, index_prefix_path + ".tags") 316 elif not isinstance(tags, str): 317 use_tags = True 318 tags_as_array = _castable_dtype_or_raise(tags, expected=np.uint32) 319 _assert(len(tags_as_array.shape) == 1, "Provided tags must be 1 dimensional") 320 _assert( 321 tags_as_array.shape[0] == num_points, 322 "Provided tags must contain an identical population to the number of points, " 323 f"{tags_as_array.shape[0]=}, {num_points=}", 324 ) 325 tags_to_file(index_prefix_path + ".tags", tags_as_array) 326 else: 327 use_tags = False 328 329 _builder( 330 distance_metric=dap_metric, 331 data_file_path=vector_bin_path, 332 index_output_path=index_prefix_path, 333 complexity=complexity, 334 graph_degree=graph_degree, 335 alpha=alpha, 336 num_threads=num_threads, 337 use_pq_build=use_pq_build, 338 num_pq_bytes=num_pq_bytes, 339 use_opq=use_opq, 340 use_tags=use_tags, 341 filter_labels_file=filter_labels_file, 342 universal_label=universal_label, 343 filter_complexity=filter_complexity, 344 ) 345 346 _write_index_metadata( 347 index_prefix_path, vector_dtype_actual, dap_metric, num_points, dimensions 348 )
This function will construct a DiskANN memory index. Memory indices are ideal for smaller datasets whose indices can fit into memory. Memory indices are faster than disk indices, but usually cannot scale to massive sizes in an individual index on an individual machine.
diskannpy
's memory indices take two forms: a diskannpy.StaticMemoryIndex
, which will not be mutated, only
searched upon, and a diskannpy.DynamicMemoryIndex
, which can be mutated AND searched upon in the same process.
Important Note:
You must determine the type of index you are building for. If you are building for a
diskannpy.DynamicMemoryIndex
, you must supply a valid value for the tags
parameter. Do not supply
tags if the index is intended to be diskannpy.StaticMemoryIndex
!
Distance Metric and Vector Datatype Restrictions
Metric \ Datatype | np.float32 | np.uint8 | np.int8 |
---|---|---|---|
L2 | ✅ | ✅ | ✅ |
MIPS | ✅ | ❌ | ❌ |
Cosine | ✅ | ✅ | ✅ |
Parameters
- data: Either a
str
representing a path to an existing DiskANN vector bin file, or a numpy.ndarray of a supported dtype in 2 dimensions. Note thatvector_dtype
must be provided ifdata
is astr
. - distance_metric: A
str
, strictly one of {"l2", "mips", "cosine"}.l2
andcosine
are supported for all 3 vector dtypes, butmips
is only available for single precision floats. - index_directory: The index files will be saved to this existing directory path
- complexity: The size of the candidate nearest neighbor list to use when building the index. Values between 75
and 200 are typical. Larger values will take more time to build but result in indices that provide higher recall
for the same search complexity. Use a value that is at least as large as
graph_degree
unless you are prepared to compromise on quality - graph_degree: The degree of the graph index, typically between 60 and 150. A larger maximum degree will result in larger indices and longer indexing times, but better search quality.
- num_threads: Number of threads to use when creating this index.
0
is used to indicate all available logical processors should be used. - alpha: The alpha parameter (>=1) is used to control the nature and number of points that are added to the graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably more distance comparisons compared to a lower alpha value.
- use_pq_build: Use product quantization during build. Product quantization is a lossy compression technique
that can reduce the size of the index on disk. This will trade off recall. Default is
True
. - num_pq_bytes: The number of bytes used to store the PQ compressed data in memory. This will trade off recall.
Default is
0
. - use_opq: Use optimized product quantization during build.
- vector_dtype: Required if the provided
data
is of typestr
, else we use thedata.dtype
if np array. - tags: Tags can be defined either as a path on disk to an existing .tags file, or provided as a np.array of
the same length as the number of vectors. Tags are used to identify vectors in the index via your own
numbering conventions, and is absolutely required for loading DynamicMemoryIndex indices
from_file
. - filter_labels: An optional, but exhaustive list of categories for each vector. This is used to filter
search results by category. If provided, this must be a list of lists, where each inner list is a list of
categories for the corresponding vector. For example, if you have 3 vectors, and the first vector belongs to
categories "a" and "b", the second vector belongs to category "b", and the third vector belongs to no categories,
you would provide
filter_labels=[["a", "b"], ["b"], []]
. If you do not want to provide categories for a particular vector, you can provide an empty list. If you do not want to provide categories for any vectors, you can provideNone
for this parameter (which is the default) - universal_label: An optional label that indicates that this vector should be included in every search in which it also meets the knn search criteria.
- filter_complexity: Complexity to use when using filters. Default is 0. 0 is strictly invalid if you are using filters.
- index_prefix: The prefix of the index files. Defaults to "ann".
34class StaticDiskIndex: 35 """ 36 A StaticDiskIndex is a disk-backed index that is not mutable. 37 """ 38 39 def __init__( 40 self, 41 index_directory: str, 42 num_threads: int, 43 num_nodes_to_cache: int, 44 cache_mechanism: int = 1, 45 distance_metric: Optional[DistanceMetric] = None, 46 vector_dtype: Optional[VectorDType] = None, 47 dimensions: Optional[int] = None, 48 index_prefix: str = "ann", 49 ): 50 """ 51 ### Parameters 52 - **index_directory**: The directory containing the index files. This directory must contain the following 53 files: 54 - `{index_prefix}_sample_data.bin` 55 - `{index_prefix}_mem.index.data` 56 - `{index_prefix}_pq_compressed.bin` 57 - `{index_prefix}_pq_pivots.bin` 58 - `{index_prefix}_sample_ids.bin` 59 - `{index_prefix}_disk.index` 60 61 It may also include the following optional files: 62 - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the 63 `index_directory` if the index was created from a numpy array 64 - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata 65 about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. 66 If an index is built from the `diskann` cli tools, this file will not exist. 67 - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system 68 - **num_nodes_to_cache**: Number of nodes to cache in memory (> -1) 69 - **cache_mechanism**: 1 -> use the generated sample_data.bin file for 70 the index to initialize a set of cached nodes, up to `num_nodes_to_cache`, 2 -> ready the cache for up to 71 `num_nodes_to_cache`, but do not initialize it with any nodes. Any other value disables node caching. 72 - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3 73 vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This 74 value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, 75 you are required to provide it. 76 - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a 77 `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it. 78 - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same 79 dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it 80 does not exist, you are required to provide it. 81 - **index_prefix**: The prefix of the index files. Defaults to "ann". 82 """ 83 index_prefix_path = _valid_index_prefix(index_directory, index_prefix) 84 vector_dtype, metric, _, _ = _ensure_index_metadata( 85 index_prefix_path, 86 vector_dtype, 87 distance_metric, 88 1, # it doesn't matter because we don't need it in this context anyway 89 dimensions, 90 ) 91 dap_metric = _valid_metric(metric) 92 93 _assert_is_nonnegative_uint32(num_threads, "num_threads") 94 _assert_is_nonnegative_uint32(num_nodes_to_cache, "num_nodes_to_cache") 95 96 self._vector_dtype = vector_dtype 97 if vector_dtype == np.uint8: 98 _index = _native_dap.StaticDiskUInt8Index 99 elif vector_dtype == np.int8: 100 _index = _native_dap.StaticDiskInt8Index 101 else: 102 _index = _native_dap.StaticDiskFloatIndex 103 self._index = _index( 104 distance_metric=dap_metric, 105 index_path_prefix=index_prefix_path, 106 num_threads=num_threads, 107 num_nodes_to_cache=num_nodes_to_cache, 108 cache_mechanism=cache_mechanism, 109 ) 110 111 def search( 112 self, query: VectorLike, k_neighbors: int, complexity: int, beam_width: int = 2 113 ) -> QueryResponse: 114 """ 115 Searches the index by a single query vector. 116 117 ### Parameters 118 - **query**: 1d numpy array of the same dimensionality and dtype of the index. 119 - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely 120 will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0. 121 - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size 122 increases accuracy at the cost of latency. Must be at least k_neighbors in size. 123 - **beam_width**: The beamwidth to be used for search. This is the maximum number of IO requests each query 124 will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query, 125 but might result in slightly higher total number of IO requests to SSD per query. For the highest query 126 throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search. 127 Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will 128 involve some tuning overhead. 129 """ 130 _query = _castable_dtype_or_raise(query, expected=self._vector_dtype) 131 _assert(len(_query.shape) == 1, "query vector must be 1-d") 132 _assert_is_positive_uint32(k_neighbors, "k_neighbors") 133 _assert_is_positive_uint32(complexity, "complexity") 134 _assert_is_positive_uint32(beam_width, "beam_width") 135 136 if k_neighbors > complexity: 137 warnings.warn( 138 f"{k_neighbors=} asked for, but {complexity=} was smaller. Increasing {complexity} to {k_neighbors}" 139 ) 140 complexity = k_neighbors 141 142 neighbors, distances = self._index.search( 143 query=_query, 144 knn=k_neighbors, 145 complexity=complexity, 146 beam_width=beam_width, 147 ) 148 return QueryResponse(identifiers=neighbors, distances=distances) 149 150 def batch_search( 151 self, 152 queries: VectorLikeBatch, 153 k_neighbors: int, 154 complexity: int, 155 num_threads: int, 156 beam_width: int = 2, 157 ) -> QueryResponseBatch: 158 """ 159 Searches the index by a batch of query vectors. 160 161 This search is parallelized and far more efficient than searching for each vector individually. 162 163 ### Parameters 164 - **queries**: 2d numpy array, with column dimensionality matching the index and row dimensionality being the 165 number of queries intended to search for in parallel. Dtype must match dtype of the index. 166 - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely 167 will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0. 168 - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size 169 increases accuracy at the cost of latency. Must be at least k_neighbors in size. 170 - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system 171 - **beam_width**: The beamwidth to be used for search. This is the maximum number of IO requests each query 172 will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query, 173 but might result in slightly higher total number of IO requests to SSD per query. For the highest query 174 throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search. 175 Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will 176 involve some tuning overhead. 177 """ 178 _queries = _castable_dtype_or_raise(queries, expected=self._vector_dtype) 179 _assert_2d(_queries, "queries") 180 _assert_is_positive_uint32(k_neighbors, "k_neighbors") 181 _assert_is_positive_uint32(complexity, "complexity") 182 _assert_is_nonnegative_uint32(num_threads, "num_threads") 183 _assert_is_positive_uint32(beam_width, "beam_width") 184 185 if k_neighbors > complexity: 186 warnings.warn( 187 f"{k_neighbors=} asked for, but {complexity=} was smaller. Increasing {complexity} to {k_neighbors}" 188 ) 189 complexity = k_neighbors 190 191 num_queries, dim = _queries.shape 192 neighbors, distances = self._index.batch_search( 193 queries=_queries, 194 num_queries=num_queries, 195 knn=k_neighbors, 196 complexity=complexity, 197 beam_width=beam_width, 198 num_threads=num_threads, 199 ) 200 return QueryResponseBatch(identifiers=neighbors, distances=distances)
A StaticDiskIndex is a disk-backed index that is not mutable.
39 def __init__( 40 self, 41 index_directory: str, 42 num_threads: int, 43 num_nodes_to_cache: int, 44 cache_mechanism: int = 1, 45 distance_metric: Optional[DistanceMetric] = None, 46 vector_dtype: Optional[VectorDType] = None, 47 dimensions: Optional[int] = None, 48 index_prefix: str = "ann", 49 ): 50 """ 51 ### Parameters 52 - **index_directory**: The directory containing the index files. This directory must contain the following 53 files: 54 - `{index_prefix}_sample_data.bin` 55 - `{index_prefix}_mem.index.data` 56 - `{index_prefix}_pq_compressed.bin` 57 - `{index_prefix}_pq_pivots.bin` 58 - `{index_prefix}_sample_ids.bin` 59 - `{index_prefix}_disk.index` 60 61 It may also include the following optional files: 62 - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the 63 `index_directory` if the index was created from a numpy array 64 - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata 65 about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. 66 If an index is built from the `diskann` cli tools, this file will not exist. 67 - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system 68 - **num_nodes_to_cache**: Number of nodes to cache in memory (> -1) 69 - **cache_mechanism**: 1 -> use the generated sample_data.bin file for 70 the index to initialize a set of cached nodes, up to `num_nodes_to_cache`, 2 -> ready the cache for up to 71 `num_nodes_to_cache`, but do not initialize it with any nodes. Any other value disables node caching. 72 - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3 73 vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This 74 value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, 75 you are required to provide it. 76 - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a 77 `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it. 78 - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same 79 dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it 80 does not exist, you are required to provide it. 81 - **index_prefix**: The prefix of the index files. Defaults to "ann". 82 """ 83 index_prefix_path = _valid_index_prefix(index_directory, index_prefix) 84 vector_dtype, metric, _, _ = _ensure_index_metadata( 85 index_prefix_path, 86 vector_dtype, 87 distance_metric, 88 1, # it doesn't matter because we don't need it in this context anyway 89 dimensions, 90 ) 91 dap_metric = _valid_metric(metric) 92 93 _assert_is_nonnegative_uint32(num_threads, "num_threads") 94 _assert_is_nonnegative_uint32(num_nodes_to_cache, "num_nodes_to_cache") 95 96 self._vector_dtype = vector_dtype 97 if vector_dtype == np.uint8: 98 _index = _native_dap.StaticDiskUInt8Index 99 elif vector_dtype == np.int8: 100 _index = _native_dap.StaticDiskInt8Index 101 else: 102 _index = _native_dap.StaticDiskFloatIndex 103 self._index = _index( 104 distance_metric=dap_metric, 105 index_path_prefix=index_prefix_path, 106 num_threads=num_threads, 107 num_nodes_to_cache=num_nodes_to_cache, 108 cache_mechanism=cache_mechanism, 109 )
Parameters
index_directory: The directory containing the index files. This directory must contain the following files:
{index_prefix}_sample_data.bin
{index_prefix}_mem.index.data
{index_prefix}_pq_compressed.bin
{index_prefix}_pq_pivots.bin
{index_prefix}_sample_ids.bin
{index_prefix}_disk.index
It may also include the following optional files:
{index_prefix}_vectors.bin
: Optional.diskannpy
builder functions may create this file in theindex_directory
if the index was created from a numpy array{index_prefix}_metadata.bin
: Optional.diskannpy
builder functions create this file to store metadata about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. If an index is built from thediskann
cli tools, this file will not exist.
- num_threads: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
- num_nodes_to_cache: Number of nodes to cache in memory (> -1)
- cache_mechanism: 1 -> use the generated sample_data.bin file for
the index to initialize a set of cached nodes, up to
num_nodes_to_cache
, 2 -> ready the cache for up tonum_nodes_to_cache
, but do not initialize it with any nodes. Any other value disables node caching. - distance_metric: A
str
, strictly one of {"l2", "mips", "cosine"}.l2
andcosine
are supported for all 3 vector dtypes, butmips
is only available for single precision floats. Default isNone
. This value is only used if a{index_prefix}_metadata.bin
file does not exist. If it does not exist, you are required to provide it. - vector_dtype: The vector dtype this index has been built with. This value is only used if a
{index_prefix}_metadata.bin
file does not exist. If it does not exist, you are required to provide it. - dimensions: The vector dimensionality of this index. All new vectors inserted must be the same
dimensionality. This value is only used if a
{index_prefix}_metadata.bin
file does not exist. If it does not exist, you are required to provide it. - index_prefix: The prefix of the index files. Defaults to "ann".
111 def search( 112 self, query: VectorLike, k_neighbors: int, complexity: int, beam_width: int = 2 113 ) -> QueryResponse: 114 """ 115 Searches the index by a single query vector. 116 117 ### Parameters 118 - **query**: 1d numpy array of the same dimensionality and dtype of the index. 119 - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely 120 will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0. 121 - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size 122 increases accuracy at the cost of latency. Must be at least k_neighbors in size. 123 - **beam_width**: The beamwidth to be used for search. This is the maximum number of IO requests each query 124 will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query, 125 but might result in slightly higher total number of IO requests to SSD per query. For the highest query 126 throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search. 127 Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will 128 involve some tuning overhead. 129 """ 130 _query = _castable_dtype_or_raise(query, expected=self._vector_dtype) 131 _assert(len(_query.shape) == 1, "query vector must be 1-d") 132 _assert_is_positive_uint32(k_neighbors, "k_neighbors") 133 _assert_is_positive_uint32(complexity, "complexity") 134 _assert_is_positive_uint32(beam_width, "beam_width") 135 136 if k_neighbors > complexity: 137 warnings.warn( 138 f"{k_neighbors=} asked for, but {complexity=} was smaller. Increasing {complexity} to {k_neighbors}" 139 ) 140 complexity = k_neighbors 141 142 neighbors, distances = self._index.search( 143 query=_query, 144 knn=k_neighbors, 145 complexity=complexity, 146 beam_width=beam_width, 147 ) 148 return QueryResponse(identifiers=neighbors, distances=distances)
Searches the index by a single query vector.
Parameters
- query: 1d numpy array of the same dimensionality and dtype of the index.
- k_neighbors: Number of neighbors to be returned. If query vector exists in index, it almost definitely
will be returned as well, so adjust your
k_neighbors
as appropriate. Must be > 0. - complexity: Size of distance ordered list of candidate neighbors to use while searching. List size increases accuracy at the cost of latency. Must be at least k_neighbors in size.
- beam_width: The beamwidth to be used for search. This is the maximum number of IO requests each query will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query, but might result in slightly higher total number of IO requests to SSD per query. For the highest query throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search. Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will involve some tuning overhead.
150 def batch_search( 151 self, 152 queries: VectorLikeBatch, 153 k_neighbors: int, 154 complexity: int, 155 num_threads: int, 156 beam_width: int = 2, 157 ) -> QueryResponseBatch: 158 """ 159 Searches the index by a batch of query vectors. 160 161 This search is parallelized and far more efficient than searching for each vector individually. 162 163 ### Parameters 164 - **queries**: 2d numpy array, with column dimensionality matching the index and row dimensionality being the 165 number of queries intended to search for in parallel. Dtype must match dtype of the index. 166 - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely 167 will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0. 168 - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size 169 increases accuracy at the cost of latency. Must be at least k_neighbors in size. 170 - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system 171 - **beam_width**: The beamwidth to be used for search. This is the maximum number of IO requests each query 172 will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query, 173 but might result in slightly higher total number of IO requests to SSD per query. For the highest query 174 throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search. 175 Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will 176 involve some tuning overhead. 177 """ 178 _queries = _castable_dtype_or_raise(queries, expected=self._vector_dtype) 179 _assert_2d(_queries, "queries") 180 _assert_is_positive_uint32(k_neighbors, "k_neighbors") 181 _assert_is_positive_uint32(complexity, "complexity") 182 _assert_is_nonnegative_uint32(num_threads, "num_threads") 183 _assert_is_positive_uint32(beam_width, "beam_width") 184 185 if k_neighbors > complexity: 186 warnings.warn( 187 f"{k_neighbors=} asked for, but {complexity=} was smaller. Increasing {complexity} to {k_neighbors}" 188 ) 189 complexity = k_neighbors 190 191 num_queries, dim = _queries.shape 192 neighbors, distances = self._index.batch_search( 193 queries=_queries, 194 num_queries=num_queries, 195 knn=k_neighbors, 196 complexity=complexity, 197 beam_width=beam_width, 198 num_threads=num_threads, 199 ) 200 return QueryResponseBatch(identifiers=neighbors, distances=distances)
Searches the index by a batch of query vectors.
This search is parallelized and far more efficient than searching for each vector individually.
Parameters
- queries: 2d numpy array, with column dimensionality matching the index and row dimensionality being the number of queries intended to search for in parallel. Dtype must match dtype of the index.
- k_neighbors: Number of neighbors to be returned. If query vector exists in index, it almost definitely
will be returned as well, so adjust your
k_neighbors
as appropriate. Must be > 0. - complexity: Size of distance ordered list of candidate neighbors to use while searching. List size increases accuracy at the cost of latency. Must be at least k_neighbors in size.
- num_threads: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
- beam_width: The beamwidth to be used for search. This is the maximum number of IO requests each query will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query, but might result in slightly higher total number of IO requests to SSD per query. For the highest query throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search. Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will involve some tuning overhead.
34class StaticMemoryIndex: 35 """ 36 A StaticMemoryIndex is an immutable in-memory DiskANN index. 37 """ 38 39 def __init__( 40 self, 41 index_directory: str, 42 num_threads: int, 43 initial_search_complexity: int, 44 index_prefix: str = "ann", 45 distance_metric: Optional[DistanceMetric] = None, 46 vector_dtype: Optional[VectorDType] = None, 47 dimensions: Optional[int] = None, 48 enable_filters: bool = False 49 ): 50 """ 51 ### Parameters 52 - **index_directory**: The directory containing the index files. This directory must contain the following 53 files: 54 - `{index_prefix}.data` 55 - `{index_prefix}` 56 57 58 It may also include the following optional files: 59 - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the 60 `index_directory` if the index was created from a numpy array 61 - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata 62 about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. 63 If an index is built from the `diskann` cli tools, this file will not exist. 64 - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system 65 - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the 66 life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of 67 `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search` 68 operation requests a space larger than can be accommodated by these values. 69 - **index_prefix**: The prefix of the index files. Defaults to "ann". 70 - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3 71 vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This 72 value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, 73 you are required to provide it. 74 - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a 75 `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it. 76 - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same 77 dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it 78 does not exist, you are required to provide it. 79 - **enable_filters**: Indexes built with filters can also be used for filtered search. 80 """ 81 index_prefix_path = _valid_index_prefix(index_directory, index_prefix) 82 self._labels_map = {} 83 self._labels_metadata = {} 84 if enable_filters: 85 try: 86 with open(f"{index_prefix_path}_labels_map.txt", "r") as labels_map_if: 87 for line in labels_map_if: 88 (key, val) = line.split("\t") 89 self._labels_map[key] = int(val) 90 with open(f"{index_prefix_path}_label_metadata.json", "r") as labels_metadata_if: 91 self._labels_metadata = json.load(labels_metadata_if) 92 except: # noqa: E722 93 # exceptions are basically presumed to be either file not found or file not formatted correctly 94 raise RuntimeException("Filter labels file was unable to be processed.") 95 vector_dtype, metric, num_points, dims = _ensure_index_metadata( 96 index_prefix_path, 97 vector_dtype, 98 distance_metric, 99 1, # it doesn't matter because we don't need it in this context anyway 100 dimensions, 101 ) 102 dap_metric = _valid_metric(metric) 103 104 _assert_is_nonnegative_uint32(num_threads, "num_threads") 105 _assert_is_positive_uint32( 106 initial_search_complexity, "initial_search_complexity" 107 ) 108 109 self._vector_dtype = vector_dtype 110 self._dimensions = dims 111 112 if vector_dtype == np.uint8: 113 _index = _native_dap.StaticMemoryUInt8Index 114 elif vector_dtype == np.int8: 115 _index = _native_dap.StaticMemoryInt8Index 116 else: 117 _index = _native_dap.StaticMemoryFloatIndex 118 119 self._index = _index( 120 distance_metric=dap_metric, 121 num_points=num_points, 122 dimensions=dims, 123 index_path=index_prefix_path, 124 num_threads=num_threads, 125 initial_search_complexity=initial_search_complexity, 126 ) 127 128 def search( 129 self, query: VectorLike, k_neighbors: int, complexity: int, filter_label: str = "" 130 ) -> QueryResponse: 131 """ 132 Searches the index by a single query vector. 133 134 ### Parameters 135 - **query**: 1d numpy array of the same dimensionality and dtype of the index. 136 - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely 137 will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0. 138 - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size 139 increases accuracy at the cost of latency. Must be at least k_neighbors in size. 140 """ 141 if filter_label != "": 142 if len(self._labels_map) == 0: 143 raise ValueError( 144 f"A filter label of {filter_label} was provided, but this class was not initialized with filters " 145 "enabled, e.g. StaticDiskMemory(..., enable_filters=True)" 146 ) 147 if filter_label not in self._labels_map: 148 raise ValueError( 149 f"A filter label of {filter_label} was provided, but the external(str)->internal(np.uint32) labels map " 150 f"does not include that label." 151 ) 152 k_neighbors = min(k_neighbors, self._labels_metadata[filter_label]) 153 _query = _castable_dtype_or_raise(query, expected=self._vector_dtype) 154 _assert(len(_query.shape) == 1, "query vector must be 1-d") 155 _assert( 156 _query.shape[0] == self._dimensions, 157 f"query vector must have the same dimensionality as the index; index dimensionality: {self._dimensions}, " 158 f"query dimensionality: {_query.shape[0]}", 159 ) 160 _assert_is_positive_uint32(k_neighbors, "k_neighbors") 161 _assert_is_nonnegative_uint32(complexity, "complexity") 162 163 if k_neighbors > complexity: 164 warnings.warn( 165 f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}" 166 ) 167 complexity = k_neighbors 168 169 if filter_label == "": 170 neighbors, distances = self._index.search(query=_query, knn=k_neighbors, complexity=complexity) 171 else: 172 filter = self._labels_map[filter_label] 173 neighbors, distances = self._index.search_with_filter( 174 query=query, 175 knn=k_neighbors, 176 complexity=complexity, 177 filter=filter 178 ) 179 return QueryResponse(identifiers=neighbors, distances=distances) 180 181 182 def batch_search( 183 self, 184 queries: VectorLikeBatch, 185 k_neighbors: int, 186 complexity: int, 187 num_threads: int, 188 ) -> QueryResponseBatch: 189 """ 190 Searches the index by a batch of query vectors. 191 192 This search is parallelized and far more efficient than searching for each vector individually. 193 194 ### Parameters 195 - **queries**: 2d numpy array, with column dimensionality matching the index and row dimensionality being the 196 number of queries intended to search for in parallel. Dtype must match dtype of the index. 197 - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely 198 will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0. 199 - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size 200 increases accuracy at the cost of latency. Must be at least k_neighbors in size. 201 - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system 202 """ 203 204 _queries = _castable_dtype_or_raise(queries, expected=self._vector_dtype) 205 _assert(len(_queries.shape) == 2, "queries must must be 2-d np array") 206 _assert( 207 _queries.shape[1] == self._dimensions, 208 f"query vectors must have the same dimensionality as the index; index dimensionality: {self._dimensions}, " 209 f"query dimensionality: {_queries.shape[1]}", 210 ) 211 _assert_is_positive_uint32(k_neighbors, "k_neighbors") 212 _assert_is_positive_uint32(complexity, "complexity") 213 _assert_is_nonnegative_uint32(num_threads, "num_threads") 214 215 if k_neighbors > complexity: 216 warnings.warn( 217 f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}" 218 ) 219 complexity = k_neighbors 220 221 num_queries, dim = _queries.shape 222 neighbors, distances = self._index.batch_search( 223 queries=_queries, 224 num_queries=num_queries, 225 knn=k_neighbors, 226 complexity=complexity, 227 num_threads=num_threads, 228 ) 229 return QueryResponseBatch(identifiers=neighbors, distances=distances)
A StaticMemoryIndex is an immutable in-memory DiskANN index.
39 def __init__( 40 self, 41 index_directory: str, 42 num_threads: int, 43 initial_search_complexity: int, 44 index_prefix: str = "ann", 45 distance_metric: Optional[DistanceMetric] = None, 46 vector_dtype: Optional[VectorDType] = None, 47 dimensions: Optional[int] = None, 48 enable_filters: bool = False 49 ): 50 """ 51 ### Parameters 52 - **index_directory**: The directory containing the index files. This directory must contain the following 53 files: 54 - `{index_prefix}.data` 55 - `{index_prefix}` 56 57 58 It may also include the following optional files: 59 - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the 60 `index_directory` if the index was created from a numpy array 61 - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata 62 about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. 63 If an index is built from the `diskann` cli tools, this file will not exist. 64 - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system 65 - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the 66 life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of 67 `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search` 68 operation requests a space larger than can be accommodated by these values. 69 - **index_prefix**: The prefix of the index files. Defaults to "ann". 70 - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3 71 vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This 72 value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, 73 you are required to provide it. 74 - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a 75 `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it. 76 - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same 77 dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it 78 does not exist, you are required to provide it. 79 - **enable_filters**: Indexes built with filters can also be used for filtered search. 80 """ 81 index_prefix_path = _valid_index_prefix(index_directory, index_prefix) 82 self._labels_map = {} 83 self._labels_metadata = {} 84 if enable_filters: 85 try: 86 with open(f"{index_prefix_path}_labels_map.txt", "r") as labels_map_if: 87 for line in labels_map_if: 88 (key, val) = line.split("\t") 89 self._labels_map[key] = int(val) 90 with open(f"{index_prefix_path}_label_metadata.json", "r") as labels_metadata_if: 91 self._labels_metadata = json.load(labels_metadata_if) 92 except: # noqa: E722 93 # exceptions are basically presumed to be either file not found or file not formatted correctly 94 raise RuntimeException("Filter labels file was unable to be processed.") 95 vector_dtype, metric, num_points, dims = _ensure_index_metadata( 96 index_prefix_path, 97 vector_dtype, 98 distance_metric, 99 1, # it doesn't matter because we don't need it in this context anyway 100 dimensions, 101 ) 102 dap_metric = _valid_metric(metric) 103 104 _assert_is_nonnegative_uint32(num_threads, "num_threads") 105 _assert_is_positive_uint32( 106 initial_search_complexity, "initial_search_complexity" 107 ) 108 109 self._vector_dtype = vector_dtype 110 self._dimensions = dims 111 112 if vector_dtype == np.uint8: 113 _index = _native_dap.StaticMemoryUInt8Index 114 elif vector_dtype == np.int8: 115 _index = _native_dap.StaticMemoryInt8Index 116 else: 117 _index = _native_dap.StaticMemoryFloatIndex 118 119 self._index = _index( 120 distance_metric=dap_metric, 121 num_points=num_points, 122 dimensions=dims, 123 index_path=index_prefix_path, 124 num_threads=num_threads, 125 initial_search_complexity=initial_search_complexity, 126 )
Parameters
index_directory: The directory containing the index files. This directory must contain the following files:
{index_prefix}.data
{index_prefix}
It may also include the following optional files:
{index_prefix}_vectors.bin
: Optional.diskannpy
builder functions may create this file in theindex_directory
if the index was created from a numpy array{index_prefix}_metadata.bin
: Optional.diskannpy
builder functions create this file to store metadata about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. If an index is built from thediskann
cli tools, this file will not exist.
- num_threads: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
- initial_search_complexity: Should be set to the most common
complexity
expected to be used during the life of thisdiskannpy.DynamicMemoryIndex
object. The working scratch memory allocated is based off ofinitial_search_complexity
*search_threads
. Note that it may be resized if asearch
orbatch_search
operation requests a space larger than can be accommodated by these values. - index_prefix: The prefix of the index files. Defaults to "ann".
- distance_metric: A
str
, strictly one of {"l2", "mips", "cosine"}.l2
andcosine
are supported for all 3 vector dtypes, butmips
is only available for single precision floats. Default isNone
. This value is only used if a{index_prefix}_metadata.bin
file does not exist. If it does not exist, you are required to provide it. - vector_dtype: The vector dtype this index has been built with. This value is only used if a
{index_prefix}_metadata.bin
file does not exist. If it does not exist, you are required to provide it. - dimensions: The vector dimensionality of this index. All new vectors inserted must be the same
dimensionality. This value is only used if a
{index_prefix}_metadata.bin
file does not exist. If it does not exist, you are required to provide it. - enable_filters: Indexes built with filters can also be used for filtered search.
128 def search( 129 self, query: VectorLike, k_neighbors: int, complexity: int, filter_label: str = "" 130 ) -> QueryResponse: 131 """ 132 Searches the index by a single query vector. 133 134 ### Parameters 135 - **query**: 1d numpy array of the same dimensionality and dtype of the index. 136 - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely 137 will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0. 138 - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size 139 increases accuracy at the cost of latency. Must be at least k_neighbors in size. 140 """ 141 if filter_label != "": 142 if len(self._labels_map) == 0: 143 raise ValueError( 144 f"A filter label of {filter_label} was provided, but this class was not initialized with filters " 145 "enabled, e.g. StaticDiskMemory(..., enable_filters=True)" 146 ) 147 if filter_label not in self._labels_map: 148 raise ValueError( 149 f"A filter label of {filter_label} was provided, but the external(str)->internal(np.uint32) labels map " 150 f"does not include that label." 151 ) 152 k_neighbors = min(k_neighbors, self._labels_metadata[filter_label]) 153 _query = _castable_dtype_or_raise(query, expected=self._vector_dtype) 154 _assert(len(_query.shape) == 1, "query vector must be 1-d") 155 _assert( 156 _query.shape[0] == self._dimensions, 157 f"query vector must have the same dimensionality as the index; index dimensionality: {self._dimensions}, " 158 f"query dimensionality: {_query.shape[0]}", 159 ) 160 _assert_is_positive_uint32(k_neighbors, "k_neighbors") 161 _assert_is_nonnegative_uint32(complexity, "complexity") 162 163 if k_neighbors > complexity: 164 warnings.warn( 165 f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}" 166 ) 167 complexity = k_neighbors 168 169 if filter_label == "": 170 neighbors, distances = self._index.search(query=_query, knn=k_neighbors, complexity=complexity) 171 else: 172 filter = self._labels_map[filter_label] 173 neighbors, distances = self._index.search_with_filter( 174 query=query, 175 knn=k_neighbors, 176 complexity=complexity, 177 filter=filter 178 ) 179 return QueryResponse(identifiers=neighbors, distances=distances)
Searches the index by a single query vector.
Parameters
- query: 1d numpy array of the same dimensionality and dtype of the index.
- k_neighbors: Number of neighbors to be returned. If query vector exists in index, it almost definitely
will be returned as well, so adjust your
k_neighbors
as appropriate. Must be > 0. - complexity: Size of distance ordered list of candidate neighbors to use while searching. List size increases accuracy at the cost of latency. Must be at least k_neighbors in size.
182 def batch_search( 183 self, 184 queries: VectorLikeBatch, 185 k_neighbors: int, 186 complexity: int, 187 num_threads: int, 188 ) -> QueryResponseBatch: 189 """ 190 Searches the index by a batch of query vectors. 191 192 This search is parallelized and far more efficient than searching for each vector individually. 193 194 ### Parameters 195 - **queries**: 2d numpy array, with column dimensionality matching the index and row dimensionality being the 196 number of queries intended to search for in parallel. Dtype must match dtype of the index. 197 - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely 198 will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0. 199 - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size 200 increases accuracy at the cost of latency. Must be at least k_neighbors in size. 201 - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system 202 """ 203 204 _queries = _castable_dtype_or_raise(queries, expected=self._vector_dtype) 205 _assert(len(_queries.shape) == 2, "queries must must be 2-d np array") 206 _assert( 207 _queries.shape[1] == self._dimensions, 208 f"query vectors must have the same dimensionality as the index; index dimensionality: {self._dimensions}, " 209 f"query dimensionality: {_queries.shape[1]}", 210 ) 211 _assert_is_positive_uint32(k_neighbors, "k_neighbors") 212 _assert_is_positive_uint32(complexity, "complexity") 213 _assert_is_nonnegative_uint32(num_threads, "num_threads") 214 215 if k_neighbors > complexity: 216 warnings.warn( 217 f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}" 218 ) 219 complexity = k_neighbors 220 221 num_queries, dim = _queries.shape 222 neighbors, distances = self._index.batch_search( 223 queries=_queries, 224 num_queries=num_queries, 225 knn=k_neighbors, 226 complexity=complexity, 227 num_threads=num_threads, 228 ) 229 return QueryResponseBatch(identifiers=neighbors, distances=distances)
Searches the index by a batch of query vectors.
This search is parallelized and far more efficient than searching for each vector individually.
Parameters
- queries: 2d numpy array, with column dimensionality matching the index and row dimensionality being the number of queries intended to search for in parallel. Dtype must match dtype of the index.
- k_neighbors: Number of neighbors to be returned. If query vector exists in index, it almost definitely
will be returned as well, so adjust your
k_neighbors
as appropriate. Must be > 0. - complexity: Size of distance ordered list of candidate neighbors to use while searching. List size increases accuracy at the cost of latency. Must be at least k_neighbors in size.
- num_threads: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
41class DynamicMemoryIndex: 42 """ 43 A DynamicMemoryIndex instance is used to both search and mutate a `diskannpy` memory index. This index is unlike 44 either `diskannpy.StaticMemoryIndex` or `diskannpy.StaticDiskIndex` in the following ways: 45 46 - It requires an explicit vector identifier for each vector added to it. 47 - Insert and (lazy) deletion operations are provided for a flexible, living index 48 49 The mutable aspect of this index will absolutely impact search time performance as new vectors are added and 50 old deleted. `DynamicMemoryIndex.consolidate_deletes()` should be called periodically to restructure the index 51 to remove deleted vectors and improve per-search performance, at the cost of an expensive index consolidation to 52 occur. 53 """ 54 55 @classmethod 56 def from_file( 57 cls, 58 index_directory: str, 59 max_vectors: int, 60 complexity: int, 61 graph_degree: int, 62 saturate_graph: bool = defaults.SATURATE_GRAPH, 63 max_occlusion_size: int = defaults.MAX_OCCLUSION_SIZE, 64 alpha: float = defaults.ALPHA, 65 num_threads: int = defaults.NUM_THREADS, 66 filter_complexity: int = defaults.FILTER_COMPLEXITY, 67 num_frozen_points: int = defaults.NUM_FROZEN_POINTS_DYNAMIC, 68 initial_search_complexity: int = 0, 69 search_threads: int = 0, 70 concurrent_consolidation: bool = True, 71 index_prefix: str = "ann", 72 distance_metric: Optional[DistanceMetric] = None, 73 vector_dtype: Optional[VectorDType] = None, 74 dimensions: Optional[int] = None, 75 ) -> "DynamicMemoryIndex": 76 """ 77 The `from_file` classmethod is used to load a previously saved index from disk. This index *must* have been 78 created with a valid `tags` file or `tags` np.ndarray of `diskannpy.VectorIdentifier`s. It is *strongly* 79 recommended that you use the same parameters as the `diskannpy.build_memory_index()` function that created 80 the index. 81 82 ### Parameters 83 - **index_directory**: The directory containing the index files. This directory must contain the following 84 files: 85 - `{index_prefix}.data` 86 - `{index_prefix}.tags` 87 - `{index_prefix}` 88 89 It may also include the following optional files: 90 - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the 91 `index_directory` if the index was created from a numpy array 92 - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata 93 about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. 94 If an index is built from the `diskann` cli tools, this file will not exist. 95 - **max_vectors**: Capacity of the memory index including space for future insertions. 96 - **complexity**: Complexity (a.k.a `L`) references the size of the list we store candidate approximate 97 neighbors in. It's used during save (which is an index rebuild), and it's used as an initial search size to 98 warm up our index and lower the latency for initial real searches. 99 - **graph_degree**: Graph degree (a.k.a. `R`) is the maximum degree allowed for a node in the index's graph 100 structure. This degree will be pruned throughout the course of the index build, but it will never grow beyond 101 this value. Higher R values require longer index build times, but may result in an index showing excellent 102 recall and latency characteristics. 103 - **saturate_graph**: If True, the adjacency list of each node will be saturated with neighbors to have exactly 104 `graph_degree` neighbors. If False, each node will have between 1 and `graph_degree` neighbors. 105 - **max_occlusion_size**: The maximum number of points that can be considered by occlude_list function. 106 - **alpha**: The alpha parameter (>=1) is used to control the nature and number of points that are added to the 107 graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably 108 more distance comparisons compared to a lower alpha value. 109 - **num_threads**: Number of threads to use when creating this index. `0` indicates we should use all available 110 logical processors. 111 - **filter_complexity**: Complexity to use when using filters. Default is 0. 112 - **num_frozen_points**: Number of points to freeze. Default is 1. 113 - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the 114 life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of 115 `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search` 116 operation requests a space larger than can be accommodated by these values. 117 - **search_threads**: Should be set to the most common `num_threads` expected to be used during the 118 life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of 119 `initial_search_complexity` * `search_threads`. Note that it may be resized if a `batch_search` 120 operation requests a space larger than can be accommodated by these values. 121 - **concurrent_consolidation**: This flag dictates whether consolidation can be run alongside inserts and 122 deletes, or whether the index is locked down to changes while consolidation is ongoing. 123 - **index_prefix**: The prefix of the index files. Defaults to "ann". 124 - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3 125 vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This 126 value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, 127 you are required to provide it. 128 - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a 129 `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it. 130 - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same 131 dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it 132 does not exist, you are required to provide it. 133 134 ### Returns 135 A `diskannpy.DynamicMemoryIndex` object, with the index loaded from disk and ready to use for insertions, 136 deletions, and searches. 137 138 """ 139 index_prefix_path = _valid_index_prefix(index_directory, index_prefix) 140 141 # do tags exist? 142 tags_file = index_prefix_path + ".tags" 143 _assert( 144 Path(tags_file).exists(), 145 f"The file {tags_file} does not exist in {index_directory}", 146 ) 147 vector_dtype, dap_metric, num_vectors, dimensions = _ensure_index_metadata( 148 index_prefix_path, vector_dtype, distance_metric, max_vectors, dimensions, warn_size_exceeded=True 149 ) 150 151 index = cls( 152 distance_metric=dap_metric, # type: ignore 153 vector_dtype=vector_dtype, 154 dimensions=dimensions, 155 max_vectors=max_vectors, 156 complexity=complexity, 157 graph_degree=graph_degree, 158 saturate_graph=saturate_graph, 159 max_occlusion_size=max_occlusion_size, 160 alpha=alpha, 161 num_threads=num_threads, 162 filter_complexity=filter_complexity, 163 num_frozen_points=num_frozen_points, 164 initial_search_complexity=initial_search_complexity, 165 search_threads=search_threads, 166 concurrent_consolidation=concurrent_consolidation, 167 ) 168 index._index.load(index_prefix_path) 169 index._num_vectors = num_vectors # current number of vectors loaded 170 return index 171 172 def __init__( 173 self, 174 distance_metric: DistanceMetric, 175 vector_dtype: VectorDType, 176 dimensions: int, 177 max_vectors: int, 178 complexity: int, 179 graph_degree: int, 180 saturate_graph: bool = defaults.SATURATE_GRAPH, 181 max_occlusion_size: int = defaults.MAX_OCCLUSION_SIZE, 182 alpha: float = defaults.ALPHA, 183 num_threads: int = defaults.NUM_THREADS, 184 filter_complexity: int = defaults.FILTER_COMPLEXITY, 185 num_frozen_points: int = defaults.NUM_FROZEN_POINTS_DYNAMIC, 186 initial_search_complexity: int = 0, 187 search_threads: int = 0, 188 concurrent_consolidation: bool = True, 189 ): 190 """ 191 The `diskannpy.DynamicMemoryIndex` represents our python API into a mutable DiskANN memory index. 192 193 This constructor is used to create a new, empty index. If you wish to load a previously saved index from disk, 194 please use the `diskannpy.DynamicMemoryIndex.from_file` classmethod instead. 195 196 ### Parameters 197 - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3 198 vector dtypes, but `mips` is only available for single precision floats. 199 - **vector_dtype**: One of {`np.float32`, `np.int8`, `np.uint8`}. The dtype of the vectors this index will 200 be storing. 201 - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same 202 dimensionality. 203 - **max_vectors**: Capacity of the data store including space for future insertions 204 - **graph_degree**: Graph degree (a.k.a. `R`) is the maximum degree allowed for a node in the index's graph 205 structure. This degree will be pruned throughout the course of the index build, but it will never grow beyond 206 this value. Higher `graph_degree` values require longer index build times, but may result in an index showing 207 excellent recall and latency characteristics. 208 - **saturate_graph**: If True, the adjacency list of each node will be saturated with neighbors to have exactly 209 `graph_degree` neighbors. If False, each node will have between 1 and `graph_degree` neighbors. 210 - **max_occlusion_size**: The maximum number of points that can be considered by occlude_list function. 211 - **alpha**: The alpha parameter (>=1) is used to control the nature and number of points that are added to the 212 graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably 213 more distance comparisons compared to a lower alpha value. 214 - **num_threads**: Number of threads to use when creating this index. `0` indicates we should use all available 215 logical processors. 216 - **filter_complexity**: Complexity to use when using filters. Default is 0. 217 - **num_frozen_points**: Number of points to freeze. Default is 1. 218 - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the 219 life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of 220 `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search` 221 operation requests a space larger than can be accommodated by these values. 222 - **search_threads**: Should be set to the most common `num_threads` expected to be used during the 223 life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of 224 `initial_search_complexity` * `search_threads`. Note that it may be resized if a `batch_search` 225 operation requests a space larger than can be accommodated by these values. 226 - **concurrent_consolidation**: This flag dictates whether consolidation can be run alongside inserts and 227 deletes, or whether the index is locked down to changes while consolidation is ongoing. 228 229 """ 230 self._num_vectors = 0 231 self._removed_num_vectors = 0 232 dap_metric = _valid_metric(distance_metric) 233 self._dap_metric = dap_metric 234 _assert_dtype(vector_dtype) 235 _assert_is_positive_uint32(dimensions, "dimensions") 236 237 self._vector_dtype = vector_dtype 238 self._dimensions = dimensions 239 240 _assert_is_positive_uint32(max_vectors, "max_vectors") 241 _assert_is_positive_uint32(complexity, "complexity") 242 _assert_is_positive_uint32(graph_degree, "graph_degree") 243 _assert( 244 alpha >= 1, 245 "alpha must be >= 1, and realistically should be kept between [1.0, 2.0)", 246 ) 247 _assert_is_nonnegative_uint32(max_occlusion_size, "max_occlusion_size") 248 _assert_is_nonnegative_uint32(num_threads, "num_threads") 249 _assert_is_nonnegative_uint32(filter_complexity, "filter_complexity") 250 _assert_is_nonnegative_uint32(num_frozen_points, "num_frozen_points") 251 _assert_is_nonnegative_uint32( 252 initial_search_complexity, "initial_search_complexity" 253 ) 254 _assert_is_nonnegative_uint32(search_threads, "search_threads") 255 256 self._max_vectors = max_vectors 257 self._complexity = complexity 258 self._graph_degree = graph_degree 259 260 if vector_dtype == np.uint8: 261 _index = _native_dap.DynamicMemoryUInt8Index 262 elif vector_dtype == np.int8: 263 _index = _native_dap.DynamicMemoryInt8Index 264 else: 265 _index = _native_dap.DynamicMemoryFloatIndex 266 267 self._index = _index( 268 distance_metric=dap_metric, 269 dimensions=dimensions, 270 max_vectors=max_vectors, 271 complexity=complexity, 272 graph_degree=graph_degree, 273 saturate_graph=saturate_graph, 274 max_occlusion_size=max_occlusion_size, 275 alpha=alpha, 276 num_threads=num_threads, 277 filter_complexity=filter_complexity, 278 num_frozen_points=num_frozen_points, 279 initial_search_complexity=initial_search_complexity, 280 search_threads=search_threads, 281 concurrent_consolidation=concurrent_consolidation, 282 ) 283 self._points_deleted = False 284 285 def search( 286 self, query: VectorLike, k_neighbors: int, complexity: int 287 ) -> QueryResponse: 288 """ 289 Searches the index by a single query vector. 290 291 ### Parameters 292 - **query**: 1d numpy array of the same dimensionality and dtype of the index. 293 - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely 294 will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0. 295 - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size 296 increases accuracy at the cost of latency. Must be at least k_neighbors in size. 297 """ 298 _query = _castable_dtype_or_raise(query, expected=self._vector_dtype) 299 _assert(len(_query.shape) == 1, "query vector must be 1-d") 300 _assert( 301 _query.shape[0] == self._dimensions, 302 f"query vector must have the same dimensionality as the index; index dimensionality: {self._dimensions}, " 303 f"query dimensionality: {_query.shape[0]}", 304 ) 305 _assert_is_positive_uint32(k_neighbors, "k_neighbors") 306 _assert_is_nonnegative_uint32(complexity, "complexity") 307 308 if k_neighbors > complexity: 309 warnings.warn( 310 f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}" 311 ) 312 complexity = k_neighbors 313 neighbors, distances = self._index.search(query=_query, knn=k_neighbors, complexity=complexity) 314 return QueryResponse(identifiers=neighbors, distances=distances) 315 316 def batch_search( 317 self, 318 queries: VectorLikeBatch, 319 k_neighbors: int, 320 complexity: int, 321 num_threads: int, 322 ) -> QueryResponseBatch: 323 """ 324 Searches the index by a batch of query vectors. 325 326 This search is parallelized and far more efficient than searching for each vector individually. 327 328 ### Parameters 329 - **queries**: 2d numpy array, with column dimensionality matching the index and row dimensionality being the 330 number of queries intended to search for in parallel. Dtype must match dtype of the index. 331 - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely 332 will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0. 333 - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size 334 increases accuracy at the cost of latency. Must be at least k_neighbors in size. 335 - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system 336 """ 337 _queries = _castable_dtype_or_raise(queries, expected=self._vector_dtype) 338 _assert_2d(_queries, "queries") 339 _assert( 340 _queries.shape[1] == self._dimensions, 341 f"query vectors must have the same dimensionality as the index; index dimensionality: {self._dimensions}, " 342 f"query dimensionality: {_queries.shape[1]}", 343 ) 344 345 _assert_is_positive_uint32(k_neighbors, "k_neighbors") 346 _assert_is_positive_uint32(complexity, "complexity") 347 _assert_is_nonnegative_uint32(num_threads, "num_threads") 348 349 if k_neighbors > complexity: 350 warnings.warn( 351 f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}" 352 ) 353 complexity = k_neighbors 354 355 num_queries, dim = queries.shape 356 neighbors, distances = self._index.batch_search( 357 queries=_queries, 358 num_queries=num_queries, 359 knn=k_neighbors, 360 complexity=complexity, 361 num_threads=num_threads, 362 ) 363 return QueryResponseBatch(identifiers=neighbors, distances=distances) 364 365 def save(self, save_path: str, index_prefix: str = "ann"): 366 """ 367 Saves this index to file. 368 369 ### Parameters 370 - **save_path**: The path to save these index files to. 371 - **index_prefix**: The prefix of the index files. Defaults to "ann". 372 """ 373 if save_path == "": 374 raise ValueError("save_path cannot be empty") 375 if index_prefix == "": 376 raise ValueError("index_prefix cannot be empty") 377 378 index_prefix = index_prefix.format(complexity=self._complexity, graph_degree=self._graph_degree) 379 _assert_existing_directory(save_path, "save_path") 380 save_path = os.path.join(save_path, index_prefix) 381 if self._points_deleted is True: 382 warnings.warn( 383 "DynamicMemoryIndex.save() currently requires DynamicMemoryIndex.consolidate_delete() to be called " 384 "prior to save when items have been marked for deletion. This is being done automatically now, though" 385 "it will increase the time it takes to save; on large sets of data it can take a substantial amount of " 386 "time. In the future, we will implement a faster save with unconsolidated deletes, but for now this is " 387 "required." 388 ) 389 self._index.consolidate_delete() 390 self._index.save( 391 save_path=save_path, compact_before_save=True 392 ) # we do not yet support uncompacted saves 393 _write_index_metadata( 394 save_path, 395 self._vector_dtype, 396 self._dap_metric, 397 self._index.num_points(), 398 self._dimensions, 399 ) 400 401 def insert(self, vector: VectorLike, vector_id: VectorIdentifier): 402 """ 403 Inserts a single vector into the index with the provided vector_id. 404 405 If this insertion will overrun the `max_vectors` count boundaries of this index, `consolidate_delete()` will 406 be executed automatically. 407 408 ### Parameters 409 - **vector**: The vector to insert. Note that dtype must match. 410 - **vector_id**: The vector_id to use for this vector. 411 """ 412 _vector = _castable_dtype_or_raise(vector, expected=self._vector_dtype) 413 _assert(len(vector.shape) == 1, "insert vector must be 1-d") 414 _assert_is_positive_uint32(vector_id, "vector_id") 415 if self._num_vectors + 1 > self._max_vectors: 416 if self._removed_num_vectors > 0: 417 warnings.warn(f"Inserting this vector would overrun the max_vectors={self._max_vectors} specified at index " 418 f"construction. We are attempting to consolidate_delete() to make space.") 419 self.consolidate_delete() 420 else: 421 raise RuntimeError(f"Inserting this vector would overrun the max_vectors={self._max_vectors} specified " 422 f"at index construction. Unable to make space by consolidating deletions. The insert" 423 f"operation has failed.") 424 status = self._index.insert(_vector, np.uint32(vector_id)) 425 if status == 0: 426 self._num_vectors += 1 427 else: 428 raise RuntimeError( 429 f"Insert was unable to complete successfully; error code returned from diskann C++ lib: {status}" 430 ) 431 432 433 def batch_insert( 434 self, 435 vectors: VectorLikeBatch, 436 vector_ids: VectorIdentifierBatch, 437 num_threads: int = 0, 438 ): 439 """ 440 Inserts a batch of vectors into the index with the provided vector_ids. 441 442 If this batch insertion will overrun the `max_vectors` count boundaries of this index, `consolidate_delete()` 443 will be executed automatically. 444 445 ### Parameters 446 - **vectors**: The 2d numpy array of vectors to insert. 447 - **vector_ids**: The 1d array of vector ids to use. This array must have the same number of elements as 448 the vectors array has rows. The dtype of vector_ids must be `np.uint32` 449 - **num_threads**: Number of threads to use when inserting into this index. (>= 0), 0 = num_threads in system 450 """ 451 _query = _castable_dtype_or_raise(vectors, expected=self._vector_dtype) 452 _assert(len(vectors.shape) == 2, "vectors must be a 2-d array") 453 _assert( 454 vectors.shape[0] == vector_ids.shape[0], 455 "Number of vectors must be equal to number of ids", 456 ) 457 _vectors = vectors.astype(dtype=self._vector_dtype, casting="safe", copy=False) 458 _vector_ids = vector_ids.astype(dtype=np.uint32, casting="safe", copy=False) 459 460 if self._num_vectors + _vector_ids.shape[0] > self._max_vectors: 461 if self._max_vectors + self._removed_num_vectors >= _vector_ids.shape[0]: 462 warnings.warn(f"Inserting these vectors, count={_vector_ids.shape[0]} would overrun the " 463 f"max_vectors={self._max_vectors} specified at index construction. We are attempting to " 464 f"consolidate_delete() to make space.") 465 self.consolidate_delete() 466 else: 467 raise RuntimeError(f"Inserting these vectors count={_vector_ids.shape[0]} would overrun the " 468 f"max_vectors={self._max_vectors} specified at index construction. Unable to make " 469 f"space by consolidating deletions. The batch insert operation has failed.") 470 471 statuses = self._index.batch_insert( 472 _vectors, _vector_ids, _vector_ids.shape[0], num_threads 473 ) 474 successes = [] 475 failures = [] 476 for i in range(0, len(statuses)): 477 if statuses[i] == 0: 478 successes.append(i) 479 else: 480 failures.append(i) 481 self._num_vectors += len(successes) 482 if len(failures) == 0: 483 return 484 failed_ids = vector_ids[failures] 485 raise RuntimeError( 486 f"During batch insert, the following vector_ids were unable to be inserted into the index: {failed_ids}. " 487 f"{len(successes)} were successfully inserted" 488 ) 489 490 491 def mark_deleted(self, vector_id: VectorIdentifier): 492 """ 493 Mark vector for deletion. This is a soft delete that won't return the vector id in any results, but does not 494 remove it from the underlying index files or memory structure. To execute a hard delete, call this method and 495 then call the much more expensive `consolidate_delete` method on this index. 496 ### Parameters 497 - **vector_id**: The vector id to delete. Must be a uint32. 498 """ 499 _assert_is_positive_uint32(vector_id, "vector_id") 500 self._points_deleted = True 501 self._removed_num_vectors += 1 502 # we do not decrement self._num_vectors until consolidate_delete 503 self._index.mark_deleted(np.uint32(vector_id)) 504 505 def consolidate_delete(self): 506 """ 507 This method actually restructures the DiskANN index to remove the items that have been marked for deletion. 508 """ 509 self._index.consolidate_delete() 510 self._points_deleted = False 511 self._num_vectors -= self._removed_num_vectors 512 self._removed_num_vectors = 0
A DynamicMemoryIndex instance is used to both search and mutate a diskannpy
memory index. This index is unlike
either diskannpy.StaticMemoryIndex
or diskannpy.StaticDiskIndex
in the following ways:
- It requires an explicit vector identifier for each vector added to it.
- Insert and (lazy) deletion operations are provided for a flexible, living index
The mutable aspect of this index will absolutely impact search time performance as new vectors are added and
old deleted. DynamicMemoryIndex.consolidate_deletes()
should be called periodically to restructure the index
to remove deleted vectors and improve per-search performance, at the cost of an expensive index consolidation to
occur.
172 def __init__( 173 self, 174 distance_metric: DistanceMetric, 175 vector_dtype: VectorDType, 176 dimensions: int, 177 max_vectors: int, 178 complexity: int, 179 graph_degree: int, 180 saturate_graph: bool = defaults.SATURATE_GRAPH, 181 max_occlusion_size: int = defaults.MAX_OCCLUSION_SIZE, 182 alpha: float = defaults.ALPHA, 183 num_threads: int = defaults.NUM_THREADS, 184 filter_complexity: int = defaults.FILTER_COMPLEXITY, 185 num_frozen_points: int = defaults.NUM_FROZEN_POINTS_DYNAMIC, 186 initial_search_complexity: int = 0, 187 search_threads: int = 0, 188 concurrent_consolidation: bool = True, 189 ): 190 """ 191 The `diskannpy.DynamicMemoryIndex` represents our python API into a mutable DiskANN memory index. 192 193 This constructor is used to create a new, empty index. If you wish to load a previously saved index from disk, 194 please use the `diskannpy.DynamicMemoryIndex.from_file` classmethod instead. 195 196 ### Parameters 197 - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3 198 vector dtypes, but `mips` is only available for single precision floats. 199 - **vector_dtype**: One of {`np.float32`, `np.int8`, `np.uint8`}. The dtype of the vectors this index will 200 be storing. 201 - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same 202 dimensionality. 203 - **max_vectors**: Capacity of the data store including space for future insertions 204 - **graph_degree**: Graph degree (a.k.a. `R`) is the maximum degree allowed for a node in the index's graph 205 structure. This degree will be pruned throughout the course of the index build, but it will never grow beyond 206 this value. Higher `graph_degree` values require longer index build times, but may result in an index showing 207 excellent recall and latency characteristics. 208 - **saturate_graph**: If True, the adjacency list of each node will be saturated with neighbors to have exactly 209 `graph_degree` neighbors. If False, each node will have between 1 and `graph_degree` neighbors. 210 - **max_occlusion_size**: The maximum number of points that can be considered by occlude_list function. 211 - **alpha**: The alpha parameter (>=1) is used to control the nature and number of points that are added to the 212 graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably 213 more distance comparisons compared to a lower alpha value. 214 - **num_threads**: Number of threads to use when creating this index. `0` indicates we should use all available 215 logical processors. 216 - **filter_complexity**: Complexity to use when using filters. Default is 0. 217 - **num_frozen_points**: Number of points to freeze. Default is 1. 218 - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the 219 life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of 220 `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search` 221 operation requests a space larger than can be accommodated by these values. 222 - **search_threads**: Should be set to the most common `num_threads` expected to be used during the 223 life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of 224 `initial_search_complexity` * `search_threads`. Note that it may be resized if a `batch_search` 225 operation requests a space larger than can be accommodated by these values. 226 - **concurrent_consolidation**: This flag dictates whether consolidation can be run alongside inserts and 227 deletes, or whether the index is locked down to changes while consolidation is ongoing. 228 229 """ 230 self._num_vectors = 0 231 self._removed_num_vectors = 0 232 dap_metric = _valid_metric(distance_metric) 233 self._dap_metric = dap_metric 234 _assert_dtype(vector_dtype) 235 _assert_is_positive_uint32(dimensions, "dimensions") 236 237 self._vector_dtype = vector_dtype 238 self._dimensions = dimensions 239 240 _assert_is_positive_uint32(max_vectors, "max_vectors") 241 _assert_is_positive_uint32(complexity, "complexity") 242 _assert_is_positive_uint32(graph_degree, "graph_degree") 243 _assert( 244 alpha >= 1, 245 "alpha must be >= 1, and realistically should be kept between [1.0, 2.0)", 246 ) 247 _assert_is_nonnegative_uint32(max_occlusion_size, "max_occlusion_size") 248 _assert_is_nonnegative_uint32(num_threads, "num_threads") 249 _assert_is_nonnegative_uint32(filter_complexity, "filter_complexity") 250 _assert_is_nonnegative_uint32(num_frozen_points, "num_frozen_points") 251 _assert_is_nonnegative_uint32( 252 initial_search_complexity, "initial_search_complexity" 253 ) 254 _assert_is_nonnegative_uint32(search_threads, "search_threads") 255 256 self._max_vectors = max_vectors 257 self._complexity = complexity 258 self._graph_degree = graph_degree 259 260 if vector_dtype == np.uint8: 261 _index = _native_dap.DynamicMemoryUInt8Index 262 elif vector_dtype == np.int8: 263 _index = _native_dap.DynamicMemoryInt8Index 264 else: 265 _index = _native_dap.DynamicMemoryFloatIndex 266 267 self._index = _index( 268 distance_metric=dap_metric, 269 dimensions=dimensions, 270 max_vectors=max_vectors, 271 complexity=complexity, 272 graph_degree=graph_degree, 273 saturate_graph=saturate_graph, 274 max_occlusion_size=max_occlusion_size, 275 alpha=alpha, 276 num_threads=num_threads, 277 filter_complexity=filter_complexity, 278 num_frozen_points=num_frozen_points, 279 initial_search_complexity=initial_search_complexity, 280 search_threads=search_threads, 281 concurrent_consolidation=concurrent_consolidation, 282 ) 283 self._points_deleted = False
The diskannpy.DynamicMemoryIndex
represents our python API into a mutable DiskANN memory index.
This constructor is used to create a new, empty index. If you wish to load a previously saved index from disk,
please use the diskannpy.DynamicMemoryIndex.from_file
classmethod instead.
Parameters
- distance_metric: A
str
, strictly one of {"l2", "mips", "cosine"}.l2
andcosine
are supported for all 3 vector dtypes, butmips
is only available for single precision floats. - vector_dtype: One of {
np.float32
,np.int8
,np.uint8
}. The dtype of the vectors this index will be storing. - dimensions: The vector dimensionality of this index. All new vectors inserted must be the same dimensionality.
- max_vectors: Capacity of the data store including space for future insertions
- graph_degree: Graph degree (a.k.a.
R
) is the maximum degree allowed for a node in the index's graph structure. This degree will be pruned throughout the course of the index build, but it will never grow beyond this value. Highergraph_degree
values require longer index build times, but may result in an index showing excellent recall and latency characteristics. - saturate_graph: If True, the adjacency list of each node will be saturated with neighbors to have exactly
graph_degree
neighbors. If False, each node will have between 1 andgraph_degree
neighbors. - max_occlusion_size: The maximum number of points that can be considered by occlude_list function.
- alpha: The alpha parameter (>=1) is used to control the nature and number of points that are added to the graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably more distance comparisons compared to a lower alpha value.
- num_threads: Number of threads to use when creating this index.
0
indicates we should use all available logical processors. - filter_complexity: Complexity to use when using filters. Default is 0.
- num_frozen_points: Number of points to freeze. Default is 1.
- initial_search_complexity: Should be set to the most common
complexity
expected to be used during the life of thisdiskannpy.DynamicMemoryIndex
object. The working scratch memory allocated is based off ofinitial_search_complexity
*search_threads
. Note that it may be resized if asearch
orbatch_search
operation requests a space larger than can be accommodated by these values. - search_threads: Should be set to the most common
num_threads
expected to be used during the life of thisdiskannpy.DynamicMemoryIndex
object. The working scratch memory allocated is based off ofinitial_search_complexity
*search_threads
. Note that it may be resized if abatch_search
operation requests a space larger than can be accommodated by these values. - concurrent_consolidation: This flag dictates whether consolidation can be run alongside inserts and deletes, or whether the index is locked down to changes while consolidation is ongoing.
55 @classmethod 56 def from_file( 57 cls, 58 index_directory: str, 59 max_vectors: int, 60 complexity: int, 61 graph_degree: int, 62 saturate_graph: bool = defaults.SATURATE_GRAPH, 63 max_occlusion_size: int = defaults.MAX_OCCLUSION_SIZE, 64 alpha: float = defaults.ALPHA, 65 num_threads: int = defaults.NUM_THREADS, 66 filter_complexity: int = defaults.FILTER_COMPLEXITY, 67 num_frozen_points: int = defaults.NUM_FROZEN_POINTS_DYNAMIC, 68 initial_search_complexity: int = 0, 69 search_threads: int = 0, 70 concurrent_consolidation: bool = True, 71 index_prefix: str = "ann", 72 distance_metric: Optional[DistanceMetric] = None, 73 vector_dtype: Optional[VectorDType] = None, 74 dimensions: Optional[int] = None, 75 ) -> "DynamicMemoryIndex": 76 """ 77 The `from_file` classmethod is used to load a previously saved index from disk. This index *must* have been 78 created with a valid `tags` file or `tags` np.ndarray of `diskannpy.VectorIdentifier`s. It is *strongly* 79 recommended that you use the same parameters as the `diskannpy.build_memory_index()` function that created 80 the index. 81 82 ### Parameters 83 - **index_directory**: The directory containing the index files. This directory must contain the following 84 files: 85 - `{index_prefix}.data` 86 - `{index_prefix}.tags` 87 - `{index_prefix}` 88 89 It may also include the following optional files: 90 - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the 91 `index_directory` if the index was created from a numpy array 92 - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata 93 about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. 94 If an index is built from the `diskann` cli tools, this file will not exist. 95 - **max_vectors**: Capacity of the memory index including space for future insertions. 96 - **complexity**: Complexity (a.k.a `L`) references the size of the list we store candidate approximate 97 neighbors in. It's used during save (which is an index rebuild), and it's used as an initial search size to 98 warm up our index and lower the latency for initial real searches. 99 - **graph_degree**: Graph degree (a.k.a. `R`) is the maximum degree allowed for a node in the index's graph 100 structure. This degree will be pruned throughout the course of the index build, but it will never grow beyond 101 this value. Higher R values require longer index build times, but may result in an index showing excellent 102 recall and latency characteristics. 103 - **saturate_graph**: If True, the adjacency list of each node will be saturated with neighbors to have exactly 104 `graph_degree` neighbors. If False, each node will have between 1 and `graph_degree` neighbors. 105 - **max_occlusion_size**: The maximum number of points that can be considered by occlude_list function. 106 - **alpha**: The alpha parameter (>=1) is used to control the nature and number of points that are added to the 107 graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably 108 more distance comparisons compared to a lower alpha value. 109 - **num_threads**: Number of threads to use when creating this index. `0` indicates we should use all available 110 logical processors. 111 - **filter_complexity**: Complexity to use when using filters. Default is 0. 112 - **num_frozen_points**: Number of points to freeze. Default is 1. 113 - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the 114 life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of 115 `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search` 116 operation requests a space larger than can be accommodated by these values. 117 - **search_threads**: Should be set to the most common `num_threads` expected to be used during the 118 life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of 119 `initial_search_complexity` * `search_threads`. Note that it may be resized if a `batch_search` 120 operation requests a space larger than can be accommodated by these values. 121 - **concurrent_consolidation**: This flag dictates whether consolidation can be run alongside inserts and 122 deletes, or whether the index is locked down to changes while consolidation is ongoing. 123 - **index_prefix**: The prefix of the index files. Defaults to "ann". 124 - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3 125 vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This 126 value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, 127 you are required to provide it. 128 - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a 129 `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it. 130 - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same 131 dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it 132 does not exist, you are required to provide it. 133 134 ### Returns 135 A `diskannpy.DynamicMemoryIndex` object, with the index loaded from disk and ready to use for insertions, 136 deletions, and searches. 137 138 """ 139 index_prefix_path = _valid_index_prefix(index_directory, index_prefix) 140 141 # do tags exist? 142 tags_file = index_prefix_path + ".tags" 143 _assert( 144 Path(tags_file).exists(), 145 f"The file {tags_file} does not exist in {index_directory}", 146 ) 147 vector_dtype, dap_metric, num_vectors, dimensions = _ensure_index_metadata( 148 index_prefix_path, vector_dtype, distance_metric, max_vectors, dimensions, warn_size_exceeded=True 149 ) 150 151 index = cls( 152 distance_metric=dap_metric, # type: ignore 153 vector_dtype=vector_dtype, 154 dimensions=dimensions, 155 max_vectors=max_vectors, 156 complexity=complexity, 157 graph_degree=graph_degree, 158 saturate_graph=saturate_graph, 159 max_occlusion_size=max_occlusion_size, 160 alpha=alpha, 161 num_threads=num_threads, 162 filter_complexity=filter_complexity, 163 num_frozen_points=num_frozen_points, 164 initial_search_complexity=initial_search_complexity, 165 search_threads=search_threads, 166 concurrent_consolidation=concurrent_consolidation, 167 ) 168 index._index.load(index_prefix_path) 169 index._num_vectors = num_vectors # current number of vectors loaded 170 return index
The from_file
classmethod is used to load a previously saved index from disk. This index must have been
created with a valid tags
file or tags
np.ndarray of diskannpy.VectorIdentifier
s. It is strongly
recommended that you use the same parameters as the diskannpy.build_memory_index()
function that created
the index.
Parameters
index_directory: The directory containing the index files. This directory must contain the following files:
{index_prefix}.data
{index_prefix}.tags
{index_prefix}
It may also include the following optional files:
{index_prefix}_vectors.bin
: Optional.diskannpy
builder functions may create this file in theindex_directory
if the index was created from a numpy array{index_prefix}_metadata.bin
: Optional.diskannpy
builder functions create this file to store metadata about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. If an index is built from thediskann
cli tools, this file will not exist.
- max_vectors: Capacity of the memory index including space for future insertions.
- complexity: Complexity (a.k.a
L
) references the size of the list we store candidate approximate neighbors in. It's used during save (which is an index rebuild), and it's used as an initial search size to warm up our index and lower the latency for initial real searches. - graph_degree: Graph degree (a.k.a.
R
) is the maximum degree allowed for a node in the index's graph structure. This degree will be pruned throughout the course of the index build, but it will never grow beyond this value. Higher R values require longer index build times, but may result in an index showing excellent recall and latency characteristics. - saturate_graph: If True, the adjacency list of each node will be saturated with neighbors to have exactly
graph_degree
neighbors. If False, each node will have between 1 andgraph_degree
neighbors. - max_occlusion_size: The maximum number of points that can be considered by occlude_list function.
- alpha: The alpha parameter (>=1) is used to control the nature and number of points that are added to the graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably more distance comparisons compared to a lower alpha value.
- num_threads: Number of threads to use when creating this index.
0
indicates we should use all available logical processors. - filter_complexity: Complexity to use when using filters. Default is 0.
- num_frozen_points: Number of points to freeze. Default is 1.
- initial_search_complexity: Should be set to the most common
complexity
expected to be used during the life of thisdiskannpy.DynamicMemoryIndex
object. The working scratch memory allocated is based off ofinitial_search_complexity
*search_threads
. Note that it may be resized if asearch
orbatch_search
operation requests a space larger than can be accommodated by these values. - search_threads: Should be set to the most common
num_threads
expected to be used during the life of thisdiskannpy.DynamicMemoryIndex
object. The working scratch memory allocated is based off ofinitial_search_complexity
*search_threads
. Note that it may be resized if abatch_search
operation requests a space larger than can be accommodated by these values. - concurrent_consolidation: This flag dictates whether consolidation can be run alongside inserts and deletes, or whether the index is locked down to changes while consolidation is ongoing.
- index_prefix: The prefix of the index files. Defaults to "ann".
- distance_metric: A
str
, strictly one of {"l2", "mips", "cosine"}.l2
andcosine
are supported for all 3 vector dtypes, butmips
is only available for single precision floats. Default isNone
. This value is only used if a{index_prefix}_metadata.bin
file does not exist. If it does not exist, you are required to provide it. - vector_dtype: The vector dtype this index has been built with. This value is only used if a
{index_prefix}_metadata.bin
file does not exist. If it does not exist, you are required to provide it. - dimensions: The vector dimensionality of this index. All new vectors inserted must be the same
dimensionality. This value is only used if a
{index_prefix}_metadata.bin
file does not exist. If it does not exist, you are required to provide it.
Returns
A diskannpy.DynamicMemoryIndex
object, with the index loaded from disk and ready to use for insertions,
deletions, and searches.
285 def search( 286 self, query: VectorLike, k_neighbors: int, complexity: int 287 ) -> QueryResponse: 288 """ 289 Searches the index by a single query vector. 290 291 ### Parameters 292 - **query**: 1d numpy array of the same dimensionality and dtype of the index. 293 - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely 294 will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0. 295 - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size 296 increases accuracy at the cost of latency. Must be at least k_neighbors in size. 297 """ 298 _query = _castable_dtype_or_raise(query, expected=self._vector_dtype) 299 _assert(len(_query.shape) == 1, "query vector must be 1-d") 300 _assert( 301 _query.shape[0] == self._dimensions, 302 f"query vector must have the same dimensionality as the index; index dimensionality: {self._dimensions}, " 303 f"query dimensionality: {_query.shape[0]}", 304 ) 305 _assert_is_positive_uint32(k_neighbors, "k_neighbors") 306 _assert_is_nonnegative_uint32(complexity, "complexity") 307 308 if k_neighbors > complexity: 309 warnings.warn( 310 f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}" 311 ) 312 complexity = k_neighbors 313 neighbors, distances = self._index.search(query=_query, knn=k_neighbors, complexity=complexity) 314 return QueryResponse(identifiers=neighbors, distances=distances)
Searches the index by a single query vector.
Parameters
- query: 1d numpy array of the same dimensionality and dtype of the index.
- k_neighbors: Number of neighbors to be returned. If query vector exists in index, it almost definitely
will be returned as well, so adjust your
k_neighbors
as appropriate. Must be > 0. - complexity: Size of distance ordered list of candidate neighbors to use while searching. List size increases accuracy at the cost of latency. Must be at least k_neighbors in size.
316 def batch_search( 317 self, 318 queries: VectorLikeBatch, 319 k_neighbors: int, 320 complexity: int, 321 num_threads: int, 322 ) -> QueryResponseBatch: 323 """ 324 Searches the index by a batch of query vectors. 325 326 This search is parallelized and far more efficient than searching for each vector individually. 327 328 ### Parameters 329 - **queries**: 2d numpy array, with column dimensionality matching the index and row dimensionality being the 330 number of queries intended to search for in parallel. Dtype must match dtype of the index. 331 - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely 332 will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0. 333 - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size 334 increases accuracy at the cost of latency. Must be at least k_neighbors in size. 335 - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system 336 """ 337 _queries = _castable_dtype_or_raise(queries, expected=self._vector_dtype) 338 _assert_2d(_queries, "queries") 339 _assert( 340 _queries.shape[1] == self._dimensions, 341 f"query vectors must have the same dimensionality as the index; index dimensionality: {self._dimensions}, " 342 f"query dimensionality: {_queries.shape[1]}", 343 ) 344 345 _assert_is_positive_uint32(k_neighbors, "k_neighbors") 346 _assert_is_positive_uint32(complexity, "complexity") 347 _assert_is_nonnegative_uint32(num_threads, "num_threads") 348 349 if k_neighbors > complexity: 350 warnings.warn( 351 f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}" 352 ) 353 complexity = k_neighbors 354 355 num_queries, dim = queries.shape 356 neighbors, distances = self._index.batch_search( 357 queries=_queries, 358 num_queries=num_queries, 359 knn=k_neighbors, 360 complexity=complexity, 361 num_threads=num_threads, 362 ) 363 return QueryResponseBatch(identifiers=neighbors, distances=distances)
Searches the index by a batch of query vectors.
This search is parallelized and far more efficient than searching for each vector individually.
Parameters
- queries: 2d numpy array, with column dimensionality matching the index and row dimensionality being the number of queries intended to search for in parallel. Dtype must match dtype of the index.
- k_neighbors: Number of neighbors to be returned. If query vector exists in index, it almost definitely
will be returned as well, so adjust your
k_neighbors
as appropriate. Must be > 0. - complexity: Size of distance ordered list of candidate neighbors to use while searching. List size increases accuracy at the cost of latency. Must be at least k_neighbors in size.
- num_threads: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
365 def save(self, save_path: str, index_prefix: str = "ann"): 366 """ 367 Saves this index to file. 368 369 ### Parameters 370 - **save_path**: The path to save these index files to. 371 - **index_prefix**: The prefix of the index files. Defaults to "ann". 372 """ 373 if save_path == "": 374 raise ValueError("save_path cannot be empty") 375 if index_prefix == "": 376 raise ValueError("index_prefix cannot be empty") 377 378 index_prefix = index_prefix.format(complexity=self._complexity, graph_degree=self._graph_degree) 379 _assert_existing_directory(save_path, "save_path") 380 save_path = os.path.join(save_path, index_prefix) 381 if self._points_deleted is True: 382 warnings.warn( 383 "DynamicMemoryIndex.save() currently requires DynamicMemoryIndex.consolidate_delete() to be called " 384 "prior to save when items have been marked for deletion. This is being done automatically now, though" 385 "it will increase the time it takes to save; on large sets of data it can take a substantial amount of " 386 "time. In the future, we will implement a faster save with unconsolidated deletes, but for now this is " 387 "required." 388 ) 389 self._index.consolidate_delete() 390 self._index.save( 391 save_path=save_path, compact_before_save=True 392 ) # we do not yet support uncompacted saves 393 _write_index_metadata( 394 save_path, 395 self._vector_dtype, 396 self._dap_metric, 397 self._index.num_points(), 398 self._dimensions, 399 )
Saves this index to file.
Parameters
- save_path: The path to save these index files to.
- index_prefix: The prefix of the index files. Defaults to "ann".
401 def insert(self, vector: VectorLike, vector_id: VectorIdentifier): 402 """ 403 Inserts a single vector into the index with the provided vector_id. 404 405 If this insertion will overrun the `max_vectors` count boundaries of this index, `consolidate_delete()` will 406 be executed automatically. 407 408 ### Parameters 409 - **vector**: The vector to insert. Note that dtype must match. 410 - **vector_id**: The vector_id to use for this vector. 411 """ 412 _vector = _castable_dtype_or_raise(vector, expected=self._vector_dtype) 413 _assert(len(vector.shape) == 1, "insert vector must be 1-d") 414 _assert_is_positive_uint32(vector_id, "vector_id") 415 if self._num_vectors + 1 > self._max_vectors: 416 if self._removed_num_vectors > 0: 417 warnings.warn(f"Inserting this vector would overrun the max_vectors={self._max_vectors} specified at index " 418 f"construction. We are attempting to consolidate_delete() to make space.") 419 self.consolidate_delete() 420 else: 421 raise RuntimeError(f"Inserting this vector would overrun the max_vectors={self._max_vectors} specified " 422 f"at index construction. Unable to make space by consolidating deletions. The insert" 423 f"operation has failed.") 424 status = self._index.insert(_vector, np.uint32(vector_id)) 425 if status == 0: 426 self._num_vectors += 1 427 else: 428 raise RuntimeError( 429 f"Insert was unable to complete successfully; error code returned from diskann C++ lib: {status}" 430 )
Inserts a single vector into the index with the provided vector_id.
If this insertion will overrun the max_vectors
count boundaries of this index, consolidate_delete()
will
be executed automatically.
Parameters
- vector: The vector to insert. Note that dtype must match.
- vector_id: The vector_id to use for this vector.
433 def batch_insert( 434 self, 435 vectors: VectorLikeBatch, 436 vector_ids: VectorIdentifierBatch, 437 num_threads: int = 0, 438 ): 439 """ 440 Inserts a batch of vectors into the index with the provided vector_ids. 441 442 If this batch insertion will overrun the `max_vectors` count boundaries of this index, `consolidate_delete()` 443 will be executed automatically. 444 445 ### Parameters 446 - **vectors**: The 2d numpy array of vectors to insert. 447 - **vector_ids**: The 1d array of vector ids to use. This array must have the same number of elements as 448 the vectors array has rows. The dtype of vector_ids must be `np.uint32` 449 - **num_threads**: Number of threads to use when inserting into this index. (>= 0), 0 = num_threads in system 450 """ 451 _query = _castable_dtype_or_raise(vectors, expected=self._vector_dtype) 452 _assert(len(vectors.shape) == 2, "vectors must be a 2-d array") 453 _assert( 454 vectors.shape[0] == vector_ids.shape[0], 455 "Number of vectors must be equal to number of ids", 456 ) 457 _vectors = vectors.astype(dtype=self._vector_dtype, casting="safe", copy=False) 458 _vector_ids = vector_ids.astype(dtype=np.uint32, casting="safe", copy=False) 459 460 if self._num_vectors + _vector_ids.shape[0] > self._max_vectors: 461 if self._max_vectors + self._removed_num_vectors >= _vector_ids.shape[0]: 462 warnings.warn(f"Inserting these vectors, count={_vector_ids.shape[0]} would overrun the " 463 f"max_vectors={self._max_vectors} specified at index construction. We are attempting to " 464 f"consolidate_delete() to make space.") 465 self.consolidate_delete() 466 else: 467 raise RuntimeError(f"Inserting these vectors count={_vector_ids.shape[0]} would overrun the " 468 f"max_vectors={self._max_vectors} specified at index construction. Unable to make " 469 f"space by consolidating deletions. The batch insert operation has failed.") 470 471 statuses = self._index.batch_insert( 472 _vectors, _vector_ids, _vector_ids.shape[0], num_threads 473 ) 474 successes = [] 475 failures = [] 476 for i in range(0, len(statuses)): 477 if statuses[i] == 0: 478 successes.append(i) 479 else: 480 failures.append(i) 481 self._num_vectors += len(successes) 482 if len(failures) == 0: 483 return 484 failed_ids = vector_ids[failures] 485 raise RuntimeError( 486 f"During batch insert, the following vector_ids were unable to be inserted into the index: {failed_ids}. " 487 f"{len(successes)} were successfully inserted" 488 )
Inserts a batch of vectors into the index with the provided vector_ids.
If this batch insertion will overrun the max_vectors
count boundaries of this index, consolidate_delete()
will be executed automatically.
Parameters
- vectors: The 2d numpy array of vectors to insert.
- vector_ids: The 1d array of vector ids to use. This array must have the same number of elements as
the vectors array has rows. The dtype of vector_ids must be
np.uint32
- num_threads: Number of threads to use when inserting into this index. (>= 0), 0 = num_threads in system
491 def mark_deleted(self, vector_id: VectorIdentifier): 492 """ 493 Mark vector for deletion. This is a soft delete that won't return the vector id in any results, but does not 494 remove it from the underlying index files or memory structure. To execute a hard delete, call this method and 495 then call the much more expensive `consolidate_delete` method on this index. 496 ### Parameters 497 - **vector_id**: The vector id to delete. Must be a uint32. 498 """ 499 _assert_is_positive_uint32(vector_id, "vector_id") 500 self._points_deleted = True 501 self._removed_num_vectors += 1 502 # we do not decrement self._num_vectors until consolidate_delete 503 self._index.mark_deleted(np.uint32(vector_id))
Mark vector for deletion. This is a soft delete that won't return the vector id in any results, but does not
remove it from the underlying index files or memory structure. To execute a hard delete, call this method and
then call the much more expensive consolidate_delete
method on this index.
Parameters
- vector_id: The vector id to delete. Must be a uint32.
505 def consolidate_delete(self): 506 """ 507 This method actually restructures the DiskANN index to remove the items that have been marked for deletion. 508 """ 509 self._index.consolidate_delete() 510 self._points_deleted = False 511 self._num_vectors -= self._removed_num_vectors 512 self._removed_num_vectors = 0
This method actually restructures the DiskANN index to remove the items that have been marked for deletion.
Type alias for one of {"l2", "mips", "cosine"}
Type alias for one of {numpy.float32
, numpy.int8
, numpy.uint8
}
68class QueryResponse(NamedTuple): 69 """ 70 Tuple with two values, identifiers and distances. Both are 1d arrays, positionally correspond, and will contain the 71 nearest neighbors from [0..k_neighbors) 72 """ 73 74 identifiers: npt.NDArray[VectorIdentifier] 75 """ A `numpy.typing.NDArray[VectorIdentifier]` array of vector identifiers, 1 dimensional """ 76 distances: npt.NDArray[np.float32] 77 """ 78 A `numpy.typing.NDAarray[numpy.float32]` of distances as calculated by the distance metric function, 1 dimensional 79 """
Tuple with two values, identifiers and distances. Both are 1d arrays, positionally correspond, and will contain the nearest neighbors from [0..k_neighbors)
Create new instance of QueryResponse(identifiers, distances)
A numpy.typing.NDArray[VectorIdentifier]
array of vector identifiers, 1 dimensional
A numpy.typing.NDAarray[numpy.float32]
of distances as calculated by the distance metric function, 1 dimensional
Inherited Members
- builtins.tuple
- index
- count
82class QueryResponseBatch(NamedTuple): 83 """ 84 Tuple with two values, identifiers and distances. Both are 2d arrays, with dimensionality determined by the 85 rows corresponding to the number of queries made, and the columns corresponding to the k neighbors 86 requested. The two 2d arrays have an implicit, position-based relationship 87 """ 88 89 identifiers: npt.NDArray[VectorIdentifier] 90 """ 91 A `numpy.typing.NDArray[VectorIdentifier]` array of vector identifiers, 2 dimensional. The row corresponds to index 92 of the query, and the column corresponds to the k neighbors requested 93 """ 94 distances: np.ndarray[np.float32] 95 """ 96 A `numpy.typing.NDAarray[numpy.float32]` of distances as calculated by the distance metric function, 2 dimensional. 97 The row corresponds to the index of the query, and the column corresponds to the distance of the query to the 98 *k-th* neighbor 99 """
Tuple with two values, identifiers and distances. Both are 2d arrays, with dimensionality determined by the rows corresponding to the number of queries made, and the columns corresponding to the k neighbors requested. The two 2d arrays have an implicit, position-based relationship
Create new instance of QueryResponseBatch(identifiers, distances)
A numpy.typing.NDArray[VectorIdentifier]
array of vector identifiers, 2 dimensional. The row corresponds to index
of the query, and the column corresponds to the k neighbors requested
A numpy.typing.NDAarray[numpy.float32]
of distances as calculated by the distance metric function, 2 dimensional.
The row corresponds to the index of the query, and the column corresponds to the distance of the query to the
k-th neighbor
Inherited Members
- builtins.tuple
- index
- count
Type alias for a vector identifier, whether it be an implicit array index identifier from StaticMemoryIndex or StaticDiskIndex, or an explicit tag identifier from DynamicMemoryIndex
Type alias for a batch of VectorIdentifiers
Type alias for something that can be treated as a vector
Type alias for a batch of VectorLikes
15class Metadata(NamedTuple): 16 """DiskANN binary vector files contain a small stanza containing some metadata about them.""" 17 18 num_vectors: int 19 """ The number of vectors in the file. """ 20 dimensions: int 21 """ The dimensionality of the vectors in the file. """
DiskANN binary vector files contain a small stanza containing some metadata about them.
Create new instance of Metadata(num_vectors, dimensions)
Inherited Members
- builtins.tuple
- index
- count
24def vectors_metadata_from_file(vector_file: str) -> Metadata: 25 """ 26 Read the metadata from a DiskANN binary vector file. 27 ### Parameters 28 - **vector_file**: The path to the vector file to read the metadata from. 29 30 ### Returns 31 `diskannpy.Metadata` 32 """ 33 _assert_existing_file(vector_file, "vector_file") 34 points, dims = np.fromfile(file=vector_file, dtype=np.int32, count=2) 35 return Metadata(points, dims)
Read the metadata from a DiskANN binary vector file.
Parameters
- vector_file: The path to the vector file to read the metadata from.
Returns
46def vectors_to_file(vector_file: str, vectors: VectorLikeBatch) -> None: 47 """ 48 Utility function that writes a DiskANN binary vector formatted file to the location of your choosing. 49 50 ### Parameters 51 - **vector_file**: The path to the vector file to write the vectors to. 52 - **vectors**: A 2d array of dtype `numpy.float32`, `numpy.uint8`, or `numpy.int8` 53 """ 54 _assert_dtype(vectors.dtype) 55 _assert_2d(vectors, "vectors") 56 with open(vector_file, "wb") as fh: 57 _write_bin(vectors, fh)
Utility function that writes a DiskANN binary vector formatted file to the location of your choosing.
Parameters
- vector_file: The path to the vector file to write the vectors to.
- vectors: A 2d array of dtype
numpy.float32
,numpy.uint8
, ornumpy.int8
60def vectors_from_file(vector_file: str, dtype: VectorDType) -> npt.NDArray[VectorDType]: 61 """ 62 Read vectors from a DiskANN binary vector file. 63 64 ### Parameters 65 - **vector_file**: The path to the vector file to read the vectors from. 66 - **dtype**: The data type of the vectors in the file. Ensure you match the data types exactly 67 68 ### Returns 69 `numpy.typing.NDArray[dtype]` 70 """ 71 points, dims = vectors_metadata_from_file(vector_file) 72 return np.fromfile(file=vector_file, dtype=dtype, offset=8).reshape(points, dims)
Read vectors from a DiskANN binary vector file.
Parameters
- vector_file: The path to the vector file to read the vectors from.
- dtype: The data type of the vectors in the file. Ensure you match the data types exactly
Returns
numpy.typing.NDArray[dtype]
27def valid_dtype(dtype: Type) -> VectorDType: 28 """ 29 Utility method to determine whether the provided dtype is supported by `diskannpy`, and if so, the canonical 30 dtype we will use internally (e.g. np.single -> np.float32) 31 """ 32 _assert_dtype(dtype) 33 if dtype == np.uint8: 34 return np.uint8 35 if dtype == np.int8: 36 return np.int8 37 if dtype == np.float32: 38 return np.float32
Utility method to determine whether the provided dtype is supported by diskannpy
, and if so, the canonical
dtype we will use internally (e.g. np.single -> np.float32)