diskannpy

Documentation Overview

diskannpy is mostly structured around 2 distinct processes: Index Builder Functions and Search Classes

It also includes a few nascent utilities.

And lastly, it makes substantial use of type hints, with various shorthand type aliases documented. When reading the diskannpy code we refer to the type aliases, though pdoc helpfully expands them.

Index Builders

  • build_disk_index - To build an index that cannot fully fit into memory when searching
  • build_memory_index - To build an index that can fully fit into memory when searching

Search Classes

  • StaticMemoryIndex - for indices that can fully fit in memory and won't be changed during the search operations
  • StaticDiskIndex - for indices that cannot fully fit in memory, thus relying on disk IO to search, and also won't be changed during search operations
  • DynamicMemoryIndex - for indices that can fully fit in memory and will be mutated via insert/deletion operations as well as search operations

Parameter Defaults

Parameter and Response Type Aliases

Utilities

  • vectors_to_file - Turns a 2 dimensional numpy.typing.NDArray[VectorDType] with shape (number_of_points, vector_dim) into a DiskANN vector bin file.
  • vectors_from_file - Reads a DiskANN vector bin file representing stored vectors into a numpy ndarray.
  • vectors_metadata_from_file - Reads metadata stored in a DiskANN vector bin file without reading the entire file
  • tags_to_file - Turns a 1 dimensional numpy.typing.NDArray[VectorIdentifier] into a DiskANN tags bin file.
  • tags_from_file - Reads a DiskANN tags bin file representing stored tags into a numpy ndarray.
  • valid_dtype - Checks if a given vector dtype is supported by diskannpy
  1# Copyright (c) Microsoft Corporation. All rights reserved.
  2# Licensed under the MIT license.
  3
  4"""
  5# Documentation Overview
  6`diskannpy` is mostly structured around 2 distinct processes: [Index Builder Functions](#index-builders) and [Search Classes](#search-classes)
  7
  8It also includes a few nascent [utilities](#utilities).
  9
 10And lastly, it makes substantial use of type hints, with various shorthand [type aliases](#parameter-and-response-type-aliases) documented. 
 11When reading the `diskannpy` code we refer to the type aliases, though `pdoc` helpfully expands them.
 12
 13## Index Builders
 14- `build_disk_index` - To build an index that cannot fully fit into memory when searching
 15- `build_memory_index` - To build an index that can fully fit into memory when searching
 16
 17## Search Classes
 18- `StaticMemoryIndex` - for indices that can fully fit in memory and won't be changed during the search operations
 19- `StaticDiskIndex` - for indices that cannot fully fit in memory, thus relying on disk IO to search, and also won't be changed during search operations
 20- `DynamicMemoryIndex` - for indices that can fully fit in memory and will be mutated via insert/deletion operations as well as search operations
 21
 22## Parameter Defaults
 23- `diskannpy.defaults` - Default values exported from the C++ extension for Python users
 24
 25## Parameter and Response Type Aliases
 26- `DistanceMetric` - What distance metrics does `diskannpy` support?
 27- `VectorDType` - What vector datatypes does `diskannpy` support?
 28- `QueryResponse` - What can I expect as a response to my search?
 29- `QueryResponseBatch` - What can I expect as a response to my batch search?
 30- `VectorIdentifier` - What types do `diskannpy` support as vector identifiers?
 31- `VectorIdentifierBatch` - A batch of identifiers of the exact same type. The type can change, but they must **all** change.
 32- `VectorLike` - How does a vector look to `diskannpy`, to be inserted or searched with.
 33- `VectorLikeBatch` - A batch of those vectors, to be inserted or searched with.
 34- `Metadata` - DiskANN vector binary file metadata (num_points, vector_dim)
 35
 36## Utilities
 37- `vectors_to_file` - Turns a 2 dimensional `numpy.typing.NDArray[VectorDType]` with shape `(number_of_points, vector_dim)` into a DiskANN vector bin file.
 38- `vectors_from_file` - Reads a DiskANN vector bin file representing stored vectors into a numpy ndarray.
 39- `vectors_metadata_from_file` - Reads metadata stored in a DiskANN vector bin file without reading the entire file
 40- `tags_to_file` - Turns a 1 dimensional `numpy.typing.NDArray[VectorIdentifier]` into a DiskANN tags bin file.
 41- `tags_from_file` - Reads a DiskANN tags bin file representing stored tags into a numpy ndarray.
 42- `valid_dtype` - Checks if a given vector dtype is supported by `diskannpy`
 43"""
 44
 45from typing import Any, Literal, NamedTuple, Type, Union
 46
 47import numpy as np
 48from numpy import typing as npt
 49
 50DistanceMetric = Literal["l2", "mips", "cosine"]
 51""" Type alias for one of {"l2", "mips", "cosine"} """
 52VectorDType = Union[Type[np.float32], Type[np.int8], Type[np.uint8]]
 53""" Type alias for one of {`numpy.float32`, `numpy.int8`, `numpy.uint8`} """
 54VectorLike = npt.NDArray[VectorDType]
 55""" Type alias for something that can be treated as a vector """
 56VectorLikeBatch = npt.NDArray[VectorDType]
 57""" Type alias for a batch of VectorLikes """
 58VectorIdentifier = np.uint32
 59""" 
 60Type alias for a vector identifier, whether it be an implicit array index identifier from StaticMemoryIndex or 
 61StaticDiskIndex, or an explicit tag identifier from DynamicMemoryIndex 
 62"""
 63VectorIdentifierBatch = npt.NDArray[np.uint32]
 64""" Type alias for a batch of VectorIdentifiers """
 65
 66
 67class QueryResponse(NamedTuple):
 68    """
 69    Tuple with two values, identifiers and distances. Both are 1d arrays, positionally correspond, and will contain the
 70    nearest neighbors from [0..k_neighbors)
 71    """
 72
 73    identifiers: npt.NDArray[VectorIdentifier]
 74    """ A `numpy.typing.NDArray[VectorIdentifier]` array of vector identifiers, 1 dimensional """
 75    distances: npt.NDArray[np.float32]
 76    """
 77    A `numpy.typing.NDAarray[numpy.float32]` of distances as calculated by the distance metric function,  1 dimensional
 78    """
 79
 80
 81class QueryResponseBatch(NamedTuple):
 82    """
 83    Tuple with two values, identifiers and distances. Both are 2d arrays, with dimensionality determined by the
 84    rows corresponding to the number of queries made, and the columns corresponding to the k neighbors
 85    requested. The two 2d arrays have an implicit, position-based relationship
 86    """
 87
 88    identifiers: npt.NDArray[VectorIdentifier]
 89    """ 
 90    A `numpy.typing.NDArray[VectorIdentifier]` array of vector identifiers, 2 dimensional. The row corresponds to index 
 91    of the query, and the column corresponds to the k neighbors requested 
 92    """
 93    distances: np.ndarray[np.float32]
 94    """  
 95    A `numpy.typing.NDAarray[numpy.float32]` of distances as calculated by the distance metric function, 2 dimensional. 
 96    The row corresponds to the index of the query, and the column corresponds to the distance of the query to the 
 97    *k-th* neighbor 
 98    """
 99
100
101from . import defaults
102from ._builder import build_disk_index, build_memory_index
103from ._common import valid_dtype
104from ._dynamic_memory_index import DynamicMemoryIndex
105from ._files import (
106    Metadata,
107    tags_from_file,
108    tags_to_file,
109    vectors_from_file,
110    vectors_metadata_from_file,
111    vectors_to_file,
112)
113from ._static_disk_index import StaticDiskIndex
114from ._static_memory_index import StaticMemoryIndex
115
116__all__ = [
117    "build_disk_index",
118    "build_memory_index",
119    "StaticDiskIndex",
120    "StaticMemoryIndex",
121    "DynamicMemoryIndex",
122    "defaults",
123    "DistanceMetric",
124    "VectorDType",
125    "QueryResponse",
126    "QueryResponseBatch",
127    "VectorIdentifier",
128    "VectorIdentifierBatch",
129    "VectorLike",
130    "VectorLikeBatch",
131    "Metadata",
132    "vectors_metadata_from_file",
133    "vectors_to_file",
134    "vectors_from_file",
135    "tags_to_file",
136    "tags_from_file",
137    "valid_dtype",
138]
def build_disk_index( data: Union[str, numpy.ndarray[Any, numpy.dtype[Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8]]]]], distance_metric: Literal['l2', 'mips', 'cosine'], index_directory: str, complexity: int, graph_degree: int, search_memory_maximum: float, build_memory_maximum: float, num_threads: int, pq_disk_bytes: int = 0, vector_dtype: Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8], NoneType] = None, index_prefix: str = 'ann') -> None:
 53def build_disk_index(
 54    data: Union[str, VectorLikeBatch],
 55    distance_metric: DistanceMetric,
 56    index_directory: str,
 57    complexity: int,
 58    graph_degree: int,
 59    search_memory_maximum: float,
 60    build_memory_maximum: float,
 61    num_threads: int,
 62    pq_disk_bytes: int = defaults.PQ_DISK_BYTES,
 63    vector_dtype: Optional[VectorDType] = None,
 64    index_prefix: str = "ann",
 65) -> None:
 66    """
 67    This function will construct a DiskANN disk index. Disk indices are ideal for very large datasets that
 68    are too large to fit in memory. Memory is still used, but it is primarily used to provide precise disk
 69    locations for fast retrieval of smaller subsets of the index without compromising much on recall.
 70
 71    If you provide a numpy array, it will save this array to disk in a temp location
 72    in the format DiskANN's PQ Flash Index builder requires. This temp folder is deleted upon index creation completion
 73    or error.
 74
 75    ## Distance Metric and Vector Datatype Restrictions
 76    | Metric \ Datatype | np.float32 | np.uint8 | np.int8 |
 77    |-------------------|------------|----------|---------|
 78    | L2                |      ✅     |     ✅    |    ✅    |
 79    | MIPS              |      ✅     |     ❌    |    ❌    |
 80    | Cosine [^bug-in-disk-cosine]     |      ❌     |     ❌    |    ❌    |
 81
 82    [^bug-in-disk-cosine]: For StaticDiskIndex, Cosine distances are not currently supported.
 83
 84    ### Parameters
 85    - **data**: Either a `str` representing a path to a DiskANN vector bin file, or a numpy.ndarray,
 86      of a supported dtype, in 2 dimensions. Note that `vector_dtype` must be provided if data is a `str`
 87    - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
 88      vector dtypes, but `mips` is only available for single precision floats.
 89    - **index_directory**: The index files will be saved to this **existing** directory path
 90    - **complexity**: The size of the candidate nearest neighbor list to use when building the index. Values between 75
 91      and 200 are typical. Larger values will take more time to build but result in indices that provide higher recall
 92      for the same search complexity. Use a value that is at least as large as `graph_degree` unless you are prepared
 93      to compromise on quality
 94    - **graph_degree**: The degree of the graph index, typically between 60 and 150. A larger maximum degree will
 95      result in larger indices and longer indexing times, but better search quality.
 96    - **search_memory_maximum**: Build index with the expectation that the search will use at most
 97      `search_memory_maximum`, in gb.
 98    - **build_memory_maximum**: Build index using at most `build_memory_maximum` in gb. Building processes typically
 99      require more memory, while search memory can be reduced.
100    - **num_threads**: Number of threads to use when creating this index. `0` is used to indicate all available
101      logical processors should be used.
102    - **pq_disk_bytes**: Use `0` to store uncompressed data on SSD. This allows the index to asymptote to 100%
103      recall. If your vectors are too large to store in SSD, this parameter provides the option to compress the
104      vectors using PQ for storing on SSD. This will trade off recall. You would also want this to be greater
105      than the number of bytes used for the PQ compressed data stored in-memory. Default is `0`.
106    - **vector_dtype**: Required if the provided `data` is of type `str`, else we use the `data.dtype` if np array.
107    - **index_prefix**: The prefix of the index files. Defaults to "ann".
108    """
109
110    _assert(
111        (isinstance(data, str) and vector_dtype is not None)
112        or isinstance(data, np.ndarray),
113        "vector_dtype is required if data is a str representing a path to the vector bin file",
114    )
115    dap_metric = _valid_metric(distance_metric)
116    _assert_is_positive_uint32(complexity, "complexity")
117    _assert_is_positive_uint32(graph_degree, "graph_degree")
118    _assert(search_memory_maximum > 0, "search_memory_maximum must be larger than 0")
119    _assert(build_memory_maximum > 0, "build_memory_maximum must be larger than 0")
120    _assert_is_nonnegative_uint32(num_threads, "num_threads")
121    _assert_is_nonnegative_uint32(pq_disk_bytes, "pq_disk_bytes")
122    _assert(index_prefix != "", "index_prefix cannot be an empty string")
123
124    index_path = Path(index_directory)
125    _assert(
126        index_path.exists() and index_path.is_dir(),
127        "index_directory must both exist and be a directory",
128    )
129
130    vector_bin_path, vector_dtype_actual = _valid_path_and_dtype(
131        data, vector_dtype, index_directory, index_prefix
132    )
133    _assert(dap_metric != _native_dap.COSINE, "Cosine is currently not supported in StaticDiskIndex")
134    if dap_metric == _native_dap.INNER_PRODUCT:
135        _assert(
136            vector_dtype_actual == np.float32,
137            "Integral vector dtypes (np.uint8, np.int8) are not supported with distance metric mips"
138        )
139
140    num_points, dimensions = vectors_metadata_from_file(vector_bin_path)
141
142    if vector_dtype_actual == np.uint8:
143        _builder = _native_dap.build_disk_uint8_index
144    elif vector_dtype_actual == np.int8:
145        _builder = _native_dap.build_disk_int8_index
146    else:
147        _builder = _native_dap.build_disk_float_index
148
149    index_prefix_path = os.path.join(index_directory, index_prefix)
150
151    _builder(
152        distance_metric=dap_metric,
153        data_file_path=vector_bin_path,
154        index_prefix_path=index_prefix_path,
155        complexity=complexity,
156        graph_degree=graph_degree,
157        final_index_ram_limit=search_memory_maximum,
158        indexing_ram_budget=build_memory_maximum,
159        num_threads=num_threads,
160        pq_disk_bytes=pq_disk_bytes,
161    )
162    _write_index_metadata(
163        index_prefix_path, vector_dtype_actual, dap_metric, num_points, dimensions
164    )

This function will construct a DiskANN disk index. Disk indices are ideal for very large datasets that are too large to fit in memory. Memory is still used, but it is primarily used to provide precise disk locations for fast retrieval of smaller subsets of the index without compromising much on recall.

If you provide a numpy array, it will save this array to disk in a temp location in the format DiskANN's PQ Flash Index builder requires. This temp folder is deleted upon index creation completion or error.

Distance Metric and Vector Datatype Restrictions

Metric \ Datatype np.float32 np.uint8 np.int8
L2
MIPS
Cosine 1

Parameters

  • data: Either a str representing a path to a DiskANN vector bin file, or a numpy.ndarray, of a supported dtype, in 2 dimensions. Note that vector_dtype must be provided if data is a str
  • distance_metric: A str, strictly one of {"l2", "mips", "cosine"}. l2 and cosine are supported for all 3 vector dtypes, but mips is only available for single precision floats.
  • index_directory: The index files will be saved to this existing directory path
  • complexity: The size of the candidate nearest neighbor list to use when building the index. Values between 75 and 200 are typical. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. Use a value that is at least as large as graph_degree unless you are prepared to compromise on quality
  • graph_degree: The degree of the graph index, typically between 60 and 150. A larger maximum degree will result in larger indices and longer indexing times, but better search quality.
  • search_memory_maximum: Build index with the expectation that the search will use at most search_memory_maximum, in gb.
  • build_memory_maximum: Build index using at most build_memory_maximum in gb. Building processes typically require more memory, while search memory can be reduced.
  • num_threads: Number of threads to use when creating this index. 0 is used to indicate all available logical processors should be used.
  • pq_disk_bytes: Use 0 to store uncompressed data on SSD. This allows the index to asymptote to 100% recall. If your vectors are too large to store in SSD, this parameter provides the option to compress the vectors using PQ for storing on SSD. This will trade off recall. You would also want this to be greater than the number of bytes used for the PQ compressed data stored in-memory. Default is 0.
  • vector_dtype: Required if the provided data is of type str, else we use the data.dtype if np array.
  • index_prefix: The prefix of the index files. Defaults to "ann".

  1. For StaticDiskIndex, Cosine distances are not currently supported. 

def build_memory_index( data: Union[str, numpy.ndarray[Any, numpy.dtype[Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8]]]]], distance_metric: Literal['l2', 'mips', 'cosine'], index_directory: str, complexity: int, graph_degree: int, num_threads: int, alpha: float = 1.2000000476837158, use_pq_build: bool = False, num_pq_bytes: int = 0, use_opq: bool = False, vector_dtype: Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8], NoneType] = None, tags: Union[str, numpy.ndarray[Any, numpy.dtype[numpy.uint32]]] = '', filter_labels: Optional[list[list[str]]] = None, universal_label: str = '', filter_complexity: int = 0, index_prefix: str = 'ann') -> None:
167def build_memory_index(
168    data: Union[str, VectorLikeBatch],
169    distance_metric: DistanceMetric,
170    index_directory: str,
171    complexity: int,
172    graph_degree: int,
173    num_threads: int,
174    alpha: float = defaults.ALPHA,
175    use_pq_build: bool = defaults.USE_PQ_BUILD,
176    num_pq_bytes: int = defaults.NUM_PQ_BYTES,
177    use_opq: bool = defaults.USE_OPQ,
178    vector_dtype: Optional[VectorDType] = None,
179    tags: Union[str, VectorIdentifierBatch] = "",
180    filter_labels: Optional[list[list[str]]] = None,
181    universal_label: str = "",
182    filter_complexity: int = defaults.FILTER_COMPLEXITY,
183    index_prefix: str = "ann",
184) -> None:
185    """
186    This function will construct a DiskANN memory index. Memory indices are ideal for smaller datasets whose
187    indices can fit into memory. Memory indices are faster than disk indices, but usually cannot scale to massive
188    sizes in an individual index on an individual machine.
189
190    `diskannpy`'s memory indices take two forms: a `diskannpy.StaticMemoryIndex`, which will not be mutated, only
191    searched upon, and a `diskannpy.DynamicMemoryIndex`, which can be mutated AND searched upon in the same process.
192
193    ## Important Note:
194    You **must** determine the type of index you are building for. If you are building for a
195    `diskannpy.DynamicMemoryIndex`, you **must** supply a valid value for the `tags` parameter. **Do not supply
196    tags if the index is intended to be `diskannpy.StaticMemoryIndex`**!
197
198    ## Distance Metric and Vector Datatype Restrictions
199
200    | Metric \ Datatype | np.float32 | np.uint8 | np.int8 |
201    |-------------------|------------|----------|---------|
202    | L2                |      ✅     |     ✅    |    ✅    |
203    | MIPS              |      ✅     |     ❌    |    ❌    |
204    | Cosine            |      ✅     |     ✅    |    ✅    |
205
206    ### Parameters
207
208    - **data**: Either a `str` representing a path to an existing DiskANN vector bin file, or a numpy.ndarray of a
209      supported dtype in 2 dimensions. Note that `vector_dtype` must be provided if `data` is a `str`.
210    - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
211      vector dtypes, but `mips` is only available for single precision floats.
212    - **index_directory**: The index files will be saved to this **existing** directory path
213    - **complexity**: The size of the candidate nearest neighbor list to use when building the index. Values between 75
214      and 200 are typical. Larger values will take more time to build but result in indices that provide higher recall
215      for the same search complexity. Use a value that is at least as large as `graph_degree` unless you are prepared
216      to compromise on quality
217    - **graph_degree**: The degree of the graph index, typically between 60 and 150. A larger maximum degree will
218      result in larger indices and longer indexing times, but better search quality.
219    - **num_threads**: Number of threads to use when creating this index. `0` is used to indicate all available
220      logical processors should be used.
221    - **alpha**: The alpha parameter (>=1) is used to control the nature and number of points that are added to the
222      graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably more
223      distance comparisons compared to a lower alpha value.
224    - **use_pq_build**: Use product quantization during build. Product quantization is a lossy compression technique
225      that can reduce the size of the index on disk. This will trade off recall. Default is `True`.
226    - **num_pq_bytes**: The number of bytes used to store the PQ compressed data in memory. This will trade off recall.
227      Default is `0`.
228    - **use_opq**: Use optimized product quantization during build.
229    - **vector_dtype**: Required if the provided `data` is of type `str`, else we use the `data.dtype` if np array.
230    - **tags**: Tags can be defined either as a path on disk to an existing .tags file, or provided as a np.array of
231      the same length as the number of vectors. Tags are used to identify vectors in the index via your *own*
232      numbering conventions, and is absolutely required for loading DynamicMemoryIndex indices `from_file`.
233    - **filter_labels**: An optional, but exhaustive list of categories for each vector. This is used to filter
234      search results by category. If provided, this must be a list of lists, where each inner list is a list of
235      categories for the corresponding vector. For example, if you have 3 vectors, and the first vector belongs to
236      categories "a" and "b", the second vector belongs to category "b", and the third vector belongs to no categories,
237      you would provide `filter_labels=[["a", "b"], ["b"], []]`. If you do not want to provide categories for a
238      particular vector, you can provide an empty list. If you do not want to provide categories for any vectors,
239      you can provide `None` for this parameter (which is the default)
240    - **universal_label**: An optional label that indicates that this vector should be included in *every* search
241      in which it also meets the knn search criteria.
242    - **filter_complexity**: Complexity to use when using filters. Default is 0. 0 is strictly invalid if you are
243      using filters.
244    - **index_prefix**: The prefix of the index files. Defaults to "ann".
245    """
246    _assert(
247        (isinstance(data, str) and vector_dtype is not None)
248        or isinstance(data, np.ndarray),
249        "vector_dtype is required if data is a str representing a path to the vector bin file",
250    )
251    dap_metric = _valid_metric(distance_metric)
252    _assert_is_positive_uint32(complexity, "complexity")
253    _assert_is_positive_uint32(graph_degree, "graph_degree")
254    _assert(
255        alpha >= 1,
256        "alpha must be >= 1, and realistically should be kept between [1.0, 2.0)",
257    )
258    _assert_is_nonnegative_uint32(num_threads, "num_threads")
259    _assert_is_nonnegative_uint32(num_pq_bytes, "num_pq_bytes")
260    _assert_is_nonnegative_uint32(filter_complexity, "filter_complexity")
261    _assert(index_prefix != "", "index_prefix cannot be an empty string")
262    _assert(
263        filter_labels is None or filter_complexity > 0,
264        "if filter_labels is provided, filter_complexity must not be 0"
265    )
266
267    index_path = Path(index_directory)
268    _assert(
269        index_path.exists() and index_path.is_dir(),
270        "index_directory must both exist and be a directory",
271    )
272
273    vector_bin_path, vector_dtype_actual = _valid_path_and_dtype(
274        data, vector_dtype, index_directory, index_prefix
275    )
276    if dap_metric == _native_dap.INNER_PRODUCT:
277        _assert(
278            vector_dtype_actual == np.float32,
279            "Integral vector dtypes (np.uint8, np.int8) are not supported with distance metric mips"
280        )
281
282    num_points, dimensions = vectors_metadata_from_file(vector_bin_path)
283    if filter_labels is not None:
284        _assert(
285            len(filter_labels) == num_points,
286            "filter_labels must be the same length as the number of points"
287        )
288
289    if vector_dtype_actual == np.uint8:
290        _builder = _native_dap.build_memory_uint8_index
291    elif vector_dtype_actual == np.int8:
292        _builder = _native_dap.build_memory_int8_index
293    else:
294        _builder = _native_dap.build_memory_float_index
295
296    index_prefix_path = os.path.join(index_directory, index_prefix)
297
298    filter_labels_file = ""
299    if filter_labels is not None:
300        label_counts = {}
301        filter_labels_file = f"{index_prefix_path}_pylabels.txt"
302        with open(filter_labels_file, "w") as labels_file:
303            for labels in filter_labels:
304                for label in labels:
305                    label_counts[label] = 1 if label not in label_counts else label_counts[label] + 1
306                if len(labels) == 0:
307                    print("default", file=labels_file)
308                else:
309                    print(",".join(labels), file=labels_file)
310        with open(f"{index_prefix_path}_label_metadata.json", "w") as label_metadata_file:
311            json.dump(label_counts, label_metadata_file, indent=True)
312
313    if isinstance(tags, str) and tags != "":
314        use_tags = True
315        shutil.copy(tags, index_prefix_path + ".tags")
316    elif not isinstance(tags, str):
317        use_tags = True
318        tags_as_array = _castable_dtype_or_raise(tags, expected=np.uint32)
319        _assert(len(tags_as_array.shape) == 1, "Provided tags must be 1 dimensional")
320        _assert(
321            tags_as_array.shape[0] == num_points,
322            "Provided tags must contain an identical population to the number of points, "
323            f"{tags_as_array.shape[0]=}, {num_points=}",
324        )
325        tags_to_file(index_prefix_path + ".tags", tags_as_array)
326    else:
327        use_tags = False
328
329    _builder(
330        distance_metric=dap_metric,
331        data_file_path=vector_bin_path,
332        index_output_path=index_prefix_path,
333        complexity=complexity,
334        graph_degree=graph_degree,
335        alpha=alpha,
336        num_threads=num_threads,
337        use_pq_build=use_pq_build,
338        num_pq_bytes=num_pq_bytes,
339        use_opq=use_opq,
340        use_tags=use_tags,
341        filter_labels_file=filter_labels_file,
342        universal_label=universal_label,
343        filter_complexity=filter_complexity,
344    )
345
346    _write_index_metadata(
347        index_prefix_path, vector_dtype_actual, dap_metric, num_points, dimensions
348    )

This function will construct a DiskANN memory index. Memory indices are ideal for smaller datasets whose indices can fit into memory. Memory indices are faster than disk indices, but usually cannot scale to massive sizes in an individual index on an individual machine.

diskannpy's memory indices take two forms: a diskannpy.StaticMemoryIndex, which will not be mutated, only searched upon, and a diskannpy.DynamicMemoryIndex, which can be mutated AND searched upon in the same process.

Important Note:

You must determine the type of index you are building for. If you are building for a diskannpy.DynamicMemoryIndex, you must supply a valid value for the tags parameter. Do not supply tags if the index is intended to be diskannpy.StaticMemoryIndex!

Distance Metric and Vector Datatype Restrictions

Metric \ Datatype np.float32 np.uint8 np.int8
L2
MIPS
Cosine

Parameters

  • data: Either a str representing a path to an existing DiskANN vector bin file, or a numpy.ndarray of a supported dtype in 2 dimensions. Note that vector_dtype must be provided if data is a str.
  • distance_metric: A str, strictly one of {"l2", "mips", "cosine"}. l2 and cosine are supported for all 3 vector dtypes, but mips is only available for single precision floats.
  • index_directory: The index files will be saved to this existing directory path
  • complexity: The size of the candidate nearest neighbor list to use when building the index. Values between 75 and 200 are typical. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. Use a value that is at least as large as graph_degree unless you are prepared to compromise on quality
  • graph_degree: The degree of the graph index, typically between 60 and 150. A larger maximum degree will result in larger indices and longer indexing times, but better search quality.
  • num_threads: Number of threads to use when creating this index. 0 is used to indicate all available logical processors should be used.
  • alpha: The alpha parameter (>=1) is used to control the nature and number of points that are added to the graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably more distance comparisons compared to a lower alpha value.
  • use_pq_build: Use product quantization during build. Product quantization is a lossy compression technique that can reduce the size of the index on disk. This will trade off recall. Default is True.
  • num_pq_bytes: The number of bytes used to store the PQ compressed data in memory. This will trade off recall. Default is 0.
  • use_opq: Use optimized product quantization during build.
  • vector_dtype: Required if the provided data is of type str, else we use the data.dtype if np array.
  • tags: Tags can be defined either as a path on disk to an existing .tags file, or provided as a np.array of the same length as the number of vectors. Tags are used to identify vectors in the index via your own numbering conventions, and is absolutely required for loading DynamicMemoryIndex indices from_file.
  • filter_labels: An optional, but exhaustive list of categories for each vector. This is used to filter search results by category. If provided, this must be a list of lists, where each inner list is a list of categories for the corresponding vector. For example, if you have 3 vectors, and the first vector belongs to categories "a" and "b", the second vector belongs to category "b", and the third vector belongs to no categories, you would provide filter_labels=[["a", "b"], ["b"], []]. If you do not want to provide categories for a particular vector, you can provide an empty list. If you do not want to provide categories for any vectors, you can provide None for this parameter (which is the default)
  • universal_label: An optional label that indicates that this vector should be included in every search in which it also meets the knn search criteria.
  • filter_complexity: Complexity to use when using filters. Default is 0. 0 is strictly invalid if you are using filters.
  • index_prefix: The prefix of the index files. Defaults to "ann".
class StaticDiskIndex:
 34class StaticDiskIndex:
 35    """
 36    A StaticDiskIndex is a disk-backed index that is not mutable.
 37    """
 38
 39    def __init__(
 40        self,
 41        index_directory: str,
 42        num_threads: int,
 43        num_nodes_to_cache: int,
 44        cache_mechanism: int = 1,
 45        distance_metric: Optional[DistanceMetric] = None,
 46        vector_dtype: Optional[VectorDType] = None,
 47        dimensions: Optional[int] = None,
 48        index_prefix: str = "ann",
 49    ):
 50        """
 51        ### Parameters
 52        - **index_directory**: The directory containing the index files. This directory must contain the following
 53            files:
 54            - `{index_prefix}_sample_data.bin`
 55            - `{index_prefix}_mem.index.data`
 56            - `{index_prefix}_pq_compressed.bin`
 57            - `{index_prefix}_pq_pivots.bin`
 58            - `{index_prefix}_sample_ids.bin`
 59            - `{index_prefix}_disk.index`
 60
 61          It may also include the following optional files:
 62            - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the
 63              `index_directory` if the index was created from a numpy array
 64            - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata
 65            about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality.
 66            If an index is built from the `diskann` cli tools, this file will not exist.
 67        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
 68        - **num_nodes_to_cache**: Number of nodes to cache in memory (> -1)
 69        - **cache_mechanism**: 1 -> use the generated sample_data.bin file for
 70            the index to initialize a set of cached nodes, up to `num_nodes_to_cache`, 2 -> ready the cache for up to
 71            `num_nodes_to_cache`, but do not initialize it with any nodes. Any other value disables node caching.
 72        - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
 73          vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This
 74          value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist,
 75          you are required to provide it.
 76        - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a
 77          `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it.
 78        - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same
 79          dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it
 80          does not exist, you are required to provide it.
 81        - **index_prefix**: The prefix of the index files. Defaults to "ann".
 82        """
 83        index_prefix_path = _valid_index_prefix(index_directory, index_prefix)
 84        vector_dtype, metric, _, _ = _ensure_index_metadata(
 85            index_prefix_path,
 86            vector_dtype,
 87            distance_metric,
 88            1,  # it doesn't matter because we don't need it in this context anyway
 89            dimensions,
 90        )
 91        dap_metric = _valid_metric(metric)
 92
 93        _assert_is_nonnegative_uint32(num_threads, "num_threads")
 94        _assert_is_nonnegative_uint32(num_nodes_to_cache, "num_nodes_to_cache")
 95
 96        self._vector_dtype = vector_dtype
 97        if vector_dtype == np.uint8:
 98            _index = _native_dap.StaticDiskUInt8Index
 99        elif vector_dtype == np.int8:
100            _index = _native_dap.StaticDiskInt8Index
101        else:
102            _index = _native_dap.StaticDiskFloatIndex
103        self._index = _index(
104            distance_metric=dap_metric,
105            index_path_prefix=index_prefix_path,
106            num_threads=num_threads,
107            num_nodes_to_cache=num_nodes_to_cache,
108            cache_mechanism=cache_mechanism,
109        )
110
111    def search(
112        self, query: VectorLike, k_neighbors: int, complexity: int, beam_width: int = 2
113    ) -> QueryResponse:
114        """
115        Searches the index by a single query vector.
116
117        ### Parameters
118        - **query**: 1d numpy array of the same dimensionality and dtype of the index.
119        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
120          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
121        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
122          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
123        - **beam_width**: The beamwidth to be used for search. This is the maximum number of IO requests each query
124          will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query,
125          but might result in slightly higher total number of IO requests to SSD per query. For the highest query
126          throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search.
127          Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will
128          involve some tuning overhead.
129        """
130        _query = _castable_dtype_or_raise(query, expected=self._vector_dtype)
131        _assert(len(_query.shape) == 1, "query vector must be 1-d")
132        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
133        _assert_is_positive_uint32(complexity, "complexity")
134        _assert_is_positive_uint32(beam_width, "beam_width")
135
136        if k_neighbors > complexity:
137            warnings.warn(
138                f"{k_neighbors=} asked for, but {complexity=} was smaller. Increasing {complexity} to {k_neighbors}"
139            )
140            complexity = k_neighbors
141
142        neighbors, distances = self._index.search(
143            query=_query,
144            knn=k_neighbors,
145            complexity=complexity,
146            beam_width=beam_width,
147        )
148        return QueryResponse(identifiers=neighbors, distances=distances)
149
150    def batch_search(
151        self,
152        queries: VectorLikeBatch,
153        k_neighbors: int,
154        complexity: int,
155        num_threads: int,
156        beam_width: int = 2,
157    ) -> QueryResponseBatch:
158        """
159        Searches the index by a batch of query vectors.
160
161        This search is parallelized and far more efficient than searching for each vector individually.
162
163        ### Parameters
164        - **queries**: 2d numpy array, with column dimensionality matching the index and row dimensionality being the
165          number of queries intended to search for in parallel. Dtype must match dtype of the index.
166        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
167          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
168        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
169          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
170        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
171        - **beam_width**: The beamwidth to be used for search. This is the maximum number of IO requests each query
172          will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query,
173          but might result in slightly higher total number of IO requests to SSD per query. For the highest query
174          throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search.
175          Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will
176          involve some tuning overhead.
177        """
178        _queries = _castable_dtype_or_raise(queries, expected=self._vector_dtype)
179        _assert_2d(_queries, "queries")
180        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
181        _assert_is_positive_uint32(complexity, "complexity")
182        _assert_is_nonnegative_uint32(num_threads, "num_threads")
183        _assert_is_positive_uint32(beam_width, "beam_width")
184
185        if k_neighbors > complexity:
186            warnings.warn(
187                f"{k_neighbors=} asked for, but {complexity=} was smaller. Increasing {complexity} to {k_neighbors}"
188            )
189            complexity = k_neighbors
190
191        num_queries, dim = _queries.shape
192        neighbors, distances = self._index.batch_search(
193            queries=_queries,
194            num_queries=num_queries,
195            knn=k_neighbors,
196            complexity=complexity,
197            beam_width=beam_width,
198            num_threads=num_threads,
199        )
200        return QueryResponseBatch(identifiers=neighbors, distances=distances)

A StaticDiskIndex is a disk-backed index that is not mutable.

StaticDiskIndex( index_directory: str, num_threads: int, num_nodes_to_cache: int, cache_mechanism: int = 1, distance_metric: Optional[Literal['l2', 'mips', 'cosine']] = None, vector_dtype: Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8], NoneType] = None, dimensions: Optional[int] = None, index_prefix: str = 'ann')
 39    def __init__(
 40        self,
 41        index_directory: str,
 42        num_threads: int,
 43        num_nodes_to_cache: int,
 44        cache_mechanism: int = 1,
 45        distance_metric: Optional[DistanceMetric] = None,
 46        vector_dtype: Optional[VectorDType] = None,
 47        dimensions: Optional[int] = None,
 48        index_prefix: str = "ann",
 49    ):
 50        """
 51        ### Parameters
 52        - **index_directory**: The directory containing the index files. This directory must contain the following
 53            files:
 54            - `{index_prefix}_sample_data.bin`
 55            - `{index_prefix}_mem.index.data`
 56            - `{index_prefix}_pq_compressed.bin`
 57            - `{index_prefix}_pq_pivots.bin`
 58            - `{index_prefix}_sample_ids.bin`
 59            - `{index_prefix}_disk.index`
 60
 61          It may also include the following optional files:
 62            - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the
 63              `index_directory` if the index was created from a numpy array
 64            - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata
 65            about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality.
 66            If an index is built from the `diskann` cli tools, this file will not exist.
 67        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
 68        - **num_nodes_to_cache**: Number of nodes to cache in memory (> -1)
 69        - **cache_mechanism**: 1 -> use the generated sample_data.bin file for
 70            the index to initialize a set of cached nodes, up to `num_nodes_to_cache`, 2 -> ready the cache for up to
 71            `num_nodes_to_cache`, but do not initialize it with any nodes. Any other value disables node caching.
 72        - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
 73          vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This
 74          value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist,
 75          you are required to provide it.
 76        - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a
 77          `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it.
 78        - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same
 79          dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it
 80          does not exist, you are required to provide it.
 81        - **index_prefix**: The prefix of the index files. Defaults to "ann".
 82        """
 83        index_prefix_path = _valid_index_prefix(index_directory, index_prefix)
 84        vector_dtype, metric, _, _ = _ensure_index_metadata(
 85            index_prefix_path,
 86            vector_dtype,
 87            distance_metric,
 88            1,  # it doesn't matter because we don't need it in this context anyway
 89            dimensions,
 90        )
 91        dap_metric = _valid_metric(metric)
 92
 93        _assert_is_nonnegative_uint32(num_threads, "num_threads")
 94        _assert_is_nonnegative_uint32(num_nodes_to_cache, "num_nodes_to_cache")
 95
 96        self._vector_dtype = vector_dtype
 97        if vector_dtype == np.uint8:
 98            _index = _native_dap.StaticDiskUInt8Index
 99        elif vector_dtype == np.int8:
100            _index = _native_dap.StaticDiskInt8Index
101        else:
102            _index = _native_dap.StaticDiskFloatIndex
103        self._index = _index(
104            distance_metric=dap_metric,
105            index_path_prefix=index_prefix_path,
106            num_threads=num_threads,
107            num_nodes_to_cache=num_nodes_to_cache,
108            cache_mechanism=cache_mechanism,
109        )

Parameters

  • index_directory: The directory containing the index files. This directory must contain the following files:

    • {index_prefix}_sample_data.bin
    • {index_prefix}_mem.index.data
    • {index_prefix}_pq_compressed.bin
    • {index_prefix}_pq_pivots.bin
    • {index_prefix}_sample_ids.bin
    • {index_prefix}_disk.index

    It may also include the following optional files:

    • {index_prefix}_vectors.bin: Optional. diskannpy builder functions may create this file in the index_directory if the index was created from a numpy array
    • {index_prefix}_metadata.bin: Optional. diskannpy builder functions create this file to store metadata about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. If an index is built from the diskann cli tools, this file will not exist.
  • num_threads: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
  • num_nodes_to_cache: Number of nodes to cache in memory (> -1)
  • cache_mechanism: 1 -> use the generated sample_data.bin file for the index to initialize a set of cached nodes, up to num_nodes_to_cache, 2 -> ready the cache for up to num_nodes_to_cache, but do not initialize it with any nodes. Any other value disables node caching.
  • distance_metric: A str, strictly one of {"l2", "mips", "cosine"}. l2 and cosine are supported for all 3 vector dtypes, but mips is only available for single precision floats. Default is None. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
  • vector_dtype: The vector dtype this index has been built with. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
  • dimensions: The vector dimensionality of this index. All new vectors inserted must be the same dimensionality. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
  • index_prefix: The prefix of the index files. Defaults to "ann".
def search( self, query: numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]], k_neighbors: int, complexity: int, beam_width: int = 2) -> QueryResponse:
111    def search(
112        self, query: VectorLike, k_neighbors: int, complexity: int, beam_width: int = 2
113    ) -> QueryResponse:
114        """
115        Searches the index by a single query vector.
116
117        ### Parameters
118        - **query**: 1d numpy array of the same dimensionality and dtype of the index.
119        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
120          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
121        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
122          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
123        - **beam_width**: The beamwidth to be used for search. This is the maximum number of IO requests each query
124          will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query,
125          but might result in slightly higher total number of IO requests to SSD per query. For the highest query
126          throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search.
127          Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will
128          involve some tuning overhead.
129        """
130        _query = _castable_dtype_or_raise(query, expected=self._vector_dtype)
131        _assert(len(_query.shape) == 1, "query vector must be 1-d")
132        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
133        _assert_is_positive_uint32(complexity, "complexity")
134        _assert_is_positive_uint32(beam_width, "beam_width")
135
136        if k_neighbors > complexity:
137            warnings.warn(
138                f"{k_neighbors=} asked for, but {complexity=} was smaller. Increasing {complexity} to {k_neighbors}"
139            )
140            complexity = k_neighbors
141
142        neighbors, distances = self._index.search(
143            query=_query,
144            knn=k_neighbors,
145            complexity=complexity,
146            beam_width=beam_width,
147        )
148        return QueryResponse(identifiers=neighbors, distances=distances)

Searches the index by a single query vector.

Parameters

  • query: 1d numpy array of the same dimensionality and dtype of the index.
  • k_neighbors: Number of neighbors to be returned. If query vector exists in index, it almost definitely will be returned as well, so adjust your k_neighbors as appropriate. Must be > 0.
  • complexity: Size of distance ordered list of candidate neighbors to use while searching. List size increases accuracy at the cost of latency. Must be at least k_neighbors in size.
  • beam_width: The beamwidth to be used for search. This is the maximum number of IO requests each query will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query, but might result in slightly higher total number of IO requests to SSD per query. For the highest query throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search. Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will involve some tuning overhead.
class StaticMemoryIndex:
 34class StaticMemoryIndex:
 35    """
 36    A StaticMemoryIndex is an immutable in-memory DiskANN index.
 37    """
 38
 39    def __init__(
 40        self,
 41        index_directory: str,
 42        num_threads: int,
 43        initial_search_complexity: int,
 44        index_prefix: str = "ann",
 45        distance_metric: Optional[DistanceMetric] = None,
 46        vector_dtype: Optional[VectorDType] = None,
 47        dimensions: Optional[int] = None,
 48        enable_filters: bool = False
 49    ):
 50        """
 51        ### Parameters
 52        - **index_directory**: The directory containing the index files. This directory must contain the following
 53          files:
 54            - `{index_prefix}.data`
 55            - `{index_prefix}`
 56
 57
 58          It may also include the following optional files:
 59            - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the
 60              `index_directory` if the index was created from a numpy array
 61            - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata
 62            about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality.
 63            If an index is built from the `diskann` cli tools, this file will not exist.
 64        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
 65        - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the
 66          life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of
 67          `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search`
 68          operation requests a space larger than can be accommodated by these values.
 69        - **index_prefix**: The prefix of the index files. Defaults to "ann".
 70        - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
 71          vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This
 72          value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist,
 73          you are required to provide it.
 74        - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a
 75          `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it.
 76        - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same
 77          dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it
 78          does not exist, you are required to provide it.
 79        - **enable_filters**: Indexes built with filters can also be used for filtered search.
 80        """
 81        index_prefix_path = _valid_index_prefix(index_directory, index_prefix)
 82        self._labels_map = {}
 83        self._labels_metadata = {}
 84        if enable_filters:
 85            try:
 86                with open(f"{index_prefix_path}_labels_map.txt", "r") as labels_map_if:
 87                    for line in labels_map_if:
 88                        (key, val) = line.split("\t")
 89                        self._labels_map[key] = int(val)
 90                with open(f"{index_prefix_path}_label_metadata.json", "r") as labels_metadata_if:
 91                    self._labels_metadata = json.load(labels_metadata_if)
 92            except: # noqa: E722
 93                # exceptions are basically presumed to be either file not found or file not formatted correctly
 94                raise RuntimeException("Filter labels file was unable to be processed.")
 95        vector_dtype, metric, num_points, dims = _ensure_index_metadata(
 96            index_prefix_path,
 97            vector_dtype,
 98            distance_metric,
 99            1,  # it doesn't matter because we don't need it in this context anyway
100            dimensions,
101        )
102        dap_metric = _valid_metric(metric)
103
104        _assert_is_nonnegative_uint32(num_threads, "num_threads")
105        _assert_is_positive_uint32(
106            initial_search_complexity, "initial_search_complexity"
107        )
108
109        self._vector_dtype = vector_dtype
110        self._dimensions = dims
111
112        if vector_dtype == np.uint8:
113            _index = _native_dap.StaticMemoryUInt8Index
114        elif vector_dtype == np.int8:
115            _index = _native_dap.StaticMemoryInt8Index
116        else:
117            _index = _native_dap.StaticMemoryFloatIndex
118
119        self._index = _index(
120            distance_metric=dap_metric,
121            num_points=num_points,
122            dimensions=dims,
123            index_path=index_prefix_path,
124            num_threads=num_threads,
125            initial_search_complexity=initial_search_complexity,
126        )
127
128    def search(
129            self, query: VectorLike, k_neighbors: int, complexity: int, filter_label: str = ""
130    ) -> QueryResponse:
131        """
132        Searches the index by a single query vector.
133
134        ### Parameters
135        - **query**: 1d numpy array of the same dimensionality and dtype of the index.
136        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
137          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
138        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
139          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
140        """
141        if filter_label != "":
142            if len(self._labels_map) == 0:
143                raise ValueError(
144                    f"A filter label of {filter_label} was provided, but this class was not initialized with filters "
145                    "enabled, e.g. StaticDiskMemory(..., enable_filters=True)"
146                )
147            if filter_label not in self._labels_map:
148                raise ValueError(
149                    f"A filter label of {filter_label} was provided, but the external(str)->internal(np.uint32) labels map "
150                    f"does not include that label."
151                )
152            k_neighbors = min(k_neighbors, self._labels_metadata[filter_label])
153        _query = _castable_dtype_or_raise(query, expected=self._vector_dtype)
154        _assert(len(_query.shape) == 1, "query vector must be 1-d")
155        _assert(
156            _query.shape[0] == self._dimensions,
157            f"query vector must have the same dimensionality as the index; index dimensionality: {self._dimensions}, "
158            f"query dimensionality: {_query.shape[0]}",
159            )
160        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
161        _assert_is_nonnegative_uint32(complexity, "complexity")
162
163        if k_neighbors > complexity:
164            warnings.warn(
165                f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}"
166            )
167            complexity = k_neighbors
168
169        if filter_label == "":
170            neighbors, distances = self._index.search(query=_query, knn=k_neighbors, complexity=complexity)
171        else:
172            filter = self._labels_map[filter_label]
173            neighbors, distances = self._index.search_with_filter(
174                query=query,
175                knn=k_neighbors,
176                complexity=complexity,
177                filter=filter
178            )
179        return QueryResponse(identifiers=neighbors, distances=distances)
180
181
182    def batch_search(
183        self,
184        queries: VectorLikeBatch,
185        k_neighbors: int,
186        complexity: int,
187        num_threads: int,
188    ) -> QueryResponseBatch:
189        """
190        Searches the index by a batch of query vectors.
191
192        This search is parallelized and far more efficient than searching for each vector individually.
193
194        ### Parameters
195        - **queries**: 2d numpy array, with column dimensionality matching the index and row dimensionality being the
196          number of queries intended to search for in parallel. Dtype must match dtype of the index.
197        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
198          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
199        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
200          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
201        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
202        """
203
204        _queries = _castable_dtype_or_raise(queries, expected=self._vector_dtype)
205        _assert(len(_queries.shape) == 2, "queries must must be 2-d np array")
206        _assert(
207            _queries.shape[1] == self._dimensions,
208            f"query vectors must have the same dimensionality as the index; index dimensionality: {self._dimensions}, "
209            f"query dimensionality: {_queries.shape[1]}",
210        )
211        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
212        _assert_is_positive_uint32(complexity, "complexity")
213        _assert_is_nonnegative_uint32(num_threads, "num_threads")
214
215        if k_neighbors > complexity:
216            warnings.warn(
217                f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}"
218            )
219            complexity = k_neighbors
220
221        num_queries, dim = _queries.shape
222        neighbors, distances = self._index.batch_search(
223            queries=_queries,
224            num_queries=num_queries,
225            knn=k_neighbors,
226            complexity=complexity,
227            num_threads=num_threads,
228        )
229        return QueryResponseBatch(identifiers=neighbors, distances=distances)

A StaticMemoryIndex is an immutable in-memory DiskANN index.

StaticMemoryIndex( index_directory: str, num_threads: int, initial_search_complexity: int, index_prefix: str = 'ann', distance_metric: Optional[Literal['l2', 'mips', 'cosine']] = None, vector_dtype: Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8], NoneType] = None, dimensions: Optional[int] = None, enable_filters: bool = False)
 39    def __init__(
 40        self,
 41        index_directory: str,
 42        num_threads: int,
 43        initial_search_complexity: int,
 44        index_prefix: str = "ann",
 45        distance_metric: Optional[DistanceMetric] = None,
 46        vector_dtype: Optional[VectorDType] = None,
 47        dimensions: Optional[int] = None,
 48        enable_filters: bool = False
 49    ):
 50        """
 51        ### Parameters
 52        - **index_directory**: The directory containing the index files. This directory must contain the following
 53          files:
 54            - `{index_prefix}.data`
 55            - `{index_prefix}`
 56
 57
 58          It may also include the following optional files:
 59            - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the
 60              `index_directory` if the index was created from a numpy array
 61            - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata
 62            about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality.
 63            If an index is built from the `diskann` cli tools, this file will not exist.
 64        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
 65        - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the
 66          life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of
 67          `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search`
 68          operation requests a space larger than can be accommodated by these values.
 69        - **index_prefix**: The prefix of the index files. Defaults to "ann".
 70        - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
 71          vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This
 72          value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist,
 73          you are required to provide it.
 74        - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a
 75          `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it.
 76        - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same
 77          dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it
 78          does not exist, you are required to provide it.
 79        - **enable_filters**: Indexes built with filters can also be used for filtered search.
 80        """
 81        index_prefix_path = _valid_index_prefix(index_directory, index_prefix)
 82        self._labels_map = {}
 83        self._labels_metadata = {}
 84        if enable_filters:
 85            try:
 86                with open(f"{index_prefix_path}_labels_map.txt", "r") as labels_map_if:
 87                    for line in labels_map_if:
 88                        (key, val) = line.split("\t")
 89                        self._labels_map[key] = int(val)
 90                with open(f"{index_prefix_path}_label_metadata.json", "r") as labels_metadata_if:
 91                    self._labels_metadata = json.load(labels_metadata_if)
 92            except: # noqa: E722
 93                # exceptions are basically presumed to be either file not found or file not formatted correctly
 94                raise RuntimeException("Filter labels file was unable to be processed.")
 95        vector_dtype, metric, num_points, dims = _ensure_index_metadata(
 96            index_prefix_path,
 97            vector_dtype,
 98            distance_metric,
 99            1,  # it doesn't matter because we don't need it in this context anyway
100            dimensions,
101        )
102        dap_metric = _valid_metric(metric)
103
104        _assert_is_nonnegative_uint32(num_threads, "num_threads")
105        _assert_is_positive_uint32(
106            initial_search_complexity, "initial_search_complexity"
107        )
108
109        self._vector_dtype = vector_dtype
110        self._dimensions = dims
111
112        if vector_dtype == np.uint8:
113            _index = _native_dap.StaticMemoryUInt8Index
114        elif vector_dtype == np.int8:
115            _index = _native_dap.StaticMemoryInt8Index
116        else:
117            _index = _native_dap.StaticMemoryFloatIndex
118
119        self._index = _index(
120            distance_metric=dap_metric,
121            num_points=num_points,
122            dimensions=dims,
123            index_path=index_prefix_path,
124            num_threads=num_threads,
125            initial_search_complexity=initial_search_complexity,
126        )

Parameters

  • index_directory: The directory containing the index files. This directory must contain the following files:

    • {index_prefix}.data
    • {index_prefix}

    It may also include the following optional files:

    • {index_prefix}_vectors.bin: Optional. diskannpy builder functions may create this file in the index_directory if the index was created from a numpy array
    • {index_prefix}_metadata.bin: Optional. diskannpy builder functions create this file to store metadata about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. If an index is built from the diskann cli tools, this file will not exist.
  • num_threads: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
  • initial_search_complexity: Should be set to the most common complexity expected to be used during the life of this diskannpy.DynamicMemoryIndex object. The working scratch memory allocated is based off of initial_search_complexity * search_threads. Note that it may be resized if a search or batch_search operation requests a space larger than can be accommodated by these values.
  • index_prefix: The prefix of the index files. Defaults to "ann".
  • distance_metric: A str, strictly one of {"l2", "mips", "cosine"}. l2 and cosine are supported for all 3 vector dtypes, but mips is only available for single precision floats. Default is None. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
  • vector_dtype: The vector dtype this index has been built with. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
  • dimensions: The vector dimensionality of this index. All new vectors inserted must be the same dimensionality. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
  • enable_filters: Indexes built with filters can also be used for filtered search.
def search( self, query: numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]], k_neighbors: int, complexity: int, filter_label: str = '') -> QueryResponse:
128    def search(
129            self, query: VectorLike, k_neighbors: int, complexity: int, filter_label: str = ""
130    ) -> QueryResponse:
131        """
132        Searches the index by a single query vector.
133
134        ### Parameters
135        - **query**: 1d numpy array of the same dimensionality and dtype of the index.
136        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
137          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
138        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
139          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
140        """
141        if filter_label != "":
142            if len(self._labels_map) == 0:
143                raise ValueError(
144                    f"A filter label of {filter_label} was provided, but this class was not initialized with filters "
145                    "enabled, e.g. StaticDiskMemory(..., enable_filters=True)"
146                )
147            if filter_label not in self._labels_map:
148                raise ValueError(
149                    f"A filter label of {filter_label} was provided, but the external(str)->internal(np.uint32) labels map "
150                    f"does not include that label."
151                )
152            k_neighbors = min(k_neighbors, self._labels_metadata[filter_label])
153        _query = _castable_dtype_or_raise(query, expected=self._vector_dtype)
154        _assert(len(_query.shape) == 1, "query vector must be 1-d")
155        _assert(
156            _query.shape[0] == self._dimensions,
157            f"query vector must have the same dimensionality as the index; index dimensionality: {self._dimensions}, "
158            f"query dimensionality: {_query.shape[0]}",
159            )
160        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
161        _assert_is_nonnegative_uint32(complexity, "complexity")
162
163        if k_neighbors > complexity:
164            warnings.warn(
165                f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}"
166            )
167            complexity = k_neighbors
168
169        if filter_label == "":
170            neighbors, distances = self._index.search(query=_query, knn=k_neighbors, complexity=complexity)
171        else:
172            filter = self._labels_map[filter_label]
173            neighbors, distances = self._index.search_with_filter(
174                query=query,
175                knn=k_neighbors,
176                complexity=complexity,
177                filter=filter
178            )
179        return QueryResponse(identifiers=neighbors, distances=distances)

Searches the index by a single query vector.

Parameters

  • query: 1d numpy array of the same dimensionality and dtype of the index.
  • k_neighbors: Number of neighbors to be returned. If query vector exists in index, it almost definitely will be returned as well, so adjust your k_neighbors as appropriate. Must be > 0.
  • complexity: Size of distance ordered list of candidate neighbors to use while searching. List size increases accuracy at the cost of latency. Must be at least k_neighbors in size.
class DynamicMemoryIndex:
 41class DynamicMemoryIndex:
 42    """
 43    A DynamicMemoryIndex instance is used to both search and mutate a `diskannpy` memory index. This index is unlike
 44    either `diskannpy.StaticMemoryIndex` or `diskannpy.StaticDiskIndex` in the following ways:
 45
 46    - It requires an explicit vector identifier for each vector added to it.
 47    - Insert and (lazy) deletion operations are provided for a flexible, living index
 48
 49    The mutable aspect of this index will absolutely impact search time performance as new vectors are added and
 50    old deleted. `DynamicMemoryIndex.consolidate_deletes()` should be called periodically to restructure the index
 51    to remove deleted vectors and improve per-search performance, at the cost of an expensive index consolidation to
 52    occur.
 53    """
 54
 55    @classmethod
 56    def from_file(
 57        cls,
 58        index_directory: str,
 59        max_vectors: int,
 60        complexity: int,
 61        graph_degree: int,
 62        saturate_graph: bool = defaults.SATURATE_GRAPH,
 63        max_occlusion_size: int = defaults.MAX_OCCLUSION_SIZE,
 64        alpha: float = defaults.ALPHA,
 65        num_threads: int = defaults.NUM_THREADS,
 66        filter_complexity: int = defaults.FILTER_COMPLEXITY,
 67        num_frozen_points: int = defaults.NUM_FROZEN_POINTS_DYNAMIC,
 68        initial_search_complexity: int = 0,
 69        search_threads: int = 0,
 70        concurrent_consolidation: bool = True,
 71        index_prefix: str = "ann",
 72        distance_metric: Optional[DistanceMetric] = None,
 73        vector_dtype: Optional[VectorDType] = None,
 74        dimensions: Optional[int] = None,
 75    ) -> "DynamicMemoryIndex":
 76        """
 77        The `from_file` classmethod is used to load a previously saved index from disk. This index *must* have been
 78        created with a valid `tags` file or `tags` np.ndarray of `diskannpy.VectorIdentifier`s. It is *strongly*
 79        recommended that you use the same parameters as the `diskannpy.build_memory_index()` function that created
 80        the index.
 81
 82        ### Parameters
 83        - **index_directory**: The directory containing the index files. This directory must contain the following
 84            files:
 85            - `{index_prefix}.data`
 86            - `{index_prefix}.tags`
 87            - `{index_prefix}`
 88
 89          It may also include the following optional files:
 90            - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the
 91              `index_directory` if the index was created from a numpy array
 92            - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata
 93            about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality.
 94            If an index is built from the `diskann` cli tools, this file will not exist.
 95        - **max_vectors**: Capacity of the memory index including space for future insertions.
 96        - **complexity**: Complexity (a.k.a `L`) references the size of the list we store candidate approximate
 97          neighbors in. It's used during save (which is an index rebuild), and it's used as an initial search size to
 98          warm up our index and lower the latency for initial real searches.
 99        - **graph_degree**: Graph degree (a.k.a. `R`) is the maximum degree allowed for a node in the index's graph
100          structure. This degree will be pruned throughout the course of the index build, but it will never grow beyond
101          this value. Higher R values require longer index build times, but may result in an index showing excellent
102          recall and latency characteristics.
103        - **saturate_graph**: If True, the adjacency list of each node will be saturated with neighbors to have exactly
104          `graph_degree` neighbors. If False, each node will have between 1 and `graph_degree` neighbors.
105        - **max_occlusion_size**: The maximum number of points that can be considered by occlude_list function.
106        - **alpha**: The alpha parameter (>=1) is used to control the nature and number of points that are added to the
107          graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably
108          more distance comparisons compared to a lower alpha value.
109        - **num_threads**: Number of threads to use when creating this index. `0` indicates we should use all available
110          logical processors.
111        - **filter_complexity**: Complexity to use when using filters. Default is 0.
112        - **num_frozen_points**: Number of points to freeze. Default is 1.
113        - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the
114          life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of
115          `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search`
116          operation requests a space larger than can be accommodated by these values.
117        - **search_threads**: Should be set to the most common `num_threads` expected to be used during the
118          life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of
119          `initial_search_complexity` * `search_threads`. Note that it may be resized if a `batch_search`
120          operation requests a space larger than can be accommodated by these values.
121        - **concurrent_consolidation**: This flag dictates whether consolidation can be run alongside inserts and
122          deletes, or whether the index is locked down to changes while consolidation is ongoing.
123        - **index_prefix**: The prefix of the index files. Defaults to "ann".
124        - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
125          vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This
126          value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist,
127          you are required to provide it.
128        - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a
129          `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it.
130        - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same
131          dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it
132          does not exist, you are required to provide it.
133
134        ### Returns
135        A `diskannpy.DynamicMemoryIndex` object, with the index loaded from disk and ready to use for insertions,
136        deletions, and searches.
137
138        """
139        index_prefix_path = _valid_index_prefix(index_directory, index_prefix)
140
141        # do tags exist?
142        tags_file = index_prefix_path + ".tags"
143        _assert(
144            Path(tags_file).exists(),
145            f"The file {tags_file} does not exist in {index_directory}",
146        )
147        vector_dtype, dap_metric, num_vectors, dimensions = _ensure_index_metadata(
148            index_prefix_path, vector_dtype, distance_metric, max_vectors, dimensions, warn_size_exceeded=True
149        )
150
151        index = cls(
152            distance_metric=dap_metric,  # type: ignore
153            vector_dtype=vector_dtype,
154            dimensions=dimensions,
155            max_vectors=max_vectors,
156            complexity=complexity,
157            graph_degree=graph_degree,
158            saturate_graph=saturate_graph,
159            max_occlusion_size=max_occlusion_size,
160            alpha=alpha,
161            num_threads=num_threads,
162            filter_complexity=filter_complexity,
163            num_frozen_points=num_frozen_points,
164            initial_search_complexity=initial_search_complexity,
165            search_threads=search_threads,
166            concurrent_consolidation=concurrent_consolidation,
167        )
168        index._index.load(index_prefix_path)
169        index._num_vectors = num_vectors  # current number of vectors loaded
170        return index
171
172    def __init__(
173        self,
174        distance_metric: DistanceMetric,
175        vector_dtype: VectorDType,
176        dimensions: int,
177        max_vectors: int,
178        complexity: int,
179        graph_degree: int,
180        saturate_graph: bool = defaults.SATURATE_GRAPH,
181        max_occlusion_size: int = defaults.MAX_OCCLUSION_SIZE,
182        alpha: float = defaults.ALPHA,
183        num_threads: int = defaults.NUM_THREADS,
184        filter_complexity: int = defaults.FILTER_COMPLEXITY,
185        num_frozen_points: int = defaults.NUM_FROZEN_POINTS_DYNAMIC,
186        initial_search_complexity: int = 0,
187        search_threads: int = 0,
188        concurrent_consolidation: bool = True,
189    ):
190        """
191        The `diskannpy.DynamicMemoryIndex` represents our python API into a mutable DiskANN memory index.
192
193        This constructor is used to create a new, empty index. If you wish to load a previously saved index from disk,
194        please use the `diskannpy.DynamicMemoryIndex.from_file` classmethod instead.
195
196        ### Parameters
197        - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
198          vector dtypes, but `mips` is only available for single precision floats.
199        - **vector_dtype**: One of {`np.float32`, `np.int8`, `np.uint8`}. The dtype of the vectors this index will
200          be storing.
201        - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same
202          dimensionality.
203        - **max_vectors**: Capacity of the data store including space for future insertions
204        - **graph_degree**: Graph degree (a.k.a. `R`) is the maximum degree allowed for a node in the index's graph
205          structure. This degree will be pruned throughout the course of the index build, but it will never grow beyond
206          this value. Higher `graph_degree` values require longer index build times, but may result in an index showing
207          excellent recall and latency characteristics.
208        - **saturate_graph**: If True, the adjacency list of each node will be saturated with neighbors to have exactly
209          `graph_degree` neighbors. If False, each node will have between 1 and `graph_degree` neighbors.
210        - **max_occlusion_size**: The maximum number of points that can be considered by occlude_list function.
211        - **alpha**: The alpha parameter (>=1) is used to control the nature and number of points that are added to the
212          graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably
213          more distance comparisons compared to a lower alpha value.
214        - **num_threads**: Number of threads to use when creating this index. `0` indicates we should use all available
215          logical processors.
216        - **filter_complexity**: Complexity to use when using filters. Default is 0.
217        - **num_frozen_points**: Number of points to freeze. Default is 1.
218        - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the
219          life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of
220          `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search`
221          operation requests a space larger than can be accommodated by these values.
222        - **search_threads**: Should be set to the most common `num_threads` expected to be used during the
223          life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of
224          `initial_search_complexity` * `search_threads`. Note that it may be resized if a `batch_search`
225          operation requests a space larger than can be accommodated by these values.
226        - **concurrent_consolidation**: This flag dictates whether consolidation can be run alongside inserts and
227          deletes, or whether the index is locked down to changes while consolidation is ongoing.
228
229        """
230        self._num_vectors = 0
231        self._removed_num_vectors = 0
232        dap_metric = _valid_metric(distance_metric)
233        self._dap_metric = dap_metric
234        _assert_dtype(vector_dtype)
235        _assert_is_positive_uint32(dimensions, "dimensions")
236
237        self._vector_dtype = vector_dtype
238        self._dimensions = dimensions
239
240        _assert_is_positive_uint32(max_vectors, "max_vectors")
241        _assert_is_positive_uint32(complexity, "complexity")
242        _assert_is_positive_uint32(graph_degree, "graph_degree")
243        _assert(
244            alpha >= 1,
245            "alpha must be >= 1, and realistically should be kept between [1.0, 2.0)",
246        )
247        _assert_is_nonnegative_uint32(max_occlusion_size, "max_occlusion_size")
248        _assert_is_nonnegative_uint32(num_threads, "num_threads")
249        _assert_is_nonnegative_uint32(filter_complexity, "filter_complexity")
250        _assert_is_nonnegative_uint32(num_frozen_points, "num_frozen_points")
251        _assert_is_nonnegative_uint32(
252            initial_search_complexity, "initial_search_complexity"
253        )
254        _assert_is_nonnegative_uint32(search_threads, "search_threads")
255
256        self._max_vectors = max_vectors
257        self._complexity = complexity
258        self._graph_degree = graph_degree
259
260        if vector_dtype == np.uint8:
261            _index = _native_dap.DynamicMemoryUInt8Index
262        elif vector_dtype == np.int8:
263            _index = _native_dap.DynamicMemoryInt8Index
264        else:
265            _index = _native_dap.DynamicMemoryFloatIndex
266
267        self._index = _index(
268            distance_metric=dap_metric,
269            dimensions=dimensions,
270            max_vectors=max_vectors,
271            complexity=complexity,
272            graph_degree=graph_degree,
273            saturate_graph=saturate_graph,
274            max_occlusion_size=max_occlusion_size,
275            alpha=alpha,
276            num_threads=num_threads,
277            filter_complexity=filter_complexity,
278            num_frozen_points=num_frozen_points,
279            initial_search_complexity=initial_search_complexity,
280            search_threads=search_threads,
281            concurrent_consolidation=concurrent_consolidation,
282        )
283        self._points_deleted = False
284
285    def search(
286        self, query: VectorLike, k_neighbors: int, complexity: int
287    ) -> QueryResponse:
288        """
289        Searches the index by a single query vector.
290
291        ### Parameters
292        - **query**: 1d numpy array of the same dimensionality and dtype of the index.
293        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
294          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
295        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
296          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
297        """
298        _query = _castable_dtype_or_raise(query, expected=self._vector_dtype)
299        _assert(len(_query.shape) == 1, "query vector must be 1-d")
300        _assert(
301            _query.shape[0] == self._dimensions,
302            f"query vector must have the same dimensionality as the index; index dimensionality: {self._dimensions}, "
303            f"query dimensionality: {_query.shape[0]}",
304        )
305        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
306        _assert_is_nonnegative_uint32(complexity, "complexity")
307
308        if k_neighbors > complexity:
309            warnings.warn(
310                f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}"
311            )
312            complexity = k_neighbors
313        neighbors, distances = self._index.search(query=_query, knn=k_neighbors, complexity=complexity)
314        return QueryResponse(identifiers=neighbors, distances=distances)
315
316    def batch_search(
317        self,
318        queries: VectorLikeBatch,
319        k_neighbors: int,
320        complexity: int,
321        num_threads: int,
322    ) -> QueryResponseBatch:
323        """
324        Searches the index by a batch of query vectors.
325
326        This search is parallelized and far more efficient than searching for each vector individually.
327
328        ### Parameters
329        - **queries**: 2d numpy array, with column dimensionality matching the index and row dimensionality being the
330          number of queries intended to search for in parallel. Dtype must match dtype of the index.
331        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
332          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
333        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
334          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
335        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
336        """
337        _queries = _castable_dtype_or_raise(queries, expected=self._vector_dtype)
338        _assert_2d(_queries, "queries")
339        _assert(
340            _queries.shape[1] == self._dimensions,
341            f"query vectors must have the same dimensionality as the index; index dimensionality: {self._dimensions}, "
342            f"query dimensionality: {_queries.shape[1]}",
343        )
344
345        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
346        _assert_is_positive_uint32(complexity, "complexity")
347        _assert_is_nonnegative_uint32(num_threads, "num_threads")
348
349        if k_neighbors > complexity:
350            warnings.warn(
351                f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}"
352            )
353            complexity = k_neighbors
354
355        num_queries, dim = queries.shape
356        neighbors, distances = self._index.batch_search(
357            queries=_queries,
358            num_queries=num_queries,
359            knn=k_neighbors,
360            complexity=complexity,
361            num_threads=num_threads,
362        )
363        return QueryResponseBatch(identifiers=neighbors, distances=distances)
364
365    def save(self, save_path: str, index_prefix: str = "ann"):
366        """
367        Saves this index to file.
368
369        ### Parameters
370        - **save_path**: The path to save these index files to.
371        - **index_prefix**: The prefix of the index files. Defaults to "ann".
372        """
373        if save_path == "":
374            raise ValueError("save_path cannot be empty")
375        if index_prefix == "":
376            raise ValueError("index_prefix cannot be empty")
377
378        index_prefix = index_prefix.format(complexity=self._complexity, graph_degree=self._graph_degree)
379        _assert_existing_directory(save_path, "save_path")
380        save_path = os.path.join(save_path, index_prefix)
381        if self._points_deleted is True:
382            warnings.warn(
383                "DynamicMemoryIndex.save() currently requires DynamicMemoryIndex.consolidate_delete() to be called "
384                "prior to save when items have been marked for deletion. This is being done automatically now, though"
385                "it will increase the time it takes to save; on large sets of data it can take a substantial amount of "
386                "time. In the future, we will implement a faster save with unconsolidated deletes, but for now this is "
387                "required."
388            )
389            self._index.consolidate_delete()
390        self._index.save(
391            save_path=save_path, compact_before_save=True
392        )  # we do not yet support uncompacted saves
393        _write_index_metadata(
394            save_path,
395            self._vector_dtype,
396            self._dap_metric,
397            self._index.num_points(),
398            self._dimensions,
399        )
400
401    def insert(self, vector: VectorLike, vector_id: VectorIdentifier):
402        """
403        Inserts a single vector into the index with the provided vector_id.
404
405        If this insertion will overrun the `max_vectors` count boundaries of this index, `consolidate_delete()` will
406        be executed automatically.
407
408        ### Parameters
409        - **vector**: The vector to insert. Note that dtype must match.
410        - **vector_id**: The vector_id to use for this vector.
411        """
412        _vector = _castable_dtype_or_raise(vector, expected=self._vector_dtype)
413        _assert(len(vector.shape) == 1, "insert vector must be 1-d")
414        _assert_is_positive_uint32(vector_id, "vector_id")
415        if self._num_vectors + 1 > self._max_vectors:
416            if self._removed_num_vectors > 0:
417                warnings.warn(f"Inserting this vector would overrun the max_vectors={self._max_vectors} specified at index "
418                              f"construction. We are attempting to consolidate_delete() to make space.")
419                self.consolidate_delete()
420            else:
421                raise RuntimeError(f"Inserting this vector would overrun the max_vectors={self._max_vectors} specified "
422                                   f"at index construction. Unable to make space by consolidating deletions. The insert"
423                                   f"operation has failed.")
424        status = self._index.insert(_vector, np.uint32(vector_id))
425        if status == 0:
426            self._num_vectors += 1
427        else:
428            raise RuntimeError(
429                f"Insert was unable to complete successfully; error code returned from diskann C++ lib: {status}"
430            )
431
432
433    def batch_insert(
434        self,
435        vectors: VectorLikeBatch,
436        vector_ids: VectorIdentifierBatch,
437        num_threads: int = 0,
438    ):
439        """
440        Inserts a batch of vectors into the index with the provided vector_ids.
441
442        If this batch insertion will overrun the `max_vectors` count boundaries of this index, `consolidate_delete()`
443        will be executed automatically.
444
445        ### Parameters
446        - **vectors**: The 2d numpy array of vectors to insert.
447        - **vector_ids**: The 1d array of vector ids to use. This array must have the same number of elements as
448            the vectors array has rows. The dtype of vector_ids must be `np.uint32`
449        - **num_threads**: Number of threads to use when inserting into this index. (>= 0), 0 = num_threads in system
450        """
451        _query = _castable_dtype_or_raise(vectors, expected=self._vector_dtype)
452        _assert(len(vectors.shape) == 2, "vectors must be a 2-d array")
453        _assert(
454            vectors.shape[0] == vector_ids.shape[0],
455            "Number of vectors must be equal to number of ids",
456        )
457        _vectors = vectors.astype(dtype=self._vector_dtype, casting="safe", copy=False)
458        _vector_ids = vector_ids.astype(dtype=np.uint32, casting="safe", copy=False)
459
460        if self._num_vectors + _vector_ids.shape[0] > self._max_vectors:
461            if self._max_vectors + self._removed_num_vectors >= _vector_ids.shape[0]:
462                warnings.warn(f"Inserting these vectors, count={_vector_ids.shape[0]} would overrun the "
463                              f"max_vectors={self._max_vectors} specified at index construction. We are attempting to "
464                              f"consolidate_delete() to make space.")
465                self.consolidate_delete()
466            else:
467                raise RuntimeError(f"Inserting these vectors count={_vector_ids.shape[0]} would overrun the "
468                                   f"max_vectors={self._max_vectors} specified at index construction. Unable to make "
469                                   f"space by consolidating deletions. The batch insert operation has failed.")
470
471        statuses = self._index.batch_insert(
472            _vectors, _vector_ids, _vector_ids.shape[0], num_threads
473        )
474        successes = []
475        failures = []
476        for i in range(0, len(statuses)):
477            if statuses[i] == 0:
478                successes.append(i)
479            else:
480                failures.append(i)
481        self._num_vectors += len(successes)
482        if len(failures) == 0:
483            return
484        failed_ids = vector_ids[failures]
485        raise RuntimeError(
486            f"During batch insert, the following vector_ids were unable to be inserted into the index: {failed_ids}. "
487            f"{len(successes)} were successfully inserted"
488        )
489
490
491    def mark_deleted(self, vector_id: VectorIdentifier):
492        """
493        Mark vector for deletion. This is a soft delete that won't return the vector id in any results, but does not
494        remove it from the underlying index files or memory structure. To execute a hard delete, call this method and
495        then call the much more expensive `consolidate_delete` method on this index.
496        ### Parameters
497        - **vector_id**: The vector id to delete. Must be a uint32.
498        """
499        _assert_is_positive_uint32(vector_id, "vector_id")
500        self._points_deleted = True
501        self._removed_num_vectors += 1
502        # we do not decrement self._num_vectors until consolidate_delete
503        self._index.mark_deleted(np.uint32(vector_id))
504
505    def consolidate_delete(self):
506        """
507        This method actually restructures the DiskANN index to remove the items that have been marked for deletion.
508        """
509        self._index.consolidate_delete()
510        self._points_deleted = False
511        self._num_vectors -= self._removed_num_vectors
512        self._removed_num_vectors = 0

A DynamicMemoryIndex instance is used to both search and mutate a diskannpy memory index. This index is unlike either diskannpy.StaticMemoryIndex or diskannpy.StaticDiskIndex in the following ways:

  • It requires an explicit vector identifier for each vector added to it.
  • Insert and (lazy) deletion operations are provided for a flexible, living index

The mutable aspect of this index will absolutely impact search time performance as new vectors are added and old deleted. DynamicMemoryIndex.consolidate_deletes() should be called periodically to restructure the index to remove deleted vectors and improve per-search performance, at the cost of an expensive index consolidation to occur.

DynamicMemoryIndex( distance_metric: Literal['l2', 'mips', 'cosine'], vector_dtype: Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8]], dimensions: int, max_vectors: int, complexity: int, graph_degree: int, saturate_graph: bool = 0, max_occlusion_size: int = 750, alpha: float = 1.2000000476837158, num_threads: int = 0, filter_complexity: int = 0, num_frozen_points: int = 1, initial_search_complexity: int = 0, search_threads: int = 0, concurrent_consolidation: bool = True)
172    def __init__(
173        self,
174        distance_metric: DistanceMetric,
175        vector_dtype: VectorDType,
176        dimensions: int,
177        max_vectors: int,
178        complexity: int,
179        graph_degree: int,
180        saturate_graph: bool = defaults.SATURATE_GRAPH,
181        max_occlusion_size: int = defaults.MAX_OCCLUSION_SIZE,
182        alpha: float = defaults.ALPHA,
183        num_threads: int = defaults.NUM_THREADS,
184        filter_complexity: int = defaults.FILTER_COMPLEXITY,
185        num_frozen_points: int = defaults.NUM_FROZEN_POINTS_DYNAMIC,
186        initial_search_complexity: int = 0,
187        search_threads: int = 0,
188        concurrent_consolidation: bool = True,
189    ):
190        """
191        The `diskannpy.DynamicMemoryIndex` represents our python API into a mutable DiskANN memory index.
192
193        This constructor is used to create a new, empty index. If you wish to load a previously saved index from disk,
194        please use the `diskannpy.DynamicMemoryIndex.from_file` classmethod instead.
195
196        ### Parameters
197        - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
198          vector dtypes, but `mips` is only available for single precision floats.
199        - **vector_dtype**: One of {`np.float32`, `np.int8`, `np.uint8`}. The dtype of the vectors this index will
200          be storing.
201        - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same
202          dimensionality.
203        - **max_vectors**: Capacity of the data store including space for future insertions
204        - **graph_degree**: Graph degree (a.k.a. `R`) is the maximum degree allowed for a node in the index's graph
205          structure. This degree will be pruned throughout the course of the index build, but it will never grow beyond
206          this value. Higher `graph_degree` values require longer index build times, but may result in an index showing
207          excellent recall and latency characteristics.
208        - **saturate_graph**: If True, the adjacency list of each node will be saturated with neighbors to have exactly
209          `graph_degree` neighbors. If False, each node will have between 1 and `graph_degree` neighbors.
210        - **max_occlusion_size**: The maximum number of points that can be considered by occlude_list function.
211        - **alpha**: The alpha parameter (>=1) is used to control the nature and number of points that are added to the
212          graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably
213          more distance comparisons compared to a lower alpha value.
214        - **num_threads**: Number of threads to use when creating this index. `0` indicates we should use all available
215          logical processors.
216        - **filter_complexity**: Complexity to use when using filters. Default is 0.
217        - **num_frozen_points**: Number of points to freeze. Default is 1.
218        - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the
219          life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of
220          `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search`
221          operation requests a space larger than can be accommodated by these values.
222        - **search_threads**: Should be set to the most common `num_threads` expected to be used during the
223          life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of
224          `initial_search_complexity` * `search_threads`. Note that it may be resized if a `batch_search`
225          operation requests a space larger than can be accommodated by these values.
226        - **concurrent_consolidation**: This flag dictates whether consolidation can be run alongside inserts and
227          deletes, or whether the index is locked down to changes while consolidation is ongoing.
228
229        """
230        self._num_vectors = 0
231        self._removed_num_vectors = 0
232        dap_metric = _valid_metric(distance_metric)
233        self._dap_metric = dap_metric
234        _assert_dtype(vector_dtype)
235        _assert_is_positive_uint32(dimensions, "dimensions")
236
237        self._vector_dtype = vector_dtype
238        self._dimensions = dimensions
239
240        _assert_is_positive_uint32(max_vectors, "max_vectors")
241        _assert_is_positive_uint32(complexity, "complexity")
242        _assert_is_positive_uint32(graph_degree, "graph_degree")
243        _assert(
244            alpha >= 1,
245            "alpha must be >= 1, and realistically should be kept between [1.0, 2.0)",
246        )
247        _assert_is_nonnegative_uint32(max_occlusion_size, "max_occlusion_size")
248        _assert_is_nonnegative_uint32(num_threads, "num_threads")
249        _assert_is_nonnegative_uint32(filter_complexity, "filter_complexity")
250        _assert_is_nonnegative_uint32(num_frozen_points, "num_frozen_points")
251        _assert_is_nonnegative_uint32(
252            initial_search_complexity, "initial_search_complexity"
253        )
254        _assert_is_nonnegative_uint32(search_threads, "search_threads")
255
256        self._max_vectors = max_vectors
257        self._complexity = complexity
258        self._graph_degree = graph_degree
259
260        if vector_dtype == np.uint8:
261            _index = _native_dap.DynamicMemoryUInt8Index
262        elif vector_dtype == np.int8:
263            _index = _native_dap.DynamicMemoryInt8Index
264        else:
265            _index = _native_dap.DynamicMemoryFloatIndex
266
267        self._index = _index(
268            distance_metric=dap_metric,
269            dimensions=dimensions,
270            max_vectors=max_vectors,
271            complexity=complexity,
272            graph_degree=graph_degree,
273            saturate_graph=saturate_graph,
274            max_occlusion_size=max_occlusion_size,
275            alpha=alpha,
276            num_threads=num_threads,
277            filter_complexity=filter_complexity,
278            num_frozen_points=num_frozen_points,
279            initial_search_complexity=initial_search_complexity,
280            search_threads=search_threads,
281            concurrent_consolidation=concurrent_consolidation,
282        )
283        self._points_deleted = False

The diskannpy.DynamicMemoryIndex represents our python API into a mutable DiskANN memory index.

This constructor is used to create a new, empty index. If you wish to load a previously saved index from disk, please use the diskannpy.DynamicMemoryIndex.from_file classmethod instead.

Parameters

  • distance_metric: A str, strictly one of {"l2", "mips", "cosine"}. l2 and cosine are supported for all 3 vector dtypes, but mips is only available for single precision floats.
  • vector_dtype: One of {np.float32, np.int8, np.uint8}. The dtype of the vectors this index will be storing.
  • dimensions: The vector dimensionality of this index. All new vectors inserted must be the same dimensionality.
  • max_vectors: Capacity of the data store including space for future insertions
  • graph_degree: Graph degree (a.k.a. R) is the maximum degree allowed for a node in the index's graph structure. This degree will be pruned throughout the course of the index build, but it will never grow beyond this value. Higher graph_degree values require longer index build times, but may result in an index showing excellent recall and latency characteristics.
  • saturate_graph: If True, the adjacency list of each node will be saturated with neighbors to have exactly graph_degree neighbors. If False, each node will have between 1 and graph_degree neighbors.
  • max_occlusion_size: The maximum number of points that can be considered by occlude_list function.
  • alpha: The alpha parameter (>=1) is used to control the nature and number of points that are added to the graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably more distance comparisons compared to a lower alpha value.
  • num_threads: Number of threads to use when creating this index. 0 indicates we should use all available logical processors.
  • filter_complexity: Complexity to use when using filters. Default is 0.
  • num_frozen_points: Number of points to freeze. Default is 1.
  • initial_search_complexity: Should be set to the most common complexity expected to be used during the life of this diskannpy.DynamicMemoryIndex object. The working scratch memory allocated is based off of initial_search_complexity * search_threads. Note that it may be resized if a search or batch_search operation requests a space larger than can be accommodated by these values.
  • search_threads: Should be set to the most common num_threads expected to be used during the life of this diskannpy.DynamicMemoryIndex object. The working scratch memory allocated is based off of initial_search_complexity * search_threads. Note that it may be resized if a batch_search operation requests a space larger than can be accommodated by these values.
  • concurrent_consolidation: This flag dictates whether consolidation can be run alongside inserts and deletes, or whether the index is locked down to changes while consolidation is ongoing.
@classmethod
def from_file( cls, index_directory: str, max_vectors: int, complexity: int, graph_degree: int, saturate_graph: bool = 0, max_occlusion_size: int = 750, alpha: float = 1.2000000476837158, num_threads: int = 0, filter_complexity: int = 0, num_frozen_points: int = 1, initial_search_complexity: int = 0, search_threads: int = 0, concurrent_consolidation: bool = True, index_prefix: str = 'ann', distance_metric: Optional[Literal['l2', 'mips', 'cosine']] = None, vector_dtype: Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8], NoneType] = None, dimensions: Optional[int] = None) -> DynamicMemoryIndex:
 55    @classmethod
 56    def from_file(
 57        cls,
 58        index_directory: str,
 59        max_vectors: int,
 60        complexity: int,
 61        graph_degree: int,
 62        saturate_graph: bool = defaults.SATURATE_GRAPH,
 63        max_occlusion_size: int = defaults.MAX_OCCLUSION_SIZE,
 64        alpha: float = defaults.ALPHA,
 65        num_threads: int = defaults.NUM_THREADS,
 66        filter_complexity: int = defaults.FILTER_COMPLEXITY,
 67        num_frozen_points: int = defaults.NUM_FROZEN_POINTS_DYNAMIC,
 68        initial_search_complexity: int = 0,
 69        search_threads: int = 0,
 70        concurrent_consolidation: bool = True,
 71        index_prefix: str = "ann",
 72        distance_metric: Optional[DistanceMetric] = None,
 73        vector_dtype: Optional[VectorDType] = None,
 74        dimensions: Optional[int] = None,
 75    ) -> "DynamicMemoryIndex":
 76        """
 77        The `from_file` classmethod is used to load a previously saved index from disk. This index *must* have been
 78        created with a valid `tags` file or `tags` np.ndarray of `diskannpy.VectorIdentifier`s. It is *strongly*
 79        recommended that you use the same parameters as the `diskannpy.build_memory_index()` function that created
 80        the index.
 81
 82        ### Parameters
 83        - **index_directory**: The directory containing the index files. This directory must contain the following
 84            files:
 85            - `{index_prefix}.data`
 86            - `{index_prefix}.tags`
 87            - `{index_prefix}`
 88
 89          It may also include the following optional files:
 90            - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the
 91              `index_directory` if the index was created from a numpy array
 92            - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata
 93            about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality.
 94            If an index is built from the `diskann` cli tools, this file will not exist.
 95        - **max_vectors**: Capacity of the memory index including space for future insertions.
 96        - **complexity**: Complexity (a.k.a `L`) references the size of the list we store candidate approximate
 97          neighbors in. It's used during save (which is an index rebuild), and it's used as an initial search size to
 98          warm up our index and lower the latency for initial real searches.
 99        - **graph_degree**: Graph degree (a.k.a. `R`) is the maximum degree allowed for a node in the index's graph
100          structure. This degree will be pruned throughout the course of the index build, but it will never grow beyond
101          this value. Higher R values require longer index build times, but may result in an index showing excellent
102          recall and latency characteristics.
103        - **saturate_graph**: If True, the adjacency list of each node will be saturated with neighbors to have exactly
104          `graph_degree` neighbors. If False, each node will have between 1 and `graph_degree` neighbors.
105        - **max_occlusion_size**: The maximum number of points that can be considered by occlude_list function.
106        - **alpha**: The alpha parameter (>=1) is used to control the nature and number of points that are added to the
107          graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably
108          more distance comparisons compared to a lower alpha value.
109        - **num_threads**: Number of threads to use when creating this index. `0` indicates we should use all available
110          logical processors.
111        - **filter_complexity**: Complexity to use when using filters. Default is 0.
112        - **num_frozen_points**: Number of points to freeze. Default is 1.
113        - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the
114          life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of
115          `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search`
116          operation requests a space larger than can be accommodated by these values.
117        - **search_threads**: Should be set to the most common `num_threads` expected to be used during the
118          life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of
119          `initial_search_complexity` * `search_threads`. Note that it may be resized if a `batch_search`
120          operation requests a space larger than can be accommodated by these values.
121        - **concurrent_consolidation**: This flag dictates whether consolidation can be run alongside inserts and
122          deletes, or whether the index is locked down to changes while consolidation is ongoing.
123        - **index_prefix**: The prefix of the index files. Defaults to "ann".
124        - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
125          vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This
126          value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist,
127          you are required to provide it.
128        - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a
129          `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it.
130        - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same
131          dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it
132          does not exist, you are required to provide it.
133
134        ### Returns
135        A `diskannpy.DynamicMemoryIndex` object, with the index loaded from disk and ready to use for insertions,
136        deletions, and searches.
137
138        """
139        index_prefix_path = _valid_index_prefix(index_directory, index_prefix)
140
141        # do tags exist?
142        tags_file = index_prefix_path + ".tags"
143        _assert(
144            Path(tags_file).exists(),
145            f"The file {tags_file} does not exist in {index_directory}",
146        )
147        vector_dtype, dap_metric, num_vectors, dimensions = _ensure_index_metadata(
148            index_prefix_path, vector_dtype, distance_metric, max_vectors, dimensions, warn_size_exceeded=True
149        )
150
151        index = cls(
152            distance_metric=dap_metric,  # type: ignore
153            vector_dtype=vector_dtype,
154            dimensions=dimensions,
155            max_vectors=max_vectors,
156            complexity=complexity,
157            graph_degree=graph_degree,
158            saturate_graph=saturate_graph,
159            max_occlusion_size=max_occlusion_size,
160            alpha=alpha,
161            num_threads=num_threads,
162            filter_complexity=filter_complexity,
163            num_frozen_points=num_frozen_points,
164            initial_search_complexity=initial_search_complexity,
165            search_threads=search_threads,
166            concurrent_consolidation=concurrent_consolidation,
167        )
168        index._index.load(index_prefix_path)
169        index._num_vectors = num_vectors  # current number of vectors loaded
170        return index

The from_file classmethod is used to load a previously saved index from disk. This index must have been created with a valid tags file or tags np.ndarray of diskannpy.VectorIdentifiers. It is strongly recommended that you use the same parameters as the diskannpy.build_memory_index() function that created the index.

Parameters

  • index_directory: The directory containing the index files. This directory must contain the following files:

    • {index_prefix}.data
    • {index_prefix}.tags
    • {index_prefix}

    It may also include the following optional files:

    • {index_prefix}_vectors.bin: Optional. diskannpy builder functions may create this file in the index_directory if the index was created from a numpy array
    • {index_prefix}_metadata.bin: Optional. diskannpy builder functions create this file to store metadata about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. If an index is built from the diskann cli tools, this file will not exist.
  • max_vectors: Capacity of the memory index including space for future insertions.
  • complexity: Complexity (a.k.a L) references the size of the list we store candidate approximate neighbors in. It's used during save (which is an index rebuild), and it's used as an initial search size to warm up our index and lower the latency for initial real searches.
  • graph_degree: Graph degree (a.k.a. R) is the maximum degree allowed for a node in the index's graph structure. This degree will be pruned throughout the course of the index build, but it will never grow beyond this value. Higher R values require longer index build times, but may result in an index showing excellent recall and latency characteristics.
  • saturate_graph: If True, the adjacency list of each node will be saturated with neighbors to have exactly graph_degree neighbors. If False, each node will have between 1 and graph_degree neighbors.
  • max_occlusion_size: The maximum number of points that can be considered by occlude_list function.
  • alpha: The alpha parameter (>=1) is used to control the nature and number of points that are added to the graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably more distance comparisons compared to a lower alpha value.
  • num_threads: Number of threads to use when creating this index. 0 indicates we should use all available logical processors.
  • filter_complexity: Complexity to use when using filters. Default is 0.
  • num_frozen_points: Number of points to freeze. Default is 1.
  • initial_search_complexity: Should be set to the most common complexity expected to be used during the life of this diskannpy.DynamicMemoryIndex object. The working scratch memory allocated is based off of initial_search_complexity * search_threads. Note that it may be resized if a search or batch_search operation requests a space larger than can be accommodated by these values.
  • search_threads: Should be set to the most common num_threads expected to be used during the life of this diskannpy.DynamicMemoryIndex object. The working scratch memory allocated is based off of initial_search_complexity * search_threads. Note that it may be resized if a batch_search operation requests a space larger than can be accommodated by these values.
  • concurrent_consolidation: This flag dictates whether consolidation can be run alongside inserts and deletes, or whether the index is locked down to changes while consolidation is ongoing.
  • index_prefix: The prefix of the index files. Defaults to "ann".
  • distance_metric: A str, strictly one of {"l2", "mips", "cosine"}. l2 and cosine are supported for all 3 vector dtypes, but mips is only available for single precision floats. Default is None. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
  • vector_dtype: The vector dtype this index has been built with. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
  • dimensions: The vector dimensionality of this index. All new vectors inserted must be the same dimensionality. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.

Returns

A diskannpy.DynamicMemoryIndex object, with the index loaded from disk and ready to use for insertions, deletions, and searches.

def search( self, query: numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]], k_neighbors: int, complexity: int) -> QueryResponse:
285    def search(
286        self, query: VectorLike, k_neighbors: int, complexity: int
287    ) -> QueryResponse:
288        """
289        Searches the index by a single query vector.
290
291        ### Parameters
292        - **query**: 1d numpy array of the same dimensionality and dtype of the index.
293        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
294          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
295        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
296          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
297        """
298        _query = _castable_dtype_or_raise(query, expected=self._vector_dtype)
299        _assert(len(_query.shape) == 1, "query vector must be 1-d")
300        _assert(
301            _query.shape[0] == self._dimensions,
302            f"query vector must have the same dimensionality as the index; index dimensionality: {self._dimensions}, "
303            f"query dimensionality: {_query.shape[0]}",
304        )
305        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
306        _assert_is_nonnegative_uint32(complexity, "complexity")
307
308        if k_neighbors > complexity:
309            warnings.warn(
310                f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}"
311            )
312            complexity = k_neighbors
313        neighbors, distances = self._index.search(query=_query, knn=k_neighbors, complexity=complexity)
314        return QueryResponse(identifiers=neighbors, distances=distances)

Searches the index by a single query vector.

Parameters

  • query: 1d numpy array of the same dimensionality and dtype of the index.
  • k_neighbors: Number of neighbors to be returned. If query vector exists in index, it almost definitely will be returned as well, so adjust your k_neighbors as appropriate. Must be > 0.
  • complexity: Size of distance ordered list of candidate neighbors to use while searching. List size increases accuracy at the cost of latency. Must be at least k_neighbors in size.
def save(self, save_path: str, index_prefix: str = 'ann'):
365    def save(self, save_path: str, index_prefix: str = "ann"):
366        """
367        Saves this index to file.
368
369        ### Parameters
370        - **save_path**: The path to save these index files to.
371        - **index_prefix**: The prefix of the index files. Defaults to "ann".
372        """
373        if save_path == "":
374            raise ValueError("save_path cannot be empty")
375        if index_prefix == "":
376            raise ValueError("index_prefix cannot be empty")
377
378        index_prefix = index_prefix.format(complexity=self._complexity, graph_degree=self._graph_degree)
379        _assert_existing_directory(save_path, "save_path")
380        save_path = os.path.join(save_path, index_prefix)
381        if self._points_deleted is True:
382            warnings.warn(
383                "DynamicMemoryIndex.save() currently requires DynamicMemoryIndex.consolidate_delete() to be called "
384                "prior to save when items have been marked for deletion. This is being done automatically now, though"
385                "it will increase the time it takes to save; on large sets of data it can take a substantial amount of "
386                "time. In the future, we will implement a faster save with unconsolidated deletes, but for now this is "
387                "required."
388            )
389            self._index.consolidate_delete()
390        self._index.save(
391            save_path=save_path, compact_before_save=True
392        )  # we do not yet support uncompacted saves
393        _write_index_metadata(
394            save_path,
395            self._vector_dtype,
396            self._dap_metric,
397            self._index.num_points(),
398            self._dimensions,
399        )

Saves this index to file.

Parameters

  • save_path: The path to save these index files to.
  • index_prefix: The prefix of the index files. Defaults to "ann".
def insert( self, vector: numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]], vector_id: numpy.uint32):
401    def insert(self, vector: VectorLike, vector_id: VectorIdentifier):
402        """
403        Inserts a single vector into the index with the provided vector_id.
404
405        If this insertion will overrun the `max_vectors` count boundaries of this index, `consolidate_delete()` will
406        be executed automatically.
407
408        ### Parameters
409        - **vector**: The vector to insert. Note that dtype must match.
410        - **vector_id**: The vector_id to use for this vector.
411        """
412        _vector = _castable_dtype_or_raise(vector, expected=self._vector_dtype)
413        _assert(len(vector.shape) == 1, "insert vector must be 1-d")
414        _assert_is_positive_uint32(vector_id, "vector_id")
415        if self._num_vectors + 1 > self._max_vectors:
416            if self._removed_num_vectors > 0:
417                warnings.warn(f"Inserting this vector would overrun the max_vectors={self._max_vectors} specified at index "
418                              f"construction. We are attempting to consolidate_delete() to make space.")
419                self.consolidate_delete()
420            else:
421                raise RuntimeError(f"Inserting this vector would overrun the max_vectors={self._max_vectors} specified "
422                                   f"at index construction. Unable to make space by consolidating deletions. The insert"
423                                   f"operation has failed.")
424        status = self._index.insert(_vector, np.uint32(vector_id))
425        if status == 0:
426            self._num_vectors += 1
427        else:
428            raise RuntimeError(
429                f"Insert was unable to complete successfully; error code returned from diskann C++ lib: {status}"
430            )

Inserts a single vector into the index with the provided vector_id.

If this insertion will overrun the max_vectors count boundaries of this index, consolidate_delete() will be executed automatically.

Parameters

  • vector: The vector to insert. Note that dtype must match.
  • vector_id: The vector_id to use for this vector.
def batch_insert( self, vectors: numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]], vector_ids: numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]], num_threads: int = 0):
433    def batch_insert(
434        self,
435        vectors: VectorLikeBatch,
436        vector_ids: VectorIdentifierBatch,
437        num_threads: int = 0,
438    ):
439        """
440        Inserts a batch of vectors into the index with the provided vector_ids.
441
442        If this batch insertion will overrun the `max_vectors` count boundaries of this index, `consolidate_delete()`
443        will be executed automatically.
444
445        ### Parameters
446        - **vectors**: The 2d numpy array of vectors to insert.
447        - **vector_ids**: The 1d array of vector ids to use. This array must have the same number of elements as
448            the vectors array has rows. The dtype of vector_ids must be `np.uint32`
449        - **num_threads**: Number of threads to use when inserting into this index. (>= 0), 0 = num_threads in system
450        """
451        _query = _castable_dtype_or_raise(vectors, expected=self._vector_dtype)
452        _assert(len(vectors.shape) == 2, "vectors must be a 2-d array")
453        _assert(
454            vectors.shape[0] == vector_ids.shape[0],
455            "Number of vectors must be equal to number of ids",
456        )
457        _vectors = vectors.astype(dtype=self._vector_dtype, casting="safe", copy=False)
458        _vector_ids = vector_ids.astype(dtype=np.uint32, casting="safe", copy=False)
459
460        if self._num_vectors + _vector_ids.shape[0] > self._max_vectors:
461            if self._max_vectors + self._removed_num_vectors >= _vector_ids.shape[0]:
462                warnings.warn(f"Inserting these vectors, count={_vector_ids.shape[0]} would overrun the "
463                              f"max_vectors={self._max_vectors} specified at index construction. We are attempting to "
464                              f"consolidate_delete() to make space.")
465                self.consolidate_delete()
466            else:
467                raise RuntimeError(f"Inserting these vectors count={_vector_ids.shape[0]} would overrun the "
468                                   f"max_vectors={self._max_vectors} specified at index construction. Unable to make "
469                                   f"space by consolidating deletions. The batch insert operation has failed.")
470
471        statuses = self._index.batch_insert(
472            _vectors, _vector_ids, _vector_ids.shape[0], num_threads
473        )
474        successes = []
475        failures = []
476        for i in range(0, len(statuses)):
477            if statuses[i] == 0:
478                successes.append(i)
479            else:
480                failures.append(i)
481        self._num_vectors += len(successes)
482        if len(failures) == 0:
483            return
484        failed_ids = vector_ids[failures]
485        raise RuntimeError(
486            f"During batch insert, the following vector_ids were unable to be inserted into the index: {failed_ids}. "
487            f"{len(successes)} were successfully inserted"
488        )

Inserts a batch of vectors into the index with the provided vector_ids.

If this batch insertion will overrun the max_vectors count boundaries of this index, consolidate_delete() will be executed automatically.

Parameters

  • vectors: The 2d numpy array of vectors to insert.
  • vector_ids: The 1d array of vector ids to use. This array must have the same number of elements as the vectors array has rows. The dtype of vector_ids must be np.uint32
  • num_threads: Number of threads to use when inserting into this index. (>= 0), 0 = num_threads in system
def mark_deleted(self, vector_id: numpy.uint32):
491    def mark_deleted(self, vector_id: VectorIdentifier):
492        """
493        Mark vector for deletion. This is a soft delete that won't return the vector id in any results, but does not
494        remove it from the underlying index files or memory structure. To execute a hard delete, call this method and
495        then call the much more expensive `consolidate_delete` method on this index.
496        ### Parameters
497        - **vector_id**: The vector id to delete. Must be a uint32.
498        """
499        _assert_is_positive_uint32(vector_id, "vector_id")
500        self._points_deleted = True
501        self._removed_num_vectors += 1
502        # we do not decrement self._num_vectors until consolidate_delete
503        self._index.mark_deleted(np.uint32(vector_id))

Mark vector for deletion. This is a soft delete that won't return the vector id in any results, but does not remove it from the underlying index files or memory structure. To execute a hard delete, call this method and then call the much more expensive consolidate_delete method on this index.

Parameters

  • vector_id: The vector id to delete. Must be a uint32.
def consolidate_delete(self):
505    def consolidate_delete(self):
506        """
507        This method actually restructures the DiskANN index to remove the items that have been marked for deletion.
508        """
509        self._index.consolidate_delete()
510        self._points_deleted = False
511        self._num_vectors -= self._removed_num_vectors
512        self._removed_num_vectors = 0

This method actually restructures the DiskANN index to remove the items that have been marked for deletion.

DistanceMetric = typing.Literal['l2', 'mips', 'cosine']

Type alias for one of {"l2", "mips", "cosine"}

VectorDType = typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]

Type alias for one of {numpy.float32, numpy.int8, numpy.uint8}

class QueryResponse(typing.NamedTuple):
68class QueryResponse(NamedTuple):
69    """
70    Tuple with two values, identifiers and distances. Both are 1d arrays, positionally correspond, and will contain the
71    nearest neighbors from [0..k_neighbors)
72    """
73
74    identifiers: npt.NDArray[VectorIdentifier]
75    """ A `numpy.typing.NDArray[VectorIdentifier]` array of vector identifiers, 1 dimensional """
76    distances: npt.NDArray[np.float32]
77    """
78    A `numpy.typing.NDAarray[numpy.float32]` of distances as calculated by the distance metric function,  1 dimensional
79    """

Tuple with two values, identifiers and distances. Both are 1d arrays, positionally correspond, and will contain the nearest neighbors from [0..k_neighbors)

QueryResponse( identifiers: numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]], distances: numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]])

Create new instance of QueryResponse(identifiers, distances)

identifiers: numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]]

A numpy.typing.NDArray[VectorIdentifier] array of vector identifiers, 1 dimensional

distances: numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]]

A numpy.typing.NDAarray[numpy.float32] of distances as calculated by the distance metric function, 1 dimensional

Inherited Members
builtins.tuple
index
count
class QueryResponseBatch(typing.NamedTuple):
82class QueryResponseBatch(NamedTuple):
83    """
84    Tuple with two values, identifiers and distances. Both are 2d arrays, with dimensionality determined by the
85    rows corresponding to the number of queries made, and the columns corresponding to the k neighbors
86    requested. The two 2d arrays have an implicit, position-based relationship
87    """
88
89    identifiers: npt.NDArray[VectorIdentifier]
90    """ 
91    A `numpy.typing.NDArray[VectorIdentifier]` array of vector identifiers, 2 dimensional. The row corresponds to index 
92    of the query, and the column corresponds to the k neighbors requested 
93    """
94    distances: np.ndarray[np.float32]
95    """  
96    A `numpy.typing.NDAarray[numpy.float32]` of distances as calculated by the distance metric function, 2 dimensional. 
97    The row corresponds to the index of the query, and the column corresponds to the distance of the query to the 
98    *k-th* neighbor 
99    """

Tuple with two values, identifiers and distances. Both are 2d arrays, with dimensionality determined by the rows corresponding to the number of queries made, and the columns corresponding to the k neighbors requested. The two 2d arrays have an implicit, position-based relationship

QueryResponseBatch( identifiers: numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]], distances: numpy.ndarray[numpy.float32])

Create new instance of QueryResponseBatch(identifiers, distances)

identifiers: numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]]

A numpy.typing.NDArray[VectorIdentifier] array of vector identifiers, 2 dimensional. The row corresponds to index of the query, and the column corresponds to the k neighbors requested

distances: numpy.ndarray[numpy.float32]

A numpy.typing.NDAarray[numpy.float32] of distances as calculated by the distance metric function, 2 dimensional. The row corresponds to the index of the query, and the column corresponds to the distance of the query to the k-th neighbor

Inherited Members
builtins.tuple
index
count
VectorIdentifier = <class 'numpy.uint32'>

Type alias for a vector identifier, whether it be an implicit array index identifier from StaticMemoryIndex or StaticDiskIndex, or an explicit tag identifier from DynamicMemoryIndex

VectorIdentifierBatch = numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]]

Type alias for a batch of VectorIdentifiers

VectorLike = numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]]

Type alias for something that can be treated as a vector

VectorLikeBatch = numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]]

Type alias for a batch of VectorLikes

class Metadata(typing.NamedTuple):
15class Metadata(NamedTuple):
16    """DiskANN binary vector files contain a small stanza containing some metadata about them."""
17
18    num_vectors: int
19    """ The number of vectors in the file. """
20    dimensions: int
21    """ The dimensionality of the vectors in the file. """

DiskANN binary vector files contain a small stanza containing some metadata about them.

Metadata(num_vectors: int, dimensions: int)

Create new instance of Metadata(num_vectors, dimensions)

num_vectors: int

The number of vectors in the file.

dimensions: int

The dimensionality of the vectors in the file.

Inherited Members
builtins.tuple
index
count
def vectors_metadata_from_file(vector_file: str) -> Metadata:
24def vectors_metadata_from_file(vector_file: str) -> Metadata:
25    """
26    Read the metadata from a DiskANN binary vector file.
27    ### Parameters
28    - **vector_file**: The path to the vector file to read the metadata from.
29
30    ### Returns
31    `diskannpy.Metadata`
32    """
33    _assert_existing_file(vector_file, "vector_file")
34    points, dims = np.fromfile(file=vector_file, dtype=np.int32, count=2)
35    return Metadata(points, dims)

Read the metadata from a DiskANN binary vector file.

Parameters

  • vector_file: The path to the vector file to read the metadata from.

Returns

diskannpy.Metadata

def vectors_to_file( vector_file: str, vectors: numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]]) -> None:
46def vectors_to_file(vector_file: str, vectors: VectorLikeBatch) -> None:
47    """
48    Utility function that writes a DiskANN binary vector formatted file to the location of your choosing.
49
50    ### Parameters
51    - **vector_file**: The path to the vector file to write the vectors to.
52    - **vectors**: A 2d array of dtype `numpy.float32`, `numpy.uint8`, or `numpy.int8`
53    """
54    _assert_dtype(vectors.dtype)
55    _assert_2d(vectors, "vectors")
56    with open(vector_file, "wb") as fh:
57        _write_bin(vectors, fh)

Utility function that writes a DiskANN binary vector formatted file to the location of your choosing.

Parameters

  • vector_file: The path to the vector file to write the vectors to.
  • vectors: A 2d array of dtype numpy.float32, numpy.uint8, or numpy.int8
def vectors_from_file( vector_file: str, dtype: Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8]]) -> numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]]:
60def vectors_from_file(vector_file: str, dtype: VectorDType) -> npt.NDArray[VectorDType]:
61    """
62    Read vectors from a DiskANN binary vector file.
63
64    ### Parameters
65    - **vector_file**: The path to the vector file to read the vectors from.
66    - **dtype**: The data type of the vectors in the file. Ensure you match the data types exactly
67
68    ### Returns
69    `numpy.typing.NDArray[dtype]`
70    """
71    points, dims = vectors_metadata_from_file(vector_file)
72    return np.fromfile(file=vector_file, dtype=dtype, offset=8).reshape(points, dims)

Read vectors from a DiskANN binary vector file.

Parameters

  • vector_file: The path to the vector file to read the vectors from.
  • dtype: The data type of the vectors in the file. Ensure you match the data types exactly

Returns

numpy.typing.NDArray[dtype]

def tags_to_file( tags_file: str, tags: numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]]) -> None:
75def tags_to_file(tags_file: str, tags: VectorIdentifierBatch) -> None:
76    """
77    Write tags to a DiskANN binary tag file.
78
79    ### Parameters
80    - **tags_file**: The path to the tag file to write the tags to.
81    - **tags**: A 1d array of dtype `numpy.uint32` containing the tags to write. If you have a 2d array of tags with
82      one column, you can pass it here and it will be reshaped and copied to a new array. It is more efficient for you
83      to reshape on your own without copying it first, as it should be a constant time operation vs. linear time
84
85    """
86    _assert(np.can_cast(tags.dtype, np.uint32), "valid tags must be uint32")
87    _assert(
88        len(tags.shape) == 1 or tags.shape[1] == 1,
89        "tags must be 1d or 2d with 1 column",
90    )
91    if len(tags.shape) == 2:
92        warnings.warn(
93            "Tags in 2d with one column will be reshaped and copied to a new array. "
94            "It is more efficient for you to reshape without copying first."
95        )
96        tags = tags.reshape(tags.shape[0], copy=True)
97    with open(tags_file, "wb") as fh:
98        _write_bin(tags.astype(np.uint32), fh)

Write tags to a DiskANN binary tag file.

Parameters

  • tags_file: The path to the tag file to write the tags to.
  • tags: A 1d array of dtype numpy.uint32 containing the tags to write. If you have a 2d array of tags with one column, you can pass it here and it will be reshaped and copied to a new array. It is more efficient for you to reshape on your own without copying it first, as it should be a constant time operation vs. linear time
def tags_from_file(tags_file: str) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]]:
101def tags_from_file(tags_file: str) -> VectorIdentifierBatch:
102    """
103    Read tags from a DiskANN binary tag file and return them as a 1d array of dtype `numpy.uint32`.
104
105    ### Parameters
106    - **tags_file**: The path to the tag file to read the tags from.
107    """
108    _assert_existing_file(tags_file, "tags_file")
109    points, dims = vectors_metadata_from_file(
110        tags_file
111    )  # tag files contain the same metadata stanza
112    return np.fromfile(file=tags_file, dtype=np.uint32, offset=8).reshape(points)

Read tags from a DiskANN binary tag file and return them as a 1d array of dtype numpy.uint32.

Parameters

  • tags_file: The path to the tag file to read the tags from.
def valid_dtype( dtype: Type) -> Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8]]:
27def valid_dtype(dtype: Type) -> VectorDType:
28    """
29    Utility method to determine whether the provided dtype is supported by `diskannpy`, and if so, the canonical
30    dtype we will use internally (e.g. np.single -> np.float32)
31    """
32    _assert_dtype(dtype)
33    if dtype == np.uint8:
34        return np.uint8
35    if dtype == np.int8:
36        return np.int8
37    if dtype == np.float32:
38        return np.float32

Utility method to determine whether the provided dtype is supported by diskannpy, and if so, the canonical dtype we will use internally (e.g. np.single -> np.float32)