diskannpy

Documentation Overview

diskannpy is mostly structured around 2 distinct processes: Index Builder Functions and Search Classes

It also includes a few nascent utilities.

And lastly, it makes substantial use of type hints, with various shorthand type aliases documented. When reading the diskannpy code we refer to the type aliases, though pdoc helpfully expands them.

Index Builders

build_disk_index - To build an index that cannot fully fit into memory when searching
build_memory_index - To build an index that can fully fit into memory when searching

Search Classes

StaticMemoryIndex - for indices that can fully fit in memory and won't be changed during the search operations
StaticDiskIndex - for indices that cannot fully fit in memory, thus relying on disk IO to search, and also won't be changed during search operations
DynamicMemoryIndex - for indices that can fully fit in memory and will be mutated via insert/deletion operations as well as search operations

Parameter Defaults

diskannpy.defaults - Default values exported from the C++ extension for Python users

Parameter and Response Type Aliases

DistanceMetric - What distance metrics does diskannpy support?
VectorDType - What vector datatypes does diskannpy support?
QueryResponse - What can I expect as a response to my search?
QueryResponseBatch - What can I expect as a response to my batch search?
VectorIdentifier - What types do diskannpy support as vector identifiers?
VectorIdentifierBatch - A batch of identifiers of the exact same type. The type can change, but they must all change.
VectorLike - How does a vector look to diskannpy, to be inserted or searched with.
VectorLikeBatch - A batch of those vectors, to be inserted or searched with.
Metadata - DiskANN vector binary file metadata (num_points, vector_dim)

Utilities

vectors_to_file - Turns a 2 dimensional numpy.typing.NDArray[VectorDType] with shape (number_of_points, vector_dim) into a DiskANN vector bin file.
vectors_from_file - Reads a DiskANN vector bin file representing stored vectors into a numpy ndarray.
vectors_metadata_from_file - Reads metadata stored in a DiskANN vector bin file without reading the entire file
tags_to_file - Turns a 1 dimensional numpy.typing.NDArray[VectorIdentifier] into a DiskANN tags bin file.
tags_from_file - Reads a DiskANN tags bin file representing stored tags into a numpy ndarray.
valid_dtype - Checks if a given vector dtype is supported by diskannpy

View Source

  1# Copyright (c) Microsoft Corporation. All rights reserved.
  2# Licensed under the MIT license.
  3
  4"""
  5# Documentation Overview
  6`diskannpy` is mostly structured around 2 distinct processes: [Index Builder Functions](#index-builders) and [Search Classes](#search-classes)
  7
  8It also includes a few nascent [utilities](#utilities).
  9
 10And lastly, it makes substantial use of type hints, with various shorthand [type aliases](#parameter-and-response-type-aliases) documented. 
 11When reading the `diskannpy` code we refer to the type aliases, though `pdoc` helpfully expands them.
 12
 13## Index Builders
 14- `build_disk_index` - To build an index that cannot fully fit into memory when searching
 15- `build_memory_index` - To build an index that can fully fit into memory when searching
 16
 17## Search Classes
 18- `StaticMemoryIndex` - for indices that can fully fit in memory and won't be changed during the search operations
 19- `StaticDiskIndex` - for indices that cannot fully fit in memory, thus relying on disk IO to search, and also won't be changed during search operations
 20- `DynamicMemoryIndex` - for indices that can fully fit in memory and will be mutated via insert/deletion operations as well as search operations
 21
 22## Parameter Defaults
 23- `diskannpy.defaults` - Default values exported from the C++ extension for Python users
 24
 25## Parameter and Response Type Aliases
 26- `DistanceMetric` - What distance metrics does `diskannpy` support?
 27- `VectorDType` - What vector datatypes does `diskannpy` support?
 28- `QueryResponse` - What can I expect as a response to my search?
 29- `QueryResponseBatch` - What can I expect as a response to my batch search?
 30- `VectorIdentifier` - What types do `diskannpy` support as vector identifiers?
 31- `VectorIdentifierBatch` - A batch of identifiers of the exact same type. The type can change, but they must **all** change.
 32- `VectorLike` - How does a vector look to `diskannpy`, to be inserted or searched with.
 33- `VectorLikeBatch` - A batch of those vectors, to be inserted or searched with.
 34- `Metadata` - DiskANN vector binary file metadata (num_points, vector_dim)
 35
 36## Utilities
 37- `vectors_to_file` - Turns a 2 dimensional `numpy.typing.NDArray[VectorDType]` with shape `(number_of_points, vector_dim)` into a DiskANN vector bin file.
 38- `vectors_from_file` - Reads a DiskANN vector bin file representing stored vectors into a numpy ndarray.
 39- `vectors_metadata_from_file` - Reads metadata stored in a DiskANN vector bin file without reading the entire file
 40- `tags_to_file` - Turns a 1 dimensional `numpy.typing.NDArray[VectorIdentifier]` into a DiskANN tags bin file.
 41- `tags_from_file` - Reads a DiskANN tags bin file representing stored tags into a numpy ndarray.
 42- `valid_dtype` - Checks if a given vector dtype is supported by `diskannpy`
 43"""
 44
 45from typing import Any, Literal, NamedTuple, Type, Union
 46
 47import numpy as np
 48from numpy import typing as npt
 49
 50DistanceMetric = Literal["l2", "mips", "cosine"]
 51""" Type alias for one of {"l2", "mips", "cosine"} """
 52VectorDType = Union[Type[np.float32], Type[np.int8], Type[np.uint8]]
 53""" Type alias for one of {`numpy.float32`, `numpy.int8`, `numpy.uint8`} """
 54VectorLike = npt.NDArray[VectorDType]
 55""" Type alias for something that can be treated as a vector """
 56VectorLikeBatch = npt.NDArray[VectorDType]
 57""" Type alias for a batch of VectorLikes """
 58VectorIdentifier = np.uint32
 59""" 
 60Type alias for a vector identifier, whether it be an implicit array index identifier from StaticMemoryIndex or 
 61StaticDiskIndex, or an explicit tag identifier from DynamicMemoryIndex 
 62"""
 63VectorIdentifierBatch = npt.NDArray[np.uint32]
 64""" Type alias for a batch of VectorIdentifiers """
 65
 66
 67class QueryResponse(NamedTuple):
 68    """
 69    Tuple with two values, identifiers and distances. Both are 1d arrays, positionally correspond, and will contain the
 70    nearest neighbors from [0..k_neighbors)
 71    """
 72
 73    identifiers: npt.NDArray[VectorIdentifier]
 74    """ A `numpy.typing.NDArray[VectorIdentifier]` array of vector identifiers, 1 dimensional """
 75    distances: npt.NDArray[np.float32]
 76    """
 77    A `numpy.typing.NDAarray[numpy.float32]` of distances as calculated by the distance metric function,  1 dimensional
 78    """
 79
 80
 81class QueryResponseBatch(NamedTuple):
 82    """
 83    Tuple with two values, identifiers and distances. Both are 2d arrays, with dimensionality determined by the
 84    rows corresponding to the number of queries made, and the columns corresponding to the k neighbors
 85    requested. The two 2d arrays have an implicit, position-based relationship
 86    """
 87
 88    identifiers: npt.NDArray[VectorIdentifier]
 89    """ 
 90    A `numpy.typing.NDArray[VectorIdentifier]` array of vector identifiers, 2 dimensional. The row corresponds to index 
 91    of the query, and the column corresponds to the k neighbors requested 
 92    """
 93    distances: np.ndarray[np.float32]
 94    """  
 95    A `numpy.typing.NDAarray[numpy.float32]` of distances as calculated by the distance metric function, 2 dimensional. 
 96    The row corresponds to the index of the query, and the column corresponds to the distance of the query to the 
 97    *k-th* neighbor 
 98    """
 99
100
101from . import defaults
102from ._builder import build_disk_index, build_memory_index
103from ._common import valid_dtype
104from ._dynamic_memory_index import DynamicMemoryIndex
105from ._files import (
106    Metadata,
107    tags_from_file,
108    tags_to_file,
109    vectors_from_file,
110    vectors_metadata_from_file,
111    vectors_to_file,
112)
113from ._static_disk_index import StaticDiskIndex
114from ._static_memory_index import StaticMemoryIndex
115
116__all__ = [
117    "build_disk_index",
118    "build_memory_index",
119    "StaticDiskIndex",
120    "StaticMemoryIndex",
121    "DynamicMemoryIndex",
122    "defaults",
123    "DistanceMetric",
124    "VectorDType",
125    "QueryResponse",
126    "QueryResponseBatch",
127    "VectorIdentifier",
128    "VectorIdentifierBatch",
129    "VectorLike",
130    "VectorLikeBatch",
131    "Metadata",
132    "vectors_metadata_from_file",
133    "vectors_to_file",
134    "vectors_from_file",
135    "tags_to_file",
136    "tags_from_file",
137    "valid_dtype",
138]

def build_disk_index( data: Union[str, numpy.ndarray[Any, numpy.dtype[Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8]]]]], distance_metric: Literal['l2', 'mips', 'cosine'], index_directory: str, complexity: int, graph_degree: int, search_memory_maximum: float, build_memory_maximum: float, num_threads: int, pq_disk_bytes: int = 0, vector_dtype: Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8], NoneType] = None, index_prefix: str = 'ann') -> None: View Source

 53def build_disk_index(
 54    data: Union[str, VectorLikeBatch],
 55    distance_metric: DistanceMetric,
 56    index_directory: str,
 57    complexity: int,
 58    graph_degree: int,
 59    search_memory_maximum: float,
 60    build_memory_maximum: float,
 61    num_threads: int,
 62    pq_disk_bytes: int = defaults.PQ_DISK_BYTES,
 63    vector_dtype: Optional[VectorDType] = None,
 64    index_prefix: str = "ann",
 65) -> None:
 66    """
 67    This function will construct a DiskANN disk index. Disk indices are ideal for very large datasets that
 68    are too large to fit in memory. Memory is still used, but it is primarily used to provide precise disk
 69    locations for fast retrieval of smaller subsets of the index without compromising much on recall.
 70
 71    If you provide a numpy array, it will save this array to disk in a temp location
 72    in the format DiskANN's PQ Flash Index builder requires. This temp folder is deleted upon index creation completion
 73    or error.
 74
 75    ## Distance Metric and Vector Datatype Restrictions
 76    | Metric \ Datatype | np.float32 | np.uint8 | np.int8 |
 77    |-------------------|------------|----------|---------|
 78    | L2                |      ✅     |     ✅    |    ✅    |
 79    | MIPS              |      ✅     |     ❌    |    ❌    |
 80    | Cosine [^bug-in-disk-cosine]     |      ❌     |     ❌    |    ❌    |
 81
 82    [^bug-in-disk-cosine]: For StaticDiskIndex, Cosine distances are not currently supported.
 83
 84    ### Parameters
 85    - **data**: Either a `str` representing a path to a DiskANN vector bin file, or a numpy.ndarray,
 86      of a supported dtype, in 2 dimensions. Note that `vector_dtype` must be provided if data is a `str`
 87    - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
 88      vector dtypes, but `mips` is only available for single precision floats.
 89    - **index_directory**: The index files will be saved to this **existing** directory path
 90    - **complexity**: The size of the candidate nearest neighbor list to use when building the index. Values between 75
 91      and 200 are typical. Larger values will take more time to build but result in indices that provide higher recall
 92      for the same search complexity. Use a value that is at least as large as `graph_degree` unless you are prepared
 93      to compromise on quality
 94    - **graph_degree**: The degree of the graph index, typically between 60 and 150. A larger maximum degree will
 95      result in larger indices and longer indexing times, but better search quality.
 96    - **search_memory_maximum**: Build index with the expectation that the search will use at most
 97      `search_memory_maximum`, in gb.
 98    - **build_memory_maximum**: Build index using at most `build_memory_maximum` in gb. Building processes typically
 99      require more memory, while search memory can be reduced.
100    - **num_threads**: Number of threads to use when creating this index. `0` is used to indicate all available
101      logical processors should be used.
102    - **pq_disk_bytes**: Use `0` to store uncompressed data on SSD. This allows the index to asymptote to 100%
103      recall. If your vectors are too large to store in SSD, this parameter provides the option to compress the
104      vectors using PQ for storing on SSD. This will trade off recall. You would also want this to be greater
105      than the number of bytes used for the PQ compressed data stored in-memory. Default is `0`.
106    - **vector_dtype**: Required if the provided `data` is of type `str`, else we use the `data.dtype` if np array.
107    - **index_prefix**: The prefix of the index files. Defaults to "ann".
108    """
109
110    _assert(
111        (isinstance(data, str) and vector_dtype is not None)
112        or isinstance(data, np.ndarray),
113        "vector_dtype is required if data is a str representing a path to the vector bin file",
114    )
115    dap_metric = _valid_metric(distance_metric)
116    _assert_is_positive_uint32(complexity, "complexity")
117    _assert_is_positive_uint32(graph_degree, "graph_degree")
118    _assert(search_memory_maximum > 0, "search_memory_maximum must be larger than 0")
119    _assert(build_memory_maximum > 0, "build_memory_maximum must be larger than 0")
120    _assert_is_nonnegative_uint32(num_threads, "num_threads")
121    _assert_is_nonnegative_uint32(pq_disk_bytes, "pq_disk_bytes")
122    _assert(index_prefix != "", "index_prefix cannot be an empty string")
123
124    index_path = Path(index_directory)
125    _assert(
126        index_path.exists() and index_path.is_dir(),
127        "index_directory must both exist and be a directory",
128    )
129
130    vector_bin_path, vector_dtype_actual = _valid_path_and_dtype(
131        data, vector_dtype, index_directory, index_prefix
132    )
133    _assert(dap_metric != _native_dap.COSINE, "Cosine is currently not supported in StaticDiskIndex")
134    if dap_metric == _native_dap.INNER_PRODUCT:
135        _assert(
136            vector_dtype_actual == np.float32,
137            "Integral vector dtypes (np.uint8, np.int8) are not supported with distance metric mips"
138        )
139
140    num_points, dimensions = vectors_metadata_from_file(vector_bin_path)
141
142    if vector_dtype_actual == np.uint8:
143        _builder = _native_dap.build_disk_uint8_index
144    elif vector_dtype_actual == np.int8:
145        _builder = _native_dap.build_disk_int8_index
146    else:
147        _builder = _native_dap.build_disk_float_index
148
149    index_prefix_path = os.path.join(index_directory, index_prefix)
150
151    _builder(
152        distance_metric=dap_metric,
153        data_file_path=vector_bin_path,
154        index_prefix_path=index_prefix_path,
155        complexity=complexity,
156        graph_degree=graph_degree,
157        final_index_ram_limit=search_memory_maximum,
158        indexing_ram_budget=build_memory_maximum,
159        num_threads=num_threads,
160        pq_disk_bytes=pq_disk_bytes,
161    )
162    _write_index_metadata(
163        index_prefix_path, vector_dtype_actual, dap_metric, num_points, dimensions
164    )

This function will construct a DiskANN disk index. Disk indices are ideal for very large datasets that are too large to fit in memory. Memory is still used, but it is primarily used to provide precise disk locations for fast retrieval of smaller subsets of the index without compromising much on recall.

If you provide a numpy array, it will save this array to disk in a temp location in the format DiskANN's PQ Flash Index builder requires. This temp folder is deleted upon index creation completion or error.

Distance Metric and Vector Datatype Restrictions

Metric \ Datatype	np.float32	np.uint8	np.int8
L2	✅	✅	✅
MIPS	✅	❌	❌
Cosine ¹	❌	❌	❌

Parameters

data: Either a str representing a path to a DiskANN vector bin file, or a numpy.ndarray, of a supported dtype, in 2 dimensions. Note that vector_dtype must be provided if data is a str
distance_metric: A str, strictly one of {"l2", "mips", "cosine"}. l2 and cosine are supported for all 3 vector dtypes, but mips is only available for single precision floats.
index_directory: The index files will be saved to this existing directory path
complexity: The size of the candidate nearest neighbor list to use when building the index. Values between 75 and 200 are typical. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. Use a value that is at least as large as graph_degree unless you are prepared to compromise on quality
graph_degree: The degree of the graph index, typically between 60 and 150. A larger maximum degree will result in larger indices and longer indexing times, but better search quality.
search_memory_maximum: Build index with the expectation that the search will use at most search_memory_maximum, in gb.
build_memory_maximum: Build index using at most build_memory_maximum in gb. Building processes typically require more memory, while search memory can be reduced.
num_threads: Number of threads to use when creating this index. 0 is used to indicate all available logical processors should be used.
pq_disk_bytes: Use 0 to store uncompressed data on SSD. This allows the index to asymptote to 100% recall. If your vectors are too large to store in SSD, this parameter provides the option to compress the vectors using PQ for storing on SSD. This will trade off recall. You would also want this to be greater than the number of bytes used for the PQ compressed data stored in-memory. Default is 0.
vector_dtype: Required if the provided data is of type str, else we use the data.dtype if np array.
index_prefix: The prefix of the index files. Defaults to "ann".

For StaticDiskIndex, Cosine distances are not currently supported. ↩

def build_memory_index( data: Union[str, numpy.ndarray[Any, numpy.dtype[Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8]]]]], distance_metric: Literal['l2', 'mips', 'cosine'], index_directory: str, complexity: int, graph_degree: int, num_threads: int, alpha: float = 1.2000000476837158, use_pq_build: bool = False, num_pq_bytes: int = 0, use_opq: bool = False, vector_dtype: Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8], NoneType] = None, tags: Union[str, numpy.ndarray[Any, numpy.dtype[numpy.uint32]]] = '', filter_labels: Optional[list[list[str]]] = None, universal_label: str = '', filter_complexity: int = 0, index_prefix: str = 'ann') -> None: View Source

167def build_memory_index(
168    data: Union[str, VectorLikeBatch],
169    distance_metric: DistanceMetric,
170    index_directory: str,
171    complexity: int,
172    graph_degree: int,
173    num_threads: int,
174    alpha: float = defaults.ALPHA,
175    use_pq_build: bool = defaults.USE_PQ_BUILD,
176    num_pq_bytes: int = defaults.NUM_PQ_BYTES,
177    use_opq: bool = defaults.USE_OPQ,
178    vector_dtype: Optional[VectorDType] = None,
179    tags: Union[str, VectorIdentifierBatch] = "",
180    filter_labels: Optional[list[list[str]]] = None,
181    universal_label: str = "",
182    filter_complexity: int = defaults.FILTER_COMPLEXITY,
183    index_prefix: str = "ann",
184) -> None:
185    """
186    This function will construct a DiskANN memory index. Memory indices are ideal for smaller datasets whose
187    indices can fit into memory. Memory indices are faster than disk indices, but usually cannot scale to massive
188    sizes in an individual index on an individual machine.
189
190    `diskannpy`'s memory indices take two forms: a `diskannpy.StaticMemoryIndex`, which will not be mutated, only
191    searched upon, and a `diskannpy.DynamicMemoryIndex`, which can be mutated AND searched upon in the same process.
192
193    ## Important Note:
194    You **must** determine the type of index you are building for. If you are building for a
195    `diskannpy.DynamicMemoryIndex`, you **must** supply a valid value for the `tags` parameter. **Do not supply
196    tags if the index is intended to be `diskannpy.StaticMemoryIndex`**!
197
198    ## Distance Metric and Vector Datatype Restrictions
199
200    | Metric \ Datatype | np.float32 | np.uint8 | np.int8 |
201    |-------------------|------------|----------|---------|
202    | L2                |      ✅     |     ✅    |    ✅    |
203    | MIPS              |      ✅     |     ❌    |    ❌    |
204    | Cosine            |      ✅     |     ✅    |    ✅    |
205
206    ### Parameters
207
208    - **data**: Either a `str` representing a path to an existing DiskANN vector bin file, or a numpy.ndarray of a
209      supported dtype in 2 dimensions. Note that `vector_dtype` must be provided if `data` is a `str`.
210    - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
211      vector dtypes, but `mips` is only available for single precision floats.
212    - **index_directory**: The index files will be saved to this **existing** directory path
213    - **complexity**: The size of the candidate nearest neighbor list to use when building the index. Values between 75
214      and 200 are typical. Larger values will take more time to build but result in indices that provide higher recall
215      for the same search complexity. Use a value that is at least as large as `graph_degree` unless you are prepared
216      to compromise on quality
217    - **graph_degree**: The degree of the graph index, typically between 60 and 150. A larger maximum degree will
218      result in larger indices and longer indexing times, but better search quality.
219    - **num_threads**: Number of threads to use when creating this index. `0` is used to indicate all available
220      logical processors should be used.
221    - **alpha**: The alpha parameter (>=1) is used to control the nature and number of points that are added to the
222      graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably more
223      distance comparisons compared to a lower alpha value.
224    - **use_pq_build**: Use product quantization during build. Product quantization is a lossy compression technique
225      that can reduce the size of the index on disk. This will trade off recall. Default is `True`.
226    - **num_pq_bytes**: The number of bytes used to store the PQ compressed data in memory. This will trade off recall.
227      Default is `0`.
228    - **use_opq**: Use optimized product quantization during build.
229    - **vector_dtype**: Required if the provided `data` is of type `str`, else we use the `data.dtype` if np array.
230    - **tags**: Tags can be defined either as a path on disk to an existing .tags file, or provided as a np.array of
231      the same length as the number of vectors. Tags are used to identify vectors in the index via your *own*
232      numbering conventions, and is absolutely required for loading DynamicMemoryIndex indices `from_file`.
233    - **filter_labels**: An optional, but exhaustive list of categories for each vector. This is used to filter
234      search results by category. If provided, this must be a list of lists, where each inner list is a list of
235      categories for the corresponding vector. For example, if you have 3 vectors, and the first vector belongs to
236      categories "a" and "b", the second vector belongs to category "b", and the third vector belongs to no categories,
237      you would provide `filter_labels=[["a", "b"], ["b"], []]`. If you do not want to provide categories for a
238      particular vector, you can provide an empty list. If you do not want to provide categories for any vectors,
239      you can provide `None` for this parameter (which is the default)
240    - **universal_label**: An optional label that indicates that this vector should be included in *every* search
241      in which it also meets the knn search criteria.
242    - **filter_complexity**: Complexity to use when using filters. Default is 0. 0 is strictly invalid if you are
243      using filters.
244    - **index_prefix**: The prefix of the index files. Defaults to "ann".
245    """
246    _assert(
247        (isinstance(data, str) and vector_dtype is not None)
248        or isinstance(data, np.ndarray),
249        "vector_dtype is required if data is a str representing a path to the vector bin file",
250    )
251    dap_metric = _valid_metric(distance_metric)
252    _assert_is_positive_uint32(complexity, "complexity")
253    _assert_is_positive_uint32(graph_degree, "graph_degree")
254    _assert(
255        alpha >= 1,
256        "alpha must be >= 1, and realistically should be kept between [1.0, 2.0)",
257    )
258    _assert_is_nonnegative_uint32(num_threads, "num_threads")
259    _assert_is_nonnegative_uint32(num_pq_bytes, "num_pq_bytes")
260    _assert_is_nonnegative_uint32(filter_complexity, "filter_complexity")
261    _assert(index_prefix != "", "index_prefix cannot be an empty string")
262    _assert(
263        filter_labels is None or filter_complexity > 0,
264        "if filter_labels is provided, filter_complexity must not be 0"
265    )
266
267    index_path = Path(index_directory)
268    _assert(
269        index_path.exists() and index_path.is_dir(),
270        "index_directory must both exist and be a directory",
271    )
272
273    vector_bin_path, vector_dtype_actual = _valid_path_and_dtype(
274        data, vector_dtype, index_directory, index_prefix
275    )
276    if dap_metric == _native_dap.INNER_PRODUCT:
277        _assert(
278            vector_dtype_actual == np.float32,
279            "Integral vector dtypes (np.uint8, np.int8) are not supported with distance metric mips"
280        )
281
282    num_points, dimensions = vectors_metadata_from_file(vector_bin_path)
283    if filter_labels is not None:
284        _assert(
285            len(filter_labels) == num_points,
286            "filter_labels must be the same length as the number of points"
287        )
288
289    if vector_dtype_actual == np.uint8:
290        _builder = _native_dap.build_memory_uint8_index
291    elif vector_dtype_actual == np.int8:
292        _builder = _native_dap.build_memory_int8_index
293    else:
294        _builder = _native_dap.build_memory_float_index
295
296    index_prefix_path = os.path.join(index_directory, index_prefix)
297
298    filter_labels_file = ""
299    if filter_labels is not None:
300        label_counts = {}
301        filter_labels_file = f"{index_prefix_path}_pylabels.txt"
302        with open(filter_labels_file, "w") as labels_file:
303            for labels in filter_labels:
304                for label in labels:
305                    label_counts[label] = 1 if label not in label_counts else label_counts[label] + 1
306                if len(labels) == 0:
307                    print("default", file=labels_file)
308                else:
309                    print(",".join(labels), file=labels_file)
310        with open(f"{index_prefix_path}_label_metadata.json", "w") as label_metadata_file:
311            json.dump(label_counts, label_metadata_file, indent=True)
312
313    if isinstance(tags, str) and tags != "":
314        use_tags = True
315        shutil.copy(tags, index_prefix_path + ".tags")
316    elif not isinstance(tags, str):
317        use_tags = True
318        tags_as_array = _castable_dtype_or_raise(tags, expected=np.uint32)
319        _assert(len(tags_as_array.shape) == 1, "Provided tags must be 1 dimensional")
320        _assert(
321            tags_as_array.shape[0] == num_points,
322            "Provided tags must contain an identical population to the number of points, "
323            f"{tags_as_array.shape[0]=}, {num_points=}",
324        )
325        tags_to_file(index_prefix_path + ".tags", tags_as_array)
326    else:
327        use_tags = False
328
329    _builder(
330        distance_metric=dap_metric,
331        data_file_path=vector_bin_path,
332        index_output_path=index_prefix_path,
333        complexity=complexity,
334        graph_degree=graph_degree,
335        alpha=alpha,
336        num_threads=num_threads,
337        use_pq_build=use_pq_build,
338        num_pq_bytes=num_pq_bytes,
339        use_opq=use_opq,
340        use_tags=use_tags,
341        filter_labels_file=filter_labels_file,
342        universal_label=universal_label,
343        filter_complexity=filter_complexity,
344    )
345
346    _write_index_metadata(
347        index_prefix_path, vector_dtype_actual, dap_metric, num_points, dimensions
348    )

This function will construct a DiskANN memory index. Memory indices are ideal for smaller datasets whose indices can fit into memory. Memory indices are faster than disk indices, but usually cannot scale to massive sizes in an individual index on an individual machine.

diskannpy's memory indices take two forms: a diskannpy.StaticMemoryIndex, which will not be mutated, only searched upon, and a diskannpy.DynamicMemoryIndex, which can be mutated AND searched upon in the same process.

Important Note:

You must determine the type of index you are building for. If you are building for a diskannpy.DynamicMemoryIndex, you must supply a valid value for the tags parameter. Do not supply tags if the index is intended to be diskannpy.StaticMemoryIndex!

Distance Metric and Vector Datatype Restrictions

Metric \ Datatype	np.float32	np.uint8	np.int8
L2	✅	✅	✅
MIPS	✅	❌	❌
Cosine	✅	✅	✅

Parameters

data: Either a str representing a path to an existing DiskANN vector bin file, or a numpy.ndarray of a supported dtype in 2 dimensions. Note that vector_dtype must be provided if data is a str.
distance_metric: A str, strictly one of {"l2", "mips", "cosine"}. l2 and cosine are supported for all 3 vector dtypes, but mips is only available for single precision floats.
index_directory: The index files will be saved to this existing directory path
complexity: The size of the candidate nearest neighbor list to use when building the index. Values between 75 and 200 are typical. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. Use a value that is at least as large as graph_degree unless you are prepared to compromise on quality
graph_degree: The degree of the graph index, typically between 60 and 150. A larger maximum degree will result in larger indices and longer indexing times, but better search quality.
num_threads: Number of threads to use when creating this index. 0 is used to indicate all available logical processors should be used.
alpha: The alpha parameter (>=1) is used to control the nature and number of points that are added to the graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably more distance comparisons compared to a lower alpha value.
use_pq_build: Use product quantization during build. Product quantization is a lossy compression technique that can reduce the size of the index on disk. This will trade off recall. Default is True.
num_pq_bytes: The number of bytes used to store the PQ compressed data in memory. This will trade off recall. Default is 0.
use_opq: Use optimized product quantization during build.
vector_dtype: Required if the provided data is of type str, else we use the data.dtype if np array.
tags: Tags can be defined either as a path on disk to an existing .tags file, or provided as a np.array of the same length as the number of vectors. Tags are used to identify vectors in the index via your own numbering conventions, and is absolutely required for loading DynamicMemoryIndex indices from_file.
filter_labels: An optional, but exhaustive list of categories for each vector. This is used to filter search results by category. If provided, this must be a list of lists, where each inner list is a list of categories for the corresponding vector. For example, if you have 3 vectors, and the first vector belongs to categories "a" and "b", the second vector belongs to category "b", and the third vector belongs to no categories, you would provide filter_labels=[["a", "b"], ["b"], []]. If you do not want to provide categories for a particular vector, you can provide an empty list. If you do not want to provide categories for any vectors, you can provide None for this parameter (which is the default)
universal_label: An optional label that indicates that this vector should be included in every search in which it also meets the knn search criteria.
filter_complexity: Complexity to use when using filters. Default is 0. 0 is strictly invalid if you are using filters.
index_prefix: The prefix of the index files. Defaults to "ann".

class StaticDiskIndex: View Source

 34class StaticDiskIndex:
 35    """
 36    A StaticDiskIndex is a disk-backed index that is not mutable.
 37    """
 38
 39    def __init__(
 40        self,
 41        index_directory: str,
 42        num_threads: int,
 43        num_nodes_to_cache: int,
 44        cache_mechanism: int = 1,
 45        distance_metric: Optional[DistanceMetric] = None,
 46        vector_dtype: Optional[VectorDType] = None,
 47        dimensions: Optional[int] = None,
 48        index_prefix: str = "ann",
 49    ):
 50        """
 51        ### Parameters
 52        - **index_directory**: The directory containing the index files. This directory must contain the following
 53            files:
 54            - `{index_prefix}_sample_data.bin`
 55            - `{index_prefix}_mem.index.data`
 56            - `{index_prefix}_pq_compressed.bin`
 57            - `{index_prefix}_pq_pivots.bin`
 58            - `{index_prefix}_sample_ids.bin`
 59            - `{index_prefix}_disk.index`
 60
 61          It may also include the following optional files:
 62            - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the
 63              `index_directory` if the index was created from a numpy array
 64            - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata
 65            about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality.
 66            If an index is built from the `diskann` cli tools, this file will not exist.
 67        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
 68        - **num_nodes_to_cache**: Number of nodes to cache in memory (> -1)
 69        - **cache_mechanism**: 1 -> use the generated sample_data.bin file for
 70            the index to initialize a set of cached nodes, up to `num_nodes_to_cache`, 2 -> ready the cache for up to
 71            `num_nodes_to_cache`, but do not initialize it with any nodes. Any other value disables node caching.
 72        - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
 73          vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This
 74          value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist,
 75          you are required to provide it.
 76        - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a
 77          `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it.
 78        - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same
 79          dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it
 80          does not exist, you are required to provide it.
 81        - **index_prefix**: The prefix of the index files. Defaults to "ann".
 82        """
 83        index_prefix_path = _valid_index_prefix(index_directory, index_prefix)
 84        vector_dtype, metric, _, _ = _ensure_index_metadata(
 85            index_prefix_path,
 86            vector_dtype,
 87            distance_metric,
 88            1,  # it doesn't matter because we don't need it in this context anyway
 89            dimensions,
 90        )
 91        dap_metric = _valid_metric(metric)
 92
 93        _assert_is_nonnegative_uint32(num_threads, "num_threads")
 94        _assert_is_nonnegative_uint32(num_nodes_to_cache, "num_nodes_to_cache")
 95
 96        self._vector_dtype = vector_dtype
 97        if vector_dtype == np.uint8:
 98            _index = _native_dap.StaticDiskUInt8Index
 99        elif vector_dtype == np.int8:
100            _index = _native_dap.StaticDiskInt8Index
101        else:
102            _index = _native_dap.StaticDiskFloatIndex
103        self._index = _index(
104            distance_metric=dap_metric,
105            index_path_prefix=index_prefix_path,
106            num_threads=num_threads,
107            num_nodes_to_cache=num_nodes_to_cache,
108            cache_mechanism=cache_mechanism,
109        )
110
111    def search(
112        self, query: VectorLike, k_neighbors: int, complexity: int, beam_width: int = 2
113    ) -> QueryResponse:
114        """
115        Searches the index by a single query vector.
116
117        ### Parameters
118        - **query**: 1d numpy array of the same dimensionality and dtype of the index.
119        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
120          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
121        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
122          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
123        - **beam_width**: The beamwidth to be used for search. This is the maximum number of IO requests each query
124          will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query,
125          but might result in slightly higher total number of IO requests to SSD per query. For the highest query
126          throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search.
127          Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will
128          involve some tuning overhead.
129        """
130        _query = _castable_dtype_or_raise(query, expected=self._vector_dtype)
131        _assert(len(_query.shape) == 1, "query vector must be 1-d")
132        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
133        _assert_is_positive_uint32(complexity, "complexity")
134        _assert_is_positive_uint32(beam_width, "beam_width")
135
136        if k_neighbors > complexity:
137            warnings.warn(
138                f"{k_neighbors=} asked for, but {complexity=} was smaller. Increasing {complexity} to {k_neighbors}"
139            )
140            complexity = k_neighbors
141
142        neighbors, distances = self._index.search(
143            query=_query,
144            knn=k_neighbors,
145            complexity=complexity,
146            beam_width=beam_width,
147        )
148        return QueryResponse(identifiers=neighbors, distances=distances)
149
150    def batch_search(
151        self,
152        queries: VectorLikeBatch,
153        k_neighbors: int,
154        complexity: int,
155        num_threads: int,
156        beam_width: int = 2,
157    ) -> QueryResponseBatch:
158        """
159        Searches the index by a batch of query vectors.
160
161        This search is parallelized and far more efficient than searching for each vector individually.
162
163        ### Parameters
164        - **queries**: 2d numpy array, with column dimensionality matching the index and row dimensionality being the
165          number of queries intended to search for in parallel. Dtype must match dtype of the index.
166        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
167          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
168        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
169          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
170        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
171        - **beam_width**: The beamwidth to be used for search. This is the maximum number of IO requests each query
172          will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query,
173          but might result in slightly higher total number of IO requests to SSD per query. For the highest query
174          throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search.
175          Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will
176          involve some tuning overhead.
177        """
178        _queries = _castable_dtype_or_raise(queries, expected=self._vector_dtype)
179        _assert_2d(_queries, "queries")
180        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
181        _assert_is_positive_uint32(complexity, "complexity")
182        _assert_is_nonnegative_uint32(num_threads, "num_threads")
183        _assert_is_positive_uint32(beam_width, "beam_width")
184
185        if k_neighbors > complexity:
186            warnings.warn(
187                f"{k_neighbors=} asked for, but {complexity=} was smaller. Increasing {complexity} to {k_neighbors}"
188            )
189            complexity = k_neighbors
190
191        num_queries, dim = _queries.shape
192        neighbors, distances = self._index.batch_search(
193            queries=_queries,
194            num_queries=num_queries,
195            knn=k_neighbors,
196            complexity=complexity,
197            beam_width=beam_width,
198            num_threads=num_threads,
199        )
200        return QueryResponseBatch(identifiers=neighbors, distances=distances)

A StaticDiskIndex is a disk-backed index that is not mutable.

StaticDiskIndex( index_directory: str, num_threads: int, num_nodes_to_cache: int, cache_mechanism: int = 1, distance_metric: Optional[Literal['l2', 'mips', 'cosine']] = None, vector_dtype: Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8], NoneType] = None, dimensions: Optional[int] = None, index_prefix: str = 'ann') View Source

 39    def __init__(
 40        self,
 41        index_directory: str,
 42        num_threads: int,
 43        num_nodes_to_cache: int,
 44        cache_mechanism: int = 1,
 45        distance_metric: Optional[DistanceMetric] = None,
 46        vector_dtype: Optional[VectorDType] = None,
 47        dimensions: Optional[int] = None,
 48        index_prefix: str = "ann",
 49    ):
 50        """
 51        ### Parameters
 52        - **index_directory**: The directory containing the index files. This directory must contain the following
 53            files:
 54            - `{index_prefix}_sample_data.bin`
 55            - `{index_prefix}_mem.index.data`
 56            - `{index_prefix}_pq_compressed.bin`
 57            - `{index_prefix}_pq_pivots.bin`
 58            - `{index_prefix}_sample_ids.bin`
 59            - `{index_prefix}_disk.index`
 60
 61          It may also include the following optional files:
 62            - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the
 63              `index_directory` if the index was created from a numpy array
 64            - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata
 65            about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality.
 66            If an index is built from the `diskann` cli tools, this file will not exist.
 67        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
 68        - **num_nodes_to_cache**: Number of nodes to cache in memory (> -1)
 69        - **cache_mechanism**: 1 -> use the generated sample_data.bin file for
 70            the index to initialize a set of cached nodes, up to `num_nodes_to_cache`, 2 -> ready the cache for up to
 71            `num_nodes_to_cache`, but do not initialize it with any nodes. Any other value disables node caching.
 72        - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
 73          vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This
 74          value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist,
 75          you are required to provide it.
 76        - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a
 77          `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it.
 78        - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same
 79          dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it
 80          does not exist, you are required to provide it.
 81        - **index_prefix**: The prefix of the index files. Defaults to "ann".
 82        """
 83        index_prefix_path = _valid_index_prefix(index_directory, index_prefix)
 84        vector_dtype, metric, _, _ = _ensure_index_metadata(
 85            index_prefix_path,
 86            vector_dtype,
 87            distance_metric,
 88            1,  # it doesn't matter because we don't need it in this context anyway
 89            dimensions,
 90        )
 91        dap_metric = _valid_metric(metric)
 92
 93        _assert_is_nonnegative_uint32(num_threads, "num_threads")
 94        _assert_is_nonnegative_uint32(num_nodes_to_cache, "num_nodes_to_cache")
 95
 96        self._vector_dtype = vector_dtype
 97        if vector_dtype == np.uint8:
 98            _index = _native_dap.StaticDiskUInt8Index
 99        elif vector_dtype == np.int8:
100            _index = _native_dap.StaticDiskInt8Index
101        else:
102            _index = _native_dap.StaticDiskFloatIndex
103        self._index = _index(
104            distance_metric=dap_metric,
105            index_path_prefix=index_prefix_path,
106            num_threads=num_threads,
107            num_nodes_to_cache=num_nodes_to_cache,
108            cache_mechanism=cache_mechanism,
109        )

Parameters

index_directory: The directory containing the index files. This directory must contain the following files:
- {index_prefix}_sample_data.bin
- {index_prefix}_mem.index.data
- {index_prefix}_pq_compressed.bin
- {index_prefix}_pq_pivots.bin
- {index_prefix}_sample_ids.bin
- {index_prefix}_disk.index
It may also include the following optional files:
- {index_prefix}_vectors.bin: Optional. diskannpy builder functions may create this file in the index_directory if the index was created from a numpy array
- {index_prefix}_metadata.bin: Optional. diskannpy builder functions create this file to store metadata about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. If an index is built from the diskann cli tools, this file will not exist.
num_threads: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
num_nodes_to_cache: Number of nodes to cache in memory (> -1)
cache_mechanism: 1 -> use the generated sample_data.bin file for the index to initialize a set of cached nodes, up to num_nodes_to_cache, 2 -> ready the cache for up to num_nodes_to_cache, but do not initialize it with any nodes. Any other value disables node caching.
distance_metric: A str, strictly one of {"l2", "mips", "cosine"}. l2 and cosine are supported for all 3 vector dtypes, but mips is only available for single precision floats. Default is None. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
vector_dtype: The vector dtype this index has been built with. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
dimensions: The vector dimensionality of this index. All new vectors inserted must be the same dimensionality. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
index_prefix: The prefix of the index files. Defaults to "ann".

def search( self, query: numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]], k_neighbors: int, complexity: int, beam_width: int = 2) -> QueryResponse: View Source

111    def search(
112        self, query: VectorLike, k_neighbors: int, complexity: int, beam_width: int = 2
113    ) -> QueryResponse:
114        """
115        Searches the index by a single query vector.
116
117        ### Parameters
118        - **query**: 1d numpy array of the same dimensionality and dtype of the index.
119        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
120          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
121        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
122          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
123        - **beam_width**: The beamwidth to be used for search. This is the maximum number of IO requests each query
124          will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query,
125          but might result in slightly higher total number of IO requests to SSD per query. For the highest query
126          throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search.
127          Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will
128          involve some tuning overhead.
129        """
130        _query = _castable_dtype_or_raise(query, expected=self._vector_dtype)
131        _assert(len(_query.shape) == 1, "query vector must be 1-d")
132        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
133        _assert_is_positive_uint32(complexity, "complexity")
134        _assert_is_positive_uint32(beam_width, "beam_width")
135
136        if k_neighbors > complexity:
137            warnings.warn(
138                f"{k_neighbors=} asked for, but {complexity=} was smaller. Increasing {complexity} to {k_neighbors}"
139            )
140            complexity = k_neighbors
141
142        neighbors, distances = self._index.search(
143            query=_query,
144            knn=k_neighbors,
145            complexity=complexity,
146            beam_width=beam_width,
147        )
148        return QueryResponse(identifiers=neighbors, distances=distances)

Searches the index by a single query vector.

Parameters

query: 1d numpy array of the same dimensionality and dtype of the index.
k_neighbors: Number of neighbors to be returned. If query vector exists in index, it almost definitely will be returned as well, so adjust your k_neighbors as appropriate. Must be > 0.
complexity: Size of distance ordered list of candidate neighbors to use while searching. List size increases accuracy at the cost of latency. Must be at least k_neighbors in size.
beam_width: The beamwidth to be used for search. This is the maximum number of IO requests each query will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query, but might result in slightly higher total number of IO requests to SSD per query. For the highest query throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search. Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will involve some tuning overhead.

def batch_search( self, queries: numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]], k_neighbors: int, complexity: int, num_threads: int, beam_width: int = 2) -> QueryResponseBatch: View Source

150    def batch_search(
151        self,
152        queries: VectorLikeBatch,
153        k_neighbors: int,
154        complexity: int,
155        num_threads: int,
156        beam_width: int = 2,
157    ) -> QueryResponseBatch:
158        """
159        Searches the index by a batch of query vectors.
160
161        This search is parallelized and far more efficient than searching for each vector individually.
162
163        ### Parameters
164        - **queries**: 2d numpy array, with column dimensionality matching the index and row dimensionality being the
165          number of queries intended to search for in parallel. Dtype must match dtype of the index.
166        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
167          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
168        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
169          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
170        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
171        - **beam_width**: The beamwidth to be used for search. This is the maximum number of IO requests each query
172          will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query,
173          but might result in slightly higher total number of IO requests to SSD per query. For the highest query
174          throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search.
175          Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will
176          involve some tuning overhead.
177        """
178        _queries = _castable_dtype_or_raise(queries, expected=self._vector_dtype)
179        _assert_2d(_queries, "queries")
180        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
181        _assert_is_positive_uint32(complexity, "complexity")
182        _assert_is_nonnegative_uint32(num_threads, "num_threads")
183        _assert_is_positive_uint32(beam_width, "beam_width")
184
185        if k_neighbors > complexity:
186            warnings.warn(
187                f"{k_neighbors=} asked for, but {complexity=} was smaller. Increasing {complexity} to {k_neighbors}"
188            )
189            complexity = k_neighbors
190
191        num_queries, dim = _queries.shape
192        neighbors, distances = self._index.batch_search(
193            queries=_queries,
194            num_queries=num_queries,
195            knn=k_neighbors,
196            complexity=complexity,
197            beam_width=beam_width,
198            num_threads=num_threads,
199        )
200        return QueryResponseBatch(identifiers=neighbors, distances=distances)

Searches the index by a batch of query vectors.

This search is parallelized and far more efficient than searching for each vector individually.

Parameters

queries: 2d numpy array, with column dimensionality matching the index and row dimensionality being the number of queries intended to search for in parallel. Dtype must match dtype of the index.
k_neighbors: Number of neighbors to be returned. If query vector exists in index, it almost definitely will be returned as well, so adjust your k_neighbors as appropriate. Must be > 0.
complexity: Size of distance ordered list of candidate neighbors to use while searching. List size increases accuracy at the cost of latency. Must be at least k_neighbors in size.
num_threads: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
beam_width: The beamwidth to be used for search. This is the maximum number of IO requests each query will issue per iteration of search code. Larger beamwidth will result in fewer IO round-trips per query, but might result in slightly higher total number of IO requests to SSD per query. For the highest query throughput with a fixed SSD IOps rating, use W=1. For best latency, use W=4,8 or higher complexity search. Specifying 0 will optimize the beamwidth depending on the number of threads performing search, but will involve some tuning overhead.

class StaticMemoryIndex: View Source

 34class StaticMemoryIndex:
 35    """
 36    A StaticMemoryIndex is an immutable in-memory DiskANN index.
 37    """
 38
 39    def __init__(
 40        self,
 41        index_directory: str,
 42        num_threads: int,
 43        initial_search_complexity: int,
 44        index_prefix: str = "ann",
 45        distance_metric: Optional[DistanceMetric] = None,
 46        vector_dtype: Optional[VectorDType] = None,
 47        dimensions: Optional[int] = None,
 48        enable_filters: bool = False
 49    ):
 50        """
 51        ### Parameters
 52        - **index_directory**: The directory containing the index files. This directory must contain the following
 53          files:
 54            - `{index_prefix}.data`
 55            - `{index_prefix}`
 56
 57
 58          It may also include the following optional files:
 59            - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the
 60              `index_directory` if the index was created from a numpy array
 61            - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata
 62            about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality.
 63            If an index is built from the `diskann` cli tools, this file will not exist.
 64        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
 65        - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the
 66          life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of
 67          `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search`
 68          operation requests a space larger than can be accommodated by these values.
 69        - **index_prefix**: The prefix of the index files. Defaults to "ann".
 70        - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
 71          vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This
 72          value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist,
 73          you are required to provide it.
 74        - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a
 75          `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it.
 76        - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same
 77          dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it
 78          does not exist, you are required to provide it.
 79        - **enable_filters**: Indexes built with filters can also be used for filtered search.
 80        """
 81        index_prefix_path = _valid_index_prefix(index_directory, index_prefix)
 82        self._labels_map = {}
 83        self._labels_metadata = {}
 84        if enable_filters:
 85            try:
 86                with open(f"{index_prefix_path}_labels_map.txt", "r") as labels_map_if:
 87                    for line in labels_map_if:
 88                        (key, val) = line.split("\t")
 89                        self._labels_map[key] = int(val)
 90                with open(f"{index_prefix_path}_label_metadata.json", "r") as labels_metadata_if:
 91                    self._labels_metadata = json.load(labels_metadata_if)
 92            except: # noqa: E722
 93                # exceptions are basically presumed to be either file not found or file not formatted correctly
 94                raise RuntimeException("Filter labels file was unable to be processed.")
 95        vector_dtype, metric, num_points, dims = _ensure_index_metadata(
 96            index_prefix_path,
 97            vector_dtype,
 98            distance_metric,
 99            1,  # it doesn't matter because we don't need it in this context anyway
100            dimensions,
101        )
102        dap_metric = _valid_metric(metric)
103
104        _assert_is_nonnegative_uint32(num_threads, "num_threads")
105        _assert_is_positive_uint32(
106            initial_search_complexity, "initial_search_complexity"
107        )
108
109        self._vector_dtype = vector_dtype
110        self._dimensions = dims
111
112        if vector_dtype == np.uint8:
113            _index = _native_dap.StaticMemoryUInt8Index
114        elif vector_dtype == np.int8:
115            _index = _native_dap.StaticMemoryInt8Index
116        else:
117            _index = _native_dap.StaticMemoryFloatIndex
118
119        self._index = _index(
120            distance_metric=dap_metric,
121            num_points=num_points,
122            dimensions=dims,
123            index_path=index_prefix_path,
124            num_threads=num_threads,
125            initial_search_complexity=initial_search_complexity,
126        )
127
128    def search(
129            self, query: VectorLike, k_neighbors: int, complexity: int, filter_label: str = ""
130    ) -> QueryResponse:
131        """
132        Searches the index by a single query vector.
133
134        ### Parameters
135        - **query**: 1d numpy array of the same dimensionality and dtype of the index.
136        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
137          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
138        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
139          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
140        """
141        if filter_label != "":
142            if len(self._labels_map) == 0:
143                raise ValueError(
144                    f"A filter label of {filter_label} was provided, but this class was not initialized with filters "
145                    "enabled, e.g. StaticDiskMemory(..., enable_filters=True)"
146                )
147            if filter_label not in self._labels_map:
148                raise ValueError(
149                    f"A filter label of {filter_label} was provided, but the external(str)->internal(np.uint32) labels map "
150                    f"does not include that label."
151                )
152            k_neighbors = min(k_neighbors, self._labels_metadata[filter_label])
153        _query = _castable_dtype_or_raise(query, expected=self._vector_dtype)
154        _assert(len(_query.shape) == 1, "query vector must be 1-d")
155        _assert(
156            _query.shape[0] == self._dimensions,
157            f"query vector must have the same dimensionality as the index; index dimensionality: {self._dimensions}, "
158            f"query dimensionality: {_query.shape[0]}",
159            )
160        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
161        _assert_is_nonnegative_uint32(complexity, "complexity")
162
163        if k_neighbors > complexity:
164            warnings.warn(
165                f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}"
166            )
167            complexity = k_neighbors
168
169        if filter_label == "":
170            neighbors, distances = self._index.search(query=_query, knn=k_neighbors, complexity=complexity)
171        else:
172            filter = self._labels_map[filter_label]
173            neighbors, distances = self._index.search_with_filter(
174                query=query,
175                knn=k_neighbors,
176                complexity=complexity,
177                filter=filter
178            )
179        return QueryResponse(identifiers=neighbors, distances=distances)
180
181
182    def batch_search(
183        self,
184        queries: VectorLikeBatch,
185        k_neighbors: int,
186        complexity: int,
187        num_threads: int,
188    ) -> QueryResponseBatch:
189        """
190        Searches the index by a batch of query vectors.
191
192        This search is parallelized and far more efficient than searching for each vector individually.
193
194        ### Parameters
195        - **queries**: 2d numpy array, with column dimensionality matching the index and row dimensionality being the
196          number of queries intended to search for in parallel. Dtype must match dtype of the index.
197        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
198          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
199        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
200          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
201        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
202        """
203
204        _queries = _castable_dtype_or_raise(queries, expected=self._vector_dtype)
205        _assert(len(_queries.shape) == 2, "queries must must be 2-d np array")
206        _assert(
207            _queries.shape[1] == self._dimensions,
208            f"query vectors must have the same dimensionality as the index; index dimensionality: {self._dimensions}, "
209            f"query dimensionality: {_queries.shape[1]}",
210        )
211        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
212        _assert_is_positive_uint32(complexity, "complexity")
213        _assert_is_nonnegative_uint32(num_threads, "num_threads")
214
215        if k_neighbors > complexity:
216            warnings.warn(
217                f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}"
218            )
219            complexity = k_neighbors
220
221        num_queries, dim = _queries.shape
222        neighbors, distances = self._index.batch_search(
223            queries=_queries,
224            num_queries=num_queries,
225            knn=k_neighbors,
226            complexity=complexity,
227            num_threads=num_threads,
228        )
229        return QueryResponseBatch(identifiers=neighbors, distances=distances)

A StaticMemoryIndex is an immutable in-memory DiskANN index.

StaticMemoryIndex( index_directory: str, num_threads: int, initial_search_complexity: int, index_prefix: str = 'ann', distance_metric: Optional[Literal['l2', 'mips', 'cosine']] = None, vector_dtype: Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8], NoneType] = None, dimensions: Optional[int] = None, enable_filters: bool = False) View Source

 39    def __init__(
 40        self,
 41        index_directory: str,
 42        num_threads: int,
 43        initial_search_complexity: int,
 44        index_prefix: str = "ann",
 45        distance_metric: Optional[DistanceMetric] = None,
 46        vector_dtype: Optional[VectorDType] = None,
 47        dimensions: Optional[int] = None,
 48        enable_filters: bool = False
 49    ):
 50        """
 51        ### Parameters
 52        - **index_directory**: The directory containing the index files. This directory must contain the following
 53          files:
 54            - `{index_prefix}.data`
 55            - `{index_prefix}`
 56
 57
 58          It may also include the following optional files:
 59            - `{index_prefix}_vectors.bin`: Optional. `diskannpy` builder functions may create this file in the
 60              `index_directory` if the index was created from a numpy array
 61            - `{index_prefix}_metadata.bin`: Optional. `diskannpy` builder functions create this file to store metadata
 62            about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality.
 63            If an index is built from the `diskann` cli tools, this file will not exist.
 64        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
 65        - **initial_search_complexity**: Should be set to the most common `complexity` expected to be used during the
 66          life of this `diskannpy.DynamicMemoryIndex` object. The working scratch memory allocated is based off of
 67          `initial_search_complexity` * `search_threads`. Note that it may be resized if a `search` or `batch_search`
 68          operation requests a space larger than can be accommodated by these values.
 69        - **index_prefix**: The prefix of the index files. Defaults to "ann".
 70        - **distance_metric**: A `str`, strictly one of {"l2", "mips", "cosine"}. `l2` and `cosine` are supported for all 3
 71          vector dtypes, but `mips` is only available for single precision floats. Default is `None`. **This
 72          value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it does not exist,
 73          you are required to provide it.
 74        - **vector_dtype**: The vector dtype this index has been built with. **This value is only used if a
 75          `{index_prefix}_metadata.bin` file does not exist.** If it does not exist, you are required to provide it.
 76        - **dimensions**: The vector dimensionality of this index. All new vectors inserted must be the same
 77          dimensionality. **This value is only used if a `{index_prefix}_metadata.bin` file does not exist.** If it
 78          does not exist, you are required to provide it.
 79        - **enable_filters**: Indexes built with filters can also be used for filtered search.
 80        """
 81        index_prefix_path = _valid_index_prefix(index_directory, index_prefix)
 82        self._labels_map = {}
 83        self._labels_metadata = {}
 84        if enable_filters:
 85            try:
 86                with open(f"{index_prefix_path}_labels_map.txt", "r") as labels_map_if:
 87                    for line in labels_map_if:
 88                        (key, val) = line.split("\t")
 89                        self._labels_map[key] = int(val)
 90                with open(f"{index_prefix_path}_label_metadata.json", "r") as labels_metadata_if:
 91                    self._labels_metadata = json.load(labels_metadata_if)
 92            except: # noqa: E722
 93                # exceptions are basically presumed to be either file not found or file not formatted correctly
 94                raise RuntimeException("Filter labels file was unable to be processed.")
 95        vector_dtype, metric, num_points, dims = _ensure_index_metadata(
 96            index_prefix_path,
 97            vector_dtype,
 98            distance_metric,
 99            1,  # it doesn't matter because we don't need it in this context anyway
100            dimensions,
101        )
102        dap_metric = _valid_metric(metric)
103
104        _assert_is_nonnegative_uint32(num_threads, "num_threads")
105        _assert_is_positive_uint32(
106            initial_search_complexity, "initial_search_complexity"
107        )
108
109        self._vector_dtype = vector_dtype
110        self._dimensions = dims
111
112        if vector_dtype == np.uint8:
113            _index = _native_dap.StaticMemoryUInt8Index
114        elif vector_dtype == np.int8:
115            _index = _native_dap.StaticMemoryInt8Index
116        else:
117            _index = _native_dap.StaticMemoryFloatIndex
118
119        self._index = _index(
120            distance_metric=dap_metric,
121            num_points=num_points,
122            dimensions=dims,
123            index_path=index_prefix_path,
124            num_threads=num_threads,
125            initial_search_complexity=initial_search_complexity,
126        )

Parameters

index_directory: The directory containing the index files. This directory must contain the following files:
- {index_prefix}.data
- {index_prefix}
It may also include the following optional files:
- {index_prefix}_vectors.bin: Optional. diskannpy builder functions may create this file in the index_directory if the index was created from a numpy array
- {index_prefix}_metadata.bin: Optional. diskannpy builder functions create this file to store metadata about the index, such as vector dtype, distance metric, number of vectors and vector dimensionality. If an index is built from the diskann cli tools, this file will not exist.
num_threads: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
initial_search_complexity: Should be set to the most common complexity expected to be used during the life of this diskannpy.DynamicMemoryIndex object. The working scratch memory allocated is based off of initial_search_complexity * search_threads. Note that it may be resized if a search or batch_search operation requests a space larger than can be accommodated by these values.
index_prefix: The prefix of the index files. Defaults to "ann".
distance_metric: A str, strictly one of {"l2", "mips", "cosine"}. l2 and cosine are supported for all 3 vector dtypes, but mips is only available for single precision floats. Default is None. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
vector_dtype: The vector dtype this index has been built with. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
dimensions: The vector dimensionality of this index. All new vectors inserted must be the same dimensionality. This value is only used if a {index_prefix}_metadata.bin file does not exist. If it does not exist, you are required to provide it.
enable_filters: Indexes built with filters can also be used for filtered search.

128    def search(
129            self, query: VectorLike, k_neighbors: int, complexity: int, filter_label: str = ""
130    ) -> QueryResponse:
131        """
132        Searches the index by a single query vector.
133
134        ### Parameters
135        - **query**: 1d numpy array of the same dimensionality and dtype of the index.
136        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
137          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
138        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
139          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
140        """
141        if filter_label != "":
142            if len(self._labels_map) == 0:
143                raise ValueError(
144                    f"A filter label of {filter_label} was provided, but this class was not initialized with filters "
145                    "enabled, e.g. StaticDiskMemory(..., enable_filters=True)"
146                )
147            if filter_label not in self._labels_map:
148                raise ValueError(
149                    f"A filter label of {filter_label} was provided, but the external(str)->internal(np.uint32) labels map "
150                    f"does not include that label."
151                )
152            k_neighbors = min(k_neighbors, self._labels_metadata[filter_label])
153        _query = _castable_dtype_or_raise(query, expected=self._vector_dtype)
154        _assert(len(_query.shape) == 1, "query vector must be 1-d")
155        _assert(
156            _query.shape[0] == self._dimensions,
157            f"query vector must have the same dimensionality as the index; index dimensionality: {self._dimensions}, "
158            f"query dimensionality: {_query.shape[0]}",
159            )
160        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
161        _assert_is_nonnegative_uint32(complexity, "complexity")
162
163        if k_neighbors > complexity:
164            warnings.warn(
165                f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}"
166            )
167            complexity = k_neighbors
168
169        if filter_label == "":
170            neighbors, distances = self._index.search(query=_query, knn=k_neighbors, complexity=complexity)
171        else:
172            filter = self._labels_map[filter_label]
173            neighbors, distances = self._index.search_with_filter(
174                query=query,
175                knn=k_neighbors,
176                complexity=complexity,
177                filter=filter
178            )
179        return QueryResponse(identifiers=neighbors, distances=distances)

Searches the index by a single query vector.

Parameters

query: 1d numpy array of the same dimensionality and dtype of the index.
k_neighbors: Number of neighbors to be returned. If query vector exists in index, it almost definitely will be returned as well, so adjust your k_neighbors as appropriate. Must be > 0.
complexity: Size of distance ordered list of candidate neighbors to use while searching. List size increases accuracy at the cost of latency. Must be at least k_neighbors in size.

182    def batch_search(
183        self,
184        queries: VectorLikeBatch,
185        k_neighbors: int,
186        complexity: int,
187        num_threads: int,
188    ) -> QueryResponseBatch:
189        """
190        Searches the index by a batch of query vectors.
191
192        This search is parallelized and far more efficient than searching for each vector individually.
193
194        ### Parameters
195        - **queries**: 2d numpy array, with column dimensionality matching the index and row dimensionality being the
196          number of queries intended to search for in parallel. Dtype must match dtype of the index.
197        - **k_neighbors**: Number of neighbors to be returned. If query vector exists in index, it almost definitely
198          will be returned as well, so adjust your ``k_neighbors`` as appropriate. Must be > 0.
199        - **complexity**: Size of distance ordered list of candidate neighbors to use while searching. List size
200          increases accuracy at the cost of latency. Must be at least k_neighbors in size.
201        - **num_threads**: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system
202        """
203
204        _queries = _castable_dtype_or_raise(queries, expected=self._vector_dtype)
205        _assert(len(_queries.shape) == 2, "queries must must be 2-d np array")
206        _assert(
207            _queries.shape[1] == self._dimensions,
208            f"query vectors must have the same dimensionality as the index; index dimensionality: {self._dimensions}, "
209            f"query dimensionality: {_queries.shape[1]}",
210        )
211        _assert_is_positive_uint32(k_neighbors, "k_neighbors")
212        _assert_is_positive_uint32(complexity, "complexity")
213        _assert_is_nonnegative_uint32(num_threads, "num_threads")
214
215        if k_neighbors > complexity:
216            warnings.warn(
217                f"k_neighbors={k_neighbors} asked for, but list_size={complexity} was smaller. Increasing {complexity} to {k_neighbors}"
218            )
219            complexity = k_neighbors
220
221        num_queries, dim = _queries.shape
222        neighbors, distances = self._index.batch_search(
223            queries=_queries,
224            num_queries=num_queries,
225            knn=k_neighbors,
226            complexity=complexity,
227            num_threads=num_threads,
228        )
229        return QueryResponseBatch(identifiers=neighbors, distances=distances)

Searches the index by a batch of query vectors.

This search is parallelized and far more efficient than searching for each vector individually.

Parameters

queries: 2d numpy array, with column dimensionality matching the index and row dimensionality being the number of queries intended to search for in parallel. Dtype must match dtype of the index.
k_neighbors: Number of neighbors to be returned. If query vector exists in index, it almost definitely will be returned as well, so adjust your k_neighbors as appropriate. Must be > 0.
complexity: Size of distance ordered list of candidate neighbors to use while searching. List size increases accuracy at the cost of latency. Must be at least k_neighbors in size.
num_threads: Number of threads to use when searching this index. (>= 0), 0 = num_threads in system

DistanceMetric = typing.Literal['l2', 'mips', 'cosine']

Type alias for one of {"l2", "mips", "cosine"}

VectorDType = typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]

Type alias for one of {numpy.float32, numpy.int8, numpy.uint8}

class QueryResponse(typing.NamedTuple): View Source

68class QueryResponse(NamedTuple):
69    """
70    Tuple with two values, identifiers and distances. Both are 1d arrays, positionally correspond, and will contain the
71    nearest neighbors from [0..k_neighbors)
72    """
73
74    identifiers: npt.NDArray[VectorIdentifier]
75    """ A `numpy.typing.NDArray[VectorIdentifier]` array of vector identifiers, 1 dimensional """
76    distances: npt.NDArray[np.float32]
77    """
78    A `numpy.typing.NDAarray[numpy.float32]` of distances as calculated by the distance metric function,  1 dimensional
79    """

Tuple with two values, identifiers and distances. Both are 1d arrays, positionally correspond, and will contain the nearest neighbors from [0..k_neighbors)

QueryResponse( identifiers: numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]], distances: numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]])

Create new instance of QueryResponse(identifiers, distances)

identifiers: numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]]

A numpy.typing.NDArray[VectorIdentifier] array of vector identifiers, 1 dimensional

distances: numpy.ndarray[typing.Any, numpy.dtype[numpy.float32]]

A numpy.typing.NDAarray[numpy.float32] of distances as calculated by the distance metric function, 1 dimensional

Inherited Members

builtins.tuple: index; count

class QueryResponseBatch(typing.NamedTuple): View Source

82class QueryResponseBatch(NamedTuple):
83    """
84    Tuple with two values, identifiers and distances. Both are 2d arrays, with dimensionality determined by the
85    rows corresponding to the number of queries made, and the columns corresponding to the k neighbors
86    requested. The two 2d arrays have an implicit, position-based relationship
87    """
88
89    identifiers: npt.NDArray[VectorIdentifier]
90    """ 
91    A `numpy.typing.NDArray[VectorIdentifier]` array of vector identifiers, 2 dimensional. The row corresponds to index 
92    of the query, and the column corresponds to the k neighbors requested 
93    """
94    distances: np.ndarray[np.float32]
95    """  
96    A `numpy.typing.NDAarray[numpy.float32]` of distances as calculated by the distance metric function, 2 dimensional. 
97    The row corresponds to the index of the query, and the column corresponds to the distance of the query to the 
98    *k-th* neighbor 
99    """

Tuple with two values, identifiers and distances. Both are 2d arrays, with dimensionality determined by the rows corresponding to the number of queries made, and the columns corresponding to the k neighbors requested. The two 2d arrays have an implicit, position-based relationship

QueryResponseBatch( identifiers: numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]], distances: numpy.ndarray[numpy.float32])

Create new instance of QueryResponseBatch(identifiers, distances)

identifiers: numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]]

A numpy.typing.NDArray[VectorIdentifier] array of vector identifiers, 2 dimensional. The row corresponds to index of the query, and the column corresponds to the k neighbors requested

distances: numpy.ndarray[numpy.float32]

A numpy.typing.NDAarray[numpy.float32] of distances as calculated by the distance metric function, 2 dimensional. The row corresponds to the index of the query, and the column corresponds to the distance of the query to the k-th neighbor

Inherited Members

builtins.tuple: index; count

VectorIdentifier = <class 'numpy.uint32'>

Type alias for a vector identifier, whether it be an implicit array index identifier from StaticMemoryIndex or StaticDiskIndex, or an explicit tag identifier from DynamicMemoryIndex

VectorIdentifierBatch = numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]]

Type alias for a batch of VectorIdentifiers

VectorLike = numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]]

Type alias for something that can be treated as a vector

VectorLikeBatch = numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]]

Type alias for a batch of VectorLikes

class Metadata(typing.NamedTuple): View Source

15class Metadata(NamedTuple):
16    """DiskANN binary vector files contain a small stanza containing some metadata about them."""
17
18    num_vectors: int
19    """ The number of vectors in the file. """
20    dimensions: int
21    """ The dimensionality of the vectors in the file. """

DiskANN binary vector files contain a small stanza containing some metadata about them.

Metadata(num_vectors: int, dimensions: int)

Create new instance of Metadata(num_vectors, dimensions)

num_vectors: int

The number of vectors in the file.

dimensions: int

The dimensionality of the vectors in the file.

Inherited Members

builtins.tuple: index; count

def vectors_metadata_from_file(vector_file: str) -> Metadata: View Source

24def vectors_metadata_from_file(vector_file: str) -> Metadata:
25    """
26    Read the metadata from a DiskANN binary vector file.
27    ### Parameters
28    - **vector_file**: The path to the vector file to read the metadata from.
29
30    ### Returns
31    `diskannpy.Metadata`
32    """
33    _assert_existing_file(vector_file, "vector_file")
34    points, dims = np.fromfile(file=vector_file, dtype=np.int32, count=2)
35    return Metadata(points, dims)

Read the metadata from a DiskANN binary vector file.

Parameters

vector_file: The path to the vector file to read the metadata from.

Returns

diskannpy.Metadata

def vectors_to_file( vector_file: str, vectors: numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]]) -> None: View Source

46def vectors_to_file(vector_file: str, vectors: VectorLikeBatch) -> None:
47    """
48    Utility function that writes a DiskANN binary vector formatted file to the location of your choosing.
49
50    ### Parameters
51    - **vector_file**: The path to the vector file to write the vectors to.
52    - **vectors**: A 2d array of dtype `numpy.float32`, `numpy.uint8`, or `numpy.int8`
53    """
54    _assert_dtype(vectors.dtype)
55    _assert_2d(vectors, "vectors")
56    with open(vector_file, "wb") as fh:
57        _write_bin(vectors, fh)

Utility function that writes a DiskANN binary vector formatted file to the location of your choosing.

Parameters

vector_file: The path to the vector file to write the vectors to.
vectors: A 2d array of dtype numpy.float32, numpy.uint8, or numpy.int8

def vectors_from_file( vector_file: str, dtype: Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8]]) -> numpy.ndarray[typing.Any, numpy.dtype[typing.Union[typing.Type[numpy.float32], typing.Type[numpy.int8], typing.Type[numpy.uint8]]]]: View Source

60def vectors_from_file(vector_file: str, dtype: VectorDType) -> npt.NDArray[VectorDType]:
61    """
62    Read vectors from a DiskANN binary vector file.
63
64    ### Parameters
65    - **vector_file**: The path to the vector file to read the vectors from.
66    - **dtype**: The data type of the vectors in the file. Ensure you match the data types exactly
67
68    ### Returns
69    `numpy.typing.NDArray[dtype]`
70    """
71    points, dims = vectors_metadata_from_file(vector_file)
72    return np.fromfile(file=vector_file, dtype=dtype, offset=8).reshape(points, dims)

Read vectors from a DiskANN binary vector file.

Parameters

vector_file: The path to the vector file to read the vectors from.
dtype: The data type of the vectors in the file. Ensure you match the data types exactly

Returns

numpy.typing.NDArray[dtype]

def tags_to_file( tags_file: str, tags: numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]]) -> None: View Source

75def tags_to_file(tags_file: str, tags: VectorIdentifierBatch) -> None:
76    """
77    Write tags to a DiskANN binary tag file.
78
79    ### Parameters
80    - **tags_file**: The path to the tag file to write the tags to.
81    - **tags**: A 1d array of dtype `numpy.uint32` containing the tags to write. If you have a 2d array of tags with
82      one column, you can pass it here and it will be reshaped and copied to a new array. It is more efficient for you
83      to reshape on your own without copying it first, as it should be a constant time operation vs. linear time
84
85    """
86    _assert(np.can_cast(tags.dtype, np.uint32), "valid tags must be uint32")
87    _assert(
88        len(tags.shape) == 1 or tags.shape[1] == 1,
89        "tags must be 1d or 2d with 1 column",
90    )
91    if len(tags.shape) == 2:
92        warnings.warn(
93            "Tags in 2d with one column will be reshaped and copied to a new array. "
94            "It is more efficient for you to reshape without copying first."
95        )
96        tags = tags.reshape(tags.shape[0], copy=True)
97    with open(tags_file, "wb") as fh:
98        _write_bin(tags.astype(np.uint32), fh)

Write tags to a DiskANN binary tag file.

Parameters

tags_file: The path to the tag file to write the tags to.
tags: A 1d array of dtype numpy.uint32 containing the tags to write. If you have a 2d array of tags with one column, you can pass it here and it will be reshaped and copied to a new array. It is more efficient for you to reshape on your own without copying it first, as it should be a constant time operation vs. linear time

def tags_from_file(tags_file: str) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.uint32]]: View Source

101def tags_from_file(tags_file: str) -> VectorIdentifierBatch:
102    """
103    Read tags from a DiskANN binary tag file and return them as a 1d array of dtype `numpy.uint32`.
104
105    ### Parameters
106    - **tags_file**: The path to the tag file to read the tags from.
107    """
108    _assert_existing_file(tags_file, "tags_file")
109    points, dims = vectors_metadata_from_file(
110        tags_file
111    )  # tag files contain the same metadata stanza
112    return np.fromfile(file=tags_file, dtype=np.uint32, offset=8).reshape(points)

Read tags from a DiskANN binary tag file and return them as a 1d array of dtype numpy.uint32.

Parameters

tags_file: The path to the tag file to read the tags from.

def valid_dtype( dtype: Type) -> Union[Type[numpy.float32], Type[numpy.int8], Type[numpy.uint8]]: View Source

27def valid_dtype(dtype: Type) -> VectorDType:
28    """
29    Utility method to determine whether the provided dtype is supported by `diskannpy`, and if so, the canonical
30    dtype we will use internally (e.g. np.single -> np.float32)
31    """
32    _assert_dtype(dtype)
33    if dtype == np.uint8:
34        return np.uint8
35    if dtype == np.int8:
36        return np.int8
37    if dtype == np.float32:
38        return np.float32

Utility method to determine whether the provided dtype is supported by diskannpy, and if so, the canonical dtype we will use internally (e.g. np.single -> np.float32)