diskannpy.defaults

Parameter Defaults

These parameter defaults are re-exported from the C++ extension module, and used to keep the pythonic wrapper in sync with the C++.

View Source

 1# Copyright (c) Microsoft Corporation. All rights reserved.
 2# Licensed under the MIT license.
 3
 4"""
 5# Parameter Defaults
 6These parameter defaults are re-exported from the C++ extension module, and used to keep the pythonic wrapper in sync with the C++.
 7"""
 8from ._diskannpy import defaults as _defaults
 9
10ALPHA = _defaults.ALPHA
11""" 
12Note that, as ALPHA is a `float32` (single precision float) in C++, when converted into Python it becomes a 
13`float64` (double precision float). The actual value is 1.2f. The alpha parameter (>=1) is used to control the nature 
14and number of points that are added to the graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) 
15to convergence, but probably more distance comparisons compared to a lower alpha value.
16"""
17NUM_THREADS = _defaults.NUM_THREADS
18""" Number of threads to use. `0` will use all available detected logical processors """
19MAX_OCCLUSION_SIZE = _defaults.MAX_OCCLUSION_SIZE
20""" 
21The maximum number of points that can be occluded by a single point. This is used to  prevent a single point from 
22dominating the graph structure. If a point has more than `max_occlusion_size` neighbors closer to it than the current 
23point, it will not be added to the graph. This is a tradeoff between index build time and search quality. 
24"""
25FILTER_COMPLEXITY = _defaults.FILTER_COMPLEXITY
26""" 
27Complexity (a.k.a. `L`) references the size of the list we store candidate approximate neighbors in while doing a 
28filtered search. This value must be larger than `k_neighbors`, and larger values tend toward higher recall in the 
29resultant ANN search at the cost of more time. 
30"""
31NUM_FROZEN_POINTS_STATIC = _defaults.NUM_FROZEN_POINTS_STATIC
32""" Number of points frozen by default in a StaticMemoryIndex """
33NUM_FROZEN_POINTS_DYNAMIC = _defaults.NUM_FROZEN_POINTS_DYNAMIC
34""" Number of points frozen by default in a DynamicMemoryIndex """
35SATURATE_GRAPH = _defaults.SATURATE_GRAPH
36""" Whether to saturate the graph or not. Default is `True` """
37GRAPH_DEGREE = _defaults.GRAPH_DEGREE
38""" 
39Graph degree (a.k.a. `R`) is the maximum degree allowed for a node in the index's graph structure. This degree will be 
40pruned throughout the course of the index build, but it will never grow beyond this value. Higher R values require 
41longer index build times, but may result in an index showing excellent recall and latency characteristics. 
42"""
43COMPLEXITY = _defaults.COMPLEXITY
44""" 
45Complexity (a.k.a `L`) references the size of the list we store candidate approximate neighbors in while doing build
46or search tasks. It's used during index build as part of the index optimization processes. It's used in index search 
47classes both to help mitigate poor latencies during cold start, as well as on subsequent queries to conduct the search. 
48Large values will likely increase latency but also may improve recall, and tuning these values for your particular 
49index is certainly a reasonable choice.
50"""
51PQ_DISK_BYTES = _defaults.PQ_DISK_BYTES
52""" 
53Use `0` to store uncompressed data on SSD. This allows the index to asymptote to 100% recall. If your vectors are 
54too large to store in SSD, this parameter provides the option to compress the vectors using PQ for storing on SSD. 
55This will trade off recall. You would also want this to be greater than the number of bytes used for the PQ 
56compressed data stored in-memory. Default is `0`. 
57"""
58USE_PQ_BUILD = _defaults.USE_PQ_BUILD
59"""
60 Whether to use product quantization in the index building process. Product quantization is an approximation 
61technique that can vastly speed up vector computations and comparisons in a spatial neighborhood, but it is still an 
62approximation technique. It should be preferred when index creation times take longer than you can afford for your 
63use case.
64"""
65NUM_PQ_BYTES = _defaults.NUM_PQ_BYTES
66""" 
67The number of product quantization bytes to use. More bytes requires more resources in both memory and time, but is 
68like to result in better approximations. 
69"""
70USE_OPQ = _defaults.USE_OPQ
71""" Whether to use Optimized Product Quantization or not. """

ALPHA = 1.2000000476837158

Note that, as ALPHA is a float32 (single precision float) in C++, when converted into Python it becomes a float64 (double precision float). The actual value is 1.2f. The alpha parameter (>=1) is used to control the nature and number of points that are added to the graph. A higher alpha value (e.g., 1.4) will result in fewer hops (and IOs) to convergence, but probably more distance comparisons compared to a lower alpha value.

NUM_THREADS = 0

Number of threads to use. 0 will use all available detected logical processors

MAX_OCCLUSION_SIZE = 750

The maximum number of points that can be occluded by a single point. This is used to prevent a single point from dominating the graph structure. If a point has more than max_occlusion_size neighbors closer to it than the current point, it will not be added to the graph. This is a tradeoff between index build time and search quality.

FILTER_COMPLEXITY = 0

Complexity (a.k.a. L) references the size of the list we store candidate approximate neighbors in while doing a filtered search. This value must be larger than k_neighbors, and larger values tend toward higher recall in the resultant ANN search at the cost of more time.

NUM_FROZEN_POINTS_STATIC = 0

Number of points frozen by default in a StaticMemoryIndex

NUM_FROZEN_POINTS_DYNAMIC = 1

Number of points frozen by default in a DynamicMemoryIndex

SATURATE_GRAPH = 0

Whether to saturate the graph or not. Default is True

GRAPH_DEGREE = 64

Graph degree (a.k.a. R) is the maximum degree allowed for a node in the index's graph structure. This degree will be pruned throughout the course of the index build, but it will never grow beyond this value. Higher R values require longer index build times, but may result in an index showing excellent recall and latency characteristics.

COMPLEXITY = 100

Complexity (a.k.a L) references the size of the list we store candidate approximate neighbors in while doing build or search tasks. It's used during index build as part of the index optimization processes. It's used in index search classes both to help mitigate poor latencies during cold start, as well as on subsequent queries to conduct the search. Large values will likely increase latency but also may improve recall, and tuning these values for your particular index is certainly a reasonable choice.

PQ_DISK_BYTES = 0

Use 0 to store uncompressed data on SSD. This allows the index to asymptote to 100% recall. If your vectors are too large to store in SSD, this parameter provides the option to compress the vectors using PQ for storing on SSD. This will trade off recall. You would also want this to be greater than the number of bytes used for the PQ compressed data stored in-memory. Default is 0.

USE_PQ_BUILD = False

Whether to use product quantization in the index building process. Product quantization is an approximation technique that can vastly speed up vector computations and comparisons in a spatial neighborhood, but it is still an approximation technique. It should be preferred when index creation times take longer than you can afford for your use case.

NUM_PQ_BYTES = 0

The number of product quantization bytes to use. More bytes requires more resources in both memory and time, but is like to result in better approximations.

USE_OPQ = False

Whether to use Optimized Product Quantization or not.