Optimize index

Optimize Index

One way to index new data files and merge them into an existing index is by calling refreshIndex with the "incremental" mode. In this mode, each time refresh is called to index newly appended source data files, it creates fresh index files for these files and updates index metadata to include them in the index content. As these index files accumulate, they could affect query performance when the index is used. Once the index is leveraged for a query, the large number of these files could increase overall query time as more index files need to be accessed and potentially read to compute the query results.

NOTE: Since Hyperspace v0.4.0, below command shows the number of index data files for the given index.</p>

import com.microsoft.hyperspace._
val hs = new Hyperspace(spark)

hs.index("lineitem_index4").select("numIndexFiles").show

Hyperspace provides the optimizeIndex command to alleviate above issue by changing index files layout for an index which has many index files, due to incremental refresh call(s). This is achieved by merging index files together, if possible, and replacing them with fewer larger files that capture exact same index records. This process is similar to compaction in append-only log structured merge index structures.

One important point is that optimizeIndex differs from refreshIndex in the sense that optimizeIndex is an index-only operation. Unlike refreshIndex, when optimizeIndex is running on an index, it does not look at the current state of the source data files. If there were any source data file changes after last refresh, optimizeIndex will not apply those data changes to the index files.

You should note that the optimizeIndex command is a best effort to modify index files layout and its final outcome depends on how index records are stored in existing index files. Two or more index files can be merged with each other, if and only if they all have index records which belong to the same bucket (according to index configuration). When running optimizeIndex command, Hyperspace tries to find such groups of index files and merge them together. If all index records for each given bucket are already stored in a single index file, then there wont be any index files merge during optimize and physical layout of index files after running optimize will be the same as the original layout. An example of such an index is an index right after creation or full refresh.

Currently, there are two optimize modes available for an index: "quick" and "full". These modes differ with each other in terms of the subset of index files they identify and try to merge.

Optimize Modes

    Quick - small files only Full
Optimize   Faster Optimize Speed Slower Optimize Speed
  API optimizeIndex(mode=”quick”) optimizeIndex(mode=”full”)
  What it does? Best-effort merge of small index files within a bucket; DOES NOT refresh the index Create a single file per bucket by merging small & large files; DOES NOT refresh the index
  When to use? When perf starts degrading by many index data files from incremental refreshes When perf starts degrading by many index data files from incremental refreshes

Optimize Index - Quick Mode

Using optimizeIndex command with the "quick" mode on an index with many index files, due to incremental index refresh, causes Hyperspace look for index files which are smaller than a configurable size threshold and try to merge them. This mode tries to achieve a moderate query performance improvement through a fast optimize index process. The size threshold for an index file to be eligible for merging during quick optimization can be changed. Check the configuration page to see how this threshold can be adjusted. Quick mode is the default mode for optimizeIndex.

Assume you have an index with the name empIndex. After adding some data files to the dataset empIndex is created on and refreshing it in the incremental mode, you can optimize empIndex in the quick mode as below:

Scala:

import com.microsoft.hyperspace._

val hs = new Hyperspace(spark)
hs.optimizeIndex("empIndex", "quick")

Python:

from hyperspace import Hyperspace

hs = Hyperspace(spark)
hs.optimizeIndex("empIndex", "quick")

Optimize Index - Full Mode

In the "full" mode, optimizeIndex command considers all existing index files as candidates for merging. Therefore, Hyperspace does a full scan on current index content and identifies groups of index files which have index records belonging to the same bucket according to the index configuration. It then replaces each group with a single index file created through merging all the files in that group together. This mode tries to achieve the best query performance improvement via a potentially slow optimize index process.

Assume you have an index with the name empIndex. After adding some source data files to the dataset empIndex is created on and refreshing it in the incremental mode, you can optimize empIndex in the full mode as below:

Scala:

import com.microsoft.hyperspace._

val hs = new Hyperspace(spark)
hs.optimizeIndex("empIndex", "full")

Python:

from hyperspace import Hyperspace

hs = Hyperspace(spark)
hs.optimizeIndex("empIndex", "full")

Updated: