Hyperspace is an early-phase indexing subsystem for Apache Spark™ that introduces the ability for users to build indexes on their data, maintain them through a multi-user concurrency mode, and leverage them automatically - without any change to their application code - for query/workload acceleration.

Users utilize a set of simple APIs exposed by Hyperspace to make use of its powerful capabilities.

Simple API

With simple API such as create, refresh, delete, restore, vacuum and cancel, Hyperspace helps you get started easily!

Multi-language support

Don’t know Scala? Don’t worry! Hyperspace supports Scala, Python and .NET allowing you to be productive right-away.

Just works!

Hyperspace works out-of-box with open source Apache Spark™ v2.4 and does not depend on any service.

Hyperspace Usage API in Apache Spark™

import org.apache.spark.sql._
import com.microsoft.hyperspace._
import com.microsoft.hyperspace.index._

object HyperspaceSampleApp extends App {
  val spark = SparkSession.builder().appName("main").master("local").getOrCreate()

  import spark.implicits._

  // Create sample data and read it back as a DataFrame
  Seq((1, "name1"), (2, "name2")).toDF("id", "name").write.mode("overwrite").parquet("table")
  val df = spark.read.parquet("table")

  // Create Hyperspace object
  val hs = new Hyperspace(spark)

  // Create indexes
  hs.createIndex(df, IndexConfig("index1", indexedColumns = Seq("id"), includedColumns = Seq("name")))
  hs.createIndex(df, IndexConfig("index2", indexedColumns = Seq("name")))

  // Display available indexes
  hs.indexes.show

  // Simple filter query
  val query = df.filter(df("id") === 1).select("name")

  // Check if any indexes will be utilized
  hs.explain(query, verbose = true)

  // Now utilize the indexes
  spark.enableHyperspace
  query.show

  // The followings are the index management APIs.
  hs.refreshIndex("index1")
  hs.deleteIndex("index1")
  hs.restoreIndex("index1")
  hs.deleteIndex("index2")
  hs.vacuumIndex("index2")

  // Clean up
  spark.stop
}