Hyperspace is an early-phase indexing subsystem for Apache Spark™ that introduces the ability for users to build indexes on their data, maintain them through a multi-user concurrency mode, and leverage them automatically - without any change to their application code - for query/workload acceleration.
Users utilize a set of simple APIs exposed by Hyperspace to make use of its powerful capabilities.
Simple API
With simple API such as create, refresh, delete, restore, vacuum and cancel, Hyperspace helps you get started easily!
Multi-language support
Don’t know Scala? Don’t worry! Hyperspace supports Scala, Python and .NET allowing you to be productive right-away.
Just works!
Hyperspace works out-of-box with open source Apache Spark™ v2.4 and does not depend on any service.
Hyperspace Usage API in Apache Spark™
import org.apache.spark.sql._
import com.microsoft.hyperspace._
import com.microsoft.hyperspace.index._
object HyperspaceSampleApp extends App {
val spark = SparkSession.builder().appName("main").master("local").getOrCreate()
import spark.implicits._
// Create sample data and read it back as a DataFrame
Seq((1, "name1"), (2, "name2")).toDF("id", "name").write.mode("overwrite").parquet("table")
val df = spark.read.parquet("table")
// Create Hyperspace object
val hs = new Hyperspace(spark)
// Create indexes
hs.createIndex(df, IndexConfig("index1", indexedColumns = Seq("id"), includedColumns = Seq("name")))
hs.createIndex(df, IndexConfig("index2", indexedColumns = Seq("name")))
// Display available indexes
hs.indexes.show
// Simple filter query
val query = df.filter(df("id") === 1).select("name")
// Check if any indexes will be utilized
hs.explain(query, verbose = true)
// Now utilize the indexes
spark.enableHyperspace
query.show
// The followings are the index management APIs.
hs.refreshIndex("index1")
hs.deleteIndex("index1")
hs.restoreIndex("index1")
hs.deleteIndex("index2")
hs.vacuumIndex("index2")
// Clean up
spark.stop
}