Inverted Index
GE provides built-in inverted index support for substring queries on cell field values. GE only indexes the cell field values that are marked with [index] attribute in TSL.
Only the cell fields with string type can be indexed. There are two cases. 1)
The type of the cell field is string. During substring query processing, a
cell is matched if the value of its indexed field contains the queried
substring. 2) The cell field is a collection of strings, e.g., List<string>
or List<List<string>>
. During substring query processing, a cell is matched as
long as any string in the collection contains the queried substring.
Index Declaration #
To make a field indexed, add an [index]
attribute to it:
cell MyCell
{
[Index]
string Name;
}
The index attribute is only valid for the fields whose type is string or a collection of strings.
It is possible to declare an index on a nested field, for example:
struct leaf
{
[Index]
string data;
}
cell root
{
leaf substructure;
}
Because there is no way to globally identify such a struct in GE, the index
attribute on struct leaf
alone does not produce a valid index. Only when such
a struct is contained in a cell, say root
, root.leaf.data
becomes an
accessible cell field and a valid index will be built. It is allowed to have an
indexed field included in a substructure of a substructure,
cell.inner_1.inner_2. ... .leaf.data
is indexable.
If a substructure with one or more indexed fields is included in multiple cell structs, then an index will be built for each of these cell types, and the indexes are independent with each other.
Substring Queries #
For an indexed field, we can issue a substring query against it by calling the
method Index.SubstringQuery
with a field identifier and a query string. A
field identifier, such as Index.root.leaf.data, is an automatically generated
nested class defined within the Index
class. It is used to specify which cell
field we are going to query against: Index.root.leaf.data identifies the
data
field in the leaf
struct within the root cell.
Index.SubstringQuery(Index.root.leaf.data, "query string")
returns a list of
cell ids. Each root.leaf.data field of the corresponding cells contains the
"query string".
The method Index.SubstringQuery also accepts a sequence of query strings.
Given a sequence of query strings q1, q2, ..., qn
, this method will perform a
wildcard search with the *q1*q2*...*qn*
pattern. It means that the strings
q1, q2, ..., qn
are substrings of a string in the order specified in the
sequence.
Index Update #
If the cells are continuously updated, the changes made to the indexed fields may not be immediately reflected in the index. That is, a substring query may return outdated results. To rule out false positives (the previously matched cells do not match now), we can check the field values again after getting the matched cell ids. It is not easy to address false negatives though, i.e., a cell should be matched, but not included in the index yet.
The indexes are updated periodically by the system. To manually update an index,
we can call Index.UpdateSubstringIndex()
on a specific field identifier.
LINQ Integration #
The inverted index subsystem is integrated with LINQ. In a LINQ query over a
selector, GE translates
String.Contains
on an indexed field into inverted index queries. The same
rule applies for IEnumerable<string>.Contains
. Refer to the
LINQ section for more details.