Data Import
We can load data in either distributed or embedded mode. Data import typically involves the following steps:
Specify the data schema in TSL so that each entity or object is associated with a cell type.
Implement the code for loading the data into cell objects.
Dump the memory storage to disk. We can call
SaveStorage
to write the memory storage to disk. This step is critical because GE is an in-memory system. Shutting down the GE instance beforeSaveStorage
finishes will drop all the in-memory data. In the embedded mode, we can directly callGlobal.LocalStorage.SaveStorage
. In the distributed mode, there are two ways of callingSaveStorage
. The first one is to callGlobal.LocalStorage.SaveStorage
on each server. The second way is to issueGlobal.CloudStorage.SaveStorage
on one client instance.
Common Data Import Patterns #
There are two major data import patterns.
An Object Per Fetch #
Suppose we have a single type of data in a file:
#File courses.txt
#======================================================
#course_id name total_score
1 Math 100
2 English 100
3 Geography 50
4 Biology 50
5 P.E. 30
...
In this file, each row represents one object and the data fields are separated
with \t
. In this case, we can model the data as follows:
cell Course
{
/* For brevity, we regard course_id as the cell id. */
string Name;
int TotalScore;
}
We can load this data set to GE by reading the file line-by-line. For each line,
we call the extension method SaveCourse(long CellId, string Name, int
TotalScore)
generated by the TSL compiler. The sample code is as follows:
using( var StreamReader reader = new StreamReader("courses.txt") )
{
string line;
string[] fields;
while(null != (line = reader.ReadLine()))
{
try
{
fields = line.Split('\t');
Global.LocalStorage.SaveCourse(
long.Parse(fields[0]),
fields[1],
int.Parse(fields[2]));
}catch
{
Console.Error.WriteLine("Failed to import the line:");
Console.Error.WriteLine(line);
}
}
}
Similarly, if the data file consists of JSON objects one after another:
/* File student.json */
{"CellID" : 10000, "name" : "Alice", "scores": [ { "cid" : 1, "score": 100 }, { "cid" : 3, "score" : 15 }]}
{"CellID" : 10001, "name" : "Bob", "scores": [ { "cid" : 1, "score": 72 }, { "cid" : 2, "score" : 97 }]}
...
We can simply model the data in TSL by 1:1 mapping to the schema of the JSON objects:
struct ScoreRecord
{
long cid; /* The course id */
int score;
}
cell Student
{
/* CellID is available for all cells */
string name;
/* Both List<T> and Array<T> are compatible with JSON arrays when parsing */
List<ScoreRecord> scores;
}
The code for importing this file is surprisingly simple:
using( var StreamReader reader = new StreamReader("students.json") )
{
string line;
while(null != (line = reader.ReadLine()))
{
try
{
Global.LocalStorage.SaveStudent(Student.Parse(line));
}catch
{
Console.Error.WriteLine("Failed to import the line:");
Console.Error.WriteLine(line);
}
}
}
The Student.Parse
method is generated by the TSL compiler. It leverages
pre-compiled regex expressions for parsing the JSON strings of the Student
JSON objects. This usually more efficient than parsing the strings with a
generic JSON parser.
No matter where the data is stored, as long as we can obtain one object per fetch, the data loading pattern holds. For example, importing data from MongoDB to GE would be as easy as:
BsonClassMap.RegisterClassMap<ScoreRecord> ( cm => {
cm.MapField("cid");
cm.MapField("score");
});
BsonClassMap.RegisterClassMap<Student>( cm => {
cm.MapIdField("CellID");
cm.MapField("name");
cm.MapField("scores");
});
/* Assuming data stored in a collection "students" */
foreach(var student in db.GetCollection<Student>("students"))
{
Global.LocalStorage.SaveStudent(student);
}
Partial Object Per Fetch #
It is subtler if we cannot fetch an object by one fetch operation. We have to join multiple lines to obtain an entry. Here is such a data set:
#File: professors.rdf
#Data given in triples <prof_id, property_name, property_value>
1000001 "teaches" "Computer science"
1000000 "name" "Jeff"
1000000 "comment" "A distinguished computer scientist."
1000001 "hobby" "Taichi"
1000000 "teaches" "P.E."
1000001 "name" "Alice"
1000001 "teaches" "Math"
...
It requires some efforts in modeling this kind of data. As shown in the example,
each line contains a single property of an object. Multiple properties with the
same name should be grouped into a List<T>
. If a property is not mandatory for
all objects, we should mark it with the optional
modifier. For this example,
we can model the data set as follows:
cell Professor
{
/* Let CellID represent prof_id */
List<string> teaches;
string name;
optional string hobby;
}
The simplest way to import data of this kind is to use accessors to upsert
the objects. An upsert operation updates an existing object or insert a new
object if no matched existing object has been found.
Global.LocalStorage.UseProfessor(long cellId, CellAccessOptions options)
has
upsert
semantics when the CellAccessOptions.CreateNewOnCellNotFound
flag is
switched on. The sample code is as follows:
using( var StreamReader reader = new StreamReader("professors.rdf") )
{
string line;
string[] cols;
while(null != (line = reader.ReadLine()))
{
try
{
cols = line.Split(new[] {' ', '\t'}, StringSplitOptions.RemoveEmpty);
using (var prof_accessor = Global.LocalStorage.UseProfessor(
long.Parse(cols[0]),
CellAccessOptions.CreateNewOnNotFound))
{
switch(cols[1])
{
case "teaches":
prof_accessor.teaches.Add(cols[2]);
break;
case "name":
prof_accessor.name = cols[2];
break;
case "hobby":
prof_accessor.hobby = cols[2];
break;
}
}
}catch
{
Console.Error.WriteLine("Failed to import the line:");
Console.Error.WriteLine(line);
}
}
}
This approach, however, usually has performance issues. Appending new fields one by one causes a lot of cell resize operations, which will pose great pressure on the GE memory management system. The best practice in this case is to sort the data set so that the properties of an object are grouped together. Then, we can build and save a whole cell at a time without fragmenting the memory storage. After sorting, the data set looks like:
# File: entities.rdf
# The data is sorted by keys
1 name Book
1 type ObjectClassDefinition
1 property title
1 property author
1 property price
2 title "Secret Garden: An Inky Treasure Hunt and Coloring Book"
2 author "Johanna Basford"
2 price "$10.24"
2 type Book
3 type Author
3 name "Johanna Basford"
3 write_book 2
4 name Author
4 property write_book
4 property name
4 type ObjectClassDefinition
...
Now we can read a few consecutive lines to construct a data object. Moreover, we can observe that the triples starting with 1 and 4 specify the data schemata (schema triples), and the triples starting with 2 and 3 are ordinary data objects (data triples) of the types defined by the schema triples. For each schema object, we can add a corresponding cell type in the TSL file:
/* The definition for the data type definition objects */
cell ObjectClassDefinition
{
string name;
List<string> property;
}
/* Constructed from object #1 */
cell Book
{
string title;
string author;
string price;
}
/* Constructed from object #4 */
cell Author
{
List<CellID> write_book;
string name;
}
...
Generic cells can be used to handle different types of data, as long as the type of a cell can be determined from the input data. The following code demonstrates the usage of generic cells on the sorted data:
/* Assuming data sorted by cell id, but fields may come in shuffled. */
using( var StreamReader reader = new StreamReader("professors.rdf") )
{
string line;
string[] cols;
string cell_type_str = "";
long cell_id = long.MaxValue;
List<string> field_keys = new List<string>();
List<string> field_values = new List<string>();
ICell generic_cell;
while(null != (line = reader.ReadLine()))
{
try
{
cols = line.Split(new[] {' ', '\t'}, StringSplitOptions.RemoveEmpty);
long line_cell_id = long.Parse(cols[0]);
if( line_cell_id != cell_id ) // We're done with the current one
{
if( field_key.Count != 0 ) // Commit the current cell if it
// contains data
{
/* We assume that cell_type_str exactly represents the
* type string of the cell. If the cell_type_str is invalid,
* an exception will be thrown and the current record will
* be dropped.
*/
generic_cell = Global.LocalStorage.NewGenericCell(cell_id, cell_type_str);
/* We use AppendToField to save data to a GenericCell. The
* value will be converted to the compatible type at the
* best effort automatically. The semantics of AppendToField
* is determined by the data type. If the target data is a
* list or a string, the supplied value will be appended to
* its tail. Otherwise, the target field will be
* overwritten. This makes it convenient for using a single
* interface for saving to both appendable data fields(lists
* and strings) and non-appendable data fields(numbers,
* objects, etc).
*/
for( int i = 0, e = field_keys.Count; i != e; ++i )
{
generic_cell.AppendToField(field_keys[i], field_values[i]);
}
/* In this example, the local memory storage is accessed
* only once for each imported cell. It will have higher
* efficiency than the upsert method shown above.
*/
Global.LocalStorage.SaveGenericCell(generic_cell);
/* We're done with the current cell. Clear up the
* environment and move on to the next cell.
*/
cell_type_str = ""; // Reset the cell type string.
field_keys.Clear(); // Reset the data field buffers.
field_values.Clear();
cell_id = line_cell_id; // Set the target cell id.
}
}
/* Process the line now */
if(cols[1] == "type") // We're reading the type string
{
cell_type_str = cols[2];
}else // Otherwise, a data field.
{
field_keys.Add(cols[1]);
field_values.Add(cols[2]);
}
}catch
{
Console.Error.WriteLine("Failed to import the line:");
Console.Error.WriteLine(line);
}
}
}
Data Import via RESTful APIs #
Sometimes it is more convenient to process certain data sets in a programming
language other than C#
. With the built-in support of user-defined RESTful
APIs, we can easily post data to Graph Engine via http interfaces.