Outputs

The default pipeline produces a series of output tables that align with the conceptual knowledge model. This page describes the detailed output table schemas. By default we write these tables out as parquet files on disk.

Shared fields

All tables have two identifier fields:

name	type	description
id	str	Generated UUID, assuring global uniqueness
human_readable_id	int	This is an incremented short ID created per-run. For example, we use this short ID with generated summaries that print citations so they are easy to cross-reference visually.

communities

This is a list of the final communities generated by Leiden. Communities are strictly hierarchical, subdividing into children as the cluster affinity is narrowed.

name	type	description
community	int	Leiden-generated cluster ID for the community. Note that these increment with depth, so they are unique through all levels of the community hierarchy. For this table, human_readable_id is a copy of the community ID rather than a plain increment.
parent	int	Parent community ID.
children	int[]	List of child community IDs.
level	int	Depth of the community in the hierarchy.
title	str	Friendly name of the community.
entity_ids	str[]	List of entities that are members of the community.
relationship_ids	str[]	List of relationships that are wholly within the community (source and target are both in the community).
text_unit_ids	str[]	List of text units represented within the community.
period	str	Date of ingest, used for incremental update merges. ISO8601
size	int	Size of the community (entity count), used for incremental update merges.

community_reports

This is the list of summarized reports for each community.

name	type	description
community	int	Short ID of the community this report applies to.
parent	int	Parent community ID.
children	int[]	List of child community IDs.
level	int	Level of the community this report applies to.
title	str	LM-generated title for the report.
summary	str	LM-generated summary of the report.
full_content	str	LM-generated full report.
rank	float	LM-derived relevance ranking of the report based on member entity salience
rating_explanation	str	LM-derived explanation of the rank.
findings	dict	LM-derived list of the top 5-10 insights from the community. Contains `summary` and `explanation` values.
full_content_json	json	Full JSON output as returned by the LM. Most fields are extracted into columns, but this JSON is sent for query summarization so we leave it to allow for prompt tuning to add fields/content by end users.
period	str	Date of ingest, used for incremental update merges. ISO8601
size	int	Size of the community (entity count), used for incremental update merges.

covariates

(Optional) If claim extraction is turned on, this is a list of the extracted covariates. Note that claims are typically oriented around identifying malicious behavior such as fraud, so they are not useful for all datasets.

name	type	description
covariate_type	str	This is always "claim" with our default covariates.
type	str	Nature of the claim type.
description	str	LM-generated description of the behavior.
subject_id	str	Name of the source entity (that is performing the claimed behavior).
object_id	str	Name of the target entity (that the claimed behavior is performed on).
status	str	LM-derived assessment of the correctness of the claim. One of [TRUE, FALSE, SUSPECTED]
start_date	str	LM-derived start of the claimed activity. ISO8601
end_date	str	LM-derived end of the claimed activity. ISO8601
source_text	str	Short string of text containing the claimed behavior.
text_unit_id	str	ID of the text unit the claim text was extracted from.

documents

List of document content after import.

name	type	description
title	str	Filename, unless otherwise configured during CSV import.
text	str	Full text of the document.
text_unit_ids	str[]	List of text units (chunks) that were parsed from the document.
metadata	dict	If specified during CSV import, this is a dict of metadata for the document.

entities

List of all entities found in the data by the LM.

name	type	description
title	str	Name of the entity.
type	str	Type of the entity. By default this will be "organization", "person", "geo", or "event" unless configured differently or auto-tuning is used.
description	str	Textual description of the entity. Entities may be found in many text units, so this is an LM-derived summary of all descriptions.
text_unit_ids	str[]	List of the text units containing the entity.
frequency	int	Count of text units the entity was found within.
degree	int	Node degree (connectedness) in the graph.
x	float	X position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0.
y	float	Y position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0.

relationships

List of all entity-to-entity relationships found in the data by the LM. This is also the edge list for the graph.

name	type	description
source	str	Name of the source entity.
target	str	Name of the target entity.
description	str	LM-derived description of the relationship. Also see note for entity descriptions.
weight	float	Weight of the edge in the graph. This is summed from an LM-derived "strength" measure for each relationship instance.
combined_degree	int	Sum of source and target node degrees.
text_unit_ids	str[]	List of text units the relationship was found within.

text_units

List of all text chunks parsed from the input documents.

name	type	description
text	str	Raw full text of the chunk.
n_tokens	int	Number of tokens in the chunk. This should normally match the `chunk_size` config parameter, except for the last chunk which is often shorter.
document_ids	str[]	List of document IDs the chunk came from. This is normally only 1 due to our default groupby, but for very short text documents (e.g., microblogs) it can be configured so text units span multiple documents.
entity_ids	str[]	List of entities found in the text unit.
relationships_ids	str[]	List of relationships found in the text unit.
covariate_ids	str[]	Optional list of covariates found in the text unit.