Skip to content

Graph File Format

This document presents the various on-disk file formats supported by GF. For the in-memory graph formats, read this doc instead.

GF Graph

The GF Graph format is the primary and most efficient way to serialize small and large graphs. The format is efficient, compatible with open-source / cloud, suited for distributed reading & writing, and easy to create manually or with other tools.

Usage

Reading and writing a GF Graph in memory is done respectively with dgf.io.read_graph and dgf.io.write_graph.

graph, schema = dgf.io.read_graph(
    "/path/to/ogb_arxiv")

dgf.io.write_graph(graph, schema, path="/tmp/my_graph")

Reading and writing a GF Graph as a Beam distributed graph is done respectively with dgf.beam.io.read_graph and dgf.beam.io.write_graph.

Directory structure

A GF graph is structured as follows:

  • metadata.json: A JSON specifying details about the graph encoding:
    • Version of the format as an integer.
    • Optionally, contains the number of nodes and edges in each nodesets/edgesets.
  • schema.json: The dgf.data.GraphSchema graph schema in JSON.
    • The nodesets are required to have a feature of type bytes or int (int32, int64) with semantic PRIMARY_KEY.
    • The edgesets can optionally have a feature of type bytes or int (int32, int64) with semantic PRIMARY_KEY. This specifies if the edges have ids.
  • nodesets directory:
    • <nodeset name>@*.parquet: Feature values for each nodeset stored as sharded parquet files. The #id column specifies the ID of the node.
  • edgesets directory:
    • <edgeset name>@*.parquet: Source and target node id, and feature values, of each edgeset, stored as sharded parquet files. The optional #id column specifies the ID of the edge. The id of the source and target nodes are stored in the #source and #target features.

About Parquet files

Parquet has become the de facto format for tabular data. Many internal and external tools can produce parquet files natively. You can use parquet-tools to look at the parquet files directly.

In Google SQL / BigQuery, you can generate parquet files with EXPORT DATA and format = 'PARQUET'. Parquet files can be generated by Beam pipelines.

Graph AI HGraph

A Graph AI HGraph (or just HGraph) is another directory-based format for defining heterogeneous graphs.

The HGraph format is similar to the GF format, with two major differences:

  • Nodes and edges are stored as tf.train.Example or other proprietary protos (instead of parquet files).
  • The protos are encapsulated within SSTables, RecordIO or TF Records.
  • The schema is defined by a TF-GNN schema proto (instead of a GF JSON Schema). This schema is more lenient (e.g., ID features are optional) and does not carry semantic information.

Recommendation

For efficiency reasons (up to 100x speed difference), Cloud / open-source compatibility, and ease of creation / consumption, using the GF format is always recommended over the Graph AI HGraph format.

Usage

Reading and writing a Graph AI HGraph in memory is done respectively with dgf.io.read_graphai_hgraph and dgf.io.write_graphai_hgraph.

Reading and writing a Graph AI HGraph as a Beam distributed graph is done respectively with dgf.beam.io.read_graphai_hgraph and dgf.beam.io.write_graphai_hgraph.

Combining dgf.io.read_graphai_hgraph with dgf.io.write_graph allows you to convert small (e.g. <100 edges) HGraph into GF graphs.

The "convert_hgraph_to_gf_graph" example Beam CLI allows you to convert large HGraph to GF Graph.

TF Graph Samples

A TF Graph Samples file defines one or several small heterogeneous graphs, where each graph fits into memory of a single computer. TF Graph Samples files are well suited to represent a large amount of small graphs e.g., graph samples.

A TF Graph Samples file (or sharded file) is a TensorFlow Record file containing serialized tensorflow.Example protos, following the conventions of TF GNN Graph.

Note: Unlike for Graph AI HGraph, the use of tf.train.Example in TF Graph Samples is mostly efficient.

Usage

TF Graph Samples can be read/written in memory using dgf.io.read_tfgnn_graphs and dgf.io.write_tfgnn_graphs.

TF Graph Samples can be read/written in Beam using dgf.beam.io.read_tfgnn_graphs and dgf.beam.io.write_tfgnn_graphs.