SBHPF-Standard/SBHPF-v1.md
2025-02-14 20:12:29 +01:00

13 KiB
Raw Permalink Blame History

Simple Binary Hierarchical Property Format (Version 1)

1. Overview

This document specifies the structure of a compact and efficient binary tree-like property format designed for hierarchical key-value storage. The format prioritizes fast serialization/deserialization and minimal storage overhead.

SBHPF stands for Simple Binary Hierarchical Property Format.

The term "file" or "property file" in this document doesn't strictly imply a file in the typical sense of an entry in a filesystem. "file", in this context, simply refers to a byte array. This byte array can in turn be stored in a traditional file, or in any other format. We chose the term "file" to make the document easier to read.

2. File Structure

A binary property file consists of: A fixed-length header (2 bytes) that defines the format version and feature set. A hierarchical node tree that contains nodes, and key-value entries known as "properties".

3. File Header

The file begins with a 2-byte fixed-length header:

Offset Size Field Name Description
0 1 Version Format version. Always 0x01 for this version. Future versions increment this.
1 1 Feature Flags Reserved for future use. Always 0x00 in version 1.

After the header, the first node starts immediately at byte 2.

4. Node Structure

Each node represents a hierarchical entity and contains a fixed-length header, properties, and child nodes. Note that only the node header is required for a node to be valid. Properties and child nodes are optional.

A node has a structure with the following components:

  • Node Header - Data describing the node. Has fixed size of 9 bytes.
  • Node Name - (optional) Only present on named nodes. A node is considered named if the Name Length node header is bigger than 0. That header also defines the size of this component.
  • Node Properties - The beginning of node properties. How this works is described in detail later, but this is only present if the Propert Count node header is bigger than 0. That header, in combination with information on each property, defines the size of this component.
  • Child Nodes - The beginning of child nodes. This is also described more later, but this is only present if the Child Count node header is bigger than 0. The size of this component is defined by that header, in combination with the Node Size headers of the child nodes.

4.1 Node Header

Each node begins with a fixed 9-byte header:

Offset Size Field Name Description
0 4 Node Size Total size of this node (including header, entries, and children).
4 2 Property Count Number of key-value entries (properties) within this node.
6 2 Child Count Number of direct child nodes.
8 1 Name Length Length of the node name field (in bytes). A value of 0 means unnamed.

These fields are unsigned integers.

4.1.1 Node Size

Node Size specifies the total size of the node in bytes.

The size is defined as: the amount of bytes from the first byte of the node header, to the last byte of the last child of the node.

If the node does not have any children, this definition changes to: the last byte of the last property.

And if there are no properties, the definition will simply be: the last byte of the node header (if the node is named, it will be last byte of node header + Name Length).

Since Node Size has a fixed size of 4 bytes, the theoretical max size of a node is 4,294,967,296 bytes, or 4.295 gigabytes. This of course, includes root nodes.

This constraint might be addressed in future revisions, but storing that amount of data in a format primarily intended for configuration files does not make sense anyway.

4.1.2 Property Count

Property Count specifies the amount of properties defined on the node. This, together with other parts of the node header, allows defining "fixed" sizes of variable fields in the binary structure. By using this approach, we eliminate the need of recursively terminating values which would contribute to inefficient parsing.

Since Property Count has a fixed size of 2 bytes, the max value of this field is 65535. This effectively means that each node can have a total of 65535 properties defined on it.

4.1.3 Child Count

Child Count specifies the amount of child nodes belonging to the node. There are no explicitly defined pointers or IDs in this binary format, because these node headers are all we need to dynamically determine the location of data.

Child nodes are defined immediately after the last property on the node, or the end of the node name if no properties are defined, or the end of the node header if no name is defined.

Since Child Count has a fixed size of 2 bytes, the max value of this field is 65535. This effectively means that each node can have up to 65535 children defined on it.

4.1.4 Name Length

If Name Length > 0, the node name is stored immediately after the node header, as a UTF-8 encoded string.

Since Name Length has a fixed size of 1 byte, the max value of this field is 255. This effectively means that the name of each node, if defined, can have a maximum length of 255.

If unnamed, this section is omitted, and the next section (properties) starts immediately after the header.

4.3 Properties

A.k.a key-value entries.

After the node header (and optional node name), the node contains properties. Each property follows this format:

Offset Size Field Name Description
0 1 Key Length Length of the key string (in bytes).
2 1 Value Type Specifies the type of the value (see Section 5).
3 X Key UTF-8 string of Key Length bytes.
X+3 Y Value The value itself, determined by Value Type.

The number of properties on a node is given by Property Count in the Node Header.

The total size of each property is variable, depending on Key Length and Value Type.

This can also be described as "the first two bytes are fixed, and are used to determine the length of the rest of the property".

4.3.1 Key Length

Defines the length of the Key that follows immediately after Value Type at offset 2.

Since Key Length has a fixed size of 1 byte, the max value of this field is 255. This effectively means that each

4.4 Child Nodes

After all properties, the node contains child nodes.

The number of children is given by Child Count in the Node Header.

Each child node follows immediately after the previous nodes last property.

Each child node is stored in the same format as a regular node (header → name → properties → children).

This immediate tree structure allows rapid traversal.

5. Value Types

Each property has a 1-byte type specifier before the value. Supported types:

| Type ID | Type Name | Size | Description | | 0x01 | int8 | 1 byte | 8-bit signed integer | | 0x02 | uint8 | 1 byte | 8-bit unsigned integer | | 0x03 | int16 | 2 bytes | 16-bit signed integer | | 0x04 | uint16 | 2 bytes | 16-bit unsigned integer | | 0x05 | int32 | 4 bytes | 32-bit signed integer | | 0x06 | uint32 | 4 bytes | 32-bit unsigned integer | | 0x07 | int64 | 8 bytes | 64-bit signed integer | | 0x08 | uint64 | 8 bytes | 64-bit unsigned integer | | 0x09 | float32 | 4 bytes | 32-bit IEEE 754 single precision float | | 0x0A | float64 | 8 bytes | 64-bit IEEE 754 single precision float | | 0x0B | bool | 1 byte | 0x00 = false, 0x01 = true | | 0x0C | string | Variable | UTF-8 encoded string (preceded by a 2-byte length prefix) |

The structure and size of a property is determined by the type.

Some types (strings) are more complex because of their variable size, and dedicate 2 bytes to specify the size of the string. This is called the length prefix. Because this prefix is fixed, the maximum size of the value of a string property is 65535 bytes.

6. Example Structure (Hex Representation)

Heres an example of a configuration file storing a named root node with a nameless child node and some properties:

01 00                  	; Header: Version 1, Feature Flags 0
10 00 00 00            	; Node Size (57 bytes)
02 00                  	; Property Count = 2
01 00                  	; Child Count = 1
06                     	; Node Name Length = 6 ("config")
63 6F 6E 66 69 67                   ; Node Name "config" (UTF-8 encoded)
05 0B 73 65 74 75 70 01             ; Property 1: Key "setup" (bool = true)
04 0C 70 61 74 68 04 00 2F 75 73 72 ; Property 2: Key "path" (string = "/usr")
10 00 00 00				; Node Size (20 bytes)
01 00					; Property Count = 1
00 00					; Child Count = 0
00						; Node Name Length = 0 (nameless node)
05 02 6C 65 76 65 6C 03 00 00 00	; Property 1: Key "level" (uint32 = 3)

This example demonstrates:

  • Named root node with name "config"
  • Two properties (setup = true, path = "/usr")
  • One nameless child node with one property (level = 3)

6.1 Understanding the structure

The rest of this document contains enough information to understand the data structure shown above. However, it might be a bit confusing if you are unfamiliar with concepts such as hexadecimal and binary data storage. While you do not need to understand any of this to use libraries implementing this standard, it is still good practice to "know your tools". And it might even be interesting, who knows.

To help explain the structure above, let's look at the most complex entry: the "path" string.

Here is the "path" string property from the example: 04 0C 70 61 74 68 04 00 2F 75 73 72

Let's break it apart to make it easier to understand: 04 0C 70 61 74 68 04 00 2F 75 73 72

Now let's explain each part of the property:

  • 04 - Key Length: In this case, a key length of 4. Because the key "path" is 4 bytes. All properties have a key length defined.
  • 0C - Value Type: Controls the type of the property. Some types also have unique formatting. In this case, we specify a type of 12 (hex 0C) which corresponds to a string value.
  • 70 61 74 68 - Key: This is the UTF-8 key. This key has a size of 4 bytes, as specified by the Key Length. It has a UTF-8 value of "path". Everything in the property after the key is unique to the value type.
  • 04 00 - String Length Prefix: This is unique to the string value type. It is fixed to a size of 2 bytes and specifies the size of the actual string value. In this case 4 bytes.
  • 2F 75 73 72 - String value: The actual string. In this case "/usr". This value has a size of 4 bytes, as specified by the String Length Prefix.

This might seem very complex just to define a string, but from a technical perspective it is actually really simple and pretty efficient. Implementing a similar structure as the example above in JSON would take around 100 bytes of storage, while this format only takes up 57 bytes. Parsing this format is also a lot more efficient compared to JSON.

There are many improvements that can be made to this format in terms of parsing efficiency, by for example padding values to multiples of 2. But the version defined here (version 1) serves as a good foundation that can be built upon in the future.

8. Future Extensions (Versions 2+)

This format is designed for future expansion. Examples of improvements that could be made in future versions are:

  • Extended Feature Flags: byte 2 of the file header can enable new features.
  • Variable Node Size Fields: to allow nodes >4GB, if needed in later versions.
  • Compression Flags: optional zlib or LZ4 support for even smaller storage.
  • Optimized Field Sizes: byte-aligning parts of the structure with padding to allow for logic-optimized parser implementations.

And probably much more...

But! This serves as a pretty solid foundation to build upon. And is a fun experiment. The presence of the version flag at the start of a file also enables backwards compatibility.

9. Conclusion

This SBHPF Version 1 specification ensures:

  • Fast, direct access to nodes with fixed headers and easily calculated offsets.
  • Minimal parsing complexity thanks to the absence of unnecessary variable-size fields.
  • Hierarchical, structured key-value storage with efficient traversal.
  • Future-proofing via a version + feature flags system.

This probably sounds like "cheesy tech talk". In reality this is just a functional and pretty efficient binary format that was created because "why not".

10. Document revisions

Additional revisions might be made to this version standard definition at any time. These revisions should never be breaking changes, as that should be defined in a new version of the format. These revisions are specifically intended for correcting grammar mistakes and improving readability and clearity.

  • 2025-02-14 - First publication.