preserves/preserves-binary.md

10 KiB
Raw Blame History

no_site_title title
true Preserves: Binary Syntax

Tony Garnock-Jones tonyg@leastfixedpoint.com
{{ site.version_date }}. Version {{ site.version }}.

Preserves is a data model, with associated serialization formats. This document defines one of those formats: a binary syntax for Values from the Preserves data model that is easy for computer software to read and write. An equivalent human-readable text syntax also exists.

Machine-Oriented Binary Syntax

A Repr is a binary-syntax encoding, or representation, of a Value. For a value v, we write «v» for the Repr of v.

Type and Length representation.

Each Repr starts with a tag byte, describing the kind of information represented. Depending on the tag, a length indicator, further encoded information, and/or an ending tag may follow.

tag                          (simple atomic data)
tag ++ length ++ binarydata  (floats, doubles, integers, strings, symbols, and binary)
tag ++ repr ++ ... ++ endtag (compound data and annotations)

The unique end tag is byte value 0x84.

If present after a tag, the length of a following piece of binary data is formatted as a base 128 varint.1 We write varint(m) for the varint-encoding of m. Quoting the Google Protocol Buffers definition,

Each byte in a varint, except the last byte, has the most significant bit (msb) set this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.

For example, varint(15) is [0x0F], and varint(1000000000) is [0x80, 0x94, 0xeb, 0xdc, 0x03].

It is an error for a varint-encoded m in a Repr to be anything other than the unique shortest encoding for that m. That is, a varint-encoding of m MUST NOT end in 0 unless m=0.

Records, Sequences, Sets and Dictionaries.

      «<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
        «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
       «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
«{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]

There is no ordering requirement on the E_i elements or K_i/V_i pairs.2 They may appear in any order. However, the E_i and K_i MUST be pairwise distinct. In addition, implementations SHOULD default to writing set elements and dictionary key/value pairs in order sorted lexicographically by their Reprs3, and MAY offer the option of serializing in some other implementation-defined order.

SignedIntegers.

«x» = [0xB0] ++ varint(|intbytes(x)|) ++ intbytes(x)  if x ∈ SignedInteger

The function intbytes(x) gives the big-endian two's-complement binary representation of x, taking exactly as many whole bytes as needed to unambiguously identify the value and its sign. The value 0 needs zero bytes to identify the value; non-zero values need at least one byte, and the most-significant bit in the first byte is the sign bit. See the examples in the appendix below.

Strings, ByteStrings and Symbols.

«S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ String
      [0xB2] ++ varint(|S|) ++ S              if S ∈ ByteString
      [0xB3] ++ varint(|utf8(S)|) ++ utf8(S)  if S ∈ Symbol

Syntax for these three types varies only in the tag used. For String and Symbol, the data following the tag is a UTF-8 encoding of the Value, while for ByteString it is the raw data contained within the Value unmodified.

Booleans.

«#f» = [0x80]
«#t» = [0x81]

Floats and Doubles.

«F» = [0x87, 0x04] ++ binary32(F)  if F ∈ Float
«D» = [0x87, 0x08] ++ binary64(D)  if D ∈ Double

The functions binary32(F) and binary64(D) yield big-endian 4- and 8-byte IEEE 754 binary representations of F and D, respectively.

Embeddeds.

«#!V» = [0x86] ++ «V»

The Repr of an Embedded is the Repr of a Value chosen to represent the denoted object, prefixed with [0x86].

Annotations.

«@V_1...@V_n V» = [0xBF] ++ «V» ++ «V_1» ++...++ «V_n» ++ [0x84]

V MUST NOT itself be annotated, but each V_i in V_1...V_n MAY be annotated. Furthermore, n MUST be greater than zero. For example, the Repr corresponding to textual syntax @a@b[], i.e. an empty sequence annotated with two symbols, a and b, is

«@a @b []» = [0xBF] ++ «[]» ++ «a» ++ «b» ++ [0x84]
           = [0xBF, 0xB5, 0x84, 0xB3, 0x01, 0x61, 0xB3, 0x01, 0x62, 0x84]

Implementations SHOULD default to omitting annotations from binary Reprs.

Security Considerations

Annotations. In modes where a Value is being read while annotations are skipped, an endless nesting of annotations may give an illusion of progress.

Canonical form for cryptographic hashing and signing. No canonical textual encoding of a Value is specified. However, a canonical form exists for binary encoded Values, and implementations SHOULD produce canonical binary encodings by default; however, an implementation MAY permit two serializations of the same Value to yield different binary Reprs.

Appendix. Autodetection of textual or binary syntax

Every tag byte in a binary Preserves Document falls within the range [0x80, 0xBF]. These bytes, interpreted as UTF-8, are continuation bytes, and will never occur as the first byte of a UTF-8 encoding. This means no binary-encoded document can be misinterpreted as valid UTF-8.

Conversely, a UTF-8 document must start with a valid scalar value, meaning in particular that it must not start with a byte in the range [0x80, 0xBF]. This means that no UTF-8 encoded textual-syntax Preserves document can be misinterpreted as a binary-syntax document.

Examination of the top two bits of the first byte of a document gives its syntax: if the top two bits are 10, it should be interpreted as a binary-syntax document; otherwise, it should be interpreted as text.

Appendix. Table of tag values

 80 - False
 81 - True
 84 - End marker
 86 - Embedded
 87 - Float and Double

 B0 - Integer
 B1 - String
 B2 - ByteString
 B3 - Symbol
 B4 - Record
 B5 - Sequence
 B6 - Set
 B7 - Dictionary

 BF - Annotated Repr (not itself starting with BF) followed by annotations

All tags fall in the range [0x80, 0xBF].

Tag values 82, 83, 85, 88...AF, and B8...BE are reserved.

Appendix. Binary SignedInteger representation

Languages that provide fixed-width machine word types may find the following table useful in encoding and decoding binary SignedInteger values.

Integer range Bytes required Encoding (hex)
0 2 B0 00
-27 ≤ n < 27 (i8) 3 B0 01 XX
-215 ≤ n < 215 (i16) 4 B0 02 XX XX
-223 ≤ n < 223 (i24) 5 B0 03 XX XX XX
-231 ≤ n < 231 (i32) 6 B0 04 XX XX XX XX
-239 ≤ n < 239 (i40) 7 B0 05 XX XX XX XX XX
-247 ≤ n < 247 (i48) 8 B0 06 XX XX XX XX XX XX
-255 ≤ n < 255 (i56) 9 B0 07 XX XX XX XX XX XX XX
-263 ≤ n < 263 (i64) 10 B0 08 XX XX XX XX XX XX XX XX

Appendix. Binary SignedInteger examples

  «-257» = B0 02 FE FF     «-2» = B0 01 FE       «255» = B0 02 00 FF
  «-256» = B0 02 FF 00     «-1» = B0 01 FF       «256» = B0 02 01 00
  «-255» = B0 02 FF 01      «0» = B0 00        «32767» = B0 02 7F FF
  «-129» = B0 02 FF 7F      «1» = B0 01 01     «32768» = B0 03 00 80 00
  «-128» = B0 01 80       «127» = B0 01 7F     «65535» = B0 03 00 FF FF
  «-127» = B0 01 81       «128» = B0 02 00 80  «65536» = B0 03 01 00 00

  «87112285931760246646623899502532662132736»
    = B0 12 01 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00
            00 00

Notes


  1. Also known as LEB128 encoding, for unsigned integers. Varints and LEB128-encoded integers differ only for negative numbers, which cannot appear as length indicators and are thus not used in Preserves. ↩︎

  2. In the BitTorrent encoding format, bencoding, dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of Values is canonical. We encourage, but do not require that key/value pairs (or set elements) be in sorted order for serialized Values; however, a canonical form for Reprs does exist where a sorted ordering is required. ↩︎

  3. It's important to note that the sort ordering for writing out set elements and dictionary key/value pairs is not the same as the sort ordering implied by the semantic ordering of those elements or keys. For example, the Repr of a negative number very far from zero will start with byte that is greater than the byte which starts the Repr of zero, making it sort lexicographically later by Repr, despite being semantically less than zero.

    Rationale. This is for ease-of-implementation reasons: not all languages can easily represent sorted sets or sorted dictionaries, but encoding and then sorting byte strings is much more likely to be within easy reach. ↩︎