preserves/preserves-binary.md

18 KiB

no_site_title title
true Preserves: Binary Syntax

Tony Garnock-Jones tonyg@leastfixedpoint.com
{{ site.version_date }}. Version {{ site.version }}.

Preserves is a data model, with associated serialization formats. This document defines one of those formats: a binary syntax for Values from the Preserves data model that is easy for computer software to read and write. An equivalent human-readable text syntax also exists.

Machine-Oriented Binary Syntax

A Repr is a binary-syntax encoding, or representation, of a Value. For a value v, we write «v» for the Repr of v.

Type and Length representation.

Each Repr starts with a tag byte, describing the kind of information represented.

However, inspired by argdata, a Repr does not describe its own length. Instead, the expected length of the Repr is always available from the surrounding context: either from a containing encoded value, or from the overall container of the data, which could be a file, an HTTP message, a UDP packet, etc.

As a consequence, Reprs for Compound values store the lengths of their contained values. Each contained Value is represented as a length in bytes followed by its own Repr. Implementations use each stored length to decide when to stop reading the following Repr.

Each length is stored as an argdata-compatible big-endian base 128 varint.1 Each byte of a varint stores seven bits of the length. All bytes have a clear upper bit, except the final byte, which has the upper bit set. We write len(m) for the varint-encoding of a non-negative integer m, defined recursively as follows:

len(m) = e(m, 128)
       where e(v, d) = [v + d]                           if v < 128
                       e(v / 128, 0) ++ [(v % 128) + d]  if v ≥ 128

We write len(|r|) for the varint-encoding of the length of Repr r.

There is no requirement that a varint-encoded m in a Repr be the unique shortest encoding for that m.2 However, implementations SHOULD use the shortest encoding whereever possible when writing, and MAY reject encodings with more than eight leading 0 bytes when reading encoded values.

Records, Sequences, Sets and Dictionaries.

      «<L F_1...F_m>» = [0xA7] ++ seq(«L», «F_1», ..., «F_m»)
        «[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m»)
       «#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m»)
«{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m»)

   seq(R_1, ..., R_m) = len(|R_1|) ++ R_1 ++...++ len(|R_m|) ++ R_m

There is no ordering requirement on the E_i elements or K_i/V_i pairs.3 They may appear in any order. However, the E_i and K_i MUST be pairwise distinct. In addition, implementations SHOULD default to writing set elements and dictionary key/value pairs in order sorted lexicographically by their Reprs4, and MAY offer the option of serializing in some other implementation-defined order.

No sentinel marks the end of a sequence of length-prefixed Reprs. During decoding, use the length of the containing Repr to decide when to stop expecting more contained Reprs.

SignedIntegers.

«x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x)

The function intbytes(x) gives a big-endian two's-complement binary representation of x, taking at least as many whole bytes as needed to unambiguously identify the value and its sign; intbytes(0) may be the empty byte sequence.5 The most-significant bit in the first byte in intbytes(x) is the sign bit. While every SignedInteger SHOULD be represented with its shortest possible encoding (which will often include a necessary leading 0xFF or 0x00), redundant leading 0xFF or 0x00 bytes MAY be used.6

Strings, ByteStrings and Symbols.

«S» = [0xA4] ++ utf8(S) ++ [0]  if S ∈ String
      [0xA5] ++ S               if S ∈ ByteString
      [0xA6] ++ utf8(S)         if S ∈ Symbol

For String and Symbol, the data following the tag is a UTF-8 encoding of the Value's code points, while for ByteString it is the raw data contained within the Value unmodified.

Each String has a trailing zero byte appended. This extra byte MUST NOT be treated as part of the Value: it exists to permit zero-copy C interoperability.7

Booleans.

«#f» = [0xA0]
«#t» = [0xA1]

Floats and Doubles.

«F» when F ∈ Float  = [0xA2] ++ binary32(F)
«D» when D ∈ Double = [0xA2] ++ binary64(D)

The functions binary32(F) and binary64(D) yield big-endian 4- and 8-byte IEEE 754 binary representations of F and D, respectively.

Embeddeds.

The Repr of an Embedded is the Repr of a Value chosen to represent the denoted object, prefixed with [0xAB].

«#!V» = [0xAB] ++ «V»

Annotations.

To annotate a Repr r with some sequence of Values [v_1, ..., v_m], surround r as follows:

[0xBF] ++ len(|r|) ++ r ++ len(|«v_1»|) ++ «v_1» ++...++ len(|«v_m»|) ++ «v_m»

The Repr r MUST NOT already have annotations; that is, it must not begin with 0xBF. The sequence [v_1, ..., v_m] MUST contain at least one Value.

Examples

Varints (length representations).

The following table illustrates varint-encoding.

Number, m m in binary, grouped into 7-bit chunks len(m) bytes
15 0001111 143
300 0000010 0101100 2 172
1000000000 0000011 1011100 1101011 0010100 0000000 3 92 107 20 128

Atoms.

       «#f» = A0                                  «#t» = A1

   «0.123f» = A2 3D FB E7 6D                   «0.123» = A2 3F BF 7C ED 91 68 72 B0

     «-257» = A3 FE FF        «-3» = A3 FD       «128» = A3 00 80
     «-256» = A3 FF 00        «-2» = A3 FE       «255» = A3 00 FF
     «-255» = A3 FF 01        «-1» = A3 FF       «256» = A3 01 00
     «-254» = A3 FF 02         «0» = A3        «32767» = A3 7F FF
     «-129» = A3 FF 7F         «1» = A3 01     «32768» = A3 00 80 00
     «-128» = A3 80           «12» = A3 0C     «65535» = A3 00 FF FF
     «-127» = A3 81           «13» = A3 0D     «65536» = A3 01 00 00
       «-4» = A3 FC          «127» = A3 7F    «131072» = A3 02 00 00

                   «87112285931760246646623899502532662132736»
                           = A3 01 00 00 00 00 00 00 00
                                00 00 00 00 00 00 00 00
                                00 00

       «""» = A4 00                               «||» = A6
      «"a"» = A4 61 00                           «|a|» = A6 61
  «"hello"» = A4 68 65 6C 6C 6F 00           «|hello|» = A6 68 65 6C 6C 6F

                                  «#[]» = A5
                              «#[AQ==]» = A5 01
                      «#[ATAyMDMwNDA1]» = A5 01 02 03 04 05

Compounds.

             «<window 100 120 500 300>» = A7 87 A6 77696E646F77
                                             82 A3 64
                                             82 A3 78
                                             83 A3 01F4
                                             83 A3 012C

             «["zzzz(...192 zs...)zzzz"]»
                 (a length-1 sequence containing a length-200 string)
                     = A8 01CA A4 7A7A7A7A (... 192 repetitions of 7A ...)
                                  7A7A7A7A 00

             «[H, He, Li, Be, B, C, N, O, F, Ne]»
                     = A8 82A648 83A64865 83A64C69 83A64265
                          82A642 82A643 82A64E 82A64F 82A646 83A64E65

             «#{H He Li Be B C N O F Ne}»
                     = A9 82A642           (B)
                          83A64265         (Be)
                          82A643           (C)
                          82A646           (F)
                          82A648           (H)
                          83A64865         (He)
                          83A64C69         (Li)
                          82A64E           (N)
                          83A64E65         (Ne)
                          82A64F           (O)

             «{H: 1.0080f, He: 4.0026f, Li: 6.94f, Be: 9.0122f,
               B: 10.81f, C: 12.011f, N: 14.007f, O: 15.999f,
               F: 18.998f, Ne: 20.180f}»
                     = AA 82A642   85A2412CF5C3    (B: 10.81f)
                          83A64265 85A2411031F9    (Be: 9.0122f)
                          82A643   85A241402D0E    (C: 12.011f)
                          82A646   85A24197FBE7    (F: 18.998f)
                          82A648   85A23F810625    (H: 1.0080f)
                          83A64865 85A24080154D    (He: 4.0026f)
                          83A64C69 85A240DE147B    (Li: 6.94f)
                          82A64E   85A241601CAC    (N: 14.007f)
                          83A64E65 85A241A170A4    (Ne: 20.180f)
                          82A64F   85A2417FFBE7    (O: 15.999f)

             «[[H 1.0080f] [He 4.0026f] [Li 6.94f] [Be 9.0122f]
               [B 10.81f] [C 12.011f] [N 14.007f] [O 15.999f]
               [F 18.998f] [Ne 20.180f]]»
                     = A8 8A A8 82A648   85A23F810625    ([H 1.0080f])
                          8B A8 83A64865 85A24080154D    ([He 4.0026f])
                          8B A8 83A64C69 85A240DE147B    ([Li 6.94f])
                          8B A8 83A64265 85A2411031F9    ([Be 9.0122f])
                          8A A8 82A642   85A2412CF5C3    ([B 10.81f])
                          8A A8 82A643   85A241402D0E    ([C 12.011f])
                          8A A8 82A64E   85A241601CAC    ([N 14.007f])
                          8A A8 82A64F   85A2417FFBE7    ([O 15.999f])
                          8A A8 82A646   85A24197FBE7    ([F 18.998f])
                          8B A8 83A64E65 85A241A170A4    ([Ne 20.180f])

Annotations.

The Repr corresponding to textual syntax @a@b[], i.e. an empty sequence annotated with two symbols, a and b, is

«@a @b []»
  = [0xBF] ++ len(|«[]»|) ++ «[]» ++ len(|«a»|) ++ «a» ++ len(|«b»|) ++ «b»
  = [0xBF, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]

Security Considerations

Annotations. In modes where a Value is being read while annotations are skipped, an endless sequence of annotations may give an illusion of progress.

Overlong varints. The binary format allows (but discourages) overlong varints. Because every Repr has a bound on its length from its surrounding context, this is not a denial-of-service vector per se; however, implementations may wish to consider optional restrictions on the number of redundant leading 0 bytes accepted when reading a varint.

Canonical form for cryptographic hashing and signing. No canonical textual encoding of a Value is specified. A canonical form exists for binary encoded Values, and implementations SHOULD produce canonical binary encodings by default; however, an implementation MAY permit two serializations of the same Value to yield different binary Reprs.

Acknowledgements

The exclusion of lengths from Reprs, placing lengths instead ahead of contained values in sequences, is inspired by argdata, as is the inclusion of a NUL byte in String Reprs for C interoperability.

Appendix. Autodetection of textual or binary syntax

Every tag byte in a binary Preserves Repr falls within the range [0x80, 0xBF]. These bytes, interpreted as UTF-8, are continuation bytes, and will never occur as the first byte of a UTF-8 encoded code point. This means no binary-encoded Repr can be misinterpreted as valid UTF-8.

Conversely, a UTF-8 Document must start with a valid codepoint, meaning in particular that it must not start with a byte in the range [0x80, 0xBF]. This means that no UTF-8 encoded textual-syntax Preserves Document can be misinterpreted as a binary-syntax Repr.

Examination of the top two bits of the first byte of an encoded Value gives its syntax: if the top two bits are 10, it should be interpreted as a binary-syntax Repr; otherwise, it should be interpreted as text.

Streaming. Autodetection is still possible when streaming an undetermined number of Values across, say, a TCP/IP connection:

  • If the text syntax is to be used for the connection, simply start writing each Document one after the other. Documents for Atoms are in general ambiguous if not separated from their neighbours by whitespace; whitespace SHOULD be used to separate adjacent documents. Specifically, whitespace separating adjacent documents SHOULD be ASCII newline (10).

  • If the binary syntax is to be used for the connection, start the connection with byte 0xA8 (sequence). After the initial byte, send each value v as len(|«v»|) ++ «v». A side effect of this approach is that the entire stream, when complete, is a valid Sequence Repr.

Appendix. Table of tag values

(8x)  RESERVED 80-8F
(9x)  RESERVED 90-9F

 A0 - False
 A1 - True
 A2 - Float or Double (length disambiguates)
 A3 - SignedIntegers (0 may be encoded with no bytes at all)
 A4 - String (a trailing NUL is added)
 A5 - ByteString
 A6 - Symbol

 A7 - Record
 A8 - Sequence
 A9 - Set
 AA - Dictionary

 AB - Embedded

(Ax)  RESERVED AC-AF
(Bx)  RESERVED B0-BE
 BF - Annotations. {BF Lval val Lann0 ann0 Lann1 ann1 ...}

Appendix. Binary SignedInteger representation

Languages that provide fixed-width machine word types may find the following table useful in encoding and decoding binary SignedInteger values.

Integer range Bytes required Encoding (hex)
0 1 A3
-27 ≤ n < 27 (i8) 2 A3 XX
-215 ≤ n < 215 (i16) 3 A3 XX XX
-223 ≤ n < 223 (i24) 4 A3 XX XX XX
-231 ≤ n < 231 (i32) 5 A3 XX XX XX XX
-239 ≤ n < 239 (i40) 6 A3 XX XX XX XX XX
-247 ≤ n < 247 (i48) 7 A3 XX XX XX XX XX XX
-255 ≤ n < 255 (i56) 8 A3 XX XX XX XX XX XX XX
-263 ≤ n < 263 (i64) 9 A3 XX XX XX XX XX XX XX XX

Notes


  1. Argdata's length representation is very close to Variable-length quantity (VLQ) encoding, differing only in the flipped interpretation of the high bit of each byte. It is big-endian, unlike LEB128 encoding (as used by Google in protobufs). ↩︎

  2. Implementation note. The spec permits overlong length encodings to reduce wasted activity in resource-constrained situations. If an implementation is in anything other than a very low-level language, it is likely to be able to use IOList-style data structures to avoid unnecessary copying. ↩︎

  3. In the BitTorrent encoding format, bencoding, dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of Values is canonical. We do not require that key/value pairs (or set elements) be in sorted order for serialized Values; however, a canonical form for Reprs does exist where a sorted ordering is required. ↩︎

  4. It's important to note that the sort ordering for writing out set elements and dictionary key/value pairs is not the same as the sort ordering implied by the semantic ordering of those elements or keys. For example, the Repr of a negative number very far from zero will start with a byte that is greater than the byte which starts the Repr of zero, making it sort lexicographically later by Repr, despite being semantically less than zero.

    Rationale. This is for ease-of-implementation reasons: not all languages can easily represent sorted sets or sorted dictionaries, but encoding and then sorting byte strings is much more likely to be within easy reach. ↩︎

  5. The value 0 needs zero bytes to identify the value, so intbytes(0) can be the empty byte string. Non-zero values need at least one byte. ↩︎

  6. Implementation note. The spec permits overlong SignedInteger encodings to allow e.g. construction of Reprs by filling in partially-completed templates, which can be useful in resource-constrained situations. ↩︎

  7. Some care must still be taken when passing String Reprs directly to a C-style ABI, since Strings may contain the zero Unicode code point, which C library routines will usually misinterpret as an end-of-string marker. ↩︎