preserves/preserves-binary.md

---
no_site_title: true
title: "Preserves: Binary Syntax"
---

Tony Garnock-Jones <tonyg@leastfixedpoint.com>  
{{ site.version_date }}. Version {{ site.version }}.

  [LEB128]: https://en.wikipedia.org/wiki/LEB128
  [argdata]: https://github.com/NuxiNL/argdata
  [canonical]: canonical-binary.html
  [google-varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
  [vlq]: https://en.wikipedia.org/wiki/Variable-length_quantity

*Preserves* is a data model, with associated serialization formats. This
document defines one of those formats: a binary syntax for `Value`s from
the [Preserves data model](preserves.html) that is easy for computer
software to read and write. An [equivalent human-readable text
syntax](preserves-text.html) also exists.

## Machine-Oriented Binary Syntax

A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
For a value `v`, we write `«v»` for the `Repr` of v.

### Type and Length representation.

Each `Repr` starts with a tag byte, describing the kind of information
represented.

However, inspired by [argdata][], a `Repr` does *not* describe its own
length. Instead, the surrounding context must supply the length of the
`Repr`.

As a consequence, `Repr`s for `Compound` values store the lengths of
their contained values. Each contained `Value` is represented as a
length in bytes followed by its own `Repr`.

<a id="varint"></a> Each length is stored as an [argdata][]-compatible
big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
stores seven bits of the length. All bytes have a clear upper bit,
except the final byte, which has the upper bit set. We write
`len(m)` for the varint-encoding of a non-negative integer `m`,
defined recursively as follows:

    len(m) = e(m, 128)
           where e(v, d) = [v + d]                           if v < 128
                           e(v / 128, 0) ++ [(v % 128) + d]  if v ≥ 128

  [^see-also-leb128]: Argdata's length representation is very close to
    [Variable-length quantity (VLQ)][VLQ] encoding, differing only in
    the flipped interpretation of the high bit of each byte. It is
    big-endian, unlike [LEB128][] encoding ([as used by
    Google][google-varint] in protobufs).

We write `len(|r|)` for the varint-encoding of the length of `Repr` `r`.

The following table illustrates varint-encoding.

| Number, `m` | `m` in binary, grouped into 7-bit chunks  | `len(m)` bytes  |
|-------------|-------------------------------------------|-----------------|
| 15          | `0001111`                                 | 143             |
| 300         | `0000010 0101100`                         | 2 172           |
| 1000000000  | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |

It is an error for a varint-encoded `m` in a `Repr` to be anything other
than the unique shortest encoding for that `m`. That is, a
varint-encoding of `m` *MUST NOT* start with `0`.

### Records, Sequences, Sets and Dictionaries.

          «<L F_1...F_m>» = [0xA7] ++ seq(«L», «F_1», ..., «F_m»)
            «[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m»)
           «#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m»)
    «{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m»)

       seq(R_1, ..., R_m) = len(|R_1|) ++ R_1 ++...++ len(|R_m|) ++ R_m

There is *no* ordering requirement on the `E_i` elements or
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
addition, implementations *SHOULD* default to writing set elements and
dictionary key/value pairs in order sorted lexicographically by their
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
serializing in some other implementation-defined order.

  [^no-sorting-rationale]: In the BitTorrent encoding format,
    [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
    dictionary key/value pairs must be sorted by key. This is a
    necessary step for ensuring serialization of `Value`s is
    canonical. We do not require that key/value pairs (or set
    elements) be in sorted order for serialized `Value`s; however, a
    [canonical form][canonical] for `Repr`s does exist where a sorted
    ordering is required.

  [^not-sorted-semantically]: It's important to note that the sort
    ordering for writing out set elements and dictionary key/value
    pairs is *not* the same as the sort ordering implied by the
    semantic ordering of those elements or keys. For example, the
    `Repr` of a negative number very far from zero will start with a
    byte that is *greater* than the byte which starts the `Repr` of
    zero, making it sort lexicographically later by `Repr`, despite
    being semantically *less than* zero.

    **Rationale**. This is for ease-of-implementation reasons: not all
    languages can easily represent sorted sets or sorted dictionaries,
    but encoding and then sorting byte strings is much more likely to
    be within easy reach.

### SignedIntegers.

    «x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x)

The function `intbytes(x)` gives the big-endian two's-complement binary
representation of `x`, taking exactly as many whole bytes as needed to
unambiguously identify the value and its sign. As a special case,
`intbytes(0)` is the empty byte sequence. The most-significant bit in
the first byte in `intbytes(x)` (for `x`≠0) is the sign
bit.[^zero-intbytes] Every `SignedInteger` *MUST* be represented with
its shortest possible encoding.

For example,

      «87112285931760246646623899502532662132736»
        = A3 01 00 00 00 00 00 00 00
             00 00 00 00 00 00 00 00
             00 00

      «-257» = A3 FE FF        «-3» = A3 FD       «128» = A3 00 80
      «-256» = A3 FF 00        «-2» = A3 FE       «255» = A3 00 FF
      «-255» = A3 FF 01        «-1» = A3 FF       «256» = A3 01 00
      «-254» = A3 FF 02         «0» = A3        «32767» = A3 7F FF
      «-129» = A3 FF 7F         «1» = A3 01     «32768» = A3 00 80 00
      «-128» = A3 80           «12» = A3 0C     «65535» = A3 00 FF FF
      «-127» = A3 81           «13» = A3 0D     «65536» = A3 01 00 00
        «-4» = A3 FC          «127» = A3 7F    «131072» = A3 02 00 00

  [^zero-intbytes]: The value 0 needs zero bytes to identify the
    value, so `intbytes(0)` is the empty byte string. Non-zero values
    need at least one byte.

### Strings, ByteStrings and Symbols.

Syntax for these three types varies only in the tag used. For `String`
and `Symbol`, the data following the tag is a UTF-8 encoding of the
`Value`'s code points, while for `ByteString` it is the raw data
contained within the `Value` unmodified.

    «S» = [0xA4] ++ utf8(S)  if S ∈ String
          [0xA5] ++ S        if S ∈ ByteString
          [0xA6] ++ utf8(S)  if S ∈ Symbol

### Booleans.

    «#f» = [0xA0]
    «#t» = [0xA1]

### Floats and Doubles.

    «F» when F ∈ Float  = [0xA2] ++ binary32(F)
    «D» when D ∈ Double = [0xA2] ++ binary64(D)

The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
8-byte IEEE 754 binary representations of `F` and `D`, respectively.

### Embeddeds.

The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
represent the denoted object, prefixed with `[0xBF]`.

    «#!V» = [0xBF] ++ «V»

### Annotations.

To annotate a `Repr` `r` with some sequence of `Value`s `[v_1, ...,
v_m]`, surround `r` as follows:

    [0xBE] ++ len(|r|) ++ r ++ len(|«v_1»|) ++ «v_1» ++...++ len(|«v_m»|) ++ «v_m»

The `Repr` `r` *MUST NOT* already have annotations; that is, it must not begin with `0xBE`.

For example, the `Repr` corresponding to textual syntax `@a@b[]`, i.e.
an empty sequence annotated with two symbols, `a` and `b`, is

    «@a @b []»
      = [0xBE] ++ len(|«[]»|) ++ «[]» ++ len(|«a»|) ++ «a» ++ len(|«b»|) ++ «b»
      = [0xBE, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]

## Security Considerations

**Annotations.** In modes where a `Value` is being read while
annotations are skipped, an endless sequence of annotations may give an
illusion of progress.

**Canonical form for cryptographic hashing and signing.** No canonical
textual encoding of a `Value` is specified. A
[canonical form][canonical] exists for binary encoded `Value`s, and
implementations *SHOULD* produce canonical binary encodings by
default; however, an implementation *MAY* permit two serializations of
the same `Value` to yield different binary `Repr`s.

## Acknowledgements

The exclusion of lengths from `Repr`s, placing lengths instead ahead of
contained values in sequences, is inspired by [argdata][].

## Appendix. Autodetection of textual or binary syntax

Every tag byte in a binary Preserves `Repr` falls within the range
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
bytes*, and will never occur as the first byte of a UTF-8 encoded code
point. This means no binary-encoded `Repr` can be misinterpreted as
valid UTF-8.

Conversely, a UTF-8 `Document` must start with a valid codepoint,
meaning in particular that it must not start with a byte in the range
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
Preserves `Document` can be misinterpreted as a binary-syntax `Repr`.

Examination of the top two bits of the first byte of an encoded `Value`
gives its syntax: if the top two bits are `10`, it should be interpreted
as a binary-syntax `Repr`; otherwise, it should be interpreted as text.

**Streaming.** Autodetection is still possible when streaming an
undetermined number of `Value`s across, say, a TCP/IP connection:

 - If the text syntax is to be used for the connection, simply start
   writing each `Document` one after the other. Documents for `Atom`s
   *MUST* be separated from their neighbours by whitespace; in general,
   whitespace *SHOULD* be used to separate adjacent documents.
   Specifically, whitespace separating adjacent documents *SHOULD* be
   ASCII newline (10).

 - If the binary syntax is to be used for the connection, start the
   connection with byte `0xA8` (sequence). After the initial byte, send
   each value `v` as `len(|«v»|) ++ «v»`. A side effect of this approach
   is that the entire stream, when complete, is a valid `Sequence`
   `Repr`.

## Appendix. Table of tag values

    (8x)  RESERVED 80-8F
    (9x)  RESERVED 90-9F

     A0 - False
     A1 - True
     A2 - Float or Double (length disambiguates)
     A3 - SignedIntegers (0 is encoded with no bytes at all)
     A4 - String (no trailing NUL is added)
     A5 - ByteString
     A6 - Symbol

     A7 - Record
     A8 - Sequence
     A9 - Set
     AA - Dictionary

    (Ax)  RESERVED AB-AF

    (Bx)  RESERVED B0-BD
     BE - Annotations. {BE Lval val Lann0 ann0 Lann1 ann1 ...}
     BF - Embedded

## Appendix. Binary SignedInteger representation

Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary `SignedInteger`
values.

| Integer range                              | Bytes required | Encoding (hex)                               |
| ---                                        | ---            | ---                                          |
| 0                                          | 1              | `A3`                                         |
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8)    | 2              | `A3` `XX`                                    |
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3              | `A3` `XX` `XX`                               |
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4              | `A3` `XX` `XX` `XX`                          |
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5              | `A3` `XX` `XX` `XX` `XX`                     |
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6              | `A3` `XX` `XX` `XX` `XX` `XX`                |
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7              | `A3` `XX` `XX` `XX` `XX` `XX` `XX`           |
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8              | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX`      |
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9              | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |

<!-- Heading to visually offset the footnotes from the main document: -->
## Notes
Split up spec! 2022-06-18 17:11:08 +00:00			`---`
			`no_site_title: true`
			`title: "Preserves: Binary Syntax"`
			`---`

			`Tony Garnock-Jones <tonyg@leastfixedpoint.com>`
			`{{ site.version_date }}. Version {{ site.version }}.`

			`[LEB128]: https://en.wikipedia.org/wiki/LEB128`
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`[argdata]: https://github.com/NuxiNL/argdata`
Split up spec! 2022-06-18 17:11:08 +00:00			`[canonical]: canonical-binary.html`
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`[google-varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints`
			`[vlq]: https://en.wikipedia.org/wiki/Variable-length_quantity`
Split up spec! 2022-06-18 17:11:08 +00:00
			`Preserves is a data model, with associated serialization formats. This`
			document defines one of those formats: a binary syntax for `Value`s from
			`the [Preserves data model](preserves.html) that is easy for computer`
			`software to read and write. An [equivalent human-readable text`
			`syntax](preserves-text.html) also exists.`

			`## Machine-Oriented Binary Syntax`

			A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
			For a value `v`, we write `«v»` for the `Repr` of v.

			`### Type and Length representation.`

			Each `Repr` starts with a tag byte, describing the kind of information
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`represented.`
Split up spec! 2022-06-18 17:11:08 +00:00
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			However, inspired by [argdata][], a `Repr` does not describe its own
			`length. Instead, the surrounding context must supply the length of the`
			`Repr`.
Split up spec! 2022-06-18 17:11:08 +00:00
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			As a consequence, `Repr`s for `Compound` values store the lengths of
			their contained values. Each contained `Value` is represented as a
			length in bytes followed by its own `Repr`.
Split up spec! 2022-06-18 17:11:08 +00:00
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`<a id="varint"></a> Each length is stored as an [argdata][]-compatible`
			`big-endian base 128 varint.[^see-also-leb128] Each byte of a varint`
			`stores seven bits of the length. All bytes have a clear upper bit,`
			`except the final byte, which has the upper bit set. We write`
			`len(m)` for the varint-encoding of a non-negative integer `m`,
			`defined recursively as follows:`
Split up spec! 2022-06-18 17:11:08 +00:00
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`len(m) = e(m, 128)`
			`where e(v, d) = [v + d] if v < 128`
			`e(v / 128, 0) ++ [(v % 128) + d] if v ≥ 128`
Split up spec! 2022-06-18 17:11:08 +00:00
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`[^see-also-leb128]: Argdata's length representation is very close to`
			`[Variable-length quantity (VLQ)][VLQ] encoding, differing only in`
			`the flipped interpretation of the high bit of each byte. It is`
			`big-endian, unlike [LEB128][] encoding ([as used by`
			`Google][google-varint] in protobufs).`
Split up spec! 2022-06-18 17:11:08 +00:00
Fix spec issues with len() and IEEE754 endianness (thanks to isd) 2022-06-11 08:06:51 +00:00			We write `len(\|r\|)` for the varint-encoding of the length of `Repr` `r`.

Split up spec! 2022-06-18 17:11:08 +00:00			`The following table illustrates varint-encoding.`

New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			\| Number, `m` \| `m` in binary, grouped into 7-bit chunks \| `len(m)` bytes \|
			`\|-------------\|-------------------------------------------\|-----------------\|`
			\| 15 \| `0001111` \| 143 \|
			\| 300 \| `0000010 0101100` \| 2 172 \|
			\| 1000000000 \| `0000011 1011100 1101011 0010100 0000000` \| 3 92 107 20 128 \|
Split up spec! 2022-06-18 17:11:08 +00:00
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			It is an error for a varint-encoded `m` in a `Repr` to be anything other
			than the unique shortest encoding for that `m`. That is, a
			varint-encoding of `m` MUST NOT start with `0`.
Split up spec! 2022-06-18 17:11:08 +00:00
			`### Records, Sequences, Sets and Dictionaries.`

New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`«<L F_1...F_m>» = [0xA7] ++ seq(«L», «F_1», ..., «F_m»)`
			`«[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m»)`
			`«#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m»)`
			`«{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m»)`
Fix spec issues with len() and IEEE754 endianness (thanks to isd) 2022-06-11 08:06:51 +00:00
			`seq(R_1, ..., R_m) = len(\|R_1\|) ++ R_1 ++...++ len(\|R_m\|) ++ R_m`
Split up spec! 2022-06-18 17:11:08 +00:00
			There is no ordering requirement on the `E_i` elements or
			`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
			order. However, the `E_i` and `K_i` MUST be pairwise distinct. In
			`addition, implementations SHOULD default to writing set elements and`
			`dictionary key/value pairs in order sorted lexicographically by their`
			`Repr`s[^not-sorted-semantically], and MAY offer the option of
			`serializing in some other implementation-defined order.`

			`[^no-sorting-rationale]: In the BitTorrent encoding format,`
			`[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),`
			`dictionary key/value pairs must be sorted by key. This is a`
			necessary step for ensuring serialization of `Value`s is
			`canonical. We do not require that key/value pairs (or set`
			elements) be in sorted order for serialized `Value`s; however, a
			[canonical form][canonical] for `Repr`s does exist where a sorted
			`ordering is required.`

			`[^not-sorted-semantically]: It's important to note that the sort`
			`ordering for writing out set elements and dictionary key/value`
			`pairs is not the same as the sort ordering implied by the`
			`semantic ordering of those elements or keys. For example, the`
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`Repr` of a negative number very far from zero will start with a
Split up spec! 2022-06-18 17:11:08 +00:00			byte that is greater than the byte which starts the `Repr` of
			zero, making it sort lexicographically later by `Repr`, despite
			`being semantically less than zero.`

			`Rationale. This is for ease-of-implementation reasons: not all`
			`languages can easily represent sorted sets or sorted dictionaries,`
			`but encoding and then sorting byte strings is much more likely to`
			`be within easy reach.`

			`### SignedIntegers.`

New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`«x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x)`

			The function `intbytes(x)` gives the big-endian two's-complement binary
			representation of `x`, taking exactly as many whole bytes as needed to
			`unambiguously identify the value and its sign. As a special case,`
			`intbytes(0)` is the empty byte sequence. The most-significant bit in
			the first byte in `intbytes(x)` (for `x`≠0) is the sign
			bit.[^zero-intbytes] Every `SignedInteger` MUST be represented with
			`its shortest possible encoding.`

			`For example,`
Split up spec! 2022-06-18 17:11:08 +00:00
			`«87112285931760246646623899502532662132736»`
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`= A3 01 00 00 00 00 00 00 00`
			`00 00 00 00 00 00 00 00`
			`00 00`

			`«-257» = A3 FE FF «-3» = A3 FD «128» = A3 00 80`
			`«-256» = A3 FF 00 «-2» = A3 FE «255» = A3 00 FF`
			`«-255» = A3 FF 01 «-1» = A3 FF «256» = A3 01 00`
			`«-254» = A3 FF 02 «0» = A3 «32767» = A3 7F FF`
			`«-129» = A3 FF 7F «1» = A3 01 «32768» = A3 00 80 00`
			`«-128» = A3 80 «12» = A3 0C «65535» = A3 00 FF FF`
			`«-127» = A3 81 «13» = A3 0D «65536» = A3 01 00 00`
			`«-4» = A3 FC «127» = A3 7F «131072» = A3 02 00 00`
Split up spec! 2022-06-18 17:11:08 +00:00
			`[^zero-intbytes]: The value 0 needs zero bytes to identify the`
			value, so `intbytes(0)` is the empty byte string. Non-zero values
			`need at least one byte.`

			`### Strings, ByteStrings and Symbols.`

			Syntax for these three types varies only in the tag used. For `String`
			and `Symbol`, the data following the tag is a UTF-8 encoding of the
			`Value`'s code points, while for `ByteString` it is the raw data
			contained within the `Value` unmodified.

New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`«S» = [0xA4] ++ utf8(S) if S ∈ String`
			`[0xA5] ++ S if S ∈ ByteString`
			`[0xA6] ++ utf8(S) if S ∈ Symbol`
Split up spec! 2022-06-18 17:11:08 +00:00
			`### Booleans.`

New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`«#f» = [0xA0]`
			`«#t» = [0xA1]`
Split up spec! 2022-06-18 17:11:08 +00:00
			`### Floats and Doubles.`

New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`«F» when F ∈ Float = [0xA2] ++ binary32(F)`
			`«D» when D ∈ Double = [0xA2] ++ binary64(D)`
Split up spec! 2022-06-18 17:11:08 +00:00
			The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
			8-byte IEEE 754 binary representations of `F` and `D`, respectively.

			`### Embeddeds.`

			The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			represent the denoted object, prefixed with `[0xBF]`.
Split up spec! 2022-06-18 17:11:08 +00:00
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`«#!V» = [0xBF] ++ «V»`
Split up spec! 2022-06-18 17:11:08 +00:00
			`### Annotations.`

New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			To annotate a `Repr` `r` with some sequence of `Value`s `[v_1, ...,
			v_m]`, surround `r` as follows:

Fix spec issues with len() and IEEE754 endianness (thanks to isd) 2022-06-11 08:06:51 +00:00			`[0xBE] ++ len(\|r\|) ++ r ++ len(\|«v_1»\|) ++ «v_1» ++...++ len(\|«v_m»\|) ++ «v_m»`
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00
			The `Repr` `r` MUST NOT already have annotations; that is, it must not begin with `0xBE`.

			For example, the `Repr` corresponding to textual syntax `@a@b[]`, i.e.
			an empty sequence annotated with two symbols, `a` and `b`, is
Split up spec! 2022-06-18 17:11:08 +00:00
			`«@a @b []»`
Fix spec issues with len() and IEEE754 endianness (thanks to isd) 2022-06-11 08:06:51 +00:00			`= [0xBE] ++ len(\|«[]»\|) ++ «[]» ++ len(\|«a»\|) ++ «a» ++ len(\|«b»\|) ++ «b»`
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`= [0xBE, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]`
Split up spec! 2022-06-18 17:11:08 +00:00
			`## Security Considerations`

			Annotations. In modes where a `Value` is being read while
			`annotations are skipped, an endless sequence of annotations may give an`
			`illusion of progress.`

			`Canonical form for cryptographic hashing and signing. No canonical`
			textual encoding of a `Value` is specified. A
			[canonical form][canonical] exists for binary encoded `Value`s, and
			`implementations SHOULD produce canonical binary encodings by`
			`default; however, an implementation MAY permit two serializations of`
			the same `Value` to yield different binary `Repr`s.

New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`## Acknowledgements`

			The exclusion of lengths from `Repr`s, placing lengths instead ahead of
			`contained values in sequences, is inspired by [argdata][].`

Split up spec! 2022-06-18 17:11:08 +00:00			`## Appendix. Autodetection of textual or binary syntax`

New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			Every tag byte in a binary Preserves `Repr` falls within the range
Split up spec! 2022-06-18 17:11:08 +00:00			[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
			`bytes*, and will never occur as the first byte of a UTF-8 encoded code`
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			point. This means no binary-encoded `Repr` can be misinterpreted as
Split up spec! 2022-06-18 17:11:08 +00:00			`valid UTF-8.`

New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			Conversely, a UTF-8 `Document` must start with a valid codepoint,
Split up spec! 2022-06-18 17:11:08 +00:00			`meaning in particular that it must not start with a byte in the range`
			[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			Preserves `Document` can be misinterpreted as a binary-syntax `Repr`.
Split up spec! 2022-06-18 17:11:08 +00:00
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			Examination of the top two bits of the first byte of an encoded `Value`
			gives its syntax: if the top two bits are `10`, it should be interpreted
			as a binary-syntax `Repr`; otherwise, it should be interpreted as text.

			`Streaming. Autodetection is still possible when streaming an`
			undetermined number of `Value`s across, say, a TCP/IP connection:

			`- If the text syntax is to be used for the connection, simply start`
			writing each `Document` one after the other. Documents for `Atom`s
			`MUST be separated from their neighbours by whitespace; in general,`
			`whitespace SHOULD be used to separate adjacent documents.`
			`Specifically, whitespace separating adjacent documents SHOULD be`
			`ASCII newline (10).`

			`- If the binary syntax is to be used for the connection, start the`
			connection with byte `0xA8` (sequence). After the initial byte, send
Fix spec issues with len() and IEEE754 endianness (thanks to isd) 2022-06-11 08:06:51 +00:00			each value `v` as `len(\|«v»\|) ++ «v»`. A side effect of this approach
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			is that the entire stream, when complete, is a valid `Sequence`
			`Repr`.
Split up spec! 2022-06-18 17:11:08 +00:00
			`## Appendix. Table of tag values`

New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			`(8x) RESERVED 80-8F`
			`(9x) RESERVED 90-9F`

			`A0 - False`
			`A1 - True`
			`A2 - Float or Double (length disambiguates)`
			`A3 - SignedIntegers (0 is encoded with no bytes at all)`
			`A4 - String (no trailing NUL is added)`
			`A5 - ByteString`
			`A6 - Symbol`

			`A7 - Record`
			`A8 - Sequence`
			`A9 - Set`
			`AA - Dictionary`

			`(Ax) RESERVED AB-AF`

			`(Bx) RESERVED B0-BD`
			`BE - Annotations. {BE Lval val Lann0 ann0 Lann1 ann1 ...}`
			`BF - Embedded`
Split up spec! 2022-06-18 17:11:08 +00:00
			`## Appendix. Binary SignedInteger representation`

			`Languages that provide fixed-width machine word types may find the`
			following table useful in encoding and decoding binary `SignedInteger`
			`values.`

			`\| Integer range \| Bytes required \| Encoding (hex) \|`
			`\| --- \| --- \| --- \|`
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			\| 0 \| 1 \| `A3` \|
			\| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) \| 2 \| `A3` `XX` \|
			\| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) \| 3 \| `A3` `XX` `XX` \|
			\| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) \| 4 \| `A3` `XX` `XX` `XX` \|
Split up spec! 2022-06-18 17:11:08 +00:00			\| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) \| 5 \| `A3` `XX` `XX` `XX` `XX` \|
New "blue jelly" machine-oriented binary syntax, inspired by argdata 2022-06-10 15:33:52 +00:00			\| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) \| 6 \| `A3` `XX` `XX` `XX` `XX` `XX` \|
			\| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) \| 7 \| `A3` `XX` `XX` `XX` `XX` `XX` `XX` \|
			\| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) \| 8 \| `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` \|
			\| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) \| 9 \| `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` \|
Split up spec! 2022-06-18 17:11:08 +00:00
			`<!-- Heading to visually offset the footnotes from the main document: -->`
			`## Notes`