237 lines
10 KiB
Markdown
237 lines
10 KiB
Markdown
---
|
|
no_site_title: true
|
|
title: "Preserves: Zero-copy Binary Syntax"
|
|
---
|
|
|
|
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
|
{{ site.version_date }}. Version {{ site.version }}.
|
|
|
|
*Preserves* is a data model, with associated serialization formats. This
|
|
document defines one of those formats: a binary syntax for `Value`s from
|
|
the [Preserves data model](preserves.html) that avoids, in many cases, use
|
|
of intermediate data structures during reading and writing. This makes it
|
|
suitable for use for representation of very large values whose
|
|
fully-decoded representations may not fit in working memory.
|
|
|
|
## Zero-Copy Binary Syntax
|
|
|
|
A `Buf` is a zero-copy syntax encoding, or representation, of a
|
|
non-immediate `Value`. A `Ref` is either a type-tagged representation of a
|
|
small immediate `Value` or a type-tagged pointer to a `Buf`.
|
|
|
|
Each `Ref` is a 64-bit unsigned value. Its tag appears in the low 4 bits.
|
|
The remaining 60 bits encode either an unsigned offset pointing to a
|
|
previously-encoded `Buf`, or an immediate value. Pointers always point
|
|
backwards to earlier positions.
|
|
|
|
Each `Buf` is prefixed with a 64-bit payload length, counted in units of
|
|
bytes, and is zero-padded to the nearest multiple of 16 bytes. Neither the
|
|
length of the padding nor the length of the length itself are included in
|
|
the length.
|
|
|
|
Offsets in pointer `Ref`s are counted in 16-byte units, measuring from the
|
|
beginning of the length indicator of the `Buf` in which the `Ref` appears.
|
|
A zero offset is special: it denotes an *empty value* of the type
|
|
associated with the tag in the `Ref`.
|
|
|
|
All multi-byte quantities are encoded using little-endian byte order.
|
|
|
|
### Header.
|
|
|
|
Because `Ref`s are typed, but `Buf`s are not, the outermost `Value` in e.g.
|
|
a file or network stream is always encoded preceded by a special header.
|
|
|
|
Offset Length Description
|
|
-------- ------ -----------
|
|
00000000 1 Marker byte 0xFF
|
|
00000001 1 Version number 0x00
|
|
00000002 6 Reserved, 0x00
|
|
00000008 8 Special Ref
|
|
00000010 8 Length ("n") of encoded data, in bytes
|
|
00000018 n Encoded data
|
|
- - Zero-padding to next 16-byte boundary
|
|
|
|
The `Ref` in the header at offset 8 is special.
|
|
|
|
If it encodes an immediate `Value`, that `Value` is the encoded value, and
|
|
the length field and encoded data are omitted. The entire encoded value is
|
|
exactly 16 bytes long in this case.
|
|
|
|
However, if the special `Ref` is an encoding of a pointer to a `Buf`, the
|
|
offset is interpreted as counting back from the very end of the padding at
|
|
the end of the encoded data. The entire encoded value is the length of the
|
|
encoded data, plus 24, rounded up to the next multiple of 16.
|
|
|
|
Either way, the tag on the special `Ref` is the type of the encoded value.
|
|
|
|
### Tags and Refs.
|
|
|
|
The following table maps bit values in the low (leftmost) byte of a `Ref`
|
|
to their interpretation. In interpretations including a three-bit `nnn`
|
|
value, the `nnn` bits specify the length of the used portion of the
|
|
remaining 56 bits of the `Ref`, counted in bytes, starting from the
|
|
following byte, with value `000` disallowed.
|
|
|
|
Bit number Meaning
|
|
7654 3210
|
|
--------- --- -------------------------------------------------------------
|
|
0000 0000 IMM Boolean; next byte = 0 means false; 1 means true.
|
|
...1 0000 IMM reserved
|
|
nnn0 0001 IMM Float: nnn must be 100, meaning a 32-bit IEEE754 value.
|
|
nnn1 0001 IMM ByteString
|
|
nnn0 0010 IMM String
|
|
nnn1 0010 IMM Symbol
|
|
|
|
.... 0011 IMM SignedInteger between -2^59 and (2^59)-1, inclusive
|
|
|
|
.... 0100 PTR SignedInteger outside the immediate range
|
|
.... 0101 PTR String
|
|
.... 0110 PTR ByteString
|
|
.... 0111 PTR Symbol
|
|
.... 1000 PTR Record
|
|
.... 1001 PTR Sequence
|
|
.... 1010 PTR Set
|
|
.... 1011 PTR Dictionary
|
|
.... 1100 PTR Embedded
|
|
.... 1101 PTR Double: length of pointed-to Buf must be 8
|
|
.... 1110 reserved
|
|
.... 1111 reserved
|
|
|
|
### Records, Sequences, Sets and Dictionaries.
|
|
|
|
Offset Length Description
|
|
-------- ------ -----------
|
|
00000000 8 n*8: length of following sequence of n Refs, in bytes
|
|
00000008 8 Ref 0
|
|
... ... ...
|
|
n*8 8 Ref n-1
|
|
(n+1)*8 8 Padding, only if n is even
|
|
|
|
Each compound datum is represented as a `Buf` containing a sequence of
|
|
`Ref`s representing the contained `Value`s. Each `Record`'s sequence
|
|
represents the label, followed by the fields in order. Each `Sequence`'s
|
|
representation is just its contained values in order. `Set`s are ordered
|
|
arbitrarily into a sequence. The key-value pairs in a `Dictionary` are
|
|
ordered arbitrarily, alternating between keys and their matching values.
|
|
|
|
There is *no* ordering requirement on the elements of `Set`s or the
|
|
key-value pairs in a `Dictionary`. They may appear in any order. However,
|
|
the elements and keys *MUST* be pairwise distinct according to the
|
|
[Preserves equivalence relation](preserves.html#equivalence).
|
|
|
|
Empty structures are represented using a `Ref` with a zero offset and the
|
|
appropriate tag.
|
|
|
|
### SignedIntegers.
|
|
|
|
Integers between -2<sup>59</sup> and 2<sup>59</sup>-1, inclusive, are
|
|
represented as immediate values in a `Ref` with tag 3. Integers outside
|
|
this range are represented with a `Ref` with tag 4 pointing to a `Buf`
|
|
containing exactly as many 64-bit words as needed to unambiguously identify
|
|
the value and its sign, in little-endian byte and word ordering. Every
|
|
`SignedInteger` *MUST* be represented with its shortest possible encoding.
|
|
Zero is represented using tag 3; use of tag 4 with a zero offset is
|
|
forbidden.
|
|
|
|
For example,
|
|
|
|
Number (decimal) Ref (64-bit) Buf (hex bytes)
|
|
----------------------------------------- ---------------- ----------------
|
|
-576460752303423488 8000000000000003 -
|
|
-257 FFFFFFFFFFFFEFF3 -
|
|
-1 FFFFFFFFFFFFFFF3 -
|
|
0 0000000000000003 -
|
|
1 0000000000000013 -
|
|
257 0000000000001013 -
|
|
576460752303423487 7FFFFFFFFFFFFFF3 -
|
|
|
|
1000000000000000000000000000000 ...............4 1000000000000000
|
|
00000040EAED7446
|
|
D09C2C9F0C000000
|
|
0000000000000000
|
|
|
|
-1000000000000000000000000000000 ...............4 1000000000000000
|
|
000000C015128BB9
|
|
2F63D360F3FFFFFF
|
|
0000000000000000
|
|
|
|
87112285931760246646623899502532662132736 ...............4 1800000000000000
|
|
0000000000000000
|
|
0000000000000000
|
|
0001000000000000
|
|
|
|
### Strings, ByteStrings and Symbols.
|
|
|
|
Syntax for these three types varies only in the tag used. For `String` and
|
|
`Symbol`, the encoded data is a UTF-8 encoding of the `Value`, while for
|
|
`ByteString` it is the raw data contained within the `Value` unmodified.
|
|
|
|
Encoded data of length between 1 and 7 bytes is represented as an immediate
|
|
`Ref` where the low *five* bits are `00010` (`String`), `10001`
|
|
(`ByteString`), or `10010` (`Symbol`). The upper three bits of the low byte
|
|
of the `Ref` give the length in bytes. The remaining bytes in the `Ref` are
|
|
the data, in memory order.
|
|
|
|
`Ref` tags 5, 6, and 7 are pointers to `String`, `ByteString` and `Symbol`
|
|
`Buf`s, respectively. Offset zero signifies zero-length data; otherwise,
|
|
the pointed-to `Buf` contains the bytes of encoded data.
|
|
|
|
Empty values (length 0) *MUST* be encoded using pointer `Ref` form with
|
|
special offset zero.
|
|
|
|
For example,
|
|
|
|
Value Ref (64-bit) Buf (hex bytes)
|
|
----------------------------------------- ---------------- ----------------
|
|
"" 0000000000000005 -
|
|
#"" 0000000000000006 -
|
|
|| 0000000000000007 -
|
|
"Hello" 00006F6C6C6548A2 -
|
|
#"a\0b" 0000000062006171 -
|
|
xyz 000000007A797872 -
|
|
|
|
"Hello, world!" ...............5 0D00000000000000
|
|
48656C6C6F2C2077
|
|
6F726C6421000000
|
|
0000000000000000
|
|
|
|
### Booleans.
|
|
|
|
Value Ref (64-bit) Buf (hex bytes)
|
|
----------------------------------------- ---------------- ----------------
|
|
#f 0000000000000000 -
|
|
#t 0000000000000100 -
|
|
|
|
### Floats and Doubles.
|
|
|
|
4-byte (32-bit) IEEE 754 `Float`s are encoded within immediate `Ref`s with
|
|
low byte equal to 0x81. The next four lowest bytes are the 4-byte,
|
|
little-endian binary representation of the floating-point value, and the
|
|
upper three bytes of the `Ref` are unused.
|
|
|
|
8-byte (64-bit) IEEE 754 `Double`s are encoded into a `Buf`, pointed to by
|
|
a `Ref` with tag 13. The length of the `Buf` must be 8 bytes.
|
|
|
|
((This is a very sparse encoding for `Double`s! Each `Double` takes up 24
|
|
bytes split across the `Buf` and `Ref`.))
|
|
|
|
### Embeddeds.
|
|
|
|
To encode an `Embedded`, first choose a `Value` to represent the denoted
|
|
object, and encode that, producing a `Ref`. Place that ref in a `Buf` all
|
|
of its own (with length 8). Finally, point to the `Buf` with a `Ref` with
|
|
tag 12.
|
|
|
|
### Annotations.
|
|
|
|
((Not sure: put them as a trailer after a Header?))
|
|
|
|
## Security Considerations
|
|
|
|
((TBD))
|
|
|
|
## Appendix. Autodetection of textual or binary syntax
|
|
|
|
The first byte of a Header is 0xFF, which may not appear in any UTF-8
|
|
string. ((...))
|