preserves/preserves-zerocopy.md

237 lines
10 KiB
Markdown
Raw Permalink Normal View History

2023-06-23 22:49:12 +00:00
---
no_site_title: true
title: "Preserves: Zero-copy Binary Syntax"
---
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
{{ site.version_date }}. Version {{ site.version }}.
*Preserves* is a data model, with associated serialization formats. This
document defines one of those formats: a binary syntax for `Value`s from
the [Preserves data model](preserves.html) that avoids, in many cases, use
of intermediate data structures during reading and writing. This makes it
suitable for use for representation of very large values whose
fully-decoded representations may not fit in working memory.
## Zero-Copy Binary Syntax
A `Buf` is a zero-copy syntax encoding, or representation, of a
non-immediate `Value`. A `Ref` is either a type-tagged representation of a
small immediate `Value` or a type-tagged pointer to a `Buf`.
Each `Ref` is a 64-bit unsigned value. Its tag appears in the low 4 bits.
The remaining 60 bits encode either an unsigned offset pointing to a
previously-encoded `Buf`, or an immediate value. Pointers always point
backwards to earlier positions.
Each `Buf` is prefixed with a 64-bit payload length, counted in units of
bytes, and is zero-padded to the nearest multiple of 16 bytes. Neither the
length of the padding nor the length of the length itself are included in
the length.
Offsets in pointer `Ref`s are counted in 16-byte units, measuring from the
beginning of the length indicator of the `Buf` in which the `Ref` appears.
A zero offset is special: it denotes an *empty value* of the type
associated with the tag in the `Ref`.
All multi-byte quantities are encoded using little-endian byte order.
### Header.
Because `Ref`s are typed, but `Buf`s are not, the outermost `Value` in e.g.
a file or network stream is always encoded preceded by a special header.
Offset Length Description
-------- ------ -----------
00000000 1 Marker byte 0xFF
00000001 1 Version number 0x00
00000002 6 Reserved, 0x00
00000008 8 Special Ref
00000010 8 Length ("n") of encoded data, in bytes
00000018 n Encoded data
- - Zero-padding to next 16-byte boundary
The `Ref` in the header at offset 8 is special.
If it encodes an immediate `Value`, that `Value` is the encoded value, and
the length field and encoded data are omitted. The entire encoded value is
exactly 16 bytes long in this case.
However, if the special `Ref` is an encoding of a pointer to a `Buf`, the
offset is interpreted as counting back from the very end of the padding at
the end of the encoded data. The entire encoded value is the length of the
encoded data, plus 24, rounded up to the next multiple of 16.
Either way, the tag on the special `Ref` is the type of the encoded value.
### Tags and Refs.
2023-06-27 20:19:52 +00:00
The following table maps bit values in the low (leftmost) byte of a `Ref`
to their interpretation. In interpretations including a three-bit `nnn`
value, the `nnn` bits specify the length of the used portion of the
remaining 56 bits of the `Ref`, counted in bytes, starting from the
following byte, with value `000` disallowed.
Bit number Meaning
7654 3210
--------- --- -------------------------------------------------------------
0000 0000 IMM Boolean; next byte = 0 means false; 1 means true.
...1 0000 IMM reserved
nnn0 0001 IMM Float: nnn must be 100, meaning a 32-bit IEEE754 value.
nnn1 0001 IMM ByteString
nnn0 0010 IMM String
nnn1 0010 IMM Symbol
.... 0011 IMM SignedInteger between -2^59 and (2^59)-1, inclusive
.... 0100 PTR SignedInteger outside the immediate range
.... 0101 PTR String
.... 0110 PTR ByteString
.... 0111 PTR Symbol
.... 1000 PTR Record
.... 1001 PTR Sequence
.... 1010 PTR Set
.... 1011 PTR Dictionary
.... 1100 PTR Embedded
.... 1101 PTR Double: length of pointed-to Buf must be 8
.... 1110 reserved
.... 1111 reserved
2023-06-23 22:49:12 +00:00
### Records, Sequences, Sets and Dictionaries.
Offset Length Description
-------- ------ -----------
00000000 8 n*8: length of following sequence of n Refs, in bytes
00000008 8 Ref 0
... ... ...
n*8 8 Ref n-1
(n+1)*8 8 Padding, only if n is even
2023-06-27 20:19:52 +00:00
Each compound datum is represented as a `Buf` containing a sequence of
`Ref`s representing the contained `Value`s. Each `Record`'s sequence
represents the label, followed by the fields in order. Each `Sequence`'s
representation is just its contained values in order. `Set`s are ordered
arbitrarily into a sequence. The key-value pairs in a `Dictionary` are
ordered arbitrarily, alternating between keys and their matching values.
2023-06-23 22:49:12 +00:00
There is *no* ordering requirement on the elements of `Set`s or the
key-value pairs in a `Dictionary`. They may appear in any order. However,
the elements and keys *MUST* be pairwise distinct according to the
[Preserves equivalence relation](preserves.html#equivalence).
2023-06-27 20:19:52 +00:00
Empty structures are represented using a `Ref` with a zero offset and the
appropriate tag.
2023-06-23 22:49:12 +00:00
### SignedIntegers.
Integers between -2<sup>59</sup> and 2<sup>59</sup>-1, inclusive, are
2023-06-27 20:19:52 +00:00
represented as immediate values in a `Ref` with tag 3. Integers outside
this range are represented with a `Ref` with tag 4 pointing to a `Buf`
2023-06-23 22:49:12 +00:00
containing exactly as many 64-bit words as needed to unambiguously identify
the value and its sign, in little-endian byte and word ordering. Every
`SignedInteger` *MUST* be represented with its shortest possible encoding.
2023-06-27 20:19:52 +00:00
Zero is represented using tag 3; use of tag 4 with a zero offset is
forbidden.
2023-06-23 22:49:12 +00:00
For example,
Number (decimal) Ref (64-bit) Buf (hex bytes)
----------------------------------------- ---------------- ----------------
2023-06-27 20:19:52 +00:00
-576460752303423488 8000000000000003 -
-257 FFFFFFFFFFFFEFF3 -
-1 FFFFFFFFFFFFFFF3 -
0 0000000000000003 -
1 0000000000000013 -
257 0000000000001013 -
576460752303423487 7FFFFFFFFFFFFFF3 -
1000000000000000000000000000000 ...............4 1000000000000000
2023-06-23 22:49:12 +00:00
00000040EAED7446
D09C2C9F0C000000
0000000000000000
2023-06-27 20:19:52 +00:00
-1000000000000000000000000000000 ...............4 1000000000000000
2023-06-23 22:49:12 +00:00
000000C015128BB9
2F63D360F3FFFFFF
0000000000000000
2023-06-27 20:19:52 +00:00
87112285931760246646623899502532662132736 ...............4 1800000000000000
2023-06-23 22:49:12 +00:00
0000000000000000
0000000000000000
0001000000000000
### Strings, ByteStrings and Symbols.
Syntax for these three types varies only in the tag used. For `String` and
`Symbol`, the encoded data is a UTF-8 encoding of the `Value`, while for
`ByteString` it is the raw data contained within the `Value` unmodified.
2023-06-23 22:49:12 +00:00
2023-06-27 20:19:52 +00:00
Encoded data of length between 1 and 7 bytes is represented as an immediate
`Ref` where the low *five* bits are `00010` (`String`), `10001`
(`ByteString`), or `10010` (`Symbol`). The upper three bits of the low byte
of the `Ref` give the length in bytes. The remaining bytes in the `Ref` are
the data, in memory order.
`Ref` tags 5, 6, and 7 are pointers to `String`, `ByteString` and `Symbol`
`Buf`s, respectively. Offset zero signifies zero-length data; otherwise,
the pointed-to `Buf` contains the bytes of encoded data.
2023-06-23 22:49:12 +00:00
2023-06-27 20:19:52 +00:00
Empty values (length 0) *MUST* be encoded using pointer `Ref` form with
special offset zero.
2023-06-23 22:49:12 +00:00
For example,
Value Ref (64-bit) Buf (hex bytes)
----------------------------------------- ---------------- ----------------
2023-06-27 20:27:41 +00:00
"" 0000000000000005 -
#"" 0000000000000006 -
|| 0000000000000007 -
"Hello" 00006F6C6C6548A2 -
#"a\0b" 0000000062006171 -
xyz 000000007A797872 -
2023-06-23 22:49:12 +00:00
"Hello, world!" ...............5 0D00000000000000
48656C6C6F2C2077
6F726C6421000000
0000000000000000
### Booleans.
Value Ref (64-bit) Buf (hex bytes)
----------------------------------------- ---------------- ----------------
#f 0000000000000000 -
2023-06-27 20:19:52 +00:00
#t 0000000000000100 -
2023-06-23 22:49:12 +00:00
### Floats and Doubles.
2023-06-27 20:19:52 +00:00
4-byte (32-bit) IEEE 754 `Float`s are encoded within immediate `Ref`s with
low byte equal to 0x81. The next four lowest bytes are the 4-byte,
little-endian binary representation of the floating-point value, and the
upper three bytes of the `Ref` are unused.
8-byte (64-bit) IEEE 754 `Double`s are encoded into a `Buf`, pointed to by
a `Ref` with tag 13. The length of the `Buf` must be 8 bytes.
2023-06-23 22:49:12 +00:00
2023-06-27 20:19:52 +00:00
((This is a very sparse encoding for `Double`s! Each `Double` takes up 24
bytes split across the `Buf` and `Ref`.))
2023-06-23 22:49:12 +00:00
### Embeddeds.
To encode an `Embedded`, first choose a `Value` to represent the denoted
object, and encode that, producing a `Ref`. Place that ref in a `Buf` all
of its own (with length 8). Finally, point to the `Buf` with a `Ref` with
2023-06-27 20:19:52 +00:00
tag 12.
2023-06-23 22:49:12 +00:00
### Annotations.
((Not sure: put them as a trailer after a Header?))
## Security Considerations
((TBD))
## Appendix. Autodetection of textual or binary syntax
The first byte of a Header is 0xFF, which may not appear in any UTF-8
string. ((...))