Experiment with zero-copy format
This commit is contained in:
parent
54b9bb6f25
commit
9f556fd9e6
|
@ -0,0 +1,211 @@
|
|||
---
|
||||
no_site_title: true
|
||||
title: "Preserves: Zero-copy Binary Syntax"
|
||||
---
|
||||
|
||||
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
||||
{{ site.version_date }}. Version {{ site.version }}.
|
||||
|
||||
*Preserves* is a data model, with associated serialization formats. This
|
||||
document defines one of those formats: a binary syntax for `Value`s from
|
||||
the [Preserves data model](preserves.html) that avoids, in many cases, use
|
||||
of intermediate data structures during reading and writing. This makes it
|
||||
suitable for use for representation of very large values whose
|
||||
fully-decoded representations may not fit in working memory.
|
||||
|
||||
## Zero-Copy Binary Syntax
|
||||
|
||||
A `Buf` is a zero-copy syntax encoding, or representation, of a
|
||||
non-immediate `Value`. A `Ref` is either a type-tagged representation of a
|
||||
small immediate `Value` or a type-tagged pointer to a `Buf`.
|
||||
|
||||
Each `Ref` is a 64-bit unsigned value. Its tag appears in the low 4 bits.
|
||||
The remaining 60 bits encode either an unsigned offset pointing to a
|
||||
previously-encoded `Buf`, or an immediate value. Pointers always point
|
||||
backwards to earlier positions.
|
||||
|
||||
Each `Buf` is prefixed with a 64-bit payload length, counted in units of
|
||||
bytes, and is zero-padded to the nearest multiple of 16 bytes. Neither the
|
||||
length of the padding nor the length of the length itself are included in
|
||||
the length.
|
||||
|
||||
Offsets in pointer `Ref`s are counted in 16-byte units, measuring from the
|
||||
beginning of the length indicator of the `Buf` in which the `Ref` appears.
|
||||
A zero offset is special: it denotes an *empty value* of the type
|
||||
associated with the tag in the `Ref`.
|
||||
|
||||
All multi-byte quantities are encoded using little-endian byte order.
|
||||
|
||||
### Header.
|
||||
|
||||
Because `Ref`s are typed, but `Buf`s are not, the outermost `Value` in e.g.
|
||||
a file or network stream is always encoded preceded by a special header.
|
||||
|
||||
Offset Length Description
|
||||
-------- ------ -----------
|
||||
00000000 1 Marker byte 0xFF
|
||||
00000001 1 Version number 0x00
|
||||
00000002 6 Reserved, 0x00
|
||||
00000008 8 Special Ref
|
||||
00000010 8 Length ("n") of encoded data, in bytes
|
||||
00000018 n Encoded data
|
||||
- - Zero-padding to next 16-byte boundary
|
||||
|
||||
The `Ref` in the header at offset 8 is special.
|
||||
|
||||
If it encodes an immediate `Value`, that `Value` is the encoded value, and
|
||||
the length field and encoded data are omitted. The entire encoded value is
|
||||
exactly 16 bytes long in this case.
|
||||
|
||||
However, if the special `Ref` is an encoding of a pointer to a `Buf`, the
|
||||
offset is interpreted as counting back from the very end of the padding at
|
||||
the end of the encoded data. The entire encoded value is the length of the
|
||||
encoded data, plus 24, rounded up to the next multiple of 16.
|
||||
|
||||
Either way, the tag on the special `Ref` is the type of the encoded value.
|
||||
|
||||
### Tags and Refs.
|
||||
|
||||
Tag Type Interpretation of 60-bit payload
|
||||
--- ------------- --------------------------------
|
||||
0 Boolean 0 = False, 1 = True
|
||||
1 IEEE 754 Offset to Buf holding little-endian 32/64-bit float
|
||||
2 SignedInteger Signed 60-bit integer
|
||||
3 SignedInteger Offset to Buf holding little-endian signed integer
|
||||
4 String 0-7 bytes of UTF-8; length in lower 4 bits
|
||||
5 String Offset to Buf holding UTF-8 data
|
||||
6 ByteString 0-7 bytes of raw binary; length in lower 4 bits
|
||||
7 ByteString Offset to Buf holding raw binary data
|
||||
8 Symbol 0-7 bytes of UTF-8; length in lower 4 bits
|
||||
9 Symbol Offset to Buf holding UTF-8 data
|
||||
A Record Offset to Buf holding Refs (label, fields)
|
||||
B Sequence Offset to Buf holding Refs (sequence values)
|
||||
C Set Offset to Buf holding Refs (elements in arbitrary order)
|
||||
D Dictionary Offset to Buf holding Refs (key/value pairs)
|
||||
E Embedded Offset to Buf holding a single Ref
|
||||
F - (reserved)
|
||||
|
||||
### Records, Sequences, Sets and Dictionaries.
|
||||
|
||||
Offset Length Description
|
||||
-------- ------ -----------
|
||||
00000000 8 n*8: length of following sequence of n Refs, in bytes
|
||||
00000008 8 Ref 0
|
||||
... ... ...
|
||||
n*8 8 Ref n-1
|
||||
(n+1)*8 8 Padding, only if n is even
|
||||
|
||||
Each compound datum is represented as a sequence of `Ref`s representing the
|
||||
contained `Value`s. Each `Record`'s sequence represents the label, followed
|
||||
by the fields in order. Each `Sequence`'s representation is just its
|
||||
contained values in order. `Set`s are ordered arbitrarily into a sequence.
|
||||
The key-value pairs in a `Dictionary` are ordered arbitrarily, alternating
|
||||
between keys and their matching values.
|
||||
|
||||
There is *no* ordering requirement on the elements of `Set`s or the
|
||||
key-value pairs in a `Dictionary`. They may appear in any order. However,
|
||||
the elements and keys *MUST* be pairwise distinct according to the
|
||||
[Preserves equivalence relation](preserves.html#equivalence).
|
||||
|
||||
### SignedIntegers.
|
||||
|
||||
Integers between -2<sup>59</sup> and 2<sup>59</sup>-1, inclusive, are
|
||||
represented as immediate values in a `Ref` with tag 2. Integers outside
|
||||
this range are represented with a `Ref` with tag 3 pointing to a `Buf`
|
||||
containing exactly as many 64-bit words as needed to unambiguously identify
|
||||
the value and its sign, in little-endian byte and word ordering. Every
|
||||
`SignedInteger` *MUST* be represented with its shortest possible encoding.
|
||||
|
||||
For example,
|
||||
|
||||
Number (decimal) Ref (64-bit) Buf (hex bytes)
|
||||
----------------------------------------- ---------------- ----------------
|
||||
-576460752303423488 8000000000000002 -
|
||||
-257 FFFFFFFFFFFFEFF2 -
|
||||
-1 FFFFFFFFFFFFFFF2 -
|
||||
0 0000000000000002 -
|
||||
1 0000000000000012 -
|
||||
257 0000000000001012 -
|
||||
576460752303423487 7FFFFFFFFFFFFFF2 -
|
||||
|
||||
1000000000000000000000000000000 ...............3 1000000000000000
|
||||
00000040EAED7446
|
||||
D09C2C9F0C000000
|
||||
0000000000000000
|
||||
|
||||
-1000000000000000000000000000000 ...............3 1000000000000000
|
||||
000000C015128BB9
|
||||
2F63D360F3FFFFFF
|
||||
0000000000000000
|
||||
|
||||
87112285931760246646623899502532662132736 ...............3 1800000000000000
|
||||
0000000000000000
|
||||
0000000000000000
|
||||
0001000000000000
|
||||
|
||||
### Strings, ByteStrings and Symbols.
|
||||
|
||||
Syntax for these three types varies only in the tag used. For `String` and
|
||||
`Symbol`, the encoded data is a UTF-8 encoding of the `Value`'s code
|
||||
points, while for `ByteString` it is the raw data contained within the
|
||||
`Value` unmodified.
|
||||
|
||||
Encoded data of length 7 bytes or shorter is represented as an immediate
|
||||
`Ref` with tag 4 (`String`), 6 (`ByteString`) or 8 (`Symbol`). The lower 4
|
||||
bits of the 60-bit payload are the length of the encoded data; the upper 56
|
||||
bits are 7 bytes of data, with the first data byte in the uppermost byte.
|
||||
|
||||
Data longer than 7 bytes is represented with a `Ref` with tag 5, 7 or 9
|
||||
pointing to a `Buf` containing the bytes of encoded data. Empty values
|
||||
(length 0) *MUST* be encoded using immediate `Ref` form.
|
||||
|
||||
For example,
|
||||
|
||||
Value Ref (64-bit) Buf (hex bytes)
|
||||
----------------------------------------- ---------------- ----------------
|
||||
"" 0000000000000004 -
|
||||
#"" 0000000000000006 -
|
||||
|| 0000000000000008 -
|
||||
"Hello" 48656C6C6F000054 -
|
||||
"a\0a" 6100610000000034 -
|
||||
|
||||
"Hello, world!" ...............5 0D00000000000000
|
||||
48656C6C6F2C2077
|
||||
6F726C6421000000
|
||||
0000000000000000
|
||||
|
||||
### Booleans.
|
||||
|
||||
Value Ref (64-bit) Buf (hex bytes)
|
||||
----------------------------------------- ---------------- ----------------
|
||||
#f 0000000000000000 -
|
||||
#t 0000000000000010 -
|
||||
|
||||
### Floats and Doubles.
|
||||
|
||||
Each IEEE 754 4- and 8-byte binary representation is encoded into a `Buf`,
|
||||
pointed to with a `Ref` with tag 1. The length of the `Buf` disambiguates
|
||||
between 32-bit floats and 64-bit doubles.
|
||||
|
||||
((This is a very sparse encoding! Each float/double takes up 24 bytes split
|
||||
across the `Buf` and `Ref`.))
|
||||
|
||||
### Embeddeds.
|
||||
|
||||
To encode an `Embedded`, first choose a `Value` to represent the denoted
|
||||
object, and encode that, producing a `Ref`. Place that ref in a `Buf` all
|
||||
of its own (with length 8). Finally, point to the `Buf` with a `Ref` with
|
||||
tag 15.
|
||||
|
||||
### Annotations.
|
||||
|
||||
((Not sure: put them as a trailer after a Header?))
|
||||
|
||||
## Security Considerations
|
||||
|
||||
((TBD))
|
||||
|
||||
## Appendix. Autodetection of textual or binary syntax
|
||||
|
||||
The first byte of a Header is 0xFF, which may not appear in any UTF-8
|
||||
string. ((...))
|
Loading…
Reference in New Issue