diff --git a/preserves-zerocopy.md b/preserves-zerocopy.md new file mode 100644 index 0000000..e26bedc --- /dev/null +++ b/preserves-zerocopy.md @@ -0,0 +1,211 @@ +--- +no_site_title: true +title: "Preserves: Zero-copy Binary Syntax" +--- + +Tony Garnock-Jones +{{ site.version_date }}. Version {{ site.version }}. + +*Preserves* is a data model, with associated serialization formats. This +document defines one of those formats: a binary syntax for `Value`s from +the [Preserves data model](preserves.html) that avoids, in many cases, use +of intermediate data structures during reading and writing. This makes it +suitable for use for representation of very large values whose +fully-decoded representations may not fit in working memory. + +## Zero-Copy Binary Syntax + +A `Buf` is a zero-copy syntax encoding, or representation, of a +non-immediate `Value`. A `Ref` is either a type-tagged representation of a +small immediate `Value` or a type-tagged pointer to a `Buf`. + +Each `Ref` is a 64-bit unsigned value. Its tag appears in the low 4 bits. +The remaining 60 bits encode either an unsigned offset pointing to a +previously-encoded `Buf`, or an immediate value. Pointers always point +backwards to earlier positions. + +Each `Buf` is prefixed with a 64-bit payload length, counted in units of +bytes, and is zero-padded to the nearest multiple of 16 bytes. Neither the +length of the padding nor the length of the length itself are included in +the length. + +Offsets in pointer `Ref`s are counted in 16-byte units, measuring from the +beginning of the length indicator of the `Buf` in which the `Ref` appears. +A zero offset is special: it denotes an *empty value* of the type +associated with the tag in the `Ref`. + +All multi-byte quantities are encoded using little-endian byte order. + +### Header. + +Because `Ref`s are typed, but `Buf`s are not, the outermost `Value` in e.g. +a file or network stream is always encoded preceded by a special header. + + Offset Length Description + -------- ------ ----------- + 00000000 1 Marker byte 0xFF + 00000001 1 Version number 0x00 + 00000002 6 Reserved, 0x00 + 00000008 8 Special Ref + 00000010 8 Length ("n") of encoded data, in bytes + 00000018 n Encoded data + - - Zero-padding to next 16-byte boundary + +The `Ref` in the header at offset 8 is special. + +If it encodes an immediate `Value`, that `Value` is the encoded value, and +the length field and encoded data are omitted. The entire encoded value is +exactly 16 bytes long in this case. + +However, if the special `Ref` is an encoding of a pointer to a `Buf`, the +offset is interpreted as counting back from the very end of the padding at +the end of the encoded data. The entire encoded value is the length of the +encoded data, plus 24, rounded up to the next multiple of 16. + +Either way, the tag on the special `Ref` is the type of the encoded value. + +### Tags and Refs. + + Tag Type Interpretation of 60-bit payload + --- ------------- -------------------------------- + 0 Boolean 0 = False, 1 = True + 1 IEEE 754 Offset to Buf holding little-endian 32/64-bit float + 2 SignedInteger Signed 60-bit integer + 3 SignedInteger Offset to Buf holding little-endian signed integer + 4 String 0-7 bytes of UTF-8; length in lower 4 bits + 5 String Offset to Buf holding UTF-8 data + 6 ByteString 0-7 bytes of raw binary; length in lower 4 bits + 7 ByteString Offset to Buf holding raw binary data + 8 Symbol 0-7 bytes of UTF-8; length in lower 4 bits + 9 Symbol Offset to Buf holding UTF-8 data + A Record Offset to Buf holding Refs (label, fields) + B Sequence Offset to Buf holding Refs (sequence values) + C Set Offset to Buf holding Refs (elements in arbitrary order) + D Dictionary Offset to Buf holding Refs (key/value pairs) + E Embedded Offset to Buf holding a single Ref + F - (reserved) + +### Records, Sequences, Sets and Dictionaries. + + Offset Length Description + -------- ------ ----------- + 00000000 8 n*8: length of following sequence of n Refs, in bytes + 00000008 8 Ref 0 + ... ... ... + n*8 8 Ref n-1 + (n+1)*8 8 Padding, only if n is even + +Each compound datum is represented as a sequence of `Ref`s representing the +contained `Value`s. Each `Record`'s sequence represents the label, followed +by the fields in order. Each `Sequence`'s representation is just its +contained values in order. `Set`s are ordered arbitrarily into a sequence. +The key-value pairs in a `Dictionary` are ordered arbitrarily, alternating +between keys and their matching values. + +There is *no* ordering requirement on the elements of `Set`s or the +key-value pairs in a `Dictionary`. They may appear in any order. However, +the elements and keys *MUST* be pairwise distinct according to the +[Preserves equivalence relation](preserves.html#equivalence). + +### SignedIntegers. + +Integers between -259 and 259-1, inclusive, are +represented as immediate values in a `Ref` with tag 2. Integers outside +this range are represented with a `Ref` with tag 3 pointing to a `Buf` +containing exactly as many 64-bit words as needed to unambiguously identify +the value and its sign, in little-endian byte and word ordering. Every +`SignedInteger` *MUST* be represented with its shortest possible encoding. + +For example, + + Number (decimal) Ref (64-bit) Buf (hex bytes) + ----------------------------------------- ---------------- ---------------- + -576460752303423488 8000000000000002 - + -257 FFFFFFFFFFFFEFF2 - + -1 FFFFFFFFFFFFFFF2 - + 0 0000000000000002 - + 1 0000000000000012 - + 257 0000000000001012 - + 576460752303423487 7FFFFFFFFFFFFFF2 - + + 1000000000000000000000000000000 ...............3 1000000000000000 + 00000040EAED7446 + D09C2C9F0C000000 + 0000000000000000 + + -1000000000000000000000000000000 ...............3 1000000000000000 + 000000C015128BB9 + 2F63D360F3FFFFFF + 0000000000000000 + + 87112285931760246646623899502532662132736 ...............3 1800000000000000 + 0000000000000000 + 0000000000000000 + 0001000000000000 + +### Strings, ByteStrings and Symbols. + +Syntax for these three types varies only in the tag used. For `String` and +`Symbol`, the encoded data is a UTF-8 encoding of the `Value`'s code +points, while for `ByteString` it is the raw data contained within the +`Value` unmodified. + +Encoded data of length 7 bytes or shorter is represented as an immediate +`Ref` with tag 4 (`String`), 6 (`ByteString`) or 8 (`Symbol`). The lower 4 +bits of the 60-bit payload are the length of the encoded data; the upper 56 +bits are 7 bytes of data, with the first data byte in the uppermost byte. + +Data longer than 7 bytes is represented with a `Ref` with tag 5, 7 or 9 +pointing to a `Buf` containing the bytes of encoded data. Empty values +(length 0) *MUST* be encoded using immediate `Ref` form. + +For example, + + Value Ref (64-bit) Buf (hex bytes) + ----------------------------------------- ---------------- ---------------- + "" 0000000000000004 - + #"" 0000000000000006 - + || 0000000000000008 - + "Hello" 48656C6C6F000054 - + "a\0a" 6100610000000034 - + + "Hello, world!" ...............5 0D00000000000000 + 48656C6C6F2C2077 + 6F726C6421000000 + 0000000000000000 + +### Booleans. + + Value Ref (64-bit) Buf (hex bytes) + ----------------------------------------- ---------------- ---------------- + #f 0000000000000000 - + #t 0000000000000010 - + +### Floats and Doubles. + +Each IEEE 754 4- and 8-byte binary representation is encoded into a `Buf`, +pointed to with a `Ref` with tag 1. The length of the `Buf` disambiguates +between 32-bit floats and 64-bit doubles. + +((This is a very sparse encoding! Each float/double takes up 24 bytes split +across the `Buf` and `Ref`.)) + +### Embeddeds. + +To encode an `Embedded`, first choose a `Value` to represent the denoted +object, and encode that, producing a `Ref`. Place that ref in a `Buf` all +of its own (with length 8). Finally, point to the `Buf` with a `Ref` with +tag 15. + +### Annotations. + +((Not sure: put them as a trailer after a Header?)) + +## Security Considerations + +((TBD)) + +## Appendix. Autodetection of textual or binary syntax + +The first byte of a Header is 0xFF, which may not appear in any UTF-8 +string. ((...))