preserves/preserves-zerocopy.md

10 KiB

no_site_title title
true Preserves: Zero-copy Binary Syntax

Tony Garnock-Jones tonyg@leastfixedpoint.com
{{ site.version_date }}. Version {{ site.version }}.

Preserves is a data model, with associated serialization formats. This document defines one of those formats: a binary syntax for Values from the Preserves data model that avoids, in many cases, use of intermediate data structures during reading and writing. This makes it suitable for use for representation of very large values whose fully-decoded representations may not fit in working memory.

Zero-Copy Binary Syntax

A Buf is a zero-copy syntax encoding, or representation, of a non-immediate Value. A Ref is either a type-tagged representation of a small immediate Value or a type-tagged pointer to a Buf.

Each Ref is a 64-bit unsigned value. Its tag appears in the low 4 bits. The remaining 60 bits encode either an unsigned offset pointing to a previously-encoded Buf, or an immediate value. Pointers always point backwards to earlier positions.

Each Buf is prefixed with a 64-bit payload length, counted in units of bytes, and is zero-padded to the nearest multiple of 16 bytes. Neither the length of the padding nor the length of the length itself are included in the length.

Offsets in pointer Refs are counted in 16-byte units, measuring from the beginning of the length indicator of the Buf in which the Ref appears. A zero offset is special: it denotes an empty value of the type associated with the tag in the Ref.

All multi-byte quantities are encoded using little-endian byte order.

Header.

Because Refs are typed, but Bufs are not, the outermost Value in e.g. a file or network stream is always encoded preceded by a special header.

Offset    Length  Description
--------  ------  -----------
00000000       1  Marker byte 0xFF
00000001       1  Version number 0x00
00000002       6  Reserved, 0x00
00000008       8  Special Ref
00000010       8  Length ("n") of encoded data, in bytes
00000018       n  Encoded data
       -       -  Zero-padding to next 16-byte boundary

The Ref in the header at offset 8 is special.

If it encodes an immediate Value, that Value is the encoded value, and the length field and encoded data are omitted. The entire encoded value is exactly 16 bytes long in this case.

However, if the special Ref is an encoding of a pointer to a Buf, the offset is interpreted as counting back from the very end of the padding at the end of the encoded data. The entire encoded value is the length of the encoded data, plus 24, rounded up to the next multiple of 16.

Either way, the tag on the special Ref is the type of the encoded value.

Tags and Refs.

The following table maps bit values in the low (leftmost) byte of a Ref to their interpretation. In interpretations including a three-bit nnn value, the nnn bits specify the length of the used portion of the remaining 56 bits of the Ref, counted in bytes, starting from the following byte, with value 000 disallowed.

Bit number    Meaning
7654 3210
--------- --- -------------------------------------------------------------
0000 0000 IMM Boolean; next byte = 0 means false; 1 means true.
...1 0000 IMM reserved
nnn0 0001 IMM Float: nnn must be 100, meaning a 32-bit IEEE754 value.
nnn1 0001 IMM ByteString
nnn0 0010 IMM String
nnn1 0010 IMM Symbol

.... 0011 IMM SignedInteger between -2^59 and (2^59)-1, inclusive

.... 0100 PTR SignedInteger outside the immediate range
.... 0101 PTR String
.... 0110 PTR ByteString
.... 0111 PTR Symbol
.... 1000 PTR Record
.... 1001 PTR Sequence
.... 1010 PTR Set
.... 1011 PTR Dictionary
.... 1100 PTR Embedded
.... 1101 PTR Double: length of pointed-to Buf must be 8
.... 1110     reserved
.... 1111     reserved

Records, Sequences, Sets and Dictionaries.

Offset    Length  Description
--------  ------  -----------
00000000       8  n*8: length of following sequence of n Refs, in bytes
00000008       8  Ref 0
  ...      ...    ...
     n*8       8  Ref n-1
 (n+1)*8       8  Padding, only if n is even

Each compound datum is represented as a Buf containing a sequence of Refs representing the contained Values. Each Record's sequence represents the label, followed by the fields in order. Each Sequence's representation is just its contained values in order. Sets are ordered arbitrarily into a sequence. The key-value pairs in a Dictionary are ordered arbitrarily, alternating between keys and their matching values.

There is no ordering requirement on the elements of Sets or the key-value pairs in a Dictionary. They may appear in any order. However, the elements and keys MUST be pairwise distinct according to the Preserves equivalence relation.

Empty structures are represented using a Ref with a zero offset and the appropriate tag.

SignedIntegers.

Integers between -259 and 259-1, inclusive, are represented as immediate values in a Ref with tag 3. Integers outside this range are represented with a Ref with tag 4 pointing to a Buf containing exactly as many 64-bit words as needed to unambiguously identify the value and its sign, in little-endian byte and word ordering. Every SignedInteger MUST be represented with its shortest possible encoding. Zero is represented using tag 3; use of tag 4 with a zero offset is forbidden.

For example,

Number (decimal)                           Ref (64-bit)      Buf (hex bytes)
-----------------------------------------  ----------------  ----------------
-576460752303423488                        8000000000000003  -
-257                                       FFFFFFFFFFFFEFF3  -
-1                                         FFFFFFFFFFFFFFF3  -
0                                          0000000000000003  -
1                                          0000000000000013  -
257                                        0000000000001013  -
576460752303423487                         7FFFFFFFFFFFFFF3  -

1000000000000000000000000000000            ...............4  1000000000000000
                                                             00000040EAED7446
                                                             D09C2C9F0C000000
                                                             0000000000000000

-1000000000000000000000000000000           ...............4  1000000000000000
                                                             000000C015128BB9
                                                             2F63D360F3FFFFFF
                                                             0000000000000000

87112285931760246646623899502532662132736  ...............4  1800000000000000
                                                             0000000000000000
                                                             0000000000000000
                                                             0001000000000000

Strings, ByteStrings and Symbols.

Syntax for these three types varies only in the tag used. For String and Symbol, the encoded data is a UTF-8 encoding of the Value, while for ByteString it is the raw data contained within the Value unmodified.

Encoded data of length between 1 and 7 bytes is represented as an immediate Ref where the low five bits are 00010 (String), 10001 (ByteString), or 10010 (Symbol). The upper three bits of the low byte of the Ref give the length in bytes. The remaining bytes in the Ref are the data, in memory order.

Ref tags 5, 6, and 7 are pointers to String, ByteString and Symbol Bufs, respectively. Offset zero signifies zero-length data; otherwise, the pointed-to Buf contains the bytes of encoded data.

Empty values (length 0) MUST be encoded using pointer Ref form with special offset zero.

For example,

Value                                      Ref (64-bit)      Buf (hex bytes)
-----------------------------------------  ----------------  ----------------
""                                         0000000000000005  -
#""                                        0000000000000006  -
||                                         0000000000000007  -
"Hello"                                    00006F6C6C6548A2  -
#"a\0b"                                    0000000062006171  -
xyz                                        000000007A797872  -

"Hello, world!"                            ...............5  0D00000000000000
                                                             48656C6C6F2C2077
                                                             6F726C6421000000
                                                             0000000000000000

Booleans.

Value                                      Ref (64-bit)      Buf (hex bytes)
-----------------------------------------  ----------------  ----------------
#f                                         0000000000000000  -
#t                                         0000000000000100  -

Floats and Doubles.

4-byte (32-bit) IEEE 754 Floats are encoded within immediate Refs with low byte equal to 0x81. The next four lowest bytes are the 4-byte, little-endian binary representation of the floating-point value, and the upper three bytes of the Ref are unused.

8-byte (64-bit) IEEE 754 Doubles are encoded into a Buf, pointed to by a Ref with tag 13. The length of the Buf must be 8 bytes.

((This is a very sparse encoding for Doubles! Each Double takes up 24 bytes split across the Buf and Ref.))

Embeddeds.

To encode an Embedded, first choose a Value to represent the denoted object, and encode that, producing a Ref. Place that ref in a Buf all of its own (with length 8). Finally, point to the Buf with a Ref with tag 12.

Annotations.

((Not sure: put them as a trailer after a Header?))

Security Considerations

((TBD))

Appendix. Autodetection of textual or binary syntax

The first byte of a Header is 0xFF, which may not appear in any UTF-8 string. ((...))