Experiment with zero-copy format

2023-06-24 00:49:12 +02:00 · 2023-06-24 00:49:12 +02:00 · 9f556fd9e6
parent 54b9bb6f25
commit 9f556fd9e6
1 changed files with 211 additions and 0 deletions
--- a/preserves-zerocopy.md
+++ b/preserves-zerocopy.md
@ -0,0 +1,211 @@
+---
+no_site_title: true
+title: "Preserves: Zero-copy Binary Syntax"
+---
+
+Tony Garnock-Jones <tonyg@leastfixedpoint.com>  
+{{ site.version_date }}. Version {{ site.version }}.
+
+*Preserves* is a data model, with associated serialization formats. This
+document defines one of those formats: a binary syntax for `Value`s from
+the [Preserves data model](preserves.html) that avoids, in many cases, use
+of intermediate data structures during reading and writing. This makes it
+suitable for use for representation of very large values whose
+fully-decoded representations may not fit in working memory.
+
+## Zero-Copy Binary Syntax
+
+A `Buf` is a zero-copy syntax encoding, or representation, of a
+non-immediate `Value`. A `Ref` is either a type-tagged representation of a
+small immediate `Value` or a type-tagged pointer to a `Buf`.
+
+Each `Ref` is a 64-bit unsigned value. Its tag appears in the low 4 bits.
+The remaining 60 bits encode either an unsigned offset pointing to a
+previously-encoded `Buf`, or an immediate value. Pointers always point
+backwards to earlier positions.
+
+Each `Buf` is prefixed with a 64-bit payload length, counted in units of
+bytes, and is zero-padded to the nearest multiple of 16 bytes. Neither the
+length of the padding nor the length of the length itself are included in
+the length.
+
+Offsets in pointer `Ref`s are counted in 16-byte units, measuring from the
+beginning of the length indicator of the `Buf` in which the `Ref` appears.
+A zero offset is special: it denotes an *empty value* of the type
+associated with the tag in the `Ref`.
+
+All multi-byte quantities are encoded using little-endian byte order.
+
+### Header.
+
+Because `Ref`s are typed, but `Buf`s are not, the outermost `Value` in e.g.
+a file or network stream is always encoded preceded by a special header.
+
+    Offset    Length  Description
+    --------  ------  -----------
+    00000000       1  Marker byte 0xFF
+    00000001       1  Version number 0x00
+    00000002       6  Reserved, 0x00
+    00000008       8  Special Ref
+    00000010       8  Length ("n") of encoded data, in bytes
+    00000018       n  Encoded data
+           -       -  Zero-padding to next 16-byte boundary
+
+The `Ref` in the header at offset 8 is special.
+
+If it encodes an immediate `Value`, that `Value` is the encoded value, and
+the length field and encoded data are omitted. The entire encoded value is
+exactly 16 bytes long in this case.
+
+However, if the special `Ref` is an encoding of a pointer to a `Buf`, the
+offset is interpreted as counting back from the very end of the padding at
+the end of the encoded data. The entire encoded value is the length of the
+encoded data, plus 24, rounded up to the next multiple of 16.
+
+Either way, the tag on the special `Ref` is the type of the encoded value.
+
+### Tags and Refs.
+
+    Tag  Type           Interpretation of 60-bit payload
+    ---  -------------  --------------------------------
+      0  Boolean        0 = False, 1 = True
+      1  IEEE 754       Offset to Buf holding little-endian 32/64-bit float
+      2  SignedInteger  Signed 60-bit integer
+      3  SignedInteger  Offset to Buf holding little-endian signed integer
+      4  String         0-7 bytes of UTF-8; length in lower 4 bits
+      5  String         Offset to Buf holding UTF-8 data
+      6  ByteString     0-7 bytes of raw binary; length in lower 4 bits
+      7  ByteString     Offset to Buf holding raw binary data
+      8  Symbol         0-7 bytes of UTF-8; length in lower 4 bits
+      9  Symbol         Offset to Buf holding UTF-8 data
+      A  Record         Offset to Buf holding Refs (label, fields)
+      B  Sequence       Offset to Buf holding Refs (sequence values)
+      C  Set            Offset to Buf holding Refs (elements in arbitrary order)
+      D  Dictionary     Offset to Buf holding Refs (key/value pairs)
+      E  Embedded       Offset to Buf holding a single Ref
+      F  -              (reserved)
+
+### Records, Sequences, Sets and Dictionaries.
+
+    Offset    Length  Description
+    --------  ------  -----------
+    00000000       8  n*8: length of following sequence of n Refs, in bytes
+    00000008       8  Ref 0
+      ...      ...    ...
+         n*8       8  Ref n-1
+     (n+1)*8       8  Padding, only if n is even
+
+Each compound datum is represented as a sequence of `Ref`s representing the
+contained `Value`s. Each `Record`'s sequence represents the label, followed
+by the fields in order. Each `Sequence`'s representation is just its
+contained values in order. `Set`s are ordered arbitrarily into a sequence.
+The key-value pairs in a `Dictionary` are ordered arbitrarily, alternating
+between keys and their matching values.
+
+There is *no* ordering requirement on the elements of `Set`s or the
+key-value pairs in a `Dictionary`. They may appear in any order. However,
+the elements and keys *MUST* be pairwise distinct according to the
+[Preserves equivalence relation](preserves.html#equivalence).
+
+### SignedIntegers.
+
+Integers between -2<sup>59</sup> and 2<sup>59</sup>-1, inclusive, are
+represented as immediate values in a `Ref` with tag 2. Integers outside
+this range are represented with a `Ref` with tag 3 pointing to a `Buf`
+containing exactly as many 64-bit words as needed to unambiguously identify
+the value and its sign, in little-endian byte and word ordering. Every
+`SignedInteger` *MUST* be represented with its shortest possible encoding.
+
+For example,
+
+    Number (decimal)                           Ref (64-bit)      Buf (hex bytes)
+    -----------------------------------------  ----------------  ----------------
+    -576460752303423488                        8000000000000002  -
+    -257                                       FFFFFFFFFFFFEFF2  -
+    -1                                         FFFFFFFFFFFFFFF2  -
+    0                                          0000000000000002  -
+    1                                          0000000000000012  -
+    257                                        0000000000001012  -
+    576460752303423487                         7FFFFFFFFFFFFFF2  -
+
+    1000000000000000000000000000000            ...............3  1000000000000000
+                                                                 00000040EAED7446
+                                                                 D09C2C9F0C000000
+                                                                 0000000000000000
+
+    -1000000000000000000000000000000           ...............3  1000000000000000
+                                                                 000000C015128BB9
+                                                                 2F63D360F3FFFFFF
+                                                                 0000000000000000
+
+    87112285931760246646623899502532662132736  ...............3  1800000000000000
+                                                                 0000000000000000
+                                                                 0000000000000000
+                                                                 0001000000000000
+
+### Strings, ByteStrings and Symbols.
+
+Syntax for these three types varies only in the tag used. For `String` and
+`Symbol`, the encoded data is a UTF-8 encoding of the `Value`'s code
+points, while for `ByteString` it is the raw data contained within the
+`Value` unmodified.
+
+Encoded data of length 7 bytes or shorter is represented as an immediate
+`Ref` with tag 4 (`String`), 6 (`ByteString`) or 8 (`Symbol`). The lower 4
+bits of the 60-bit payload are the length of the encoded data; the upper 56
+bits are 7 bytes of data, with the first data byte in the uppermost byte.
+
+Data longer than 7 bytes is represented with a `Ref` with tag 5, 7 or 9
+pointing to a `Buf` containing the bytes of encoded data. Empty values
+(length 0) *MUST* be encoded using immediate `Ref` form.
+
+For example,
+
+    Value                                      Ref (64-bit)      Buf (hex bytes)
+    -----------------------------------------  ----------------  ----------------
+    ""                                         0000000000000004  -
+    #""                                        0000000000000006  -
+    ||                                         0000000000000008  -
+    "Hello"                                    48656C6C6F000054  -
+    "a\0a"                                     6100610000000034  -
+
+    "Hello, world!"                            ...............5  0D00000000000000
+                                                                 48656C6C6F2C2077
+                                                                 6F726C6421000000
+                                                                 0000000000000000
+
+### Booleans.
+
+    Value                                      Ref (64-bit)      Buf (hex bytes)
+    -----------------------------------------  ----------------  ----------------
+    #f                                         0000000000000000  -
+    #t                                         0000000000000010  -
+
+### Floats and Doubles.
+
+Each IEEE 754 4- and 8-byte binary representation is encoded into a `Buf`,
+pointed to with a `Ref` with tag 1. The length of the `Buf` disambiguates
+between 32-bit floats and 64-bit doubles.
+
+((This is a very sparse encoding! Each float/double takes up 24 bytes split
+across the `Buf` and `Ref`.))
+
+### Embeddeds.
+
+To encode an `Embedded`, first choose a `Value` to represent the denoted
+object, and encode that, producing a `Ref`. Place that ref in a `Buf` all
+of its own (with length 8). Finally, point to the `Buf` with a `Ref` with
+tag 15.
+
+### Annotations.
+
+((Not sure: put them as a trailer after a Header?))
+
+## Security Considerations
+
+((TBD))
+
+## Appendix. Autodetection of textual or binary syntax
+
+The first byte of a Header is 0xFF, which may not appear in any UTF-8
+string. ((...))