Update tagging scheme

This commit is contained in:
Tony Garnock-Jones 2023-06-27 22:19:52 +02:00
parent 5562add7ba
commit 775741f944
1 changed files with 77 additions and 107 deletions

View File

@ -66,76 +66,36 @@ Either way, the tag on the special `Ref` is the type of the encoded value.
### Tags and Refs. ### Tags and Refs.
................................................................ The following table maps bit values in the low (leftmost) byte of a `Ref`
Version 1 to their interpretation. In interpretations including a three-bit `nnn`
value, the `nnn` bits specify the length of the used portion of the
remaining 56 bits of the `Ref`, counted in bytes, starting from the
following byte, with value `000` disallowed.
00000000 IMM bool Bit number Meaning
...00100 IMM RESERVED 7654 3210
nnn01000 IMM float nnn = length of payload in bytes. 000 disallowed --------- --- -------------------------------------------------------------
nnn10000 IMM str 0000 0000 IMM Boolean; next byte = 0 means false; 1 means true.
nnn10100 IMM bytes ...1 0000 IMM reserved
nnn11000 IMM sym nnn0 0001 IMM Float: nnn must be 100, meaning a 32-bit IEEE754 value.
nnn1 0001 IMM ByteString
nnn0 0010 IMM String
nnn1 0010 IMM Symbol
....1100 IMM int .... 0011 IMM SignedInteger between -2^59 and (2^59)-1, inclusive
.....010 RESERVED .... 0100 PTR SignedInteger outside the immediate range
....0110 PTR embedded .... 0101 PTR String
....1110 PTR float .... 0110 PTR ByteString
.... 0111 PTR Symbol
....0001 PTR str .... 1000 PTR Record
....0101 PTR bytes .... 1001 PTR Sequence
....1001 PTR sym .... 1010 PTR Set
....1101 PTR int .... 1011 PTR Dictionary
.... 1100 PTR Embedded
....0011 PTR rec .... 1101 PTR Double: length of pointed-to Buf must be 8
....0111 PTR seq .... 1110 reserved
....1011 PTR set .... 1111 reserved
....1111 PTR map
................................................................
Version 2
0000 0000 IMM bool
...1 0000 IMM RESERVED
nnn0 0001 IMM float nnn = length of payload in bytes. 000 disallowed
nnn1 0001 IMM bytes
nnn0 0010 IMM str
nnn1 0010 IMM sym
.... 0011 IMM int
.... 0100 PTR int
.... 0101 PTR str
.... 0110 PTR bytes
.... 0111 PTR sym
.... 1000 PTR rec
.... 1001 PTR seq
.... 1010 PTR set
.... 1011 PTR map
.... 1100 PTR embedded
.... 1101 PTR float
.... 1110 RESERVED
.... 1111 RESERVED
Tag Type Interpretation of 60-bit payload
--- ------------- --------------------------------
0 Boolean 0 = False, 1 = True
1 IEEE 754 Offset to Buf holding little-endian 32/64-bit float
2 SignedInteger Signed 60-bit integer
3 SignedInteger Offset to Buf holding little-endian signed integer
4 String 0-7 bytes of UTF-8; length in lower 4 bits
5 String Offset to Buf holding UTF-8 data
6 ByteString 0-7 bytes of raw binary; length in lower 4 bits
7 ByteString Offset to Buf holding raw binary data
8 Symbol 0-7 bytes of UTF-8; length in lower 4 bits
9 Symbol Offset to Buf holding UTF-8 data
A Record Offset to Buf holding Refs (label, fields)
B Sequence Offset to Buf holding Refs (sequence values)
C Set Offset to Buf holding Refs (elements in arbitrary order)
D Dictionary Offset to Buf holding Refs (key/value pairs)
E Embedded Offset to Buf holding a single Ref
F - (reserved)
### Records, Sequences, Sets and Dictionaries. ### Records, Sequences, Sets and Dictionaries.
@ -147,50 +107,55 @@ Either way, the tag on the special `Ref` is the type of the encoded value.
n*8 8 Ref n-1 n*8 8 Ref n-1
(n+1)*8 8 Padding, only if n is even (n+1)*8 8 Padding, only if n is even
Each compound datum is represented as a sequence of `Ref`s representing the Each compound datum is represented as a `Buf` containing a sequence of
contained `Value`s. Each `Record`'s sequence represents the label, followed `Ref`s representing the contained `Value`s. Each `Record`'s sequence
by the fields in order. Each `Sequence`'s representation is just its represents the label, followed by the fields in order. Each `Sequence`'s
contained values in order. `Set`s are ordered arbitrarily into a sequence. representation is just its contained values in order. `Set`s are ordered
The key-value pairs in a `Dictionary` are ordered arbitrarily, alternating arbitrarily into a sequence. The key-value pairs in a `Dictionary` are
between keys and their matching values. ordered arbitrarily, alternating between keys and their matching values.
There is *no* ordering requirement on the elements of `Set`s or the There is *no* ordering requirement on the elements of `Set`s or the
key-value pairs in a `Dictionary`. They may appear in any order. However, key-value pairs in a `Dictionary`. They may appear in any order. However,
the elements and keys *MUST* be pairwise distinct according to the the elements and keys *MUST* be pairwise distinct according to the
[Preserves equivalence relation](preserves.html#equivalence). [Preserves equivalence relation](preserves.html#equivalence).
Empty structures are represented using a `Ref` with a zero offset and the
appropriate tag.
### SignedIntegers. ### SignedIntegers.
Integers between -2<sup>59</sup> and 2<sup>59</sup>-1, inclusive, are Integers between -2<sup>59</sup> and 2<sup>59</sup>-1, inclusive, are
represented as immediate values in a `Ref` with tag 2. Integers outside represented as immediate values in a `Ref` with tag 3. Integers outside
this range are represented with a `Ref` with tag 3 pointing to a `Buf` this range are represented with a `Ref` with tag 4 pointing to a `Buf`
containing exactly as many 64-bit words as needed to unambiguously identify containing exactly as many 64-bit words as needed to unambiguously identify
the value and its sign, in little-endian byte and word ordering. Every the value and its sign, in little-endian byte and word ordering. Every
`SignedInteger` *MUST* be represented with its shortest possible encoding. `SignedInteger` *MUST* be represented with its shortest possible encoding.
Zero is represented using tag 3; use of tag 4 with a zero offset is
forbidden.
For example, For example,
Number (decimal) Ref (64-bit) Buf (hex bytes) Number (decimal) Ref (64-bit) Buf (hex bytes)
----------------------------------------- ---------------- ---------------- ----------------------------------------- ---------------- ----------------
-576460752303423488 8000000000000002 - -576460752303423488 8000000000000003 -
-257 FFFFFFFFFFFFEFF2 - -257 FFFFFFFFFFFFEFF3 -
-1 FFFFFFFFFFFFFFF2 - -1 FFFFFFFFFFFFFFF3 -
0 0000000000000002 - 0 0000000000000003 -
1 0000000000000012 - 1 0000000000000013 -
257 0000000000001012 - 257 0000000000001013 -
576460752303423487 7FFFFFFFFFFFFFF2 - 576460752303423487 7FFFFFFFFFFFFFF3 -
1000000000000000000000000000000 ...............3 1000000000000000 1000000000000000000000000000000 ...............4 1000000000000000
00000040EAED7446 00000040EAED7446
D09C2C9F0C000000 D09C2C9F0C000000
0000000000000000 0000000000000000
-1000000000000000000000000000000 ...............3 1000000000000000 -1000000000000000000000000000000 ...............4 1000000000000000
000000C015128BB9 000000C015128BB9
2F63D360F3FFFFFF 2F63D360F3FFFFFF
0000000000000000 0000000000000000
87112285931760246646623899502532662132736 ...............3 1800000000000000 87112285931760246646623899502532662132736 ...............4 1800000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0001000000000000 0001000000000000
@ -202,27 +167,28 @@ Syntax for these three types varies only in the tag used. For `String` and
points, while for `ByteString` it is the raw data contained within the points, while for `ByteString` it is the raw data contained within the
`Value` unmodified. `Value` unmodified.
Encoded data of length 7 bytes or shorter is represented as an immediate Encoded data of length between 1 and 7 bytes is represented as an immediate
`Ref` with tag 4 (`String`), 6 (`ByteString`) or 8 (`Symbol`). The lower 4 `Ref` where the low *five* bits are `00010` (`String`), `10001`
bits of the 60-bit payload are the length of the encoded data; the upper 56 (`ByteString`), or `10010` (`Symbol`). The upper three bits of the low byte
bits are 7 bytes of data, with the first data byte in the lowest byte, so of the `Ref` give the length in bytes. The remaining bytes in the `Ref` are
that the order of data bytes in memory in an immediate encoding matches the the data, in memory order.
order in a `Buf` encoding.
Data longer than 7 bytes is represented with a `Ref` with tag 5, 7 or 9 `Ref` tags 5, 6, and 7 are pointers to `String`, `ByteString` and `Symbol`
pointing to a `Buf` containing the bytes of encoded data. Empty values `Buf`s, respectively. Offset zero signifies zero-length data; otherwise,
(length 0) *MUST* be encoded using pointer `Ref` form with special offset the pointed-to `Buf` contains the bytes of encoded data.
zero.
Empty values (length 0) *MUST* be encoded using pointer `Ref` form with
special offset zero.
For example, For example,
Value Ref (64-bit) Buf (hex bytes) Value Ref (64-bit) Buf (hex bytes)
----------------------------------------- ---------------- ---------------- ----------------------------------------- ---------------- ----------------
"" 0000000000000005 - "" 0000000000000002 -
#"" 0000000000000007 - #"" 0000000000000011 -
|| 0000000000000009 - || 0000000000000012 -
"Hello" 48656C6C6F000054 - "Hello" 48656C6C6F0000A2 -
"a\0a" 6100610000000034 - #"a\0a" 6100610000000071 -
"Hello, world!" ...............5 0D00000000000000 "Hello, world!" ...............5 0D00000000000000
48656C6C6F2C2077 48656C6C6F2C2077
@ -234,23 +200,27 @@ For example,
Value Ref (64-bit) Buf (hex bytes) Value Ref (64-bit) Buf (hex bytes)
----------------------------------------- ---------------- ---------------- ----------------------------------------- ---------------- ----------------
#f 0000000000000000 - #f 0000000000000000 -
#t 0000000000000010 - #t 0000000000000100 -
### Floats and Doubles. ### Floats and Doubles.
Each IEEE 754 4- and 8-byte binary representation is encoded into a `Buf`, 4-byte (32-bit) IEEE 754 `Float`s are encoded within immediate `Ref`s with
pointed to with a `Ref` with tag 1. The length of the `Buf` disambiguates low byte equal to 0x81. The next four lowest bytes are the 4-byte,
between 32-bit floats and 64-bit doubles. little-endian binary representation of the floating-point value, and the
upper three bytes of the `Ref` are unused.
((This is a very sparse encoding! Each float/double takes up 24 bytes split 8-byte (64-bit) IEEE 754 `Double`s are encoded into a `Buf`, pointed to by
across the `Buf` and `Ref`.)) a `Ref` with tag 13. The length of the `Buf` must be 8 bytes.
((This is a very sparse encoding for `Double`s! Each `Double` takes up 24
bytes split across the `Buf` and `Ref`.))
### Embeddeds. ### Embeddeds.
To encode an `Embedded`, first choose a `Value` to represent the denoted To encode an `Embedded`, first choose a `Value` to represent the denoted
object, and encode that, producing a `Ref`. Place that ref in a `Buf` all object, and encode that, producing a `Ref`. Place that ref in a `Buf` all
of its own (with length 8). Finally, point to the `Buf` with a `Ref` with of its own (with length 8). Finally, point to the `Buf` with a `Ref` with
tag 15. tag 12.
### Annotations. ### Annotations.