Update tagging scheme

This commit is contained in:
Tony Garnock-Jones 2023-06-27 22:19:52 +02:00
parent 5562add7ba
commit 775741f944
1 changed files with 77 additions and 107 deletions

View File

@ -66,76 +66,36 @@ Either way, the tag on the special `Ref` is the type of the encoded value.
### Tags and Refs.
................................................................
Version 1
The following table maps bit values in the low (leftmost) byte of a `Ref`
to their interpretation. In interpretations including a three-bit `nnn`
value, the `nnn` bits specify the length of the used portion of the
remaining 56 bits of the `Ref`, counted in bytes, starting from the
following byte, with value `000` disallowed.
00000000 IMM bool
...00100 IMM RESERVED
nnn01000 IMM float nnn = length of payload in bytes. 000 disallowed
nnn10000 IMM str
nnn10100 IMM bytes
nnn11000 IMM sym
Bit number Meaning
7654 3210
--------- --- -------------------------------------------------------------
0000 0000 IMM Boolean; next byte = 0 means false; 1 means true.
...1 0000 IMM reserved
nnn0 0001 IMM Float: nnn must be 100, meaning a 32-bit IEEE754 value.
nnn1 0001 IMM ByteString
nnn0 0010 IMM String
nnn1 0010 IMM Symbol
....1100 IMM int
.... 0011 IMM SignedInteger between -2^59 and (2^59)-1, inclusive
.....010 RESERVED
....0110 PTR embedded
....1110 PTR float
....0001 PTR str
....0101 PTR bytes
....1001 PTR sym
....1101 PTR int
....0011 PTR rec
....0111 PTR seq
....1011 PTR set
....1111 PTR map
................................................................
Version 2
0000 0000 IMM bool
...1 0000 IMM RESERVED
nnn0 0001 IMM float nnn = length of payload in bytes. 000 disallowed
nnn1 0001 IMM bytes
nnn0 0010 IMM str
nnn1 0010 IMM sym
.... 0011 IMM int
.... 0100 PTR int
.... 0101 PTR str
.... 0110 PTR bytes
.... 0111 PTR sym
.... 1000 PTR rec
.... 1001 PTR seq
.... 1010 PTR set
.... 1011 PTR map
.... 1100 PTR embedded
.... 1101 PTR float
.... 1110 RESERVED
.... 1111 RESERVED
Tag Type Interpretation of 60-bit payload
--- ------------- --------------------------------
0 Boolean 0 = False, 1 = True
1 IEEE 754 Offset to Buf holding little-endian 32/64-bit float
2 SignedInteger Signed 60-bit integer
3 SignedInteger Offset to Buf holding little-endian signed integer
4 String 0-7 bytes of UTF-8; length in lower 4 bits
5 String Offset to Buf holding UTF-8 data
6 ByteString 0-7 bytes of raw binary; length in lower 4 bits
7 ByteString Offset to Buf holding raw binary data
8 Symbol 0-7 bytes of UTF-8; length in lower 4 bits
9 Symbol Offset to Buf holding UTF-8 data
A Record Offset to Buf holding Refs (label, fields)
B Sequence Offset to Buf holding Refs (sequence values)
C Set Offset to Buf holding Refs (elements in arbitrary order)
D Dictionary Offset to Buf holding Refs (key/value pairs)
E Embedded Offset to Buf holding a single Ref
F - (reserved)
.... 0100 PTR SignedInteger outside the immediate range
.... 0101 PTR String
.... 0110 PTR ByteString
.... 0111 PTR Symbol
.... 1000 PTR Record
.... 1001 PTR Sequence
.... 1010 PTR Set
.... 1011 PTR Dictionary
.... 1100 PTR Embedded
.... 1101 PTR Double: length of pointed-to Buf must be 8
.... 1110 reserved
.... 1111 reserved
### Records, Sequences, Sets and Dictionaries.
@ -147,50 +107,55 @@ Either way, the tag on the special `Ref` is the type of the encoded value.
n*8 8 Ref n-1
(n+1)*8 8 Padding, only if n is even
Each compound datum is represented as a sequence of `Ref`s representing the
contained `Value`s. Each `Record`'s sequence represents the label, followed
by the fields in order. Each `Sequence`'s representation is just its
contained values in order. `Set`s are ordered arbitrarily into a sequence.
The key-value pairs in a `Dictionary` are ordered arbitrarily, alternating
between keys and their matching values.
Each compound datum is represented as a `Buf` containing a sequence of
`Ref`s representing the contained `Value`s. Each `Record`'s sequence
represents the label, followed by the fields in order. Each `Sequence`'s
representation is just its contained values in order. `Set`s are ordered
arbitrarily into a sequence. The key-value pairs in a `Dictionary` are
ordered arbitrarily, alternating between keys and their matching values.
There is *no* ordering requirement on the elements of `Set`s or the
key-value pairs in a `Dictionary`. They may appear in any order. However,
the elements and keys *MUST* be pairwise distinct according to the
[Preserves equivalence relation](preserves.html#equivalence).
Empty structures are represented using a `Ref` with a zero offset and the
appropriate tag.
### SignedIntegers.
Integers between -2<sup>59</sup> and 2<sup>59</sup>-1, inclusive, are
represented as immediate values in a `Ref` with tag 2. Integers outside
this range are represented with a `Ref` with tag 3 pointing to a `Buf`
represented as immediate values in a `Ref` with tag 3. Integers outside
this range are represented with a `Ref` with tag 4 pointing to a `Buf`
containing exactly as many 64-bit words as needed to unambiguously identify
the value and its sign, in little-endian byte and word ordering. Every
`SignedInteger` *MUST* be represented with its shortest possible encoding.
Zero is represented using tag 3; use of tag 4 with a zero offset is
forbidden.
For example,
Number (decimal) Ref (64-bit) Buf (hex bytes)
----------------------------------------- ---------------- ----------------
-576460752303423488 8000000000000002 -
-257 FFFFFFFFFFFFEFF2 -
-1 FFFFFFFFFFFFFFF2 -
0 0000000000000002 -
1 0000000000000012 -
257 0000000000001012 -
576460752303423487 7FFFFFFFFFFFFFF2 -
-576460752303423488 8000000000000003 -
-257 FFFFFFFFFFFFEFF3 -
-1 FFFFFFFFFFFFFFF3 -
0 0000000000000003 -
1 0000000000000013 -
257 0000000000001013 -
576460752303423487 7FFFFFFFFFFFFFF3 -
1000000000000000000000000000000 ...............3 1000000000000000
1000000000000000000000000000000 ...............4 1000000000000000
00000040EAED7446
D09C2C9F0C000000
0000000000000000
-1000000000000000000000000000000 ...............3 1000000000000000
-1000000000000000000000000000000 ...............4 1000000000000000
000000C015128BB9
2F63D360F3FFFFFF
0000000000000000
87112285931760246646623899502532662132736 ...............3 1800000000000000
87112285931760246646623899502532662132736 ...............4 1800000000000000
0000000000000000
0000000000000000
0001000000000000
@ -202,27 +167,28 @@ Syntax for these three types varies only in the tag used. For `String` and
points, while for `ByteString` it is the raw data contained within the
`Value` unmodified.
Encoded data of length 7 bytes or shorter is represented as an immediate
`Ref` with tag 4 (`String`), 6 (`ByteString`) or 8 (`Symbol`). The lower 4
bits of the 60-bit payload are the length of the encoded data; the upper 56
bits are 7 bytes of data, with the first data byte in the lowest byte, so
that the order of data bytes in memory in an immediate encoding matches the
order in a `Buf` encoding.
Encoded data of length between 1 and 7 bytes is represented as an immediate
`Ref` where the low *five* bits are `00010` (`String`), `10001`
(`ByteString`), or `10010` (`Symbol`). The upper three bits of the low byte
of the `Ref` give the length in bytes. The remaining bytes in the `Ref` are
the data, in memory order.
Data longer than 7 bytes is represented with a `Ref` with tag 5, 7 or 9
pointing to a `Buf` containing the bytes of encoded data. Empty values
(length 0) *MUST* be encoded using pointer `Ref` form with special offset
zero.
`Ref` tags 5, 6, and 7 are pointers to `String`, `ByteString` and `Symbol`
`Buf`s, respectively. Offset zero signifies zero-length data; otherwise,
the pointed-to `Buf` contains the bytes of encoded data.
Empty values (length 0) *MUST* be encoded using pointer `Ref` form with
special offset zero.
For example,
Value Ref (64-bit) Buf (hex bytes)
----------------------------------------- ---------------- ----------------
"" 0000000000000005 -
#"" 0000000000000007 -
|| 0000000000000009 -
"Hello" 48656C6C6F000054 -
"a\0a" 6100610000000034 -
"" 0000000000000002 -
#"" 0000000000000011 -
|| 0000000000000012 -
"Hello" 48656C6C6F0000A2 -
#"a\0a" 6100610000000071 -
"Hello, world!" ...............5 0D00000000000000
48656C6C6F2C2077
@ -234,23 +200,27 @@ For example,
Value Ref (64-bit) Buf (hex bytes)
----------------------------------------- ---------------- ----------------
#f 0000000000000000 -
#t 0000000000000010 -
#t 0000000000000100 -
### Floats and Doubles.
Each IEEE 754 4- and 8-byte binary representation is encoded into a `Buf`,
pointed to with a `Ref` with tag 1. The length of the `Buf` disambiguates
between 32-bit floats and 64-bit doubles.
4-byte (32-bit) IEEE 754 `Float`s are encoded within immediate `Ref`s with
low byte equal to 0x81. The next four lowest bytes are the 4-byte,
little-endian binary representation of the floating-point value, and the
upper three bytes of the `Ref` are unused.
((This is a very sparse encoding! Each float/double takes up 24 bytes split
across the `Buf` and `Ref`.))
8-byte (64-bit) IEEE 754 `Double`s are encoded into a `Buf`, pointed to by
a `Ref` with tag 13. The length of the `Buf` must be 8 bytes.
((This is a very sparse encoding for `Double`s! Each `Double` takes up 24
bytes split across the `Buf` and `Ref`.))
### Embeddeds.
To encode an `Embedded`, first choose a `Value` to represent the denoted
object, and encode that, producing a `Ref`. Place that ref in a `Buf` all
of its own (with length 8). Finally, point to the `Buf` with a `Ref` with
tag 15.
tag 12.
### Annotations.