New "blue jelly" machine-oriented binary syntax, inspired by argdata
This commit is contained in:
parent
4528100248
commit
7055a6467c
|
@ -14,4 +14,4 @@ defaults:
|
|||
|
||||
title: "Preserves"
|
||||
version_date: "June 2022"
|
||||
version: "0.6.3"
|
||||
version: "0.7.0"
|
||||
|
|
|
@ -6,9 +6,11 @@ title: "Preserves: Binary Syntax"
|
|||
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
||||
{{ site.version_date }}. Version {{ site.version }}.
|
||||
|
||||
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
|
||||
[LEB128]: https://en.wikipedia.org/wiki/LEB128
|
||||
[argdata]: https://github.com/NuxiNL/argdata
|
||||
[canonical]: canonical-binary.html
|
||||
[google-varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
|
||||
[vlq]: https://en.wikipedia.org/wiki/Variable-length_quantity
|
||||
|
||||
*Preserves* is a data model, with associated serialization formats. This
|
||||
document defines one of those formats: a binary syntax for `Value`s from
|
||||
|
@ -24,49 +26,52 @@ For a value `v`, we write `«v»` for the `Repr` of v.
|
|||
### Type and Length representation.
|
||||
|
||||
Each `Repr` starts with a tag byte, describing the kind of information
|
||||
represented. Depending on the tag, a length indicator, further encoded
|
||||
information, and/or an ending tag may follow.
|
||||
represented.
|
||||
|
||||
tag (simple atomic data and small integers)
|
||||
tag ++ binarydata (most integers)
|
||||
tag ++ length ++ binarydata (large integers, strings, symbols, and binary)
|
||||
tag ++ repr ++ ... ++ endtag (compound data)
|
||||
However, inspired by [argdata][], a `Repr` does *not* describe its own
|
||||
length. Instead, the surrounding context must supply the length of the
|
||||
`Repr`.
|
||||
|
||||
The unique end tag is byte value `0x84`.
|
||||
As a consequence, `Repr`s for `Compound` values store the lengths of
|
||||
their contained values. Each contained `Value` is represented as a
|
||||
length in bytes followed by its own `Repr`.
|
||||
|
||||
If present after a tag, the length of a following piece of binary data
|
||||
is formatted as a [base 128 varint][varint].[^see-also-leb128] We
|
||||
write `varint(m)` for the varint-encoding of `m`. Quoting the
|
||||
[Google Protocol Buffers][varint] definition,
|
||||
<a id="varint"></a> Each length is stored as an [argdata][]-compatible
|
||||
big-endian base 128 *varint*.[^see-also-leb128] Each byte of a varint
|
||||
stores seven bits of the length. All bytes have a clear upper bit,
|
||||
except the final byte, which has the upper bit set. We write
|
||||
`len(m)` for the varint-encoding of a non-negative integer `m`,
|
||||
defined recursively as follows:
|
||||
|
||||
[^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
|
||||
integers. Varints and LEB128-encoded integers differ only for
|
||||
signed integers, which are not used in Preserves.
|
||||
len(m) = e(m, 128)
|
||||
where e(v, d) = [v + d] if v < 128
|
||||
e(v / 128, 0) ++ [(v % 128) + d] if v ≥ 128
|
||||
|
||||
> Each byte in a varint, except the last byte, has the most
|
||||
> significant bit (msb) set – this indicates that there are further
|
||||
> bytes to come. The lower 7 bits of each byte are used to store the
|
||||
> two's complement representation of the number in groups of 7 bits,
|
||||
> least significant group first.
|
||||
[^see-also-leb128]: Argdata's length representation is very close to
|
||||
[Variable-length quantity (VLQ)][VLQ] encoding, differing only in
|
||||
the flipped interpretation of the high bit of each byte. It is
|
||||
big-endian, unlike [LEB128][] encoding ([as used by
|
||||
Google][google-varint] in protobufs).
|
||||
|
||||
The following table illustrates varint-encoding.
|
||||
|
||||
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
|
||||
| ------ | ------------------- | ------------ |
|
||||
| 15 | `0001111` | 15 |
|
||||
| 300 | `0000010 0101100` | 172 2 |
|
||||
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
|
||||
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `len(m)` bytes |
|
||||
|-------------|-------------------------------------------|-----------------|
|
||||
| 15 | `0001111` | 143 |
|
||||
| 300 | `0000010 0101100` | 2 172 |
|
||||
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 3 92 107 20 128 |
|
||||
|
||||
It is an error for a varint-encoded `m` in a `Repr` to be anything
|
||||
other than the unique shortest encoding for that `m`. That is, a
|
||||
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.
|
||||
It is an error for a varint-encoded `m` in a `Repr` to be anything other
|
||||
than the unique shortest encoding for that `m`. That is, a
|
||||
varint-encoding of `m` *MUST NOT* start with `0`.
|
||||
|
||||
### Records, Sequences, Sets and Dictionaries.
|
||||
|
||||
«<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
|
||||
«[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
|
||||
«#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
|
||||
«{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
|
||||
«<L F_1...F_m>» = [0xA7] ++ seq(«L», «F_1», ..., «F_m»)
|
||||
«[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m»)
|
||||
«#{E_1...E_m}» = [0xA9] ++ seq(«E_1», ..., «E_m»)
|
||||
«{K_1:V_1...K_m:V_m}» = [0xAA] ++ seq(«K_1», «V_1», ..., «K_m», «V_m»)
|
||||
where seq(R_1, ... R_m) = len(R_1) ++ R_1 ++...++ len(R_m) ++ R_m
|
||||
|
||||
There is *no* ordering requirement on the `E_i` elements or
|
||||
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
|
||||
|
@ -89,7 +94,7 @@ serializing in some other implementation-defined order.
|
|||
ordering for writing out set elements and dictionary key/value
|
||||
pairs is *not* the same as the sort ordering implied by the
|
||||
semantic ordering of those elements or keys. For example, the
|
||||
`Repr` of a negative number very far from zero will start with
|
||||
`Repr` of a negative number very far from zero will start with a
|
||||
byte that is *greater* than the byte which starts the `Repr` of
|
||||
zero, making it sort lexicographically later by `Repr`, despite
|
||||
being semantically *less than* zero.
|
||||
|
@ -101,39 +106,31 @@ serializing in some other implementation-defined order.
|
|||
|
||||
### SignedIntegers.
|
||||
|
||||
«x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16
|
||||
([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16
|
||||
([0xA0] + x) if (-3≤x≤-1)
|
||||
([0x90] + x) if ( 0≤x≤12)
|
||||
where m = |intbytes(x)|
|
||||
«x» when x ∈ SignedInteger = [0xA3] ++ intbytes(x)
|
||||
|
||||
Integers in the range [-3,12] are compactly represented with tags
|
||||
between `0x90` and `0x9F` because they are so frequently used.
|
||||
Integers up to 16 bytes long are represented with a single-byte tag
|
||||
encoding the length of the integer. Larger integers are represented
|
||||
with an explicit varint length. Every `SignedInteger` *MUST* be
|
||||
represented with its shortest possible encoding.
|
||||
The function `intbytes(x)` gives the big-endian two's-complement binary
|
||||
representation of `x`, taking exactly as many whole bytes as needed to
|
||||
unambiguously identify the value and its sign. As a special case,
|
||||
`intbytes(0)` is the empty byte sequence. The most-significant bit in
|
||||
the first byte in `intbytes(x)` (for `x`≠0) is the sign
|
||||
bit.[^zero-intbytes] Every `SignedInteger` *MUST* be represented with
|
||||
its shortest possible encoding.
|
||||
|
||||
The function `intbytes(x)` gives the big-endian two's-complement
|
||||
binary representation of `x`, taking exactly as many whole bytes as
|
||||
needed to unambiguously identify the value and its sign, and `m =
|
||||
|intbytes(x)|`. The most-significant bit in the first byte in
|
||||
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
|
||||
example,
|
||||
For example,
|
||||
|
||||
«87112285931760246646623899502532662132736»
|
||||
= B0 12 01 00 00 00 00 00 00 00
|
||||
00 00 00 00 00 00 00 00
|
||||
00 00
|
||||
= A3 01 00 00 00 00 00 00 00
|
||||
00 00 00 00 00 00 00 00
|
||||
00 00
|
||||
|
||||
«-257» = A1 FE FF «-3» = 9D «128» = A1 00 80
|
||||
«-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF
|
||||
«-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00
|
||||
«-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF
|
||||
«-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00
|
||||
«-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF
|
||||
«-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00
|
||||
«-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00
|
||||
«-257» = A3 FE FF «-3» = A3 FD «128» = A3 00 80
|
||||
«-256» = A3 FF 00 «-2» = A3 FE «255» = A3 00 FF
|
||||
«-255» = A3 FF 01 «-1» = A3 FF «256» = A3 01 00
|
||||
«-254» = A3 FF 02 «0» = A3 «32767» = A3 7F FF
|
||||
«-129» = A3 FF 7F «1» = A3 01 «32768» = A3 00 80 00
|
||||
«-128» = A3 80 «12» = A3 0C «65535» = A3 00 FF FF
|
||||
«-127» = A3 81 «13» = A3 0D «65536» = A3 01 00 00
|
||||
«-4» = A3 FC «127» = A3 7F «131072» = A3 02 00 00
|
||||
|
||||
[^zero-intbytes]: The value 0 needs zero bytes to identify the
|
||||
value, so `intbytes(0)` is the empty byte string. Non-zero values
|
||||
|
@ -146,19 +143,19 @@ and `Symbol`, the data following the tag is a UTF-8 encoding of the
|
|||
`Value`'s code points, while for `ByteString` it is the raw data
|
||||
contained within the `Value` unmodified.
|
||||
|
||||
«S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
|
||||
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
|
||||
[0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol
|
||||
«S» = [0xA4] ++ utf8(S) if S ∈ String
|
||||
[0xA5] ++ S if S ∈ ByteString
|
||||
[0xA6] ++ utf8(S) if S ∈ Symbol
|
||||
|
||||
### Booleans.
|
||||
|
||||
«#f» = [0x80]
|
||||
«#t» = [0x81]
|
||||
«#f» = [0xA0]
|
||||
«#t» = [0xA1]
|
||||
|
||||
### Floats and Doubles.
|
||||
|
||||
«F» when F ∈ Float = [0x82] ++ binary32(F)
|
||||
«D» when D ∈ Double = [0x83] ++ binary64(D)
|
||||
«F» when F ∈ Float = [0xA2] ++ binary32(F)
|
||||
«D» when D ∈ Double = [0xA2] ++ binary64(D)
|
||||
|
||||
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
||||
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
|
||||
|
@ -166,20 +163,25 @@ The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
|||
### Embeddeds.
|
||||
|
||||
The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
|
||||
represent the denoted object, prefixed with `[0x86]`.
|
||||
represent the denoted object, prefixed with `[0xBF]`.
|
||||
|
||||
«#!V» = [0x86] ++ «V»
|
||||
«#!V» = [0xBF] ++ «V»
|
||||
|
||||
### Annotations.
|
||||
|
||||
To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
|
||||
`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
|
||||
syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
|
||||
`a` and `b`, is
|
||||
To annotate a `Repr` `r` with some sequence of `Value`s `[v_1, ...,
|
||||
v_m]`, surround `r` as follows:
|
||||
|
||||
[0xBE] ++ len(r) ++ r ++ len(v_1) ++ v_1 ++...++ len(v_m) ++ v_m
|
||||
|
||||
The `Repr` `r` *MUST NOT* already have annotations; that is, it must not begin with `0xBE`.
|
||||
|
||||
For example, the `Repr` corresponding to textual syntax `@a@b[]`, i.e.
|
||||
an empty sequence annotated with two symbols, `a` and `b`, is
|
||||
|
||||
«@a @b []»
|
||||
= [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
|
||||
= [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
|
||||
= [0xBE] ++ len(«[]») ++ «[]» ++ len(«a») ++ «a» ++ len(«b») ++ «b»
|
||||
= [0xBE, 0x81, 0xA8, 0x82, 0xA6, 0x61, 0x82, 0xA6, 0x62]
|
||||
|
||||
## Security Considerations
|
||||
|
||||
|
@ -194,45 +196,67 @@ implementations *SHOULD* produce canonical binary encodings by
|
|||
default; however, an implementation *MAY* permit two serializations of
|
||||
the same `Value` to yield different binary `Repr`s.
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
The exclusion of lengths from `Repr`s, placing lengths instead ahead of
|
||||
contained values in sequences, is inspired by [argdata][].
|
||||
|
||||
## Appendix. Autodetection of textual or binary syntax
|
||||
|
||||
Every tag byte in a binary Preserves `Document` falls within the range
|
||||
Every tag byte in a binary Preserves `Repr` falls within the range
|
||||
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
|
||||
bytes*, and will never occur as the first byte of a UTF-8 encoded code
|
||||
point. This means no binary-encoded document can be misinterpreted as
|
||||
point. This means no binary-encoded `Repr` can be misinterpreted as
|
||||
valid UTF-8.
|
||||
|
||||
Conversely, a UTF-8 document must start with a valid codepoint,
|
||||
Conversely, a UTF-8 `Document` must start with a valid codepoint,
|
||||
meaning in particular that it must not start with a byte in the range
|
||||
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
|
||||
Preserves document can be misinterpreted as a binary-syntax document.
|
||||
Preserves `Document` can be misinterpreted as a binary-syntax `Repr`.
|
||||
|
||||
Examination of the top two bits of the first byte of a document gives
|
||||
its syntax: if the top two bits are `10`, it should be interpreted as
|
||||
a binary-syntax document; otherwise, it should be interpreted as text.
|
||||
Examination of the top two bits of the first byte of an encoded `Value`
|
||||
gives its syntax: if the top two bits are `10`, it should be interpreted
|
||||
as a binary-syntax `Repr`; otherwise, it should be interpreted as text.
|
||||
|
||||
**Streaming.** Autodetection is still possible when streaming an
|
||||
undetermined number of `Value`s across, say, a TCP/IP connection:
|
||||
|
||||
- If the text syntax is to be used for the connection, simply start
|
||||
writing each `Document` one after the other. Documents for `Atom`s
|
||||
*MUST* be separated from their neighbours by whitespace; in general,
|
||||
whitespace *SHOULD* be used to separate adjacent documents.
|
||||
Specifically, whitespace separating adjacent documents *SHOULD* be
|
||||
ASCII newline (10).
|
||||
|
||||
- If the binary syntax is to be used for the connection, start the
|
||||
connection with byte `0xA8` (sequence). After the initial byte, send
|
||||
each value `v` as `len(«v») ++ «v»`. A side effect of this approach
|
||||
is that the entire stream, when complete, is a valid `Sequence`
|
||||
`Repr`.
|
||||
|
||||
## Appendix. Table of tag values
|
||||
|
||||
80 - False
|
||||
81 - True
|
||||
82 - Float
|
||||
83 - Double
|
||||
84 - End marker
|
||||
85 - Annotation
|
||||
86 - Embedded
|
||||
(8x) RESERVED 87-8F
|
||||
(8x) RESERVED 80-8F
|
||||
(9x) RESERVED 90-9F
|
||||
|
||||
9x - Small integers 0..12,-3..-1
|
||||
An - Medium integers, (n+1) bytes long
|
||||
B0 - Large integers, variable length
|
||||
B1 - String
|
||||
B2 - ByteString
|
||||
B3 - Symbol
|
||||
A0 - False
|
||||
A1 - True
|
||||
A2 - Float or Double (length disambiguates)
|
||||
A3 - SignedIntegers (0 is encoded with no bytes at all)
|
||||
A4 - String (no trailing NUL is added)
|
||||
A5 - ByteString
|
||||
A6 - Symbol
|
||||
|
||||
B4 - Record
|
||||
B5 - Sequence
|
||||
B6 - Set
|
||||
B7 - Dictionary
|
||||
A7 - Record
|
||||
A8 - Sequence
|
||||
A9 - Set
|
||||
AA - Dictionary
|
||||
|
||||
(Ax) RESERVED AB-AF
|
||||
|
||||
(Bx) RESERVED B0-BD
|
||||
BE - Annotations. {BE Lval val Lann0 ann0 Lann1 ann1 ...}
|
||||
BF - Embedded
|
||||
|
||||
## Appendix. Binary SignedInteger representation
|
||||
|
||||
|
@ -242,15 +266,15 @@ values.
|
|||
|
||||
| Integer range | Bytes required | Encoding (hex) |
|
||||
| --- | --- | --- |
|
||||
| -3 ≤ n ≤ 12 | 1 | `9X` |
|
||||
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A0` `XX` |
|
||||
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A1` `XX` `XX` |
|
||||
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A2` `XX` `XX` `XX` |
|
||||
| 0 | 1 | `A3` |
|
||||
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A3` `XX` |
|
||||
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A3` `XX` `XX` |
|
||||
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A3` `XX` `XX` `XX` |
|
||||
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `A3` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A4` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A5` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A3` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A3` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||
|
||||
<!-- Heading to visually offset the footnotes from the main document: -->
|
||||
## Notes
|
||||
|
|
|
@ -206,8 +206,8 @@ object, prefixed with `#!`.
|
|||
Embedded = "#!" Value
|
||||
|
||||
Finally, any `Value` may be represented by escaping from the textual
|
||||
syntax to the [machine-oriented binary syntax](preserves-binary.html)
|
||||
by prefixing a `ByteString` containing the binary representation of the
|
||||
syntax to the [machine-oriented binary syntax](preserves-binary.html) by
|
||||
prefixing a `ByteString` containing the binary representation of the
|
||||
`Value` with `#=`.[^rationale-switch-to-binary]
|
||||
[^no-literal-binary-in-text] [^machine-value-annotations]
|
||||
|
||||
|
@ -216,18 +216,18 @@ by prefixing a `ByteString` containing the binary representation of the
|
|||
[^rationale-switch-to-binary]: **Rationale.** The textual syntax
|
||||
cannot express every `Value`: specifically, it cannot express the
|
||||
several million floating-point NaNs, or the two floating-point
|
||||
Infinities. Since the machine-oriented binary format for `Value`s
|
||||
expresses each `Value` with precision, embedding binary `Value`s
|
||||
solves the problem.
|
||||
Infinities. Since the machine-oriented binary format for `Value`s expresses
|
||||
each `Value` with precision, embedding binary `Value`s solves the
|
||||
problem.
|
||||
|
||||
[^no-literal-binary-in-text]: Every text is ultimately physically
|
||||
stored as bytes; therefore, it might seem possible to escape to the
|
||||
raw form of binary encoding from within a piece of textual syntax.
|
||||
However, while bytes must be involved in any *representation* of
|
||||
text, the text *itself* is logically a sequence of *code points* and
|
||||
is not *intrinsically* a binary structure at all. It would be
|
||||
incoherent to expect to be able to access the representation of the
|
||||
text from within the text itself.
|
||||
stored as bytes; therefore, it might seem possible to escape to
|
||||
the raw binary encoding from within a
|
||||
piece of textual syntax. However, while bytes must be involved in
|
||||
any *representation* of text, the text *itself* is logically a
|
||||
sequence of *code points* and is not *intrinsically* a binary
|
||||
structure at all. It would be incoherent to expect to be able to
|
||||
access the representation of the text from within the text itself.
|
||||
|
||||
[^machine-value-annotations]: Any text-syntax annotations preceding
|
||||
the `#` are prepended to any binary-syntax annotations yielded by
|
||||
|
@ -235,11 +235,11 @@ by prefixing a `ByteString` containing the binary representation of the
|
|||
|
||||
## Annotations
|
||||
|
||||
When written down, a `Value` may have an associated sequence of
|
||||
*annotations* carrying “out-of-band” contextual metadata about the
|
||||
value. Each annotation is, in turn, a `Value`, and may itself have
|
||||
annotations. The ordering of annotations attached to a `Value` is
|
||||
significant.
|
||||
When written down, a `Value` may have an associated
|
||||
sequence of *annotations* carrying “out-of-band” contextual metadata
|
||||
about the value. Each annotation is, in turn, a `Value`, and may
|
||||
itself have annotations. The ordering of annotations attached to a
|
||||
`Value` is significant.
|
||||
|
||||
Value =/ ws "@" Value Value
|
||||
|
||||
|
@ -276,7 +276,7 @@ different.
|
|||
|
||||
## Security Considerations
|
||||
|
||||
**Whitespace.** The textual format allows arbitrary whitespace in many
|
||||
**Whitespace.** The text syntax allows arbitrary whitespace in many
|
||||
positions. Consider optional restrictions on the amount of consecutive
|
||||
whitespace that may appear.
|
||||
|
||||
|
|
142
preserves.md
142
preserves.md
|
@ -220,21 +220,21 @@ The total ordering specified [above](#total-order) means that the following stat
|
|||
<!-- TODO: Give some examples of large and small Preserves, perhaps -->
|
||||
<!-- translated from various JSON blobs floating around the internet. -->
|
||||
|
||||
| Value | Encoded byte sequence |
|
||||
|-----------------------------|---------------------------------------------------------------------------------|
|
||||
| `<capture <discard>>` | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
|
||||
| `[1 2 3 4]` | B5 91 92 93 94 84 |
|
||||
| `[-2 -1 0 1]` | B5 9E 9F 90 91 84 |
|
||||
| `"hello"` (format B) | B1 05 'h' 'e' 'l' 'l' 'o' |
|
||||
| `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 |
|
||||
| `-257` | A1 FE FF |
|
||||
| `-1` | 9F |
|
||||
| `0` | 90 |
|
||||
| `1` | 91 |
|
||||
| `255` | A1 00 FF |
|
||||
| `1.0f` | 82 3F 80 00 00 |
|
||||
| `1.0` | 83 3F F0 00 00 00 00 00 00 |
|
||||
| `-1.202e300` | 83 FE 3C B7 B7 59 BF 04 26 |
|
||||
| Value | Encoded byte sequence |
|
||||
|-----------------------------|------------------------------------------------------------------------------|
|
||||
| `<capture <discard>>` | A7 88 A6 'c' 'a' 'p' 't' 'u' 'r' 'e' 8A A7 88 A6 'd' 'i' 's' 'c' 'a' 'r' 'd' |
|
||||
| `[1 2 3 4]` | A8 82 A3 01 82 A3 02 82 A3 03 82 A3 04 |
|
||||
| `[-2 -1 0 1]` | A8 82 A3 FE 82 A3 FF 81 A3 82 A3 01 |
|
||||
| `"hello"` | A4 'h' 'e' 'l' 'l' 'o' |
|
||||
| `["a" b #"c" [] #{} #t #f]` | A8 82 A4 'a' 82 A6 'b' 82 A5 'c' 81 A8 81 A9 81 A1 81 A0 |
|
||||
| `-257` | A3 FE FF |
|
||||
| `-1` | A3 FF |
|
||||
| `0` | A3 |
|
||||
| `1` | A3 01 |
|
||||
| `255` | A3 00 FF |
|
||||
| `1.0f` | A2 3F 80 00 00 |
|
||||
| `1.0` | A2 3F F0 00 00 00 00 00 00 |
|
||||
| `-1.202e300` | A2 FE 3C B7 B7 59 BF 04 26 |
|
||||
|
||||
The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
|
||||
|
||||
|
@ -242,24 +242,21 @@ The next example uses a non-`Symbol` label for a record.[^extensibility2] The `R
|
|||
|
||||
encodes to
|
||||
|
||||
B4 ;; Record
|
||||
B5 ;; Sequence
|
||||
B3 06 74 69 74 6C 65 64 ;; Symbol, "titled"
|
||||
B3 06 70 65 72 73 6F 6E ;; Symbol, "person"
|
||||
92 ;; SignedInteger, "2"
|
||||
B3 05 74 68 69 6E 67 ;; Symbol, "thing"
|
||||
91 ;; SignedInteger, "1"
|
||||
84 ;; End (sequence)
|
||||
A0 65 ;; SignedInteger, "101"
|
||||
B1 09 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
|
||||
B4 ;; Record
|
||||
B3 04 64 61 74 65 ;; Symbol, "date"
|
||||
A1 07 1D ;; SignedInteger, "1821"
|
||||
92 ;; SignedInteger, "2"
|
||||
93 ;; SignedInteger, "3"
|
||||
84 ;; End (record)
|
||||
B1 02 44 72 ;; String, "Dr"
|
||||
84 ;; End (record)
|
||||
A7 ;; Record
|
||||
9E A8 ;; Length 30, Sequence
|
||||
87 A6 74 69 74 6C 65 64 ;; Length 7, Symbol, "titled"
|
||||
87 A6 70 65 72 73 6F 6E ;; Length 7, Symbol, "person"
|
||||
82 A3 02 ;; Length 2, SignedInteger, "2"
|
||||
86 A6 74 68 69 6E 67 ;; Length 6, Symbol, "thing"
|
||||
82 A3 01 ;; Length 2, SignedInteger, "1"
|
||||
82 A3 65 ;; Length 2, SignedInteger, "101"
|
||||
8A A4 42 6C 61 63 6B 77 65 6C 6C ;; Length 10, String, "Blackwell"
|
||||
91 A7 ;; Length 17, Record
|
||||
85 A6 64 61 74 65 ;; Length 5, Symbol, "date"
|
||||
83 A3 07 1D ;; Length 3, SignedInteger, "1821"
|
||||
82 A3 02 ;; Length 2, SignedInteger, "2"
|
||||
82 A3 03 ;; Length 2, SignedInteger, "3"
|
||||
83 A4 44 72 ;; Length 3, String, "Dr"
|
||||
|
||||
[^extensibility2]: It happens to line up with Racket's
|
||||
representation of a record label for an inheritance hierarchy
|
||||
|
@ -311,27 +308,23 @@ The first RFC 8259 example:
|
|||
when read using the Preserves text syntax encodes via the binary syntax
|
||||
as follows:
|
||||
|
||||
B7
|
||||
B1 05 "Image"
|
||||
B7
|
||||
B1 03 "IDs" B5
|
||||
A0 74
|
||||
A1 03 AF
|
||||
A1 00 EA
|
||||
A2 00 97 89
|
||||
84
|
||||
B1 05 "Title" B1 14 "View from 15th Floor"
|
||||
B1 05 "Width" A1 03 20
|
||||
B1 06 "Height" A1 02 58
|
||||
B1 08 "Animated" B3 05 "false"
|
||||
B1 09 "Thumbnail"
|
||||
B7
|
||||
B1 03 "Url" B1 26 "http://www.example.com/image/481989943"
|
||||
B1 05 "Width" A0 64
|
||||
B1 06 "Height" A0 7D
|
||||
84
|
||||
84
|
||||
84
|
||||
AA
|
||||
86 A4 "Image"
|
||||
01 AC AA
|
||||
89 A4 "Animated" 86 A6 "false"
|
||||
87 A4 "Height" 83 A3 02 58
|
||||
84 A4 "IDs" 91 A8
|
||||
82 A3 74
|
||||
83 A3 03 AF
|
||||
83 A3 00 EA
|
||||
84 A3 00 97 89
|
||||
8A A4 "Thumbnail"
|
||||
C3 AA
|
||||
87 A4 "Height" 82 A3 7D
|
||||
84 A4 "Url" A7 A4 "http://www.example.com/image/481989943"
|
||||
86 A4 "Width" 82 A3 64
|
||||
86 A4 "Title" 95 A4 "View from 15th Floor"
|
||||
86 A4 "Width" 83 A3 03 20
|
||||
|
||||
The second RFC 8259 example:
|
||||
|
||||
|
@ -360,28 +353,25 @@ The second RFC 8259 example:
|
|||
|
||||
encodes to binary as follows:
|
||||
|
||||
B5
|
||||
B7
|
||||
B1 03 "Zip" B1 05 "94107"
|
||||
B1 04 "City" B1 0D "SAN FRANCISCO"
|
||||
B1 05 "State" B1 02 "CA"
|
||||
B1 07 "Address" B1 00
|
||||
B1 07 "Country" B1 02 "US"
|
||||
B1 08 "Latitude" 83 40 42 E2 26 80 9D 49 52
|
||||
B1 09 "Longitude" 83 C0 5E 99 56 6C F4 1F 21
|
||||
B1 09 "precision" B1 03 "zip"
|
||||
84
|
||||
B7
|
||||
B1 03 "Zip" B1 05 "94085"
|
||||
B1 04 "City" B1 09 "SUNNYVALE"
|
||||
B1 05 "State" B1 02 "CA"
|
||||
B1 07 "Address" B1 00
|
||||
B1 07 "Country" B1 02 "US"
|
||||
B1 08 "Latitude" 83 40 42 AF 9D 66 AD B4 03
|
||||
B1 09 "Longitude" 83 C0 5E 81 AA 4F CA 42 AF
|
||||
B1 09 "precision" B1 03 "zip"
|
||||
84
|
||||
84
|
||||
A8
|
||||
FE AA
|
||||
88 A4 "Address" 81 A4
|
||||
85 A4 "City" 8E A4 "SAN FRANCISCO"
|
||||
88 A4 "Country" 83 A4 "US"
|
||||
89 A4 "Latitude" 89 A2 40 42 E2 26 80 9D 49 52
|
||||
8A A4 "Longitude" 89 A2 C0 5E 99 56 6C F4 1F 21
|
||||
86 A4 "State" 83 A4 "CA"
|
||||
84 A4 "Zip" 86 A4 "94107"
|
||||
8A A4 "precision" 84 A4 "zip"
|
||||
FA AA
|
||||
88 A4 "Address" 81 A4
|
||||
85 A4 "City" 8A A4 "SUNNYVALE"
|
||||
88 A4 "Country" 83 A4 "US"
|
||||
89 A4 "Latitude" 89 A2 40 42 AF 9D 66 AD B4 03
|
||||
8A A4 "Longitude" 89 A2 C0 5E 81 AA 4F CA 42 AF
|
||||
86 A4 "State" 83 A4 "CA"
|
||||
84 A4 "Zip" 86 A4 "94085"
|
||||
8A A4 "precision" 84 A4 "zip"
|
||||
|
||||
<!-- Heading to visually offset the footnotes from the main document: -->
|
||||
## Notes
|
||||
|
|
|
@ -2,6 +2,8 @@
|
|||
title: "Representing Values in Programming Languages"
|
||||
---
|
||||
|
||||
[erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
|
||||
|
||||
**NOT YET READY**
|
||||
|
||||
We have given a definition of `Value` and its semantics, and proposed
|
||||
|
|
Loading…
Reference in New Issue