Major revision of binary syntax: placeholders; annotations; forbid empty format-C chunks
This commit is contained in:
parent
d349e89ea4
commit
e2b859e55d
|
@ -1,5 +1,6 @@
|
|||
body {
|
||||
font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", "TeX Gyre Pagella", serif;
|
||||
font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", "TeX Gyre Pagella", serif;
|
||||
box-sizing: border-box;
|
||||
}
|
||||
@media screen {
|
||||
body { padding-top: 2rem; max-width: 40em; margin: auto; font-size: 120%; }
|
||||
|
@ -59,3 +60,20 @@ h2#notes:before {
|
|||
}
|
||||
|
||||
.footnotes > ol { padding: 0; font-size: 90%; }
|
||||
|
||||
table {
|
||||
border-collapse: collapse;
|
||||
width: 100%;
|
||||
}
|
||||
thead tr {
|
||||
border-bottom: solid black 1px;
|
||||
}
|
||||
th {
|
||||
font-weight: normal;
|
||||
text-align: left;
|
||||
padding-right: 0.5rem;
|
||||
padding-bottom: 0.3rem;
|
||||
}
|
||||
td {
|
||||
padding-right: 0.5rem;
|
||||
}
|
||||
|
|
376
preserves.md
376
preserves.md
|
@ -6,7 +6,7 @@
|
|||
# Preserves: an Expressive Data Language
|
||||
|
||||
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
||||
November 2018. Version 0.0.4.
|
||||
June 2019. Version 0.0.5.
|
||||
|
||||
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
|
||||
[spki]: http://world.std.com/~cme/html/spki.html
|
||||
|
@ -72,9 +72,8 @@ follows:[^ordering-by-syntax]
|
|||
< String < ByteString < Symbol
|
||||
|
||||
[^ordering-by-syntax]: The observant reader may note that the
|
||||
ordering here is (almost) the same as that implied by the tagging
|
||||
scheme used in the concrete binary syntax for `Value`s. (The
|
||||
exception is the syntax for small integers near zero.)
|
||||
ordering here is the same as that implied by the tagging scheme
|
||||
used in the concrete binary syntax for `Value`s.
|
||||
|
||||
**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
|
||||
neither is less than the other according to the total order.
|
||||
|
@ -400,7 +399,8 @@ itself have annotations.
|
|||
Value =/ ws "@" Value Value
|
||||
|
||||
Each annotation is preceded by `@`; the underlying annotated value
|
||||
follows its annotations.
|
||||
follows its annotations. Here we extend only the syntactic nonterminal
|
||||
named "`Value`" without altering the semantic class of `Value`s.
|
||||
|
||||
**Equivalence.** Annotations appear within syntax denoting a `Value`;
|
||||
however, the annotations are not part of the denoted value. They are
|
||||
|
@ -411,19 +411,24 @@ Reflective tools such as debuggers, user interfaces, and message
|
|||
routers and relays---tools which process `Value`s generically---may
|
||||
use annotated inputs to tailor their operation, or may insert
|
||||
annotations in their outputs. By contrast, in ordinary programs, as a
|
||||
rule of thumb, the presence, absence or specific value of an
|
||||
annotation should not change the control flow or output of the
|
||||
program. Annotations are data *describing* `Value`s, and are not in
|
||||
the domain of any specific application of `Value`s. That is, an
|
||||
annotation will almost never cause a non-reflective program to do
|
||||
anything observably different.
|
||||
rule of thumb, the presence, absence or content of an annotation
|
||||
should not change the control flow or output of the program.
|
||||
Annotations are data *describing* `Value`s, and are not in the domain
|
||||
of any specific application of `Value`s. That is, an annotation will
|
||||
almost never cause a non-reflective program to do anything observably
|
||||
different.
|
||||
|
||||
## Compact Binary Syntax
|
||||
|
||||
A `Repr` is an encoding, or representation, of a specific `Value`.
|
||||
Each `Repr` comprises one or more bytes describing first the kind of
|
||||
represented `Value` and the length of the representation, and then the
|
||||
encoded details of the `Value` itself.
|
||||
A `Repr` is a binary-syntax encoding, or representation, of either
|
||||
|
||||
- a `Value`,
|
||||
- a "placeholder" for a `Value`, or
|
||||
- an annotation on a `Repr`.
|
||||
|
||||
Each `Repr` comprises one or more bytes describing the kind of
|
||||
represented information and the length of the representation, followed
|
||||
by the encoded details.
|
||||
|
||||
For a value `v`, we write `[[v]]` for the `Repr` of v.
|
||||
|
||||
|
@ -431,19 +436,16 @@ For a value `v`, we write `[[v]]` for the `Repr` of v.
|
|||
|
||||
Each `Repr` takes one of three possible forms:
|
||||
|
||||
- (A) a fixed-length form, used for simple values such as `Boolean`s
|
||||
or `Float`s.
|
||||
- (A) type-specific form, used for simple values such as `Boolean`s
|
||||
or `Float`s, for placeholders, and for introducing annotations.
|
||||
|
||||
- (B) a variable-length form with length specified up-front, used for
|
||||
almost all `Record`s as well as for most `Sequence`s and `String`s,
|
||||
when their sizes are known at the time serialization begins.
|
||||
compound and variable-length atomic data structures when their
|
||||
sizes are known at the time serialization begins.
|
||||
|
||||
- (C) a variable-length streaming form with unknown or unpredictable
|
||||
length, used only seldom for `Record`s, since the number of fields
|
||||
in a `Record` is usually statically known, but sometimes used for
|
||||
`Sequence`s, `String`s etc., such as in cases when serialization
|
||||
begins before the number of elements or bytes in the corresponding
|
||||
`Value` is known.
|
||||
length, used in cases when serialization begins before the number
|
||||
of elements or bytes in the corresponding `Value` is known.
|
||||
|
||||
Applications may choose between formats B and C depending on their
|
||||
needs at serialization time.
|
||||
|
@ -455,30 +457,33 @@ Every `Repr` starts with a *lead byte*, constructed by
|
|||
|
||||
leadbyte(t,n,m) = [t*64 + n*16 + m]
|
||||
|
||||
The arguments `t` and `n` describe the rest of the
|
||||
representation:[^some-encodings-unused]
|
||||
The arguments `t`, `n` and `m` describe the rest of the
|
||||
representation.[^some-encodings-unused]
|
||||
|
||||
[^some-encodings-unused]: Some encodings are unused. All such
|
||||
encodings are reserved for future versions of this specification.
|
||||
|
||||
- `t`=0, `n`=0 (format A) represents an `Atom` with fixed-length binary representation.
|
||||
- `t`=0, `n`=1 (format A) represents certain small `SignedInteger`s.
|
||||
- `t`=0, `n`=2 (format C) is a Stream Start byte.
|
||||
- `t`=0, `n`=3 (format C) is a Stream End byte.
|
||||
- `t`=1 (format B) represents an `Atom` with variable-length binary representation.
|
||||
- `t`=2 (format B) represents a `Record`.
|
||||
- `t`=3 (format B) represents a `Sequence`, `Set` or `Dictionary`.
|
||||
| `t` | `n` | `m` | Meaning |
|
||||
| --- | --- | --- | ------- |
|
||||
| 0 | 0 | 0–3 | (format A) An `Atom` with fixed-length binary representation |
|
||||
| 0 | 0 | 4 | (format C) Stream end |
|
||||
| 0 | 0 | 5 | (format A) Annotation |
|
||||
| 0 | 1 | | (format A) Placeholder for an application-specific `Value` |
|
||||
| 0 | 2 | | (format C) Stream start |
|
||||
| 0 | 3 | | (format A) Certain small `SignedInteger`s |
|
||||
| 1 | | | (format B) An `Atom` with variable-length binary representation |
|
||||
| 2 | | | (format B) A `Compound` with variable-length representation |
|
||||
|
||||
#### Encoding data of fixed length (format A).
|
||||
#### Encoding data of type-specific length (format A).
|
||||
|
||||
Each specific type of data defines its own rules for this format.
|
||||
Each type of data defines its own rules for this format.
|
||||
|
||||
#### Encoding data of known length (format B).
|
||||
|
||||
A `Repr` where the length of the `Value` to be encoded is variable but
|
||||
known uses the value of `m` in `leadbyte` to encode its length. The
|
||||
length counts *bytes* for atomic `Value`s, but counts *contained
|
||||
values* for compound `Value`s.
|
||||
Format B is used where the length `l` of the `Value` to be encoded is
|
||||
known when serialization begins. Format B `Repr`s use `m` in
|
||||
`leadbyte` to encode `l`. The length counts *bytes* for atomic
|
||||
`Value`s, but counts *contained values* for compound `Value`s.
|
||||
|
||||
- A length `l` between 0 and 14 is represented using `leadbyte` with
|
||||
`m=l`.
|
||||
|
@ -503,11 +508,13 @@ definition,
|
|||
> two's complement representation of the number in groups of 7 bits,
|
||||
> least significant group first.
|
||||
|
||||
**Examples.**
|
||||
The following table illustrates varint-encoding.
|
||||
|
||||
- The varint representation of 15 is just the byte 15.
|
||||
- 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2.
|
||||
- 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3.
|
||||
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
|
||||
| ------ | ------------------- | ------------ |
|
||||
| 15 | `0001111` | 15 |
|
||||
| 300 | `0000010 0101100` | 172 2 |
|
||||
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
|
||||
|
||||
#### Streaming data of unknown length (format C).
|
||||
|
||||
|
@ -516,61 +523,69 @@ not known at the time serialization of the `Value` starts is encoded
|
|||
by a single Stream Start (“open”) byte, followed by zero or more
|
||||
*chunks*, followed by a matching Stream End (“close”) byte:
|
||||
|
||||
open(t,n) = leadbyte(0,2, t*4 + n)
|
||||
close(t,n) = leadbyte(0,3, t*4 + n)
|
||||
open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n]
|
||||
close() = leadbyte(0,0, 4) = [0x04]
|
||||
|
||||
For a `Repr` of a `Value` containing binary data, each chunk is to be
|
||||
a format B `Repr` of a `ByteString`, no matter the type of the overall
|
||||
`Repr`.
|
||||
For a format C `Repr` of an atomic `Value`, each chunk is to be a
|
||||
format B `Repr` of a `ByteString`, no matter the type of the overall
|
||||
`Value`. Annotations are not allowed on these individual chunks.
|
||||
|
||||
For a `Repr` of a `Value` containing other `Value`s, each chunk is to
|
||||
be a single `Repr`.
|
||||
For a format C `Repr` of a compound `Value`, each chunk is to be a
|
||||
single `Repr`, which may itself be annotated.
|
||||
|
||||
Each chunk within a format C `Repr` *MUST* have non-zero length.
|
||||
Software that decodes `Repr`s *MUST* reject `Repr`s that include
|
||||
zero-length chunks.
|
||||
|
||||
### Records.
|
||||
|
||||
Format B (known length):
|
||||
|
||||
[[ L(F_1...F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
|
||||
[[ L(F_1...F_m) ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
|
||||
|
||||
For `m` fields, `m+1` is supplied to `header`, to account for the
|
||||
encoding of the record label.
|
||||
|
||||
Format C (streaming):
|
||||
|
||||
[[ L(F_1...F_m) ]] = open(2,3) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close(2,3)
|
||||
[[ L(F_1...F_m) ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close()
|
||||
|
||||
Applications *SHOULD* prefer the known-length format for encoding
|
||||
`Record`s.
|
||||
|
||||
#### Application-specific short form for labels.
|
||||
### Placeholders.
|
||||
|
||||
Any given protocol using Preserves may additionally define an
|
||||
interpretation for `n`∈{0,1,2}, mapping each *short form label
|
||||
number* `n` to a specific record label. When encoding `m` fields with
|
||||
short form label number `n`, format B becomes
|
||||
Any given protocol using Preserves may define an interpretation for
|
||||
numbered *placeholders* in the binary syntax, mapping each
|
||||
*placeholder number* `n` to a specific `Value`. For example, a
|
||||
placeholder number may be assigned for a frequently-used `Record`
|
||||
label.
|
||||
|
||||
header(2,n,m) ++ [[F_1]] ++...++ [[F_m]]
|
||||
A `Value` `v` for which placeholder number `n` has been assigned may
|
||||
be tersely encoded as
|
||||
|
||||
and format C becomes
|
||||
[[v]] = header(0,1,n) when n is a placeholder number for v
|
||||
|
||||
open(2,n) ++ [[F_1]] ++...++ [[F_m]] ++ close(2,n)
|
||||
**Examples.** For example, a protocol may choose to assign placeholder
|
||||
number 4 to the symbol `void`, making
|
||||
|
||||
**Examples.** For example, a protocol may choose to map records
|
||||
labelled `void` to `n=0`, making
|
||||
[[void]] = header(0,1,4) = [0x14]
|
||||
[[void()]] = header(2,0,1) ++ [[void]] = [0x81, 0x14]
|
||||
|
||||
[[void()]] = header(2,0,0) = [0x80]
|
||||
or it may map symbol `person` to placeholder number 102, making
|
||||
|
||||
or it may map records labelled `person` to short form label number 1,
|
||||
making
|
||||
[[person]] = header(0,1,102) = [0x1F, 0x66]
|
||||
|
||||
and so
|
||||
|
||||
[[person("Dr", "Elizabeth", "Blackwell")]]
|
||||
= header(2,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
|
||||
= [0x93] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
|
||||
= header(2,0,4) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
|
||||
= [0x84, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
|
||||
|
||||
for format B, or
|
||||
|
||||
= open(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close(2,1)
|
||||
= [0x29] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x39]
|
||||
open(2,0) ++ [[person]] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close()
|
||||
= [0x28, 0x1F, 0x66] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x04]
|
||||
|
||||
for format C.
|
||||
|
||||
|
@ -578,9 +593,9 @@ for format C.
|
|||
|
||||
Format B (known length):
|
||||
|
||||
[[ [X_1...X_m] ]] = header(3,0,m) ++ [[X_1]] ++...++ [[X_m]]
|
||||
[[ #set{X_1...X_m} ]] = header(3,1,m) ++ [[X_1]] ++...++ [[X_m]]
|
||||
[[ {K_1:V_1...K_m:V_m} ]] = header(3,2,m*2) ++ [[K_1]] ++ [[V_1]] ++...
|
||||
[[ [X_1...X_m] ]] = header(2,1,m) ++ [[X_1]] ++...++ [[X_m]]
|
||||
[[ #set{X_1...X_m} ]] = header(2,2,m) ++ [[X_1]] ++...++ [[X_m]]
|
||||
[[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++...
|
||||
++ [[K_m]] ++ [[V_m]]
|
||||
|
||||
Note that `m*2` is given to `header` for a `Dictionary`, since there
|
||||
|
@ -588,10 +603,10 @@ are two `Value`s in each key-value pair.
|
|||
|
||||
Format C (streaming):
|
||||
|
||||
[[ [X_1...X_m] ]] = open(3,0) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,0)
|
||||
[[ #set{X_1...X_m} ]] = open(3,1) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,1)
|
||||
[[ {K_1:V_1...K_m:V_m} ]] = open(3,2) ++ [[K_1]] ++ [[V_1]] ++...
|
||||
++ [[K_m]] ++ [[V_m]] ++ close(3,2)
|
||||
[[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close()
|
||||
[[ #set{X_1...X_m} ]] = open(2,2) ++ [[X_1]] ++...++ [[X_m]] ++ close()
|
||||
[[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++...
|
||||
++ [[K_m]] ++ [[V_m]] ++ close()
|
||||
|
||||
Applications may use whichever format suits their needs on a
|
||||
case-by-case basis.
|
||||
|
@ -616,15 +631,13 @@ order.
|
|||
the option of serializing with set elements and dictionary keys in
|
||||
sorted order.
|
||||
|
||||
Note that `header(3,3,m)` and `open(3,3)`/`close(3,3)` are unused and reserved.
|
||||
|
||||
### SignedIntegers.
|
||||
|
||||
Format B/A (known length/fixed-size):
|
||||
|
||||
[[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x) if x<-3 ∨ 13≤x
|
||||
header(0,1,x+16) if -3≤x<0
|
||||
header(0,1,x) if 0≤x<13
|
||||
header(0,3,x+16) if -3≤x<0
|
||||
header(0,3,x) if 0≤x<13
|
||||
|
||||
Integers in the range [-3,12] are compactly represented using format A
|
||||
because they are so frequently used. Other integers are represented
|
||||
|
@ -644,22 +657,22 @@ needed to unambiguously identify the value and its sign, and `m =
|
|||
|
||||
For example,
|
||||
|
||||
[[ -257 ]] = 42 FE FF [[ -3 ]] = 1D [[ 128 ]] = 42 00 80
|
||||
[[ -256 ]] = 42 FF 00 [[ -2 ]] = 1E [[ 255 ]] = 42 00 FF
|
||||
[[ -255 ]] = 42 FF 01 [[ -1 ]] = 1F [[ 256 ]] = 42 01 00
|
||||
[[ -254 ]] = 42 FF 02 [[ 0 ]] = 10 [[ 32767 ]] = 42 7F FF
|
||||
[[ -129 ]] = 42 FF 7F [[ 1 ]] = 11 [[ 32768 ]] = 43 00 80 00
|
||||
[[ -128 ]] = 41 80 [[ 12 ]] = 1C [[ 65535 ]] = 43 00 FF FF
|
||||
[[ -257 ]] = 42 FE FF [[ -3 ]] = 3D [[ 128 ]] = 42 00 80
|
||||
[[ -256 ]] = 42 FF 00 [[ -2 ]] = 3E [[ 255 ]] = 42 00 FF
|
||||
[[ -255 ]] = 42 FF 01 [[ -1 ]] = 3F [[ 256 ]] = 42 01 00
|
||||
[[ -254 ]] = 42 FF 02 [[ 0 ]] = 30 [[ 32767 ]] = 42 7F FF
|
||||
[[ -129 ]] = 42 FF 7F [[ 1 ]] = 31 [[ 32768 ]] = 43 00 80 00
|
||||
[[ -128 ]] = 41 80 [[ 12 ]] = 3C [[ 65535 ]] = 43 00 FF FF
|
||||
[[ -127 ]] = 41 81 [[ 13 ]] = 41 0D [[ 65536 ]] = 43 01 00 00
|
||||
[[ -4 ]] = 41 FC [[ 127 ]] = 41 7F [[ 131072 ]] = 43 02 00 00
|
||||
|
||||
### Strings, ByteStrings and Symbols.
|
||||
|
||||
Syntax for these three types varies only in the value of `n` supplied
|
||||
to `header`, `open`, and `close`. In each case, the payload following
|
||||
the header is a binary sequence; for `String` and `Symbol`, it is a
|
||||
UTF-8 encoding of the `Value`'s code points, while for `ByteString` it
|
||||
is the raw data contained within the `Value` unmodified.
|
||||
to `header` and `open`. In each case, the payload following the header
|
||||
is a binary sequence; for `String` and `Symbol`, it is a UTF-8
|
||||
encoding of the `Value`'s code points, while for `ByteString` it is
|
||||
the raw data contained within the `Value` unmodified.
|
||||
|
||||
Format B (known length):
|
||||
|
||||
|
@ -671,7 +684,8 @@ Format B (known length):
|
|||
|
||||
To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and
|
||||
then a sequence of zero or more format B chunks, followed by
|
||||
`close(1,n)`. Every chunk must be a `ByteString`.
|
||||
`close()`. Every chunk must be a `ByteString`, and no chunk may be
|
||||
annotated.
|
||||
|
||||
While the overall content of a streamed `String` or `Symbol` must be
|
||||
valid UTF-8, individual chunks do not have to conform to UTF-8.
|
||||
|
@ -680,8 +694,8 @@ valid UTF-8, individual chunks do not have to conform to UTF-8.
|
|||
|
||||
Fixed-length atoms all use format A, and do not have a length
|
||||
representation. They repurpose the bits that format B `Repr`s use to
|
||||
specify lengths. Applications *MUST NOT* use format C with
|
||||
`open(0,n)` or `close(0,n)` for any `n`.
|
||||
specify lengths. Applications *MUST NOT* use format C with `open(0,n)`
|
||||
for any `n`.
|
||||
|
||||
#### Booleans.
|
||||
|
||||
|
@ -696,6 +710,18 @@ specify lengths. Applications *MUST NOT* use format C with
|
|||
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
||||
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
|
||||
|
||||
### Annotations.
|
||||
|
||||
To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
|
||||
`[0x05] ++ [[v]]`.
|
||||
|
||||
For example, the `Repr` corresponding to textual syntax `@a@b[]`, i.e.
|
||||
an empty sequence annotated with two symbols, `a` and `b`, is
|
||||
|
||||
[[ @a @b [] ]]
|
||||
= [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]]
|
||||
= [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90]
|
||||
|
||||
## Examples
|
||||
|
||||
### Simple examples.
|
||||
|
@ -703,25 +729,25 @@ The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
|||
<!-- TODO: Give some examples of large and small Preserves, perhaps -->
|
||||
<!-- translated from various JSON blobs floating around the internet. -->
|
||||
|
||||
For the following examples, imagine an application that maps `Record`
|
||||
short form label number 0 to label `discard`, 1 to `capture`, and 2 to
|
||||
For the following examples, imagine an application that maps
|
||||
placeholder number 0 to symbol `discard`, 1 to `capture`, and 2 to
|
||||
`observe`.
|
||||
|
||||
| Value | Encoded hexadecimal byte sequence |
|
||||
|---------------------------------------------------|----------------------------------------------------------------------|
|
||||
| `capture(discard())` | 91 80 |
|
||||
| `observe(speak(discard(), capture(discard())))` | A1 B3 75 73 70 65 61 6B 80 91 80 |
|
||||
| `[1 2 3 4]` (format B) | C4 11 12 13 14 |
|
||||
| `[1 2 3 4]` (format C) | 2C 11 12 13 14 3C |
|
||||
| `[-2 -1 0 1]` | C4 1E 1F 10 11 |
|
||||
| `capture(discard())` | 82 11 81 10 |
|
||||
| `observe(speak(discard(), capture(discard())))` | 82 12 83 75 73 70 65 61 6B 81 10 82 11 81 11 |
|
||||
| `[1 2 3 4]` (format B) | 94 31 32 33 34 |
|
||||
| `[1 2 3 4]` (format C) | 29 31 32 33 34 04 |
|
||||
| `[-2 -1 0 1]` | 94 3E 3F 30 31 |
|
||||
| `"hello"` (format B) | 55 68 65 6C 6C 6F |
|
||||
| `"hello"` (format C, 2 chunks) | 25 62 68 65 63 6C 6C 6F 35 |
|
||||
| `"hello"` (format C, 5 chunks) | 25 62 68 65 62 6C 6C 60 60 61 6F 35 |
|
||||
| `["hello" there #"world" [] #set{} #true #false]` | C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 C0 D0 01 00 |
|
||||
| `"hello"` (format C, 5 chunks) | 25 61 68 61 65 61 6C 61 6C 61 6F 35 |
|
||||
| `["hello" there #"world" [] #set{} #true #false]` | 97 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 90 A0 01 00 |
|
||||
| `-257` | 42 FE FF |
|
||||
| `-1` | 1F |
|
||||
| `0` | 10 |
|
||||
| `1` | 11 |
|
||||
| `-1` | 3F |
|
||||
| `0` | 30 |
|
||||
| `1` | 31 |
|
||||
| `255` | 42 00 FF |
|
||||
| `1.0f` | 02 3F 80 00 00 |
|
||||
| `1.0` | 03 3F F0 00 00 00 00 00 00 |
|
||||
|
@ -733,20 +759,20 @@ The next example uses a non-`Symbol` label for a record.[^extensibility2] The `R
|
|||
|
||||
encodes to
|
||||
|
||||
B5 ;; Record, generic, 4+1
|
||||
C5 ;; Sequence, 5
|
||||
85 ;; Record, generic, 4+1
|
||||
95 ;; Sequence, 5
|
||||
76 74 69 74 6C 65 64 ;; Symbol, "titled"
|
||||
76 70 65 72 73 6F 6E ;; Symbol, "person"
|
||||
12 ;; SignedInteger, "2"
|
||||
32 ;; SignedInteger, "2"
|
||||
75 74 68 69 6E 67 ;; Symbol, "thing"
|
||||
11 ;; SignedInteger, "1"
|
||||
31 ;; SignedInteger, "1"
|
||||
41 65 ;; SignedInteger, "101"
|
||||
59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
|
||||
B4 ;; Record, generic, 3+1
|
||||
84 ;; Record, generic, 3+1
|
||||
74 64 61 74 65 ;; Symbol, "date"
|
||||
42 07 1D ;; SignedInteger, "1821"
|
||||
12 ;; SignedInteger, "2"
|
||||
13 ;; SignedInteger, "3"
|
||||
32 ;; SignedInteger, "2"
|
||||
33 ;; SignedInteger, "3"
|
||||
52 44 72 ;; String, "Dr"
|
||||
|
||||
[^extensibility2]: It happens to line up with Racket's
|
||||
|
@ -787,19 +813,19 @@ read as `Symbol`s. The first example:
|
|||
|
||||
encodes to binary as follows:
|
||||
|
||||
E2
|
||||
B2
|
||||
55 "Image"
|
||||
EC
|
||||
BC
|
||||
55 "Width" 42 03 20
|
||||
55 "Title" 5F 14 "View from 15th Floor"
|
||||
58 "Animated" 75 "false"
|
||||
56 "Height" 42 02 58
|
||||
59 "Thumbnail"
|
||||
E6
|
||||
B6
|
||||
55 "Width" 41 64
|
||||
53 "Url" 5F 26 "http://www.example.com/image/481989943"
|
||||
56 "Height" 41 7D
|
||||
53 "IDs" C4
|
||||
53 "IDs" 94
|
||||
41 74
|
||||
42 03 AF
|
||||
42 00 EA
|
||||
|
@ -832,8 +858,8 @@ and the second example:
|
|||
|
||||
encodes to binary as follows:
|
||||
|
||||
C2
|
||||
EF 10
|
||||
92
|
||||
BF 10
|
||||
59 "precision" 53 "zip"
|
||||
58 "Latitude" 03 40 42 E2 26 80 9D 49 52
|
||||
59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21
|
||||
|
@ -842,7 +868,7 @@ encodes to binary as follows:
|
|||
55 "State" 52 "CA"
|
||||
53 "Zip" 55 "94107"
|
||||
57 "Country" 52 "US"
|
||||
EF 10
|
||||
BF 10
|
||||
59 "precision" 53 "zip"
|
||||
58 "Latitude" 03 40 42 AF 9D 66 AD B4 03
|
||||
59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF
|
||||
|
@ -957,16 +983,17 @@ such media types following the general rules for ordering of
|
|||
|
||||
| Value | Encoded hexadecimal byte sequence |
|
||||
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
|
||||
| `mime(application/octet-stream #"abcde")` | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
|
||||
| `mime(text/plain #"ABC")` | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
|
||||
| `mime(application/xml #"<xhtml/>")` | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
|
||||
| `mime(text/csv #"123,234,345")` | B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
|
||||
| `mime(application/octet-stream #"abcde")` | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
|
||||
| `mime(text/plain #"ABC")` | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
|
||||
| `mime(application/xml #"<xhtml/>")` | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
|
||||
| `mime(text/csv #"123,234,345")` | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
|
||||
|
||||
Applications making heavy use of `mime` records may choose to use a
|
||||
short form label number for the record type. For example, if short
|
||||
form label number 1 were chosen, the second example above,
|
||||
`mime(text/plain "ABC")`, would be encoded with "92" in place of "B3
|
||||
74 6D 69 6D 65".
|
||||
placeholder number for the symbol `mime` as well as the symbols for
|
||||
individual media types. For example, if placeholder number 1 were
|
||||
chosen for `mime`, and placeholder number 7 for `text/plain`, the
|
||||
second example above, `mime(text/plain #"ABC")`, would be encoded as
|
||||
`83 11 17 63 41 42 43`.
|
||||
|
||||
### Unicode normalization forms.
|
||||
|
||||
|
@ -1027,20 +1054,23 @@ or `date-time` productions of
|
|||
|
||||
## Security Considerations
|
||||
|
||||
**Empty chunks.** Streamed (format C) `String`s, `ByteString`s and
|
||||
`Symbol`s may include chunks of zero length. This opens up a
|
||||
possibility for denial-of-service: an attacker may begin streaming a
|
||||
string, sending an endless sequence of zero length chunks, appearing
|
||||
to make progress but not actually doing so. Implementations may place
|
||||
optional reasonable restrictions on the number of consecutive empty
|
||||
chunks that may appear in a stream, and may even supply an optional
|
||||
mode that rejects empty chunks entirely.
|
||||
**Empty chunks.** Chunks of zero length are prohibited in streamed
|
||||
(format C) `Repr`s. However, a malicious or broken encoder may include
|
||||
them nonetheless. This opens up a possibility for denial-of-service:
|
||||
an attacker may begin streaming a `String`, for example, sending an
|
||||
endless sequence of zero length chunks, appearing to make progress but
|
||||
not actually doing so. Implementations *MUST* reject zero length
|
||||
chunks when decoding, and *MUST NOT* produce them when encoding.
|
||||
|
||||
**Whitespace.** Similarly, the textual format for `Value`s allows
|
||||
arbitrary whitespace in many positions. In streaming transfer
|
||||
situations, consider optional restrictions on the amount of
|
||||
consecutive whitespace that may appear in a serialized `Value`.
|
||||
|
||||
**Annotations.** Also similarly, in modes where a `Value` is being
|
||||
read while annotations are skipped, an endless sequence of annotations
|
||||
may give an illusion of progress.
|
||||
|
||||
**Canonical form for cryptographic hashing and signing.** As
|
||||
specified, neither the textual nor the compact binary encoding rules
|
||||
for `Value`s force canonical serializations. Two serializations of the
|
||||
|
@ -1052,24 +1082,26 @@ same `Value` may yield different binary `Repr`s.
|
|||
01 - True
|
||||
02 - Float
|
||||
03 - Double
|
||||
(0x) RESERVED 04-0F
|
||||
1x - Small integers 0..12,-3..-1
|
||||
04 - End stream
|
||||
05 - Annotation
|
||||
(0x) RESERVED 06-0F
|
||||
1x - Placeholder
|
||||
2x - Start Stream
|
||||
3x - End Stream
|
||||
3x - Small integers 0..12,-3..-1
|
||||
|
||||
4x - SignedInteger
|
||||
5x - String
|
||||
6x - ByteString
|
||||
7x - Symbol
|
||||
|
||||
8x - short form Record label index 0
|
||||
9x - short form Record label index 1
|
||||
Ax - short form Record label index 2
|
||||
Bx - Record
|
||||
8x - Record
|
||||
9x - Sequence
|
||||
Ax - Set
|
||||
Bx - Dictionary
|
||||
|
||||
Cx - Sequence
|
||||
Dx - Set
|
||||
Ex - Dictionary
|
||||
(Cx) RESERVED C0-CF
|
||||
(Dx) RESERVED D0-DF
|
||||
(Ex) RESERVED E0-EF
|
||||
(Fx) RESERVED F0-FF
|
||||
|
||||
## Appendix. Bit fields within lead byte values
|
||||
|
@ -1081,31 +1113,40 @@ same `Value` may yield different binary `Repr`s.
|
|||
00 00 0001 True
|
||||
00 00 0010 Float, 32 bits big-endian binary
|
||||
00 00 0011 Double, 64 bits big-endian binary
|
||||
00 00 0100 End Stream (to match a previous Start Stream)
|
||||
00 00 0101 Annotation; two more Reprs follow
|
||||
|
||||
00 01 xxxx Small integers 0..12,-3..-1
|
||||
00 01 mmmm Placeholder; m is the placeholder number
|
||||
|
||||
00 10 ttnn Start Stream <tt,nn>
|
||||
When tt = 00 --> error
|
||||
01 --> each chunk is a ByteString
|
||||
1x --> each chunk is a single encoded Value
|
||||
00 11 ttnn End Stream <tt,nn> (must match preceding Start Stream)
|
||||
10 --> each chunk is a single encoded Value
|
||||
11 --> error (RESERVED)
|
||||
|
||||
00 11 xxxx Small integers 0..12,-3..-1
|
||||
|
||||
01 00 mmmm SignedInteger, big-endian binary
|
||||
01 01 mmmm String, UTF-8 binary
|
||||
01 10 mmmm ByteString
|
||||
01 11 mmmm Symbol, UTF-8 binary
|
||||
|
||||
10 00 mmmm application-specific Record
|
||||
10 01 mmmm application-specific Record
|
||||
10 10 mmmm application-specific Record
|
||||
10 11 mmmm Record
|
||||
10 00 mmmm Record
|
||||
10 01 mmmm Sequence
|
||||
10 10 mmmm Set
|
||||
10 11 mmmm Dictionary
|
||||
|
||||
11 00 mmmm Sequence
|
||||
11 01 mmmm Set
|
||||
11 10 mmmm Dictionary
|
||||
11 nn mmmm error, RESERVED
|
||||
|
||||
If mmmm = 1111, a varint(m) follows, giving the length, before
|
||||
the body; otherwise, m is the length of the body to follow.
|
||||
Where `mmmm` appears, interpret it as an unsigned 4-bit number `m`. If
|
||||
`m`<15, let `l`=`m`. Otherwise, `m`=15; let `l` be the result of
|
||||
decoding the varint that follows.
|
||||
|
||||
Then, if `ttnn`=`0001`, `l` is the placeholder number; otherwise, `l`
|
||||
is the length of the body that follows, counted in bytes for `tt`=`01`
|
||||
and in `Repr`s for `tt`=`10`.
|
||||
|
||||
<!-- Not yet ready
|
||||
|
||||
## Appendix. Representing Values in Programming Languages
|
||||
|
||||
|
@ -1118,6 +1159,9 @@ When designing a language mapping, an important consideration is
|
|||
roundtripping: serialization after deserialization, and vice versa,
|
||||
should both be identities.
|
||||
|
||||
Also, the presence or absence of annotations on a `Value` should not
|
||||
affect comparisons of that `Value` to others in any way.
|
||||
|
||||
### JavaScript.
|
||||
|
||||
- `Boolean` ↔ `Boolean`
|
||||
|
@ -1211,6 +1255,8 @@ or `Record`s.
|
|||
- `Set` ↔ `Set`
|
||||
- `Dictionary` ↔ `Dictionary`
|
||||
|
||||
-->
|
||||
|
||||
## Appendix. Why not Just Use JSON?
|
||||
|
||||
<!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
|
||||
|
@ -1367,18 +1413,14 @@ Q. Should "symbols" instead be URIs? Relative, usually; relative to
|
|||
what? Some domain-specific base URI?
|
||||
|
||||
Q. Literal small integers: are they pulling their weight? They're not
|
||||
absolutely necessary. They mess up the connection between
|
||||
value-type-ordering and repr-tag-ordering! (The connection between
|
||||
*value* ordering and *repr* ordering is already irretrievably messed
|
||||
up: length prefixes blow lexicographic ordering away, sign bits are
|
||||
the wrong way around, floats are sign-magnitude, etc etc.)
|
||||
absolutely necessary.
|
||||
|
||||
Q. Should we go for trying to make the data ordering line up with the
|
||||
encoding ordering? We'd have to only use streaming forms, and avoid
|
||||
the small integer encoding, and not store record arities, and sort
|
||||
sets and dictionaries, and mask floats and doubles (perhaps
|
||||
[like this](https://stackoverflow.com/questions/43299299/sorting-floating-point-values-using-their-byte-representation)),
|
||||
and pick a specific `NaN`, and I don't know what to do about
|
||||
and perhaps pick a specific `NaN`, and I don't know what to do about
|
||||
SignedIntegers. Perhaps make them more like float formats, with the
|
||||
byte count acting as a kind of exponent underneath the sign bit.
|
||||
|
||||
|
@ -1413,11 +1455,3 @@ link escape"; it is not a printable ASCII character, and is disallowed
|
|||
in the textual Preserves grammar; and it is also mnemonic for "version
|
||||
0", since it is the Preserves binary encoding of the small integer
|
||||
zero.))
|
||||
|
||||
IN PROGRESS: Remove the special short syntax for application-specific record
|
||||
label usage? Then perhaps 8x, 9x, Ax and Bx would work for Record,
|
||||
Sequence, Set and Dictionary, leaving Cx, Dx, Ex and Fx entirely free.
|
||||
|
||||
TODO: Forbid empty chunks.
|
||||
|
||||
## Notes
|
||||
|
|
Loading…
Reference in New Issue