Split up spec!
This commit is contained in:
parent
1f495eef1e
commit
7d3789e371
14
README.md
14
README.md
|
@ -6,22 +6,24 @@ no_site_title: true
|
||||||
---
|
---
|
||||||
|
|
||||||
This [repository]({{page.projectpages}}) contains a
|
This [repository]({{page.projectpages}}) contains a
|
||||||
[proposal](preserves.html) and various implementations of *Preserves*,
|
[proposal](preserves.html) and various implementations of *Preserves*, a
|
||||||
a new data model and serialization format in many ways comparable to
|
new data model, with associated serialization formats, in many ways
|
||||||
JSON, XML, S-expressions, CBOR, ASN.1 BER, and so on.
|
comparable to JSON, XML, S-expressions, CBOR, ASN.1 BER, and so on.
|
||||||
|
|
||||||
## Core documents
|
## Core documents
|
||||||
|
|
||||||
### Preserves data model and serialization formats
|
### Preserves data model and serialization formats
|
||||||
|
|
||||||
Preserves is defined in terms of a syntax-neutral
|
Preserves is defined in terms of a syntax-neutral
|
||||||
[data model and semantics](preserves.html#starting-with-semantics)
|
[data model and semantics](preserves.html#semantics)
|
||||||
which all transfer syntaxes share. This allows trivial, completely
|
which all transfer syntaxes share. This allows trivial, completely
|
||||||
automatic, perfect-fidelity conversion between syntaxes.
|
automatic, perfect-fidelity conversion between syntaxes.
|
||||||
|
|
||||||
|
- [Preserves specification](preserves.html):
|
||||||
|
- [Preserves semantics and data model](preserves.html#semantics),
|
||||||
|
- [Preserves textual syntax](preserves-text.html), and
|
||||||
|
- [Preserves machine-oriented binary syntax](preserves-binary.html)
|
||||||
- [Preserves tutorial](TUTORIAL.html)
|
- [Preserves tutorial](TUTORIAL.html)
|
||||||
- [Preserves specification](preserves.html), including semantics,
|
|
||||||
data model, textual syntax, and compact binary syntax
|
|
||||||
- [Canonical Form for Binary Syntax](canonical-binary.html)
|
- [Canonical Form for Binary Syntax](canonical-binary.html)
|
||||||
- [Syrup](https://github.com/ocapn/syrup#pseudo-specification), a
|
- [Syrup](https://github.com/ocapn/syrup#pseudo-specification), a
|
||||||
hybrid binary/human-readable syntax for the Preserves data model
|
hybrid binary/human-readable syntax for the Preserves data model
|
||||||
|
|
|
@ -13,3 +13,5 @@ defaults:
|
||||||
layout: page
|
layout: page
|
||||||
|
|
||||||
title: "Preserves"
|
title: "Preserves"
|
||||||
|
version_date: "June 2022"
|
||||||
|
version: "0.6.3"
|
||||||
|
|
|
@ -17,8 +17,8 @@ their *syntax* for equivalence gives the same result as comparing them
|
||||||
That is, canonical forms are equal if and only if the encoded `Value`s
|
That is, canonical forms are equal if and only if the encoded `Value`s
|
||||||
are equal.
|
are equal.
|
||||||
|
|
||||||
This document specifies canonical form for the Preserves compact
|
This document specifies canonical form for the Preserves [machine-oriented
|
||||||
binary syntax.
|
binary syntax](preserves-binary.html).
|
||||||
|
|
||||||
**Annotations.**
|
**Annotations.**
|
||||||
Annotations *MUST NOT* be present.
|
Annotations *MUST NOT* be present.
|
||||||
|
|
|
@ -0,0 +1,260 @@
|
||||||
|
---
|
||||||
|
no_site_title: true
|
||||||
|
title: "Preserves: Binary Syntax"
|
||||||
|
---
|
||||||
|
|
||||||
|
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
||||||
|
{{ site.version_date }}. Version {{ site.version }}.
|
||||||
|
|
||||||
|
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
|
||||||
|
[spki]: http://world.std.com/~cme/html/spki.html
|
||||||
|
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
|
||||||
|
[LEB128]: https://en.wikipedia.org/wiki/LEB128
|
||||||
|
[erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
|
||||||
|
[abnf]: https://tools.ietf.org/html/rfc7405
|
||||||
|
[canonical]: canonical-binary.html
|
||||||
|
|
||||||
|
*Preserves* is a data model, with associated serialization formats. This
|
||||||
|
document defines one of those formats: a binary syntax for `Value`s from
|
||||||
|
the [Preserves data model](preserves.html) that is easy for computer
|
||||||
|
software to read and write. An [equivalent human-readable text
|
||||||
|
syntax](preserves-text.html) also exists.
|
||||||
|
|
||||||
|
## Machine-Oriented Binary Syntax
|
||||||
|
|
||||||
|
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
|
||||||
|
For a value `v`, we write `«v»` for the `Repr` of v.
|
||||||
|
|
||||||
|
### Type and Length representation.
|
||||||
|
|
||||||
|
Each `Repr` starts with a tag byte, describing the kind of information
|
||||||
|
represented. Depending on the tag, a length indicator, further encoded
|
||||||
|
information, and/or an ending tag may follow.
|
||||||
|
|
||||||
|
tag (simple atomic data and small integers)
|
||||||
|
tag ++ binarydata (most integers)
|
||||||
|
tag ++ length ++ binarydata (large integers, strings, symbols, and binary)
|
||||||
|
tag ++ repr ++ ... ++ endtag (compound data)
|
||||||
|
|
||||||
|
The unique end tag is byte value `0x84`.
|
||||||
|
|
||||||
|
If present after a tag, the length of a following piece of binary data
|
||||||
|
is formatted as a [base 128 varint][varint].[^see-also-leb128] We
|
||||||
|
write `varint(m)` for the varint-encoding of `m`. Quoting the
|
||||||
|
[Google Protocol Buffers][varint] definition,
|
||||||
|
|
||||||
|
[^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
|
||||||
|
integers. Varints and LEB128-encoded integers differ only for
|
||||||
|
signed integers, which are not used in Preserves.
|
||||||
|
|
||||||
|
> Each byte in a varint, except the last byte, has the most
|
||||||
|
> significant bit (msb) set – this indicates that there are further
|
||||||
|
> bytes to come. The lower 7 bits of each byte are used to store the
|
||||||
|
> two's complement representation of the number in groups of 7 bits,
|
||||||
|
> least significant group first.
|
||||||
|
|
||||||
|
The following table illustrates varint-encoding.
|
||||||
|
|
||||||
|
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
|
||||||
|
| ------ | ------------------- | ------------ |
|
||||||
|
| 15 | `0001111` | 15 |
|
||||||
|
| 300 | `0000010 0101100` | 172 2 |
|
||||||
|
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
|
||||||
|
|
||||||
|
It is an error for a varint-encoded `m` in a `Repr` to be anything
|
||||||
|
other than the unique shortest encoding for that `m`. That is, a
|
||||||
|
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.
|
||||||
|
|
||||||
|
### Records, Sequences, Sets and Dictionaries.
|
||||||
|
|
||||||
|
«<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
|
||||||
|
«[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
|
||||||
|
«#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
|
||||||
|
«{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
|
||||||
|
|
||||||
|
There is *no* ordering requirement on the `E_i` elements or
|
||||||
|
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
|
||||||
|
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
|
||||||
|
addition, implementations *SHOULD* default to writing set elements and
|
||||||
|
dictionary key/value pairs in order sorted lexicographically by their
|
||||||
|
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
|
||||||
|
serializing in some other implementation-defined order.
|
||||||
|
|
||||||
|
[^no-sorting-rationale]: In the BitTorrent encoding format,
|
||||||
|
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
|
||||||
|
dictionary key/value pairs must be sorted by key. This is a
|
||||||
|
necessary step for ensuring serialization of `Value`s is
|
||||||
|
canonical. We do not require that key/value pairs (or set
|
||||||
|
elements) be in sorted order for serialized `Value`s; however, a
|
||||||
|
[canonical form][canonical] for `Repr`s does exist where a sorted
|
||||||
|
ordering is required.
|
||||||
|
|
||||||
|
[^not-sorted-semantically]: It's important to note that the sort
|
||||||
|
ordering for writing out set elements and dictionary key/value
|
||||||
|
pairs is *not* the same as the sort ordering implied by the
|
||||||
|
semantic ordering of those elements or keys. For example, the
|
||||||
|
`Repr` of a negative number very far from zero will start with
|
||||||
|
byte that is *greater* than the byte which starts the `Repr` of
|
||||||
|
zero, making it sort lexicographically later by `Repr`, despite
|
||||||
|
being semantically *less than* zero.
|
||||||
|
|
||||||
|
**Rationale**. This is for ease-of-implementation reasons: not all
|
||||||
|
languages can easily represent sorted sets or sorted dictionaries,
|
||||||
|
but encoding and then sorting byte strings is much more likely to
|
||||||
|
be within easy reach.
|
||||||
|
|
||||||
|
### SignedIntegers.
|
||||||
|
|
||||||
|
«x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16
|
||||||
|
([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16
|
||||||
|
([0xA0] + x) if (-3≤x≤-1)
|
||||||
|
([0x90] + x) if ( 0≤x≤12)
|
||||||
|
where m = |intbytes(x)|
|
||||||
|
|
||||||
|
Integers in the range [-3,12] are compactly represented with tags
|
||||||
|
between `0x90` and `0x9F` because they are so frequently used.
|
||||||
|
Integers up to 16 bytes long are represented with a single-byte tag
|
||||||
|
encoding the length of the integer. Larger integers are represented
|
||||||
|
with an explicit varint length. Every `SignedInteger` *MUST* be
|
||||||
|
represented with its shortest possible encoding.
|
||||||
|
|
||||||
|
The function `intbytes(x)` gives the big-endian two's-complement
|
||||||
|
binary representation of `x`, taking exactly as many whole bytes as
|
||||||
|
needed to unambiguously identify the value and its sign, and `m =
|
||||||
|
|intbytes(x)|`. The most-significant bit in the first byte in
|
||||||
|
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
|
||||||
|
example,
|
||||||
|
|
||||||
|
«87112285931760246646623899502532662132736»
|
||||||
|
= B0 12 01 00 00 00 00 00 00 00
|
||||||
|
00 00 00 00 00 00 00 00
|
||||||
|
00 00
|
||||||
|
|
||||||
|
«-257» = A1 FE FF «-3» = 9D «128» = A1 00 80
|
||||||
|
«-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF
|
||||||
|
«-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00
|
||||||
|
«-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF
|
||||||
|
«-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00
|
||||||
|
«-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF
|
||||||
|
«-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00
|
||||||
|
«-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00
|
||||||
|
|
||||||
|
[^zero-intbytes]: The value 0 needs zero bytes to identify the
|
||||||
|
value, so `intbytes(0)` is the empty byte string. Non-zero values
|
||||||
|
need at least one byte.
|
||||||
|
|
||||||
|
### Strings, ByteStrings and Symbols.
|
||||||
|
|
||||||
|
Syntax for these three types varies only in the tag used. For `String`
|
||||||
|
and `Symbol`, the data following the tag is a UTF-8 encoding of the
|
||||||
|
`Value`'s code points, while for `ByteString` it is the raw data
|
||||||
|
contained within the `Value` unmodified.
|
||||||
|
|
||||||
|
«S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
|
||||||
|
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
|
||||||
|
[0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol
|
||||||
|
|
||||||
|
### Booleans.
|
||||||
|
|
||||||
|
«#f» = [0x80]
|
||||||
|
«#t» = [0x81]
|
||||||
|
|
||||||
|
### Floats and Doubles.
|
||||||
|
|
||||||
|
«F» when F ∈ Float = [0x82] ++ binary32(F)
|
||||||
|
«D» when D ∈ Double = [0x83] ++ binary64(D)
|
||||||
|
|
||||||
|
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
||||||
|
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
|
||||||
|
|
||||||
|
### Embeddeds.
|
||||||
|
|
||||||
|
The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
|
||||||
|
represent the denoted object, prefixed with `[0x86]`.
|
||||||
|
|
||||||
|
«#!V» = [0x86] ++ «V»
|
||||||
|
|
||||||
|
### Annotations.
|
||||||
|
|
||||||
|
To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
|
||||||
|
`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
|
||||||
|
syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
|
||||||
|
`a` and `b`, is
|
||||||
|
|
||||||
|
«@a @b []»
|
||||||
|
= [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
|
||||||
|
= [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
|
||||||
|
|
||||||
|
## Security Considerations
|
||||||
|
|
||||||
|
**Annotations.** In modes where a `Value` is being read while
|
||||||
|
annotations are skipped, an endless sequence of annotations may give an
|
||||||
|
illusion of progress.
|
||||||
|
|
||||||
|
**Canonical form for cryptographic hashing and signing.** No canonical
|
||||||
|
textual encoding of a `Value` is specified. A
|
||||||
|
[canonical form][canonical] exists for binary encoded `Value`s, and
|
||||||
|
implementations *SHOULD* produce canonical binary encodings by
|
||||||
|
default; however, an implementation *MAY* permit two serializations of
|
||||||
|
the same `Value` to yield different binary `Repr`s.
|
||||||
|
|
||||||
|
## Appendix. Autodetection of textual or binary syntax
|
||||||
|
|
||||||
|
Every tag byte in a binary Preserves `Document` falls within the range
|
||||||
|
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
|
||||||
|
bytes*, and will never occur as the first byte of a UTF-8 encoded code
|
||||||
|
point. This means no binary-encoded document can be misinterpreted as
|
||||||
|
valid UTF-8.
|
||||||
|
|
||||||
|
Conversely, a UTF-8 document must start with a valid codepoint,
|
||||||
|
meaning in particular that it must not start with a byte in the range
|
||||||
|
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
|
||||||
|
Preserves document can be misinterpreted as a binary-syntax document.
|
||||||
|
|
||||||
|
Examination of the top two bits of the first byte of a document gives
|
||||||
|
its syntax: if the top two bits are `10`, it should be interpreted as
|
||||||
|
a binary-syntax document; otherwise, it should be interpreted as text.
|
||||||
|
|
||||||
|
## Appendix. Table of tag values
|
||||||
|
|
||||||
|
80 - False
|
||||||
|
81 - True
|
||||||
|
82 - Float
|
||||||
|
83 - Double
|
||||||
|
84 - End marker
|
||||||
|
85 - Annotation
|
||||||
|
86 - Embedded
|
||||||
|
(8x) RESERVED 87-8F
|
||||||
|
|
||||||
|
9x - Small integers 0..12,-3..-1
|
||||||
|
An - Medium integers, (n+1) bytes long
|
||||||
|
B0 - Large integers, variable length
|
||||||
|
B1 - String
|
||||||
|
B2 - ByteString
|
||||||
|
B3 - Symbol
|
||||||
|
|
||||||
|
B4 - Record
|
||||||
|
B5 - Sequence
|
||||||
|
B6 - Set
|
||||||
|
B7 - Dictionary
|
||||||
|
|
||||||
|
## Appendix. Binary SignedInteger representation
|
||||||
|
|
||||||
|
Languages that provide fixed-width machine word types may find the
|
||||||
|
following table useful in encoding and decoding binary `SignedInteger`
|
||||||
|
values.
|
||||||
|
|
||||||
|
| Integer range | Bytes required | Encoding (hex) |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| -3 ≤ n ≤ 12 | 1 | `9X` |
|
||||||
|
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A0` `XX` |
|
||||||
|
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A1` `XX` `XX` |
|
||||||
|
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A2` `XX` `XX` `XX` |
|
||||||
|
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `A3` `XX` `XX` `XX` `XX` |
|
||||||
|
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A4` `XX` `XX` `XX` `XX` `XX` |
|
||||||
|
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A5` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||||
|
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||||
|
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||||
|
|
||||||
|
<!-- Heading to visually offset the footnotes from the main document: -->
|
||||||
|
## Notes
|
|
@ -0,0 +1,302 @@
|
||||||
|
---
|
||||||
|
no_site_title: true
|
||||||
|
title: "Preserves: Text Syntax"
|
||||||
|
---
|
||||||
|
|
||||||
|
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
||||||
|
{{ site.version_date }}. Version {{ site.version }}.
|
||||||
|
|
||||||
|
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
|
||||||
|
[spki]: http://world.std.com/~cme/html/spki.html
|
||||||
|
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
|
||||||
|
[LEB128]: https://en.wikipedia.org/wiki/LEB128
|
||||||
|
[erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
|
||||||
|
[abnf]: https://tools.ietf.org/html/rfc7405
|
||||||
|
[canonical]: canonical-binary.html
|
||||||
|
|
||||||
|
*Preserves* is a data model, with associated serialization formats. This
|
||||||
|
document defines one of those formats: a textual syntax for `Value`s
|
||||||
|
from the [Preserves data model](preserves.html) that is easy for people
|
||||||
|
to read and write. An [equivalent machine-oriented binary
|
||||||
|
syntax](preserves-binary.html) also exists.
|
||||||
|
|
||||||
|
## Preliminaries
|
||||||
|
|
||||||
|
The definition uses [case-sensitive ABNF][abnf].
|
||||||
|
|
||||||
|
ABNF allows easy definition of US-ASCII-based languages. However,
|
||||||
|
Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as
|
||||||
|
a grammar for recognising sequences of Unicode code points.
|
||||||
|
|
||||||
|
**Encoding.** Textual syntax for a `Value` *SHOULD* be encoded using
|
||||||
|
UTF-8 where possible.
|
||||||
|
|
||||||
|
**Whitespace.** Whitespace is defined as any number of spaces, tabs,
|
||||||
|
carriage returns, line feeds, or commas.
|
||||||
|
|
||||||
|
ws = *(%x20 / %x09 / newline / ",")
|
||||||
|
newline = CR / LF
|
||||||
|
|
||||||
|
## Grammar
|
||||||
|
|
||||||
|
Standalone documents may have trailing whitespace.
|
||||||
|
|
||||||
|
Document = Value ws
|
||||||
|
|
||||||
|
Any `Value` may be preceded by whitespace.
|
||||||
|
|
||||||
|
Value = ws (Record / Collection / Atom / Embedded / Machine)
|
||||||
|
Collection = Sequence / Dictionary / Set
|
||||||
|
Atom = Boolean / Float / Double / SignedInteger /
|
||||||
|
String / ByteString / Symbol
|
||||||
|
|
||||||
|
Each `Record` is an angle-bracket enclosed grouping of its
|
||||||
|
label-`Value` followed by its field-`Value`s.
|
||||||
|
|
||||||
|
Record = "<" Value *Value ws ">"
|
||||||
|
|
||||||
|
`Sequence`s are enclosed in square brackets. `Dictionary` values are
|
||||||
|
curly-brace-enclosed colon-separated pairs of values. `Set`s are
|
||||||
|
written as values enclosed by the tokens `#{` and
|
||||||
|
`}`.[^printing-collections] It is an error for a set to contain
|
||||||
|
duplicate elements or for a dictionary to contain duplicate keys.
|
||||||
|
|
||||||
|
Sequence = "[" *Value ws "]"
|
||||||
|
Dictionary = "{" *(Value ws ":" Value) ws "}"
|
||||||
|
Set = "#{" *Value ws "}"
|
||||||
|
|
||||||
|
[^printing-collections]: **Implementation note.** When implementing
|
||||||
|
printing of `Value`s using the textual syntax, consider supporting
|
||||||
|
(a) optional pretty-printing with indentation, (b) optional
|
||||||
|
JSON-compatible print mode for that subset of `Value` that is
|
||||||
|
compatible with JSON, and (c) optional submodes for no commas,
|
||||||
|
commas separating, and commas terminating elements or key/value
|
||||||
|
pairs within a collection.
|
||||||
|
|
||||||
|
`Boolean`s are the simple literal strings `#t` and `#f` for true and
|
||||||
|
false, respectively.
|
||||||
|
|
||||||
|
Boolean = %s"#t" / %s"#f"
|
||||||
|
|
||||||
|
Numeric data follow the
|
||||||
|
[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
|
||||||
|
the addition of a trailing “f” distinguishing `Float` from `Double`
|
||||||
|
values. `Float`s and `Double`s always have either a fractional part or
|
||||||
|
an exponent part, where `SignedInteger`s never have
|
||||||
|
either.[^reading-and-writing-floats-accurately]
|
||||||
|
[^arbitrary-precision-signedinteger]
|
||||||
|
|
||||||
|
Float = flt %i"f"
|
||||||
|
Double = flt
|
||||||
|
SignedInteger = int
|
||||||
|
|
||||||
|
digit1-9 = %x31-39
|
||||||
|
nat = %x30 / ( digit1-9 *DIGIT )
|
||||||
|
int = ["-"] nat
|
||||||
|
frac = "." 1*DIGIT
|
||||||
|
exp = %i"e" ["-"/"+"] 1*DIGIT
|
||||||
|
flt = int (frac exp / frac / exp)
|
||||||
|
|
||||||
|
[^reading-and-writing-floats-accurately]: **Implementation note.**
|
||||||
|
Your language's standard library likely has a good routine for
|
||||||
|
converting between decimal notation and IEEE 754 floating-point.
|
||||||
|
However, if not, or if you are interested in the challenges of
|
||||||
|
accurately reading and writing floating point numbers, see the
|
||||||
|
excellent matched pair of 1990 papers by Clinger and Steele &
|
||||||
|
White, and a recent follow-up by Jaffer:
|
||||||
|
|
||||||
|
Clinger, William D. ‘How to Read Floating Point Numbers
|
||||||
|
Accurately’. In Proc. PLDI. White Plains, New York, 1990.
|
||||||
|
<https://doi.org/10.1145/93542.93557>.
|
||||||
|
|
||||||
|
Steele, Guy L., Jr., and Jon L. White. ‘How to Print
|
||||||
|
Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains,
|
||||||
|
New York, 1990. <https://doi.org/10.1145/93542.93559>.
|
||||||
|
|
||||||
|
Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of
|
||||||
|
Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013.
|
||||||
|
<http://arxiv.org/abs/1310.8121>.
|
||||||
|
|
||||||
|
[^arbitrary-precision-signedinteger]: **Implementation note.** Be
|
||||||
|
aware when implementing reading and writing of `SignedInteger`s
|
||||||
|
that the data model *requires* arbitrary-precision integers. Your
|
||||||
|
implementation may (but, ideally, should not) truncate precision
|
||||||
|
when reading or writing a `SignedInteger`; however, if it does so,
|
||||||
|
it should (a) signal its client that truncation has occurred, and
|
||||||
|
(b) make it clear to the client that comparing such truncated
|
||||||
|
values for equality or ordering will not yield results that match
|
||||||
|
the expected semantics of the data model.
|
||||||
|
|
||||||
|
`String`s are,
|
||||||
|
[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
|
||||||
|
escaped text surrounded by double quotes. The escaping rules are the
|
||||||
|
same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
|
||||||
|
|
||||||
|
String = %x22 *char %x22
|
||||||
|
char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
|
||||||
|
unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
|
||||||
|
escape = %x5C ; \
|
||||||
|
escaped = ( %x5C / ; \ reverse solidus U+005C
|
||||||
|
%x2F / ; / solidus U+002F
|
||||||
|
%x62 / ; b backspace U+0008
|
||||||
|
%x66 / ; f form feed U+000C
|
||||||
|
%x6E / ; n line feed U+000A
|
||||||
|
%x72 / ; r carriage return U+000D
|
||||||
|
%x74 ) ; t tab U+0009
|
||||||
|
|
||||||
|
[^string-json-correspondence]: The grammar for `String` has the same
|
||||||
|
effect as the
|
||||||
|
[JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
|
||||||
|
`string`. Some auxiliary definitions (e.g. `escaped`) are lifted
|
||||||
|
largely unmodified from the text of RFC 8259.
|
||||||
|
|
||||||
|
[^escaping-surrogate-pairs]: In particular, note JSON's rules around
|
||||||
|
the use of surrogate pairs for code points not in the Basic
|
||||||
|
Multilingual Plane. We encourage implementations to avoid using
|
||||||
|
`\u` escapes when producing output, and instead to rely on the
|
||||||
|
UTF-8 encoding of the entire document to handle non-ASCII
|
||||||
|
codepoints correctly.
|
||||||
|
|
||||||
|
A `ByteString` may be written in any of three different forms.
|
||||||
|
|
||||||
|
The first is similar to a `String`, but prepended with a hash sign
|
||||||
|
`#`. In addition, only Unicode code points overlapping with printable
|
||||||
|
7-bit ASCII are permitted unescaped inside such a `ByteString`; other
|
||||||
|
byte values must be escaped by prepending a two-digit hexadecimal
|
||||||
|
value with `\x`.
|
||||||
|
|
||||||
|
ByteString = "#" %x22 *binchar %x22
|
||||||
|
binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
|
||||||
|
binunescaped = %x20-21 / %x23-5B / %x5D-7E
|
||||||
|
|
||||||
|
The second is as a sequence of pairs of hexadecimal digits interleaved
|
||||||
|
with whitespace and surrounded by `#x"` and `"`.
|
||||||
|
|
||||||
|
ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
|
||||||
|
|
||||||
|
The third is as a sequence of
|
||||||
|
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
|
||||||
|
with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
|
||||||
|
Base64 characters are allowed.
|
||||||
|
|
||||||
|
ByteString =/ "#[" *(ws / base64char) ws "]"
|
||||||
|
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
|
||||||
|
|
||||||
|
A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
|
||||||
|
it conforms to certain restrictions on the characters appearing in the
|
||||||
|
symbol. Alternatively, it may be written in a quoted form. The quoted
|
||||||
|
form is much the same as the syntax for `String`s, including embedded
|
||||||
|
escape syntax, except using a bar or pipe character (`|`) instead of a
|
||||||
|
double quote mark.
|
||||||
|
|
||||||
|
Symbol = symstart *symcont / "|" *symchar "|"
|
||||||
|
symstart = ALPHA / sympunct / symustart
|
||||||
|
symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-"
|
||||||
|
sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
|
||||||
|
"?" / "_" / "=" / "+" / "/" / "."
|
||||||
|
symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
|
||||||
|
symustart = <any code point greater than 127 whose Unicode
|
||||||
|
category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me,
|
||||||
|
Pc, Po, Sc, Sm, Sk, So, or Co>
|
||||||
|
symucont = <any code point greater than 127 whose Unicode
|
||||||
|
category is Nd, Nl, No, or Pd>
|
||||||
|
|
||||||
|
[^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
|
||||||
|
definition of “token representation”, and with the
|
||||||
|
[R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).
|
||||||
|
|
||||||
|
An `Embedded` is written as a `Value` chosen to represent the denoted
|
||||||
|
object, prefixed with `#!`.
|
||||||
|
|
||||||
|
Embedded = "#!" Value
|
||||||
|
|
||||||
|
Finally, any `Value` may be represented by escaping from the textual
|
||||||
|
syntax to the [machine-oriented binary syntax](preserves-binary.html)
|
||||||
|
by prefixing a `ByteString` containing the binary representation of the
|
||||||
|
`Value` with `#=`.[^rationale-switch-to-binary]
|
||||||
|
[^no-literal-binary-in-text] [^machine-value-annotations]
|
||||||
|
|
||||||
|
Machine = "#=" ws ByteString
|
||||||
|
|
||||||
|
[^rationale-switch-to-binary]: **Rationale.** The textual syntax
|
||||||
|
cannot express every `Value`: specifically, it cannot express the
|
||||||
|
several million floating-point NaNs, or the two floating-point
|
||||||
|
Infinities. Since the machine-oriented binary format for `Value`s
|
||||||
|
expresses each `Value` with precision, embedding binary `Value`s
|
||||||
|
solves the problem.
|
||||||
|
|
||||||
|
[^no-literal-binary-in-text]: Every text is ultimately physically
|
||||||
|
stored as bytes; therefore, it might seem possible to escape to the
|
||||||
|
raw form of binary encoding from within a piece of textual syntax.
|
||||||
|
However, while bytes must be involved in any *representation* of
|
||||||
|
text, the text *itself* is logically a sequence of *code points* and
|
||||||
|
is not *intrinsically* a binary structure at all. It would be
|
||||||
|
incoherent to expect to be able to access the representation of the
|
||||||
|
text from within the text itself.
|
||||||
|
|
||||||
|
[^machine-value-annotations]: Any text-syntax annotations preceding
|
||||||
|
the `#` are prepended to any binary-syntax annotations yielded by
|
||||||
|
decoding the `ByteString`.
|
||||||
|
|
||||||
|
## Annotations
|
||||||
|
|
||||||
|
When written down, a `Value` may have an associated sequence of
|
||||||
|
*annotations* carrying “out-of-band” contextual metadata about the
|
||||||
|
value. Each annotation is, in turn, a `Value`, and may itself have
|
||||||
|
annotations. The ordering of annotations attached to a `Value` is
|
||||||
|
significant.
|
||||||
|
|
||||||
|
Value =/ ws "@" Value Value
|
||||||
|
|
||||||
|
Each annotation is preceded by `@`; the underlying annotated value
|
||||||
|
follows its annotations. Here we extend only the syntactic nonterminal
|
||||||
|
named “`Value`” without altering the semantic class of `Value`s.
|
||||||
|
|
||||||
|
**Comments.** Strings annotating a `Value` are conventionally
|
||||||
|
interpreted as comments associated with that value. Comments are
|
||||||
|
sufficiently common that special syntax exists for them.
|
||||||
|
|
||||||
|
Value =/ ws
|
||||||
|
";" *(%x00-09 / %x0B-0C / %x0E-10FFFF) newline
|
||||||
|
Value
|
||||||
|
|
||||||
|
When written this way, everything between the `;` and the newline is
|
||||||
|
included in the string annotating the `Value`.
|
||||||
|
|
||||||
|
**Equivalence.** Annotations appear within syntax denoting a `Value`;
|
||||||
|
however, the annotations are not part of the denoted value. They are
|
||||||
|
only part of the syntax. Annotations do not play a part in
|
||||||
|
equivalences and orderings of `Value`s.
|
||||||
|
|
||||||
|
Reflective tools such as debuggers, user interfaces, and message
|
||||||
|
routers and relays---tools which process `Value`s generically---may
|
||||||
|
use annotated inputs to tailor their operation, or may insert
|
||||||
|
annotations in their outputs. By contrast, in ordinary programs, as a
|
||||||
|
rule of thumb, the presence, absence or content of an annotation
|
||||||
|
should not change the control flow or output of the program.
|
||||||
|
Annotations are data *describing* `Value`s, and are not in the domain
|
||||||
|
of any specific application of `Value`s. That is, an annotation will
|
||||||
|
almost never cause a non-reflective program to do anything observably
|
||||||
|
different.
|
||||||
|
|
||||||
|
## Security Considerations
|
||||||
|
|
||||||
|
**Whitespace.** The textual format allows arbitrary whitespace in many
|
||||||
|
positions. Consider optional restrictions on the amount of consecutive
|
||||||
|
whitespace that may appear.
|
||||||
|
|
||||||
|
**Annotations.** Similarly, in modes where a `Value` is being read
|
||||||
|
while annotations are skipped, an endless sequence of annotations may
|
||||||
|
give an illusion of progress.
|
||||||
|
|
||||||
|
## Acknowledgements
|
||||||
|
|
||||||
|
The treatment of commas as whitespace in the text syntax is inspired
|
||||||
|
by the same feature of [EDN](https://github.com/edn-format/edn).
|
||||||
|
|
||||||
|
The text syntax for `Boolean`s, `Symbol`s, and `ByteString`s is
|
||||||
|
directly inspired by [Racket](https://racket-lang.org/)'s lexical
|
||||||
|
syntax.
|
||||||
|
|
||||||
|
<!-- Heading to visually offset the footnotes from the main document: -->
|
||||||
|
## Notes
|
619
preserves.md
619
preserves.md
|
@ -4,7 +4,7 @@ title: "Preserves: an Expressive Data Language"
|
||||||
---
|
---
|
||||||
|
|
||||||
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
||||||
January 2022. Version 0.6.2.
|
{{ site.version_date }}. Version {{ site.version }}.
|
||||||
|
|
||||||
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
|
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
|
||||||
[spki]: http://world.std.com/~cme/html/spki.html
|
[spki]: http://world.std.com/~cme/html/spki.html
|
||||||
|
@ -14,29 +14,35 @@ January 2022. Version 0.6.2.
|
||||||
[abnf]: https://tools.ietf.org/html/rfc7405
|
[abnf]: https://tools.ietf.org/html/rfc7405
|
||||||
[canonical]: canonical-binary.html
|
[canonical]: canonical-binary.html
|
||||||
|
|
||||||
This document proposes a data model and serialization format called
|
*Preserves* is a data model, with associated serialization formats.
|
||||||
*Preserves*.
|
|
||||||
|
|
||||||
Preserves supports *records* with user-defined *labels*, embedded
|
It supports *records* with user-defined *labels*, embedded *references*,
|
||||||
*references*, and the usual suite of atomic and compound data types,
|
and the usual suite of atomic and compound data types, including
|
||||||
including *binary* data as a distinct type from text strings. Its
|
*binary* data as a distinct type from text strings. Its *annotations*
|
||||||
*annotations* allow separation of data from metadata such as
|
allow separation of data from metadata such as
|
||||||
[comments](conventions.html#comments), trace information, and
|
[comments](conventions.html#comments), trace information, and provenance
|
||||||
provenance information.
|
information.
|
||||||
|
|
||||||
Preserves departs from many other data languages in defining how to
|
Preserves departs from many other data languages in defining how to
|
||||||
*compare* two values. Comparison is based on the data model, not on
|
*compare* two values. Comparison is based on the data model, not on
|
||||||
syntax or on data structures of any particular implementation
|
syntax or on data structures of any particular implementation
|
||||||
language.
|
language.
|
||||||
|
|
||||||
## Starting with Semantics
|
This document defines the core semantics and data model of Preserves and
|
||||||
|
presents a handful of examples. Two other core documents define
|
||||||
|
|
||||||
Taking inspiration from functional programming, we start with a
|
- a [human-readable text syntax](preserves-text.html), and
|
||||||
definition of the *values* that we want to work with and give them
|
- a [machine-oriented binary syntax](preserves-binary.html)
|
||||||
meaning independent of their syntax.
|
|
||||||
|
|
||||||
<a id="values"></a>
|
for the Preserves data model.
|
||||||
Our `Value`s fall into two broad categories: *atomic* and *compound*
|
|
||||||
|
## <a id="semantics"></a><a id="starting-with-semantics"></a>Values
|
||||||
|
|
||||||
|
Preserves *values* are given meaning independent of their syntax. We
|
||||||
|
will write "`Value`" when we mean the set of all Preserves values or an
|
||||||
|
element of that set.
|
||||||
|
|
||||||
|
`Value`s fall into two broad categories: *atomic* and *compound*
|
||||||
data. Every `Value` is finite and non-cyclic. Embedded values, called
|
data. Every `Value` is finite and non-cyclic. Embedded values, called
|
||||||
`Embedded`s, are a third, special-case category.
|
`Embedded`s, are a third, special-case category.
|
||||||
|
|
||||||
|
@ -76,20 +82,23 @@ neither is less than the other according to the total order.
|
||||||
|
|
||||||
### Signed integers.
|
### Signed integers.
|
||||||
|
|
||||||
A `SignedInteger` is a signed integer of arbitrary width.
|
A `SignedInteger` is an arbitrarily-large signed integer.
|
||||||
`SignedInteger`s are compared as mathematical integers.
|
`SignedInteger`s are compared as mathematical integers.
|
||||||
|
|
||||||
### Unicode strings.
|
### Unicode strings.
|
||||||
|
|
||||||
A `String` is a sequence of Unicode
|
A `String` is a sequence of Unicode
|
||||||
[code-point](http://www.unicode.org/glossary/#code_point)s. `String`s
|
[code-point](http://www.unicode.org/glossary/#code_point)s.[^nul-permitted]
|
||||||
are compared lexicographically, code-point by
|
`String`s are compared lexicographically, code-point by
|
||||||
code-point.[^utf8-is-awesome]
|
code-point.[^utf8-is-awesome]
|
||||||
|
|
||||||
[^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
|
[^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
|
||||||
gives the same result as a lexicographic byte-by-byte comparison
|
gives the same result as a lexicographic byte-by-byte comparison
|
||||||
of the UTF-8 encoding of a string!
|
of the UTF-8 encoding of a string!
|
||||||
|
|
||||||
|
[^nul-permitted]: All Unicode code-points are permitted, including NUL
|
||||||
|
(code point zero).
|
||||||
|
|
||||||
### Binary data.
|
### Binary data.
|
||||||
|
|
||||||
A `ByteString` is a sequence of octets. `ByteString`s are compared
|
A `ByteString` is a sequence of octets. `ByteString`s are compared
|
||||||
|
@ -111,11 +120,11 @@ less-than the “true” value.
|
||||||
|
|
||||||
`Float`s and `Double`s are single- and double-precision IEEE 754
|
`Float`s and `Double`s are single- and double-precision IEEE 754
|
||||||
floating-point values, respectively. `Float`s, `Double`s and
|
floating-point values, respectively. `Float`s, `Double`s and
|
||||||
`SignedInteger`s are disjoint; by the rules [above](#total-order),
|
`SignedInteger`s are disjoint; by the rules [above](#total-order), every
|
||||||
every `Float` is less than every `Double`, and every `SignedInteger`
|
`Float` is less than every `Double`, and every `SignedInteger` is
|
||||||
is greater than both. Two `Float`s or two `Double`s are to be ordered
|
greater than both. Two `Float`s or two `Double`s are to be ordered by
|
||||||
by the `totalOrder` predicate defined in section 5.10 of
|
the `totalOrder` predicate defined in section 5.10 of [IEEE Std
|
||||||
[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
|
754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
|
||||||
|
|
||||||
### Records.
|
### Records.
|
||||||
|
|
||||||
|
@ -200,457 +209,13 @@ URL, compared according to
|
||||||
usually be represented as ordinary `Value`s, in which case the
|
usually be represented as ordinary `Value`s, in which case the
|
||||||
ordinary rules for comparing `Value`s will apply.
|
ordinary rules for comparing `Value`s will apply.
|
||||||
|
|
||||||
## Textual Syntax
|
|
||||||
|
|
||||||
Now we have discussed `Value`s and their meanings, we may turn to
|
|
||||||
techniques for *representing* `Value`s for communication or storage.
|
|
||||||
|
|
||||||
In this section, we use [case-sensitive ABNF][abnf] to define a
|
|
||||||
textual syntax that is easy for people to read and
|
|
||||||
write.[^json-superset] Most of the examples in this document are
|
|
||||||
written using this syntax. In the following section, we will define an
|
|
||||||
equivalent compact machine-readable syntax.
|
|
||||||
|
|
||||||
[^json-superset]: The grammar of the textual syntax is a superset of
|
|
||||||
JSON, with the slightly unusual feature that `true`, `false`, and
|
|
||||||
`null` are all read as `Symbol`s, and that `SignedInteger`s are
|
|
||||||
never read as `Double`s.
|
|
||||||
|
|
||||||
The following [schema](./preserves-schema.html) definitions match
|
|
||||||
exactly the JSON subset of a Preserves input:
|
|
||||||
|
|
||||||
version 1 .
|
|
||||||
JSON = @string string / @integer int / @double double / @boolean JSONBoolean / @null =null
|
|
||||||
/ @array [JSON ...] / @object { string: JSON ...:... } .
|
|
||||||
JSONBoolean = =true / =false .
|
|
||||||
|
|
||||||
### Character set.
|
|
||||||
|
|
||||||
[ABNF][abnf] allows easy definition of US-ASCII-based languages.
|
|
||||||
However, Preserves is a Unicode-based language. Therefore, we
|
|
||||||
reinterpret ABNF as a grammar for recognising sequences of Unicode
|
|
||||||
code points.
|
|
||||||
|
|
||||||
Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where
|
|
||||||
possible.
|
|
||||||
|
|
||||||
### Whitespace.
|
|
||||||
|
|
||||||
Whitespace is defined as any number of spaces, tabs, carriage returns,
|
|
||||||
line feeds, or commas.
|
|
||||||
|
|
||||||
ws = *(%x20 / %x09 / newline / ",")
|
|
||||||
newline = CR / LF
|
|
||||||
|
|
||||||
### Grammar.
|
|
||||||
|
|
||||||
Standalone documents may have trailing whitespace.
|
|
||||||
|
|
||||||
Document = Value ws
|
|
||||||
|
|
||||||
Any `Value` may be preceded by whitespace.
|
|
||||||
|
|
||||||
Value = ws (Record / Collection / Atom / Embedded / Compact)
|
|
||||||
Collection = Sequence / Dictionary / Set
|
|
||||||
Atom = Boolean / Float / Double / SignedInteger /
|
|
||||||
String / ByteString / Symbol
|
|
||||||
|
|
||||||
Each `Record` is an angle-bracket enclosed grouping of its
|
|
||||||
label-`Value` followed by its field-`Value`s.
|
|
||||||
|
|
||||||
Record = "<" Value *Value ws ">"
|
|
||||||
|
|
||||||
`Sequence`s are enclosed in square brackets. `Dictionary` values are
|
|
||||||
curly-brace-enclosed colon-separated pairs of values. `Set`s are
|
|
||||||
written as values enclosed by the tokens `#{` and
|
|
||||||
`}`.[^printing-collections] It is an error for a set to contain
|
|
||||||
duplicate elements or for a dictionary to contain duplicate keys.
|
|
||||||
|
|
||||||
Sequence = "[" *Value ws "]"
|
|
||||||
Dictionary = "{" *(Value ws ":" Value) ws "}"
|
|
||||||
Set = "#{" *Value ws "}"
|
|
||||||
|
|
||||||
[^printing-collections]: **Implementation note.** When implementing
|
|
||||||
printing of `Value`s using the textual syntax, consider supporting
|
|
||||||
(a) optional pretty-printing with indentation, (b) optional
|
|
||||||
JSON-compatible print mode for that subset of `Value` that is
|
|
||||||
compatible with JSON, and (c) optional submodes for no commas,
|
|
||||||
commas separating, and commas terminating elements or key/value
|
|
||||||
pairs within a collection.
|
|
||||||
|
|
||||||
`Boolean`s are the simple literal strings `#t` and `#f` for true and
|
|
||||||
false, respectively.
|
|
||||||
|
|
||||||
Boolean = %s"#t" / %s"#f"
|
|
||||||
|
|
||||||
Numeric data follow the
|
|
||||||
[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
|
|
||||||
the addition of a trailing “f” distinguishing `Float` from `Double`
|
|
||||||
values. `Float`s and `Double`s always have either a fractional part or
|
|
||||||
an exponent part, where `SignedInteger`s never have
|
|
||||||
either.[^reading-and-writing-floats-accurately]
|
|
||||||
[^arbitrary-precision-signedinteger]
|
|
||||||
|
|
||||||
Float = flt %i"f"
|
|
||||||
Double = flt
|
|
||||||
SignedInteger = int
|
|
||||||
|
|
||||||
digit1-9 = %x31-39
|
|
||||||
nat = %x30 / ( digit1-9 *DIGIT )
|
|
||||||
int = ["-"] nat
|
|
||||||
frac = "." 1*DIGIT
|
|
||||||
exp = %i"e" ["-"/"+"] 1*DIGIT
|
|
||||||
flt = int (frac exp / frac / exp)
|
|
||||||
|
|
||||||
[^reading-and-writing-floats-accurately]: **Implementation note.**
|
|
||||||
Your language's standard library likely has a good routine for
|
|
||||||
converting between decimal notation and IEEE 754 floating-point.
|
|
||||||
However, if not, or if you are interested in the challenges of
|
|
||||||
accurately reading and writing floating point numbers, see the
|
|
||||||
excellent matched pair of 1990 papers by Clinger and Steele &
|
|
||||||
White, and a recent follow-up by Jaffer:
|
|
||||||
|
|
||||||
Clinger, William D. ‘How to Read Floating Point Numbers
|
|
||||||
Accurately’. In Proc. PLDI. White Plains, New York, 1990.
|
|
||||||
<https://doi.org/10.1145/93542.93557>.
|
|
||||||
|
|
||||||
Steele, Guy L., Jr., and Jon L. White. ‘How to Print
|
|
||||||
Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains,
|
|
||||||
New York, 1990. <https://doi.org/10.1145/93542.93559>.
|
|
||||||
|
|
||||||
Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of
|
|
||||||
Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013.
|
|
||||||
<http://arxiv.org/abs/1310.8121>.
|
|
||||||
|
|
||||||
[^arbitrary-precision-signedinteger]: **Implementation note.** Be
|
|
||||||
aware when implementing reading and writing of `SignedInteger`s
|
|
||||||
that the data model *requires* arbitrary-precision integers. Your
|
|
||||||
implementation may (but, ideally, should not) truncate precision
|
|
||||||
when reading or writing a `SignedInteger`; however, if it does so,
|
|
||||||
it should (a) signal its client that truncation has occurred, and
|
|
||||||
(b) make it clear to the client that comparing such truncated
|
|
||||||
values for equality or ordering will not yield results that match
|
|
||||||
the expected semantics of the data model.
|
|
||||||
|
|
||||||
`String`s are,
|
|
||||||
[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
|
|
||||||
escaped text surrounded by double quotes. The escaping rules are the
|
|
||||||
same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
|
|
||||||
|
|
||||||
String = %x22 *char %x22
|
|
||||||
char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
|
|
||||||
unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
|
|
||||||
escape = %x5C ; \
|
|
||||||
escaped = ( %x5C / ; \ reverse solidus U+005C
|
|
||||||
%x2F / ; / solidus U+002F
|
|
||||||
%x62 / ; b backspace U+0008
|
|
||||||
%x66 / ; f form feed U+000C
|
|
||||||
%x6E / ; n line feed U+000A
|
|
||||||
%x72 / ; r carriage return U+000D
|
|
||||||
%x74 ) ; t tab U+0009
|
|
||||||
|
|
||||||
[^string-json-correspondence]: The grammar for `String` has the same
|
|
||||||
effect as the
|
|
||||||
[JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
|
|
||||||
`string`. Some auxiliary definitions (e.g. `escaped`) are lifted
|
|
||||||
largely unmodified from the text of RFC 8259.
|
|
||||||
|
|
||||||
[^escaping-surrogate-pairs]: In particular, note JSON's rules around
|
|
||||||
the use of surrogate pairs for code points not in the Basic
|
|
||||||
Multilingual Plane. We encourage implementations to avoid using
|
|
||||||
`\u` escapes when producing output, and instead to rely on the
|
|
||||||
UTF-8 encoding of the entire document to handle non-ASCII
|
|
||||||
codepoints correctly.
|
|
||||||
|
|
||||||
A `ByteString` may be written in any of three different forms.
|
|
||||||
|
|
||||||
The first is similar to a `String`, but prepended with a hash sign
|
|
||||||
`#`. In addition, only Unicode code points overlapping with printable
|
|
||||||
7-bit ASCII are permitted unescaped inside such a `ByteString`; other
|
|
||||||
byte values must be escaped by prepending a two-digit hexadecimal
|
|
||||||
value with `\x`.
|
|
||||||
|
|
||||||
ByteString = "#" %x22 *binchar %x22
|
|
||||||
binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
|
|
||||||
binunescaped = %x20-21 / %x23-5B / %x5D-7E
|
|
||||||
|
|
||||||
The second is as a sequence of pairs of hexadecimal digits interleaved
|
|
||||||
with whitespace and surrounded by `#x"` and `"`.
|
|
||||||
|
|
||||||
ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
|
|
||||||
|
|
||||||
The third is as a sequence of
|
|
||||||
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
|
|
||||||
with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
|
|
||||||
Base64 characters are allowed.
|
|
||||||
|
|
||||||
ByteString =/ "#[" *(ws / base64char) ws "]"
|
|
||||||
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
|
|
||||||
|
|
||||||
A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
|
|
||||||
it conforms to certain restrictions on the characters appearing in the
|
|
||||||
symbol. Alternatively, it may be written in a quoted form. The quoted
|
|
||||||
form is much the same as the syntax for `String`s, including embedded
|
|
||||||
escape syntax, except using a bar or pipe character (`|`) instead of a
|
|
||||||
double quote mark.
|
|
||||||
|
|
||||||
Symbol = symstart *symcont / "|" *symchar "|"
|
|
||||||
symstart = ALPHA / sympunct / symustart
|
|
||||||
symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-"
|
|
||||||
sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
|
|
||||||
"?" / "_" / "=" / "+" / "/" / "."
|
|
||||||
symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
|
|
||||||
symustart = <any code point greater than 127 whose Unicode
|
|
||||||
category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me,
|
|
||||||
Pc, Po, Sc, Sm, Sk, So, or Co>
|
|
||||||
symucont = <any code point greater than 127 whose Unicode
|
|
||||||
category is Nd, Nl, No, or Pd>
|
|
||||||
|
|
||||||
[^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
|
|
||||||
definition of “token representation”, and with the
|
|
||||||
[R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).
|
|
||||||
|
|
||||||
An `Embedded` is written as a `Value` chosen to represent the denoted
|
|
||||||
object, prefixed with `#!`.
|
|
||||||
|
|
||||||
Embedded = "#!" Value
|
|
||||||
|
|
||||||
Finally, any `Value` may be represented by escaping from the textual
|
|
||||||
syntax to the [compact binary syntax](#compact-binary-syntax) by
|
|
||||||
prefixing a `ByteString` containing the binary representation of the
|
|
||||||
`Value` with `#=`.[^rationale-switch-to-binary]
|
|
||||||
[^no-literal-binary-in-text] [^compact-value-annotations]
|
|
||||||
|
|
||||||
Compact = "#=" ws ByteString
|
|
||||||
|
|
||||||
[^rationale-switch-to-binary]: **Rationale.** The textual syntax
|
|
||||||
cannot express every `Value`: specifically, it cannot express the
|
|
||||||
several million floating-point NaNs, or the two floating-point
|
|
||||||
Infinities. Since the compact binary format for `Value`s expresses
|
|
||||||
each `Value` with precision, embedding binary `Value`s solves the
|
|
||||||
problem.
|
|
||||||
|
|
||||||
[^no-literal-binary-in-text]: Every text is ultimately physically
|
|
||||||
stored as bytes; therefore, it might seem possible to escape to
|
|
||||||
the raw binary form of compact binary encoding from within a
|
|
||||||
pieces of textual syntax. However, while bytes must be involved in
|
|
||||||
any *representation* of text, the text *itself* is logically a
|
|
||||||
sequence of *code points* and is not *intrinsically* a binary
|
|
||||||
structure at all. It would be incoherent to expect to be able to
|
|
||||||
access the representation of the text from within the text itself.
|
|
||||||
|
|
||||||
[^compact-value-annotations]: Any text-syntax annotations preceding
|
|
||||||
the `#` are prepended to any binary-syntax annotations yielded by
|
|
||||||
decoding the `ByteString`.
|
|
||||||
|
|
||||||
### Annotations.
|
|
||||||
|
|
||||||
**Syntax.** When written down, a `Value` may have an associated
|
|
||||||
sequence of *annotations* carrying “out-of-band” contextual metadata
|
|
||||||
about the value. Each annotation is, in turn, a `Value`, and may
|
|
||||||
itself have annotations. The ordering of annotations attached to a
|
|
||||||
`Value` is significant.
|
|
||||||
|
|
||||||
Value =/ ws "@" Value Value
|
|
||||||
|
|
||||||
Each annotation is preceded by `@`; the underlying annotated value
|
|
||||||
follows its annotations. Here we extend only the syntactic nonterminal
|
|
||||||
named “`Value`” without altering the semantic class of `Value`s.
|
|
||||||
|
|
||||||
**Comments.** Strings annotating a `Value` are conventionally
|
|
||||||
interpreted as comments associated with that value. Comments are
|
|
||||||
sufficiently common that special syntax exists for them.
|
|
||||||
|
|
||||||
Value =/ ws
|
|
||||||
";" *(%x00-09 / %x0B-0C / %x0E-10FFFF) newline
|
|
||||||
Value
|
|
||||||
|
|
||||||
When written this way, everything between the `;` and the newline is
|
|
||||||
included in the string annotating the `Value`.
|
|
||||||
|
|
||||||
**Equivalence.** Annotations appear within syntax denoting a `Value`;
|
|
||||||
however, the annotations are not part of the denoted value. They are
|
|
||||||
only part of the syntax. Annotations do not play a part in
|
|
||||||
equivalences and orderings of `Value`s.
|
|
||||||
|
|
||||||
Reflective tools such as debuggers, user interfaces, and message
|
|
||||||
routers and relays---tools which process `Value`s generically---may
|
|
||||||
use annotated inputs to tailor their operation, or may insert
|
|
||||||
annotations in their outputs. By contrast, in ordinary programs, as a
|
|
||||||
rule of thumb, the presence, absence or content of an annotation
|
|
||||||
should not change the control flow or output of the program.
|
|
||||||
Annotations are data *describing* `Value`s, and are not in the domain
|
|
||||||
of any specific application of `Value`s. That is, an annotation will
|
|
||||||
almost never cause a non-reflective program to do anything observably
|
|
||||||
different.
|
|
||||||
|
|
||||||
## Compact Binary Syntax
|
|
||||||
|
|
||||||
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
|
|
||||||
For a value `v`, we write `«v»` for the `Repr` of v.
|
|
||||||
|
|
||||||
### Type and Length representation.
|
|
||||||
|
|
||||||
Each `Repr` starts with a tag byte, describing the kind of information
|
|
||||||
represented. Depending on the tag, a length indicator, further encoded
|
|
||||||
information, and/or an ending tag may follow.
|
|
||||||
|
|
||||||
tag (simple atomic data and small integers)
|
|
||||||
tag ++ binarydata (most integers)
|
|
||||||
tag ++ length ++ binarydata (large integers, strings, symbols, and binary)
|
|
||||||
tag ++ repr ++ ... ++ endtag (compound data)
|
|
||||||
|
|
||||||
The unique end tag is byte value `0x84`.
|
|
||||||
|
|
||||||
If present after a tag, the length of a following piece of binary data
|
|
||||||
is formatted as a [base 128 varint][varint].[^see-also-leb128] We
|
|
||||||
write `varint(m)` for the varint-encoding of `m`. Quoting the
|
|
||||||
[Google Protocol Buffers][varint] definition,
|
|
||||||
|
|
||||||
[^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
|
|
||||||
integers. Varints and LEB128-encoded integers differ only for
|
|
||||||
signed integers, which are not used in Preserves.
|
|
||||||
|
|
||||||
> Each byte in a varint, except the last byte, has the most
|
|
||||||
> significant bit (msb) set – this indicates that there are further
|
|
||||||
> bytes to come. The lower 7 bits of each byte are used to store the
|
|
||||||
> two's complement representation of the number in groups of 7 bits,
|
|
||||||
> least significant group first.
|
|
||||||
|
|
||||||
The following table illustrates varint-encoding.
|
|
||||||
|
|
||||||
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
|
|
||||||
| ------ | ------------------- | ------------ |
|
|
||||||
| 15 | `0001111` | 15 |
|
|
||||||
| 300 | `0000010 0101100` | 172 2 |
|
|
||||||
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
|
|
||||||
|
|
||||||
It is an error for a varint-encoded `m` in a `Repr` to be anything
|
|
||||||
other than the unique shortest encoding for that `m`. That is, a
|
|
||||||
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.
|
|
||||||
|
|
||||||
### Records, Sequences, Sets and Dictionaries.
|
|
||||||
|
|
||||||
«<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
|
|
||||||
«[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
|
|
||||||
«#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
|
|
||||||
«{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
|
|
||||||
|
|
||||||
There is *no* ordering requirement on the `E_i` elements or
|
|
||||||
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
|
|
||||||
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
|
|
||||||
addition, implementations *SHOULD* default to writing set elements and
|
|
||||||
dictionary key/value pairs in order sorted lexicographically by their
|
|
||||||
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
|
|
||||||
serializing in some other implementation-defined order.
|
|
||||||
|
|
||||||
[^no-sorting-rationale]: In the BitTorrent encoding format,
|
|
||||||
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
|
|
||||||
dictionary key/value pairs must be sorted by key. This is a
|
|
||||||
necessary step for ensuring serialization of `Value`s is
|
|
||||||
canonical. We do not require that key/value pairs (or set
|
|
||||||
elements) be in sorted order for serialized `Value`s; however, a
|
|
||||||
[canonical form][canonical] for `Repr`s does exist where a sorted
|
|
||||||
ordering is required.
|
|
||||||
|
|
||||||
[^not-sorted-semantically]: It's important to note that the sort
|
|
||||||
ordering for writing out set elements and dictionary key/value
|
|
||||||
pairs is *not* the same as the sort ordering implied by the
|
|
||||||
semantic ordering of those elements or keys. For example, the
|
|
||||||
`Repr` of a negative number very far from zero will start with
|
|
||||||
byte that is *greater* than the byte which starts the `Repr` of
|
|
||||||
zero, making it sort lexicographically later by `Repr`, despite
|
|
||||||
being semantically *less than* zero.
|
|
||||||
|
|
||||||
**Rationale**. This is for ease-of-implementation reasons: not all
|
|
||||||
languages can easily represent sorted sets or sorted dictionaries,
|
|
||||||
but encoding and then sorting byte strings is much more likely to
|
|
||||||
be within easy reach.
|
|
||||||
|
|
||||||
### SignedIntegers.
|
|
||||||
|
|
||||||
«x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16
|
|
||||||
([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16
|
|
||||||
([0xA0] + x) if (-3≤x≤-1)
|
|
||||||
([0x90] + x) if ( 0≤x≤12)
|
|
||||||
where m = |intbytes(x)|
|
|
||||||
|
|
||||||
Integers in the range [-3,12] are compactly represented with tags
|
|
||||||
between `0x90` and `0x9F` because they are so frequently used.
|
|
||||||
Integers up to 16 bytes long are represented with a single-byte tag
|
|
||||||
encoding the length of the integer. Larger integers are represented
|
|
||||||
with an explicit varint length. Every `SignedInteger` *MUST* be
|
|
||||||
represented with its shortest possible encoding.
|
|
||||||
|
|
||||||
The function `intbytes(x)` gives the big-endian two's-complement
|
|
||||||
binary representation of `x`, taking exactly as many whole bytes as
|
|
||||||
needed to unambiguously identify the value and its sign, and `m =
|
|
||||||
|intbytes(x)|`. The most-significant bit in the first byte in
|
|
||||||
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
|
|
||||||
example,
|
|
||||||
|
|
||||||
«87112285931760246646623899502532662132736»
|
|
||||||
= B0 12 01 00 00 00 00 00 00 00
|
|
||||||
00 00 00 00 00 00 00 00
|
|
||||||
00 00
|
|
||||||
|
|
||||||
«-257» = A1 FE FF «-3» = 9D «128» = A1 00 80
|
|
||||||
«-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF
|
|
||||||
«-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00
|
|
||||||
«-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF
|
|
||||||
«-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00
|
|
||||||
«-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF
|
|
||||||
«-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00
|
|
||||||
«-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00
|
|
||||||
|
|
||||||
[^zero-intbytes]: The value 0 needs zero bytes to identify the
|
|
||||||
value, so `intbytes(0)` is the empty byte string. Non-zero values
|
|
||||||
need at least one byte.
|
|
||||||
|
|
||||||
### Strings, ByteStrings and Symbols.
|
|
||||||
|
|
||||||
Syntax for these three types varies only in the tag used. For `String`
|
|
||||||
and `Symbol`, the data following the tag is a UTF-8 encoding of the
|
|
||||||
`Value`'s code points, while for `ByteString` it is the raw data
|
|
||||||
contained within the `Value` unmodified.
|
|
||||||
|
|
||||||
«S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
|
|
||||||
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
|
|
||||||
[0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol
|
|
||||||
|
|
||||||
### Booleans.
|
|
||||||
|
|
||||||
«#f» = [0x80]
|
|
||||||
«#t» = [0x81]
|
|
||||||
|
|
||||||
### Floats and Doubles.
|
|
||||||
|
|
||||||
«F» when F ∈ Float = [0x82] ++ binary32(F)
|
|
||||||
«D» when D ∈ Double = [0x83] ++ binary64(D)
|
|
||||||
|
|
||||||
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
|
||||||
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
|
|
||||||
|
|
||||||
### Embeddeds.
|
|
||||||
|
|
||||||
The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to
|
|
||||||
represent the denoted object, prefixed with `[0x86]`.
|
|
||||||
|
|
||||||
«#!V» = [0x86] ++ «V»
|
|
||||||
|
|
||||||
### Annotations.
|
|
||||||
|
|
||||||
To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
|
|
||||||
`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
|
|
||||||
syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
|
|
||||||
`a` and `b`, is
|
|
||||||
|
|
||||||
«@a @b []»
|
|
||||||
= [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
|
|
||||||
= [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
|
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
|
The definitions above are independent of any particular concrete syntax.
|
||||||
|
The examples of `Value`s that follow are written using [the Preserves
|
||||||
|
text syntax](preserves-text.html), and the example encoded byte
|
||||||
|
sequences use [the Preserves binary encoding](preserves-binary.html).
|
||||||
|
|
||||||
### Ordering.
|
### Ordering.
|
||||||
|
|
||||||
The total ordering specified [above](#total-order) means that the following statements are true:
|
The total ordering specified [above](#total-order) means that the following statements are true:
|
||||||
|
@ -720,10 +285,23 @@ encodes to
|
||||||
|
|
||||||
### JSON examples.
|
### JSON examples.
|
||||||
|
|
||||||
The examples from
|
Preserves text syntax is a superset of JSON, so the examples from [RFC
|
||||||
[RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as
|
8259](https://tools.ietf.org/html/rfc8259#section-13) read as valid
|
||||||
valid Preserves, though the JSON literals `true`, `false` and `null`
|
Preserves.
|
||||||
read as `Symbol`s. The first example:
|
|
||||||
|
The JSON literals `true`, `false` and `null` all read as `Symbol`s, and
|
||||||
|
JSON numbers read (unambiguously) either as `SignedInteger`s or as
|
||||||
|
`Double`s.[^json-superset]
|
||||||
|
|
||||||
|
[^json-superset]: The following [schema](./preserves-schema.html)
|
||||||
|
definitions match exactly the JSON subset of a Preserves input:
|
||||||
|
|
||||||
|
version 1 .
|
||||||
|
JSON = @string string / @integer int / @double double / @boolean JSONBoolean / @null =null
|
||||||
|
/ @array [JSON ...] / @object { string: JSON ...:... } .
|
||||||
|
JSONBoolean = =true / =false .
|
||||||
|
|
||||||
|
The first RFC 8259 example:
|
||||||
|
|
||||||
{
|
{
|
||||||
"Image": {
|
"Image": {
|
||||||
|
@ -740,7 +318,8 @@ read as `Symbol`s. The first example:
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
encodes to binary as follows:
|
when read using the Preserves text syntax encodes via the binary syntax
|
||||||
|
as follows:
|
||||||
|
|
||||||
B7
|
B7
|
||||||
B1 05 "Image"
|
B1 05 "Image"
|
||||||
|
@ -764,7 +343,7 @@ encodes to binary as follows:
|
||||||
84
|
84
|
||||||
84
|
84
|
||||||
|
|
||||||
and the second example:
|
The second RFC 8259 example:
|
||||||
|
|
||||||
[
|
[
|
||||||
{
|
{
|
||||||
|
@ -814,89 +393,5 @@ encodes to binary as follows:
|
||||||
84
|
84
|
||||||
84
|
84
|
||||||
|
|
||||||
## Security Considerations
|
|
||||||
|
|
||||||
**Whitespace.** The textual format allows arbitrary whitespace in many
|
|
||||||
positions. Consider optional restrictions on the amount of consecutive
|
|
||||||
whitespace that may appear.
|
|
||||||
|
|
||||||
**Annotations.** Similarly, in modes where a `Value` is being read
|
|
||||||
while annotations are skipped, an endless sequence of annotations may
|
|
||||||
give an illusion of progress.
|
|
||||||
|
|
||||||
**Canonical form for cryptographic hashing and signing.** No canonical
|
|
||||||
textual encoding of a `Value` is specified. A
|
|
||||||
[canonical form][canonical] exists for binary encoded `Value`s, and
|
|
||||||
implementations *SHOULD* produce canonical binary encodings by
|
|
||||||
default; however, an implementation *MAY* permit two serializations of
|
|
||||||
the same `Value` to yield different binary `Repr`s.
|
|
||||||
|
|
||||||
## Acknowledgements
|
|
||||||
|
|
||||||
The treatment of commas as whitespace in the text syntax is inspired
|
|
||||||
by the same feature of [EDN](https://github.com/edn-format/edn).
|
|
||||||
|
|
||||||
The text syntax for `Boolean`s, `Symbol`s, and `ByteString`s is
|
|
||||||
directly inspired by [Racket](https://racket-lang.org/)'s lexical
|
|
||||||
syntax.
|
|
||||||
|
|
||||||
## Appendix. Autodetection of textual or binary syntax
|
|
||||||
|
|
||||||
Every tag byte in a binary Preserves `Document` falls within the range
|
|
||||||
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
|
|
||||||
bytes*, and will never occur as the first byte of a UTF-8 encoded code
|
|
||||||
point. This means no binary-encoded document can be misinterpreted as
|
|
||||||
valid UTF-8.
|
|
||||||
|
|
||||||
Conversely, a UTF-8 document must start with a valid codepoint,
|
|
||||||
meaning in particular that it must not start with a byte in the range
|
|
||||||
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
|
|
||||||
Preserves document can be misinterpreted as a binary-syntax document.
|
|
||||||
|
|
||||||
Examination of the top two bits of the first byte of a document gives
|
|
||||||
its syntax: if the top two bits are `10`, it should be interpreted as
|
|
||||||
a binary-syntax document; otherwise, it should be interpreted as text.
|
|
||||||
|
|
||||||
## Appendix. Table of tag values
|
|
||||||
|
|
||||||
80 - False
|
|
||||||
81 - True
|
|
||||||
82 - Float
|
|
||||||
83 - Double
|
|
||||||
84 - End marker
|
|
||||||
85 - Annotation
|
|
||||||
86 - Embedded
|
|
||||||
(8x) RESERVED 87-8F
|
|
||||||
|
|
||||||
9x - Small integers 0..12,-3..-1
|
|
||||||
An - Medium integers, (n+1) bytes long
|
|
||||||
B0 - Large integers, variable length
|
|
||||||
B1 - String
|
|
||||||
B2 - ByteString
|
|
||||||
B3 - Symbol
|
|
||||||
|
|
||||||
B4 - Record
|
|
||||||
B5 - Sequence
|
|
||||||
B6 - Set
|
|
||||||
B7 - Dictionary
|
|
||||||
|
|
||||||
## Appendix. Binary SignedInteger representation
|
|
||||||
|
|
||||||
Languages that provide fixed-width machine word types may find the
|
|
||||||
following table useful in encoding and decoding binary `SignedInteger`
|
|
||||||
values.
|
|
||||||
|
|
||||||
| Integer range | Bytes required | Encoding (hex) |
|
|
||||||
| --- | --- | --- |
|
|
||||||
| -3 ≤ n ≤ 12 | 1 | `9X` |
|
|
||||||
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A0` `XX` |
|
|
||||||
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A1` `XX` `XX` |
|
|
||||||
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A2` `XX` `XX` `XX` |
|
|
||||||
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `A3` `XX` `XX` `XX` `XX` |
|
|
||||||
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A4` `XX` `XX` `XX` `XX` `XX` |
|
|
||||||
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A5` `XX` `XX` `XX` `XX` `XX` `XX` |
|
|
||||||
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
|
||||||
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
|
||||||
|
|
||||||
<!-- Heading to visually offset the footnotes from the main document: -->
|
<!-- Heading to visually offset the footnotes from the main document: -->
|
||||||
## Notes
|
## Notes
|
||||||
|
|
Loading…
Reference in New Issue