From 7d3789e371220f13aad91a5da2327ef835e96cc9 Mon Sep 17 00:00:00 2001 From: Tony Garnock-Jones Date: Sat, 18 Jun 2022 19:11:08 +0200 Subject: [PATCH] Split up spec! --- README.md | 14 +- _config.yml | 2 + canonical-binary.md | 4 +- preserves-binary.md | 260 +++++++++++++++++++ preserves-text.md | 302 +++++++++++++++++++++ preserves.md | 619 ++++---------------------------------------- 6 files changed, 631 insertions(+), 570 deletions(-) create mode 100644 preserves-binary.md create mode 100644 preserves-text.md diff --git a/README.md b/README.md index 5eac064..a7bb402 100644 --- a/README.md +++ b/README.md @@ -6,22 +6,24 @@ no_site_title: true --- This [repository]({{page.projectpages}}) contains a -[proposal](preserves.html) and various implementations of *Preserves*, -a new data model and serialization format in many ways comparable to -JSON, XML, S-expressions, CBOR, ASN.1 BER, and so on. +[proposal](preserves.html) and various implementations of *Preserves*, a +new data model, with associated serialization formats, in many ways +comparable to JSON, XML, S-expressions, CBOR, ASN.1 BER, and so on. ## Core documents ### Preserves data model and serialization formats Preserves is defined in terms of a syntax-neutral -[data model and semantics](preserves.html#starting-with-semantics) +[data model and semantics](preserves.html#semantics) which all transfer syntaxes share. This allows trivial, completely automatic, perfect-fidelity conversion between syntaxes. + - [Preserves specification](preserves.html): + - [Preserves semantics and data model](preserves.html#semantics), + - [Preserves textual syntax](preserves-text.html), and + - [Preserves machine-oriented binary syntax](preserves-binary.html) - [Preserves tutorial](TUTORIAL.html) - - [Preserves specification](preserves.html), including semantics, - data model, textual syntax, and compact binary syntax - [Canonical Form for Binary Syntax](canonical-binary.html) - [Syrup](https://github.com/ocapn/syrup#pseudo-specification), a hybrid binary/human-readable syntax for the Preserves data model diff --git a/_config.yml b/_config.yml index e638b87..c24e804 100644 --- a/_config.yml +++ b/_config.yml @@ -13,3 +13,5 @@ defaults: layout: page title: "Preserves" +version_date: "June 2022" +version: "0.6.3" diff --git a/canonical-binary.md b/canonical-binary.md index 7048292..8393cab 100644 --- a/canonical-binary.md +++ b/canonical-binary.md @@ -17,8 +17,8 @@ their *syntax* for equivalence gives the same result as comparing them That is, canonical forms are equal if and only if the encoded `Value`s are equal. -This document specifies canonical form for the Preserves compact -binary syntax. +This document specifies canonical form for the Preserves [machine-oriented +binary syntax](preserves-binary.html). **Annotations.** Annotations *MUST NOT* be present. diff --git a/preserves-binary.md b/preserves-binary.md new file mode 100644 index 0000000..3bc3371 --- /dev/null +++ b/preserves-binary.md @@ -0,0 +1,260 @@ +--- +no_site_title: true +title: "Preserves: Binary Syntax" +--- + +Tony Garnock-Jones +{{ site.version_date }}. Version {{ site.version }}. + + [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt + [spki]: http://world.std.com/~cme/html/spki.html + [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints + [LEB128]: https://en.wikipedia.org/wiki/LEB128 + [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map + [abnf]: https://tools.ietf.org/html/rfc7405 + [canonical]: canonical-binary.html + +*Preserves* is a data model, with associated serialization formats. This +document defines one of those formats: a binary syntax for `Value`s from +the [Preserves data model](preserves.html) that is easy for computer +software to read and write. An [equivalent human-readable text +syntax](preserves-text.html) also exists. + +## Machine-Oriented Binary Syntax + +A `Repr` is a binary-syntax encoding, or representation, of a `Value`. +For a value `v`, we write `«v»` for the `Repr` of v. + +### Type and Length representation. + +Each `Repr` starts with a tag byte, describing the kind of information +represented. Depending on the tag, a length indicator, further encoded +information, and/or an ending tag may follow. + + tag (simple atomic data and small integers) + tag ++ binarydata (most integers) + tag ++ length ++ binarydata (large integers, strings, symbols, and binary) + tag ++ repr ++ ... ++ endtag (compound data) + +The unique end tag is byte value `0x84`. + +If present after a tag, the length of a following piece of binary data +is formatted as a [base 128 varint][varint].[^see-also-leb128] We +write `varint(m)` for the varint-encoding of `m`. Quoting the +[Google Protocol Buffers][varint] definition, + + [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned + integers. Varints and LEB128-encoded integers differ only for + signed integers, which are not used in Preserves. + +> Each byte in a varint, except the last byte, has the most +> significant bit (msb) set – this indicates that there are further +> bytes to come. The lower 7 bits of each byte are used to store the +> two's complement representation of the number in groups of 7 bits, +> least significant group first. + +The following table illustrates varint-encoding. + +| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes | +| ------ | ------------------- | ------------ | +| 15 | `0001111` | 15 | +| 300 | `0000010 0101100` | 172 2 | +| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 | + +It is an error for a varint-encoded `m` in a `Repr` to be anything +other than the unique shortest encoding for that `m`. That is, a +varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0. + +### Records, Sequences, Sets and Dictionaries. + + «» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84] + «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84] + «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84] + «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84] + +There is *no* ordering requirement on the `E_i` elements or +`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any +order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In +addition, implementations *SHOULD* default to writing set elements and +dictionary key/value pairs in order sorted lexicographically by their +`Repr`s[^not-sorted-semantically], and *MAY* offer the option of +serializing in some other implementation-defined order. + + [^no-sorting-rationale]: In the BitTorrent encoding format, + [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding), + dictionary key/value pairs must be sorted by key. This is a + necessary step for ensuring serialization of `Value`s is + canonical. We do not require that key/value pairs (or set + elements) be in sorted order for serialized `Value`s; however, a + [canonical form][canonical] for `Repr`s does exist where a sorted + ordering is required. + + [^not-sorted-semantically]: It's important to note that the sort + ordering for writing out set elements and dictionary key/value + pairs is *not* the same as the sort ordering implied by the + semantic ordering of those elements or keys. For example, the + `Repr` of a negative number very far from zero will start with + byte that is *greater* than the byte which starts the `Repr` of + zero, making it sort lexicographically later by `Repr`, despite + being semantically *less than* zero. + + **Rationale**. This is for ease-of-implementation reasons: not all + languages can easily represent sorted sets or sorted dictionaries, + but encoding and then sorting byte strings is much more likely to + be within easy reach. + +### SignedIntegers. + + «x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16 + ([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16 + ([0xA0] + x) if (-3≤x≤-1) + ([0x90] + x) if ( 0≤x≤12) + where m = |intbytes(x)| + +Integers in the range [-3,12] are compactly represented with tags +between `0x90` and `0x9F` because they are so frequently used. +Integers up to 16 bytes long are represented with a single-byte tag +encoding the length of the integer. Larger integers are represented +with an explicit varint length. Every `SignedInteger` *MUST* be +represented with its shortest possible encoding. + +The function `intbytes(x)` gives the big-endian two's-complement +binary representation of `x`, taking exactly as many whole bytes as +needed to unambiguously identify the value and its sign, and `m = +|intbytes(x)|`. The most-significant bit in the first byte in +`intbytes(x)` is the sign bit.[^zero-intbytes] For +example, + + «87112285931760246646623899502532662132736» + = B0 12 01 00 00 00 00 00 00 00 + 00 00 00 00 00 00 00 00 + 00 00 + + «-257» = A1 FE FF «-3» = 9D «128» = A1 00 80 + «-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF + «-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00 + «-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF + «-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00 + «-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF + «-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00 + «-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00 + + [^zero-intbytes]: The value 0 needs zero bytes to identify the + value, so `intbytes(0)` is the empty byte string. Non-zero values + need at least one byte. + +### Strings, ByteStrings and Symbols. + +Syntax for these three types varies only in the tag used. For `String` +and `Symbol`, the data following the tag is a UTF-8 encoding of the +`Value`'s code points, while for `ByteString` it is the raw data +contained within the `Value` unmodified. + + «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String + [0xB2] ++ varint(|S|) ++ S if S ∈ ByteString + [0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol + +### Booleans. + + «#f» = [0x80] + «#t» = [0x81] + +### Floats and Doubles. + + «F» when F ∈ Float = [0x82] ++ binary32(F) + «D» when D ∈ Double = [0x83] ++ binary64(D) + +The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and +8-byte IEEE 754 binary representations of `F` and `D`, respectively. + +### Embeddeds. + +The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to +represent the denoted object, prefixed with `[0x86]`. + + «#!V» = [0x86] ++ «V» + +### Annotations. + +To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with +`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual +syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols, +`a` and `b`, is + + «@a @b []» + = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]» + = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84] + +## Security Considerations + +**Annotations.** In modes where a `Value` is being read while +annotations are skipped, an endless sequence of annotations may give an +illusion of progress. + +**Canonical form for cryptographic hashing and signing.** No canonical +textual encoding of a `Value` is specified. A +[canonical form][canonical] exists for binary encoded `Value`s, and +implementations *SHOULD* produce canonical binary encodings by +default; however, an implementation *MAY* permit two serializations of +the same `Value` to yield different binary `Repr`s. + +## Appendix. Autodetection of textual or binary syntax + +Every tag byte in a binary Preserves `Document` falls within the range +[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation +bytes*, and will never occur as the first byte of a UTF-8 encoded code +point. This means no binary-encoded document can be misinterpreted as +valid UTF-8. + +Conversely, a UTF-8 document must start with a valid codepoint, +meaning in particular that it must not start with a byte in the range +[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax +Preserves document can be misinterpreted as a binary-syntax document. + +Examination of the top two bits of the first byte of a document gives +its syntax: if the top two bits are `10`, it should be interpreted as +a binary-syntax document; otherwise, it should be interpreted as text. + +## Appendix. Table of tag values + + 80 - False + 81 - True + 82 - Float + 83 - Double + 84 - End marker + 85 - Annotation + 86 - Embedded + (8x) RESERVED 87-8F + + 9x - Small integers 0..12,-3..-1 + An - Medium integers, (n+1) bytes long + B0 - Large integers, variable length + B1 - String + B2 - ByteString + B3 - Symbol + + B4 - Record + B5 - Sequence + B6 - Set + B7 - Dictionary + +## Appendix. Binary SignedInteger representation + +Languages that provide fixed-width machine word types may find the +following table useful in encoding and decoding binary `SignedInteger` +values. + +| Integer range | Bytes required | Encoding (hex) | +| --- | --- | --- | +| -3 ≤ n ≤ 12 | 1 | `9X` | +| -27 ≤ n < 27 (i8) | 2 | `A0` `XX` | +| -215 ≤ n < 215 (i16) | 3 | `A1` `XX` `XX` | +| -223 ≤ n < 223 (i24) | 4 | `A2` `XX` `XX` `XX` | +| -231 ≤ n < 231 (i32) | 5 | `A3` `XX` `XX` `XX` `XX` | +| -239 ≤ n < 239 (i40) | 6 | `A4` `XX` `XX` `XX` `XX` `XX` | +| -247 ≤ n < 247 (i48) | 7 | `A5` `XX` `XX` `XX` `XX` `XX` `XX` | +| -255 ≤ n < 255 (i56) | 8 | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | +| -263 ≤ n < 263 (i64) | 9 | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | + + +## Notes diff --git a/preserves-text.md b/preserves-text.md new file mode 100644 index 0000000..258f4f2 --- /dev/null +++ b/preserves-text.md @@ -0,0 +1,302 @@ +--- +no_site_title: true +title: "Preserves: Text Syntax" +--- + +Tony Garnock-Jones +{{ site.version_date }}. Version {{ site.version }}. + + [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt + [spki]: http://world.std.com/~cme/html/spki.html + [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints + [LEB128]: https://en.wikipedia.org/wiki/LEB128 + [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map + [abnf]: https://tools.ietf.org/html/rfc7405 + [canonical]: canonical-binary.html + +*Preserves* is a data model, with associated serialization formats. This +document defines one of those formats: a textual syntax for `Value`s +from the [Preserves data model](preserves.html) that is easy for people +to read and write. An [equivalent machine-oriented binary +syntax](preserves-binary.html) also exists. + +## Preliminaries + +The definition uses [case-sensitive ABNF][abnf]. + +ABNF allows easy definition of US-ASCII-based languages. However, +Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as +a grammar for recognising sequences of Unicode code points. + +**Encoding.** Textual syntax for a `Value` *SHOULD* be encoded using +UTF-8 where possible. + +**Whitespace.** Whitespace is defined as any number of spaces, tabs, +carriage returns, line feeds, or commas. + + ws = *(%x20 / %x09 / newline / ",") + newline = CR / LF + +## Grammar + +Standalone documents may have trailing whitespace. + + Document = Value ws + +Any `Value` may be preceded by whitespace. + + Value = ws (Record / Collection / Atom / Embedded / Machine) + Collection = Sequence / Dictionary / Set + Atom = Boolean / Float / Double / SignedInteger / + String / ByteString / Symbol + +Each `Record` is an angle-bracket enclosed grouping of its +label-`Value` followed by its field-`Value`s. + + Record = "<" Value *Value ws ">" + +`Sequence`s are enclosed in square brackets. `Dictionary` values are +curly-brace-enclosed colon-separated pairs of values. `Set`s are +written as values enclosed by the tokens `#{` and +`}`.[^printing-collections] It is an error for a set to contain +duplicate elements or for a dictionary to contain duplicate keys. + + Sequence = "[" *Value ws "]" + Dictionary = "{" *(Value ws ":" Value) ws "}" + Set = "#{" *Value ws "}" + + [^printing-collections]: **Implementation note.** When implementing + printing of `Value`s using the textual syntax, consider supporting + (a) optional pretty-printing with indentation, (b) optional + JSON-compatible print mode for that subset of `Value` that is + compatible with JSON, and (c) optional submodes for no commas, + commas separating, and commas terminating elements or key/value + pairs within a collection. + +`Boolean`s are the simple literal strings `#t` and `#f` for true and +false, respectively. + + Boolean = %s"#t" / %s"#f" + +Numeric data follow the +[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with +the addition of a trailing “f” distinguishing `Float` from `Double` +values. `Float`s and `Double`s always have either a fractional part or +an exponent part, where `SignedInteger`s never have +either.[^reading-and-writing-floats-accurately] +[^arbitrary-precision-signedinteger] + + Float = flt %i"f" + Double = flt + SignedInteger = int + + digit1-9 = %x31-39 + nat = %x30 / ( digit1-9 *DIGIT ) + int = ["-"] nat + frac = "." 1*DIGIT + exp = %i"e" ["-"/"+"] 1*DIGIT + flt = int (frac exp / frac / exp) + + [^reading-and-writing-floats-accurately]: **Implementation note.** + Your language's standard library likely has a good routine for + converting between decimal notation and IEEE 754 floating-point. + However, if not, or if you are interested in the challenges of + accurately reading and writing floating point numbers, see the + excellent matched pair of 1990 papers by Clinger and Steele & + White, and a recent follow-up by Jaffer: + + Clinger, William D. ‘How to Read Floating Point Numbers + Accurately’. In Proc. PLDI. White Plains, New York, 1990. + . + + Steele, Guy L., Jr., and Jon L. White. ‘How to Print + Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains, + New York, 1990. . + + Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of + Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013. + . + + [^arbitrary-precision-signedinteger]: **Implementation note.** Be + aware when implementing reading and writing of `SignedInteger`s + that the data model *requires* arbitrary-precision integers. Your + implementation may (but, ideally, should not) truncate precision + when reading or writing a `SignedInteger`; however, if it does so, + it should (a) signal its client that truncation has occurred, and + (b) make it clear to the client that comparing such truncated + values for equality or ordering will not yield results that match + the expected semantics of the data model. + +`String`s are, +[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly +escaped text surrounded by double quotes. The escaping rules are the +same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs] + + String = %x22 *char %x22 + char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG) + unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF + escape = %x5C ; \ + escaped = ( %x5C / ; \ reverse solidus U+005C + %x2F / ; / solidus U+002F + %x62 / ; b backspace U+0008 + %x66 / ; f form feed U+000C + %x6E / ; n line feed U+000A + %x72 / ; r carriage return U+000D + %x74 ) ; t tab U+0009 + + [^string-json-correspondence]: The grammar for `String` has the same + effect as the + [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for + `string`. Some auxiliary definitions (e.g. `escaped`) are lifted + largely unmodified from the text of RFC 8259. + + [^escaping-surrogate-pairs]: In particular, note JSON's rules around + the use of surrogate pairs for code points not in the Basic + Multilingual Plane. We encourage implementations to avoid using + `\u` escapes when producing output, and instead to rely on the + UTF-8 encoding of the entire document to handle non-ASCII + codepoints correctly. + +A `ByteString` may be written in any of three different forms. + +The first is similar to a `String`, but prepended with a hash sign +`#`. In addition, only Unicode code points overlapping with printable +7-bit ASCII are permitted unescaped inside such a `ByteString`; other +byte values must be escaped by prepending a two-digit hexadecimal +value with `\x`. + + ByteString = "#" %x22 *binchar %x22 + binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG) + binunescaped = %x20-21 / %x23-5B / %x5D-7E + +The second is as a sequence of pairs of hexadecimal digits interleaved +with whitespace and surrounded by `#x"` and `"`. + + ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22 + +The third is as a sequence of +[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved +with whitespace and surrounded by `#[` and `]`. Plain and URL-safe +Base64 characters are allowed. + + ByteString =/ "#[" *(ws / base64char) ws "]" + base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "=" + +A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as +it conforms to certain restrictions on the characters appearing in the +symbol. Alternatively, it may be written in a quoted form. The quoted +form is much the same as the syntax for `String`s, including embedded +escape syntax, except using a bar or pipe character (`|`) instead of a +double quote mark. + + Symbol = symstart *symcont / "|" *symchar "|" + symstart = ALPHA / sympunct / symustart + symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-" + sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" / + "?" / "_" / "=" / "+" / "/" / "." + symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG) + symustart = + symucont = + + [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt] + definition of “token representation”, and with the + [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4). + +An `Embedded` is written as a `Value` chosen to represent the denoted +object, prefixed with `#!`. + + Embedded = "#!" Value + +Finally, any `Value` may be represented by escaping from the textual +syntax to the [machine-oriented binary syntax](preserves-binary.html) +by prefixing a `ByteString` containing the binary representation of the +`Value` with `#=`.[^rationale-switch-to-binary] +[^no-literal-binary-in-text] [^machine-value-annotations] + + Machine = "#=" ws ByteString + + [^rationale-switch-to-binary]: **Rationale.** The textual syntax + cannot express every `Value`: specifically, it cannot express the + several million floating-point NaNs, or the two floating-point + Infinities. Since the machine-oriented binary format for `Value`s + expresses each `Value` with precision, embedding binary `Value`s + solves the problem. + + [^no-literal-binary-in-text]: Every text is ultimately physically + stored as bytes; therefore, it might seem possible to escape to the + raw form of binary encoding from within a piece of textual syntax. + However, while bytes must be involved in any *representation* of + text, the text *itself* is logically a sequence of *code points* and + is not *intrinsically* a binary structure at all. It would be + incoherent to expect to be able to access the representation of the + text from within the text itself. + + [^machine-value-annotations]: Any text-syntax annotations preceding + the `#` are prepended to any binary-syntax annotations yielded by + decoding the `ByteString`. + +## Annotations + +When written down, a `Value` may have an associated sequence of +*annotations* carrying “out-of-band” contextual metadata about the +value. Each annotation is, in turn, a `Value`, and may itself have +annotations. The ordering of annotations attached to a `Value` is +significant. + + Value =/ ws "@" Value Value + +Each annotation is preceded by `@`; the underlying annotated value +follows its annotations. Here we extend only the syntactic nonterminal +named “`Value`” without altering the semantic class of `Value`s. + +**Comments.** Strings annotating a `Value` are conventionally +interpreted as comments associated with that value. Comments are +sufficiently common that special syntax exists for them. + + Value =/ ws + ";" *(%x00-09 / %x0B-0C / %x0E-10FFFF) newline + Value + +When written this way, everything between the `;` and the newline is +included in the string annotating the `Value`. + +**Equivalence.** Annotations appear within syntax denoting a `Value`; +however, the annotations are not part of the denoted value. They are +only part of the syntax. Annotations do not play a part in +equivalences and orderings of `Value`s. + +Reflective tools such as debuggers, user interfaces, and message +routers and relays---tools which process `Value`s generically---may +use annotated inputs to tailor their operation, or may insert +annotations in their outputs. By contrast, in ordinary programs, as a +rule of thumb, the presence, absence or content of an annotation +should not change the control flow or output of the program. +Annotations are data *describing* `Value`s, and are not in the domain +of any specific application of `Value`s. That is, an annotation will +almost never cause a non-reflective program to do anything observably +different. + +## Security Considerations + +**Whitespace.** The textual format allows arbitrary whitespace in many +positions. Consider optional restrictions on the amount of consecutive +whitespace that may appear. + +**Annotations.** Similarly, in modes where a `Value` is being read +while annotations are skipped, an endless sequence of annotations may +give an illusion of progress. + +## Acknowledgements + +The treatment of commas as whitespace in the text syntax is inspired +by the same feature of [EDN](https://github.com/edn-format/edn). + +The text syntax for `Boolean`s, `Symbol`s, and `ByteString`s is +directly inspired by [Racket](https://racket-lang.org/)'s lexical +syntax. + + +## Notes diff --git a/preserves.md b/preserves.md index 074bce0..aff94aa 100644 --- a/preserves.md +++ b/preserves.md @@ -4,7 +4,7 @@ title: "Preserves: an Expressive Data Language" --- Tony Garnock-Jones -January 2022. Version 0.6.2. +{{ site.version_date }}. Version {{ site.version }}. [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt [spki]: http://world.std.com/~cme/html/spki.html @@ -14,29 +14,35 @@ January 2022. Version 0.6.2. [abnf]: https://tools.ietf.org/html/rfc7405 [canonical]: canonical-binary.html -This document proposes a data model and serialization format called -*Preserves*. +*Preserves* is a data model, with associated serialization formats. -Preserves supports *records* with user-defined *labels*, embedded -*references*, and the usual suite of atomic and compound data types, -including *binary* data as a distinct type from text strings. Its -*annotations* allow separation of data from metadata such as -[comments](conventions.html#comments), trace information, and -provenance information. +It supports *records* with user-defined *labels*, embedded *references*, +and the usual suite of atomic and compound data types, including +*binary* data as a distinct type from text strings. Its *annotations* +allow separation of data from metadata such as +[comments](conventions.html#comments), trace information, and provenance +information. Preserves departs from many other data languages in defining how to *compare* two values. Comparison is based on the data model, not on syntax or on data structures of any particular implementation language. -## Starting with Semantics +This document defines the core semantics and data model of Preserves and +presents a handful of examples. Two other core documents define -Taking inspiration from functional programming, we start with a -definition of the *values* that we want to work with and give them -meaning independent of their syntax. + - a [human-readable text syntax](preserves-text.html), and + - a [machine-oriented binary syntax](preserves-binary.html) - -Our `Value`s fall into two broad categories: *atomic* and *compound* +for the Preserves data model. + +## Values + +Preserves *values* are given meaning independent of their syntax. We +will write "`Value`" when we mean the set of all Preserves values or an +element of that set. + +`Value`s fall into two broad categories: *atomic* and *compound* data. Every `Value` is finite and non-cyclic. Embedded values, called `Embedded`s, are a third, special-case category. @@ -76,20 +82,23 @@ neither is less than the other according to the total order. ### Signed integers. -A `SignedInteger` is a signed integer of arbitrary width. +A `SignedInteger` is an arbitrarily-large signed integer. `SignedInteger`s are compared as mathematical integers. ### Unicode strings. A `String` is a sequence of Unicode -[code-point](http://www.unicode.org/glossary/#code_point)s. `String`s -are compared lexicographically, code-point by +[code-point](http://www.unicode.org/glossary/#code_point)s.[^nul-permitted] +`String`s are compared lexicographically, code-point by code-point.[^utf8-is-awesome] [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this gives the same result as a lexicographic byte-by-byte comparison of the UTF-8 encoding of a string! + [^nul-permitted]: All Unicode code-points are permitted, including NUL + (code point zero). + ### Binary data. A `ByteString` is a sequence of octets. `ByteString`s are compared @@ -111,11 +120,11 @@ less-than the “true” value. `Float`s and `Double`s are single- and double-precision IEEE 754 floating-point values, respectively. `Float`s, `Double`s and -`SignedInteger`s are disjoint; by the rules [above](#total-order), -every `Float` is less than every `Double`, and every `SignedInteger` -is greater than both. Two `Float`s or two `Double`s are to be ordered -by the `totalOrder` predicate defined in section 5.10 of -[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935). +`SignedInteger`s are disjoint; by the rules [above](#total-order), every +`Float` is less than every `Double`, and every `SignedInteger` is +greater than both. Two `Float`s or two `Double`s are to be ordered by +the `totalOrder` predicate defined in section 5.10 of [IEEE Std +754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935). ### Records. @@ -200,457 +209,13 @@ URL, compared according to usually be represented as ordinary `Value`s, in which case the ordinary rules for comparing `Value`s will apply. -## Textual Syntax - -Now we have discussed `Value`s and their meanings, we may turn to -techniques for *representing* `Value`s for communication or storage. - -In this section, we use [case-sensitive ABNF][abnf] to define a -textual syntax that is easy for people to read and -write.[^json-superset] Most of the examples in this document are -written using this syntax. In the following section, we will define an -equivalent compact machine-readable syntax. - - [^json-superset]: The grammar of the textual syntax is a superset of - JSON, with the slightly unusual feature that `true`, `false`, and - `null` are all read as `Symbol`s, and that `SignedInteger`s are - never read as `Double`s. - - The following [schema](./preserves-schema.html) definitions match - exactly the JSON subset of a Preserves input: - - version 1 . - JSON = @string string / @integer int / @double double / @boolean JSONBoolean / @null =null - / @array [JSON ...] / @object { string: JSON ...:... } . - JSONBoolean = =true / =false . - -### Character set. - -[ABNF][abnf] allows easy definition of US-ASCII-based languages. -However, Preserves is a Unicode-based language. Therefore, we -reinterpret ABNF as a grammar for recognising sequences of Unicode -code points. - -Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where -possible. - -### Whitespace. - -Whitespace is defined as any number of spaces, tabs, carriage returns, -line feeds, or commas. - - ws = *(%x20 / %x09 / newline / ",") - newline = CR / LF - -### Grammar. - -Standalone documents may have trailing whitespace. - - Document = Value ws - -Any `Value` may be preceded by whitespace. - - Value = ws (Record / Collection / Atom / Embedded / Compact) - Collection = Sequence / Dictionary / Set - Atom = Boolean / Float / Double / SignedInteger / - String / ByteString / Symbol - -Each `Record` is an angle-bracket enclosed grouping of its -label-`Value` followed by its field-`Value`s. - - Record = "<" Value *Value ws ">" - -`Sequence`s are enclosed in square brackets. `Dictionary` values are -curly-brace-enclosed colon-separated pairs of values. `Set`s are -written as values enclosed by the tokens `#{` and -`}`.[^printing-collections] It is an error for a set to contain -duplicate elements or for a dictionary to contain duplicate keys. - - Sequence = "[" *Value ws "]" - Dictionary = "{" *(Value ws ":" Value) ws "}" - Set = "#{" *Value ws "}" - - [^printing-collections]: **Implementation note.** When implementing - printing of `Value`s using the textual syntax, consider supporting - (a) optional pretty-printing with indentation, (b) optional - JSON-compatible print mode for that subset of `Value` that is - compatible with JSON, and (c) optional submodes for no commas, - commas separating, and commas terminating elements or key/value - pairs within a collection. - -`Boolean`s are the simple literal strings `#t` and `#f` for true and -false, respectively. - - Boolean = %s"#t" / %s"#f" - -Numeric data follow the -[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with -the addition of a trailing “f” distinguishing `Float` from `Double` -values. `Float`s and `Double`s always have either a fractional part or -an exponent part, where `SignedInteger`s never have -either.[^reading-and-writing-floats-accurately] -[^arbitrary-precision-signedinteger] - - Float = flt %i"f" - Double = flt - SignedInteger = int - - digit1-9 = %x31-39 - nat = %x30 / ( digit1-9 *DIGIT ) - int = ["-"] nat - frac = "." 1*DIGIT - exp = %i"e" ["-"/"+"] 1*DIGIT - flt = int (frac exp / frac / exp) - - [^reading-and-writing-floats-accurately]: **Implementation note.** - Your language's standard library likely has a good routine for - converting between decimal notation and IEEE 754 floating-point. - However, if not, or if you are interested in the challenges of - accurately reading and writing floating point numbers, see the - excellent matched pair of 1990 papers by Clinger and Steele & - White, and a recent follow-up by Jaffer: - - Clinger, William D. ‘How to Read Floating Point Numbers - Accurately’. In Proc. PLDI. White Plains, New York, 1990. - . - - Steele, Guy L., Jr., and Jon L. White. ‘How to Print - Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains, - New York, 1990. . - - Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of - Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013. - . - - [^arbitrary-precision-signedinteger]: **Implementation note.** Be - aware when implementing reading and writing of `SignedInteger`s - that the data model *requires* arbitrary-precision integers. Your - implementation may (but, ideally, should not) truncate precision - when reading or writing a `SignedInteger`; however, if it does so, - it should (a) signal its client that truncation has occurred, and - (b) make it clear to the client that comparing such truncated - values for equality or ordering will not yield results that match - the expected semantics of the data model. - -`String`s are, -[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly -escaped text surrounded by double quotes. The escaping rules are the -same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs] - - String = %x22 *char %x22 - char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG) - unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF - escape = %x5C ; \ - escaped = ( %x5C / ; \ reverse solidus U+005C - %x2F / ; / solidus U+002F - %x62 / ; b backspace U+0008 - %x66 / ; f form feed U+000C - %x6E / ; n line feed U+000A - %x72 / ; r carriage return U+000D - %x74 ) ; t tab U+0009 - - [^string-json-correspondence]: The grammar for `String` has the same - effect as the - [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for - `string`. Some auxiliary definitions (e.g. `escaped`) are lifted - largely unmodified from the text of RFC 8259. - - [^escaping-surrogate-pairs]: In particular, note JSON's rules around - the use of surrogate pairs for code points not in the Basic - Multilingual Plane. We encourage implementations to avoid using - `\u` escapes when producing output, and instead to rely on the - UTF-8 encoding of the entire document to handle non-ASCII - codepoints correctly. - -A `ByteString` may be written in any of three different forms. - -The first is similar to a `String`, but prepended with a hash sign -`#`. In addition, only Unicode code points overlapping with printable -7-bit ASCII are permitted unescaped inside such a `ByteString`; other -byte values must be escaped by prepending a two-digit hexadecimal -value with `\x`. - - ByteString = "#" %x22 *binchar %x22 - binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG) - binunescaped = %x20-21 / %x23-5B / %x5D-7E - -The second is as a sequence of pairs of hexadecimal digits interleaved -with whitespace and surrounded by `#x"` and `"`. - - ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22 - -The third is as a sequence of -[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved -with whitespace and surrounded by `#[` and `]`. Plain and URL-safe -Base64 characters are allowed. - - ByteString =/ "#[" *(ws / base64char) ws "]" - base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "=" - -A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as -it conforms to certain restrictions on the characters appearing in the -symbol. Alternatively, it may be written in a quoted form. The quoted -form is much the same as the syntax for `String`s, including embedded -escape syntax, except using a bar or pipe character (`|`) instead of a -double quote mark. - - Symbol = symstart *symcont / "|" *symchar "|" - symstart = ALPHA / sympunct / symustart - symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-" - sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" / - "?" / "_" / "=" / "+" / "/" / "." - symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG) - symustart = - symucont = - - [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt] - definition of “token representation”, and with the - [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4). - -An `Embedded` is written as a `Value` chosen to represent the denoted -object, prefixed with `#!`. - - Embedded = "#!" Value - -Finally, any `Value` may be represented by escaping from the textual -syntax to the [compact binary syntax](#compact-binary-syntax) by -prefixing a `ByteString` containing the binary representation of the -`Value` with `#=`.[^rationale-switch-to-binary] -[^no-literal-binary-in-text] [^compact-value-annotations] - - Compact = "#=" ws ByteString - - [^rationale-switch-to-binary]: **Rationale.** The textual syntax - cannot express every `Value`: specifically, it cannot express the - several million floating-point NaNs, or the two floating-point - Infinities. Since the compact binary format for `Value`s expresses - each `Value` with precision, embedding binary `Value`s solves the - problem. - - [^no-literal-binary-in-text]: Every text is ultimately physically - stored as bytes; therefore, it might seem possible to escape to - the raw binary form of compact binary encoding from within a - pieces of textual syntax. However, while bytes must be involved in - any *representation* of text, the text *itself* is logically a - sequence of *code points* and is not *intrinsically* a binary - structure at all. It would be incoherent to expect to be able to - access the representation of the text from within the text itself. - - [^compact-value-annotations]: Any text-syntax annotations preceding - the `#` are prepended to any binary-syntax annotations yielded by - decoding the `ByteString`. - -### Annotations. - -**Syntax.** When written down, a `Value` may have an associated -sequence of *annotations* carrying “out-of-band” contextual metadata -about the value. Each annotation is, in turn, a `Value`, and may -itself have annotations. The ordering of annotations attached to a -`Value` is significant. - - Value =/ ws "@" Value Value - -Each annotation is preceded by `@`; the underlying annotated value -follows its annotations. Here we extend only the syntactic nonterminal -named “`Value`” without altering the semantic class of `Value`s. - -**Comments.** Strings annotating a `Value` are conventionally -interpreted as comments associated with that value. Comments are -sufficiently common that special syntax exists for them. - - Value =/ ws - ";" *(%x00-09 / %x0B-0C / %x0E-10FFFF) newline - Value - -When written this way, everything between the `;` and the newline is -included in the string annotating the `Value`. - -**Equivalence.** Annotations appear within syntax denoting a `Value`; -however, the annotations are not part of the denoted value. They are -only part of the syntax. Annotations do not play a part in -equivalences and orderings of `Value`s. - -Reflective tools such as debuggers, user interfaces, and message -routers and relays---tools which process `Value`s generically---may -use annotated inputs to tailor their operation, or may insert -annotations in their outputs. By contrast, in ordinary programs, as a -rule of thumb, the presence, absence or content of an annotation -should not change the control flow or output of the program. -Annotations are data *describing* `Value`s, and are not in the domain -of any specific application of `Value`s. That is, an annotation will -almost never cause a non-reflective program to do anything observably -different. - -## Compact Binary Syntax - -A `Repr` is a binary-syntax encoding, or representation, of a `Value`. -For a value `v`, we write `«v»` for the `Repr` of v. - -### Type and Length representation. - -Each `Repr` starts with a tag byte, describing the kind of information -represented. Depending on the tag, a length indicator, further encoded -information, and/or an ending tag may follow. - - tag (simple atomic data and small integers) - tag ++ binarydata (most integers) - tag ++ length ++ binarydata (large integers, strings, symbols, and binary) - tag ++ repr ++ ... ++ endtag (compound data) - -The unique end tag is byte value `0x84`. - -If present after a tag, the length of a following piece of binary data -is formatted as a [base 128 varint][varint].[^see-also-leb128] We -write `varint(m)` for the varint-encoding of `m`. Quoting the -[Google Protocol Buffers][varint] definition, - - [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned - integers. Varints and LEB128-encoded integers differ only for - signed integers, which are not used in Preserves. - -> Each byte in a varint, except the last byte, has the most -> significant bit (msb) set – this indicates that there are further -> bytes to come. The lower 7 bits of each byte are used to store the -> two's complement representation of the number in groups of 7 bits, -> least significant group first. - -The following table illustrates varint-encoding. - -| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes | -| ------ | ------------------- | ------------ | -| 15 | `0001111` | 15 | -| 300 | `0000010 0101100` | 172 2 | -| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 | - -It is an error for a varint-encoded `m` in a `Repr` to be anything -other than the unique shortest encoding for that `m`. That is, a -varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0. - -### Records, Sequences, Sets and Dictionaries. - - «» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84] - «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84] - «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84] - «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84] - -There is *no* ordering requirement on the `E_i` elements or -`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any -order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In -addition, implementations *SHOULD* default to writing set elements and -dictionary key/value pairs in order sorted lexicographically by their -`Repr`s[^not-sorted-semantically], and *MAY* offer the option of -serializing in some other implementation-defined order. - - [^no-sorting-rationale]: In the BitTorrent encoding format, - [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding), - dictionary key/value pairs must be sorted by key. This is a - necessary step for ensuring serialization of `Value`s is - canonical. We do not require that key/value pairs (or set - elements) be in sorted order for serialized `Value`s; however, a - [canonical form][canonical] for `Repr`s does exist where a sorted - ordering is required. - - [^not-sorted-semantically]: It's important to note that the sort - ordering for writing out set elements and dictionary key/value - pairs is *not* the same as the sort ordering implied by the - semantic ordering of those elements or keys. For example, the - `Repr` of a negative number very far from zero will start with - byte that is *greater* than the byte which starts the `Repr` of - zero, making it sort lexicographically later by `Repr`, despite - being semantically *less than* zero. - - **Rationale**. This is for ease-of-implementation reasons: not all - languages can easily represent sorted sets or sorted dictionaries, - but encoding and then sorting byte strings is much more likely to - be within easy reach. - -### SignedIntegers. - - «x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16 - ([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16 - ([0xA0] + x) if (-3≤x≤-1) - ([0x90] + x) if ( 0≤x≤12) - where m = |intbytes(x)| - -Integers in the range [-3,12] are compactly represented with tags -between `0x90` and `0x9F` because they are so frequently used. -Integers up to 16 bytes long are represented with a single-byte tag -encoding the length of the integer. Larger integers are represented -with an explicit varint length. Every `SignedInteger` *MUST* be -represented with its shortest possible encoding. - -The function `intbytes(x)` gives the big-endian two's-complement -binary representation of `x`, taking exactly as many whole bytes as -needed to unambiguously identify the value and its sign, and `m = -|intbytes(x)|`. The most-significant bit in the first byte in -`intbytes(x)` is the sign bit.[^zero-intbytes] For -example, - - «87112285931760246646623899502532662132736» - = B0 12 01 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00 - 00 00 - - «-257» = A1 FE FF «-3» = 9D «128» = A1 00 80 - «-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF - «-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00 - «-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF - «-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00 - «-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF - «-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00 - «-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00 - - [^zero-intbytes]: The value 0 needs zero bytes to identify the - value, so `intbytes(0)` is the empty byte string. Non-zero values - need at least one byte. - -### Strings, ByteStrings and Symbols. - -Syntax for these three types varies only in the tag used. For `String` -and `Symbol`, the data following the tag is a UTF-8 encoding of the -`Value`'s code points, while for `ByteString` it is the raw data -contained within the `Value` unmodified. - - «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String - [0xB2] ++ varint(|S|) ++ S if S ∈ ByteString - [0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol - -### Booleans. - - «#f» = [0x80] - «#t» = [0x81] - -### Floats and Doubles. - - «F» when F ∈ Float = [0x82] ++ binary32(F) - «D» when D ∈ Double = [0x83] ++ binary64(D) - -The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and -8-byte IEEE 754 binary representations of `F` and `D`, respectively. - -### Embeddeds. - -The `Repr` of an `Embedded` is the `Repr` of a `Value` chosen to -represent the denoted object, prefixed with `[0x86]`. - - «#!V» = [0x86] ++ «V» - -### Annotations. - -To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with -`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual -syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols, -`a` and `b`, is - - «@a @b []» - = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]» - = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84] - ## Examples +The definitions above are independent of any particular concrete syntax. +The examples of `Value`s that follow are written using [the Preserves +text syntax](preserves-text.html), and the example encoded byte +sequences use [the Preserves binary encoding](preserves-binary.html). + ### Ordering. The total ordering specified [above](#total-order) means that the following statements are true: @@ -720,10 +285,23 @@ encodes to ### JSON examples. -The examples from -[RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as -valid Preserves, though the JSON literals `true`, `false` and `null` -read as `Symbol`s. The first example: +Preserves text syntax is a superset of JSON, so the examples from [RFC +8259](https://tools.ietf.org/html/rfc8259#section-13) read as valid +Preserves. + +The JSON literals `true`, `false` and `null` all read as `Symbol`s, and +JSON numbers read (unambiguously) either as `SignedInteger`s or as +`Double`s.[^json-superset] + + [^json-superset]: The following [schema](./preserves-schema.html) + definitions match exactly the JSON subset of a Preserves input: + + version 1 . + JSON = @string string / @integer int / @double double / @boolean JSONBoolean / @null =null + / @array [JSON ...] / @object { string: JSON ...:... } . + JSONBoolean = =true / =false . + +The first RFC 8259 example: { "Image": { @@ -740,7 +318,8 @@ read as `Symbol`s. The first example: } } -encodes to binary as follows: +when read using the Preserves text syntax encodes via the binary syntax +as follows: B7 B1 05 "Image" @@ -764,7 +343,7 @@ encodes to binary as follows: 84 84 -and the second example: +The second RFC 8259 example: [ { @@ -814,89 +393,5 @@ encodes to binary as follows: 84 84 -## Security Considerations - -**Whitespace.** The textual format allows arbitrary whitespace in many -positions. Consider optional restrictions on the amount of consecutive -whitespace that may appear. - -**Annotations.** Similarly, in modes where a `Value` is being read -while annotations are skipped, an endless sequence of annotations may -give an illusion of progress. - -**Canonical form for cryptographic hashing and signing.** No canonical -textual encoding of a `Value` is specified. A -[canonical form][canonical] exists for binary encoded `Value`s, and -implementations *SHOULD* produce canonical binary encodings by -default; however, an implementation *MAY* permit two serializations of -the same `Value` to yield different binary `Repr`s. - -## Acknowledgements - -The treatment of commas as whitespace in the text syntax is inspired -by the same feature of [EDN](https://github.com/edn-format/edn). - -The text syntax for `Boolean`s, `Symbol`s, and `ByteString`s is -directly inspired by [Racket](https://racket-lang.org/)'s lexical -syntax. - -## Appendix. Autodetection of textual or binary syntax - -Every tag byte in a binary Preserves `Document` falls within the range -[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation -bytes*, and will never occur as the first byte of a UTF-8 encoded code -point. This means no binary-encoded document can be misinterpreted as -valid UTF-8. - -Conversely, a UTF-8 document must start with a valid codepoint, -meaning in particular that it must not start with a byte in the range -[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax -Preserves document can be misinterpreted as a binary-syntax document. - -Examination of the top two bits of the first byte of a document gives -its syntax: if the top two bits are `10`, it should be interpreted as -a binary-syntax document; otherwise, it should be interpreted as text. - -## Appendix. Table of tag values - - 80 - False - 81 - True - 82 - Float - 83 - Double - 84 - End marker - 85 - Annotation - 86 - Embedded - (8x) RESERVED 87-8F - - 9x - Small integers 0..12,-3..-1 - An - Medium integers, (n+1) bytes long - B0 - Large integers, variable length - B1 - String - B2 - ByteString - B3 - Symbol - - B4 - Record - B5 - Sequence - B6 - Set - B7 - Dictionary - -## Appendix. Binary SignedInteger representation - -Languages that provide fixed-width machine word types may find the -following table useful in encoding and decoding binary `SignedInteger` -values. - -| Integer range | Bytes required | Encoding (hex) | -| --- | --- | --- | -| -3 ≤ n ≤ 12 | 1 | `9X` | -| -27 ≤ n < 27 (i8) | 2 | `A0` `XX` | -| -215 ≤ n < 215 (i16) | 3 | `A1` `XX` `XX` | -| -223 ≤ n < 223 (i24) | 4 | `A2` `XX` `XX` `XX` | -| -231 ≤ n < 231 (i32) | 5 | `A3` `XX` `XX` `XX` `XX` | -| -239 ≤ n < 239 (i40) | 6 | `A4` `XX` `XX` `XX` `XX` `XX` | -| -247 ≤ n < 247 (i48) | 7 | `A5` `XX` `XX` `XX` `XX` `XX` `XX` | -| -255 ≤ n < 255 (i56) | 8 | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | -| -263 ≤ n < 263 (i64) | 9 | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | - ## Notes