diff --git a/syndicate/mc/preserve.md b/syndicate/mc/preserve.md index 10e22de..7c035df 100644 --- a/syndicate/mc/preserve.md +++ b/syndicate/mc/preserve.md @@ -5,12 +5,15 @@ body { font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", "TeX Gyre Pagella", serif; } @media screen { body { padding-top: 2rem; max-width: 40em; margin: auto; font-size: 120%; } + hr { display: none; } } @media print { - @page { margin: 1.5cm; } + @page { margin: 4rem 4rem 4.333rem 3rem; } body { margin-left: 2rem; margin-right 2rem; } - h1, h2 { page-break-before: always } + h1, h2 { page-break-before: always; margin-top: 0; } h1:first-of-type, h2:first-of-type { page-break-before: auto; } + hr+* { page-break-before: always; margin-top: 0; } + hr { display: none; } } h1, h2, h3, h4, h5, h6 { margin-left: -1rem; color: #4f81bd; } h2 { border-bottom: solid #4f81bd 1px; } @@ -41,9 +44,9 @@ Preserves also supports the usual suite of atomic and compound data types, in particular including *binary* data as a distinct type from text strings. -Finally, Preserves defines precisely how to compare two values with -each other in terms of the data model, not in terms of syntax or of -the data structures of any particular implementation language. +Finally, Preserves defines precisely how to *compare* two values. +Comparison is based on the data model, not on syntax or on data +structures of any particular implementation language. [^macro-expressiveness]: By "expressive" I mean *macro-expressive* in the sense of Felleisen's 1991 paper, "On the Expressive Power @@ -66,6 +69,9 @@ definition of the *values* that we want to work with and give them meaning independent of their syntax. We will treat syntax separately, later in this document. +Our `Value`s fall into two broad categories: *atomic* and *compound* +data. + Value = Atom | Compound @@ -82,14 +88,6 @@ later in this document. | Set | Dictionary -Our `Value`s fall into two broad categories: *atomic* and *compound* -data.[^inspiration] - - [^inspiration]: This design was loosely inspired by S-expressions, - as seen in Lisp, Scheme, [SPKI/SDSI][sexp.txt], and many others, - as well as by the ML type system, as seen in languages such as - SML, OCaml, Haskell, Rust, and many others. - **Total order.** As we go, we will incrementally specify a total order over `Value`s. Two values of the same kind are compared using kind-specific rules. The ordering among @@ -126,10 +124,10 @@ examples of `SignedInteger`s using standard mathematical notation. ### Unicode strings. A `String` is a sequence of Unicode -[code-point](http://www.unicode.org/glossary/#code_point)s. Two -`String`s are compared lexicographically, code-point by +[code-point](http://www.unicode.org/glossary/#code_point)s. `String`s +are compared lexicographically, code-point by code-point.[^utf8-is-awesome] We will write examples of `String`s as -text surrounded by double-quotes “`"`” using a monospace font. +text surrounded by quotes “`"`”. [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this gives the same result as a lexicographic byte-by-byte comparison @@ -139,33 +137,27 @@ text surrounded by double-quotes “`"`” using a monospace font. the string containing the three Unicode code-points `z` (0x7A), `水` (0x6C34) and `𝄞` (0x1D11E); `""`, the empty string. -**Normalization forms.** Unicode defines multiple -[normalization forms](http://unicode.org/reports/tr15/) for text. No -particular normalization form is required for `String`s; -[see below](#normalization-forms). - ### Binary data. -A `ByteString` is an ordered sequence of zero or more integers in the -inclusive range [0..255]. `ByteString`s are compared -lexicographically, byte by byte. We will only write examples of -`ByteString`s that contain bytes mapping to printable ASCII -characters, using “`#"`” as an opening quote mark and “`"`” as a -closing quote mark. +A `ByteString` is an ordered sequence of zero or more eight-bit bytes. +`ByteString`s are compared lexicographically. We will only write +examples of `ByteString`s that contain bytes denoting printable ASCII +characters, using “`#"`” as an open-quote and “`"`” as a close-quote +mark. **Examples.** The `ByteString` containing the integers 65, 66 and 67 (corresponding to ASCII characters `A`, `B` and `C`) is written as `#"ABC"`. The empty `ByteString` is written as `#""`. **N.B.** Despite appearances, these are *binary* data. -### Symbols or identifiers. +### Symbols. Programming languages like Lisp and Prolog frequently use string-like values called *symbols*. Here, a `Symbol` is, like a `String`, a -sequence of Unicode code-points, intended to represent an identifier -of some kind. `Symbol`s are also compared lexicographically by -code-point. We will write examples including only non-empty sequences -of non-whitespace characters, using a monospace font without quotation +sequence of Unicode code-points representing an identifier of some +kind. `Symbol`s are also compared lexicographically by code-point. We +will write examples including only non-empty sequences of +non-whitespace characters, using a monospace font without quotation marks. **Examples.** `hello-world`; `utf8-string`; `exact-integer?`. @@ -176,8 +168,6 @@ There are exactly two `Boolean` values, “false” and “true”. The “false” value compares less-than the “true” value. We write `#f` for “false”, and `#t` for “true”. -**Examples.** `#f`; `#t`. - ### IEEE floating-point values. A `Float` is a single-precision IEEE 754 floating-point value; a @@ -345,6 +335,8 @@ representation:[^some-encodings-unused] Each specific type of data defines its own rules for this format. +--- + #### Encoding data of known length (format B) A `Repr` where the length of the `Value` to be encoded is variable but @@ -416,7 +408,7 @@ Applications *SHOULD* prefer the known-length format for encoding #### Application-specific short form for labels Any given protocol using Preserves may additionally define an -interpretation for `n ∈ {0,1,2}`, mapping each *short form label +interpretation for `n`∈{0,1,2}, mapping each *short form label number* `n` to a specific record label. When encoding `m` fields with short form label number `n`, format B becomes @@ -583,7 +575,7 @@ short form label number 0 to label `discard`, 1 to `capture`, and 2 to | `(observe (speak (discard) (capture (discard))))` | A1 B3 75 73 70 65 61 6B 80 91 80 | | `[1 2 3 4]` (format B) | C4 11 12 13 14 | | `[1 2 3 4]` (format C) | 2C 11 12 13 14 3C | -| `[-2 -1 0 1]` | C4 1E 1F 40 11 | +| `[-2 -1 0 1]` | C4 1E 1F 10 11 | | `"hello"` (format B) | 55 68 65 6C 6C 6F | | `"hello"` (format C, 2 chunks) | 25 52 68 65 53 6C 6C 6F 35 | | `"hello"` (format C, 5 chunks) | 25 52 68 65 52 6C 6C 50 50 51 6F 35 | @@ -708,20 +700,20 @@ form label number 1 were chosen, the second example above, `(mime text/plain "ABC")`, would be encoded with "92" in place of "B3 74 6D 69 6D 65". -### Text +### Unicode normalization forms -#### Normalization forms - -In order for users to unambiguously signal or require a particular -[normalization form](http://unicode.org/reports/tr15/), we define a -`NormalizedString`, which is a `Record` labelled with +Unicode defines multiple +[normalization forms](http://unicode.org/reports/tr15/) for text. +While no particular normalization form is required for `String`s, +users may need to unambiguously signal or require a particular +normalization form. A `NormalizedString` is a `Record` labelled with `unicode-normalization` and having two fields, the first of which is a `Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`, `nfkc`, `nfkd`), and the second of which is a `String` whose underlying code point representation *MUST* be normalized according to the named normalization form. -#### IRIs (URIs, URLs, URNs, etc.) +### IRIs (URIs, URLs, URNs, etc.) An `IRI` is a `Record` labelled with `iri` and having one field, a `String` which is the IRI itself and which *MUST* be a valid absolute