preserves/conventions.md

218 lines
8.4 KiB
Markdown

---
title: "Conventions for Common Data Types"
---
The `Value` data type is essentially an S-Expression, able to
represent semi-structured data over `ByteString`, `String`,
`SignedInteger` atoms and so on.[^why-not-spki-sexps]
[^why-not-spki-sexps]: Rivest's S-Expressions are in many ways
similar to Preserves. However, while they include binary data and
sequences, and an obvious equivalence for them exists, they lack
numbers *per se* as well as any kind of unordered structure such
as sets or maps. In addition, while “display hints” allow
labelling of binary data with an intended interpretation, they
cannot be attached to any other kind of structure, and the “hint”
itself can only be a binary blob.
However, users need a wide variety of data types for representing
domain-specific values such as various kinds of encoded and normalized
text, calendrical values, machine words, and so on.
Appropriately-labelled `Record`s denote these domain-specific data
types.[^why-dictionaries]
[^why-dictionaries]: Given `Record`'s existence, it may seem odd
that `Dictionary`, `Set`, `Float`, etc. are given special
treatment. Preserves aims to offer a useful basic equivalence
predicate to programmers, and so if a data type demands a special
equivalence predicate, as `Dictionary`, `Set` and `Float` all do,
then the type should be included in the base language. Otherwise,
it can be represented as a `Record` and treated separately.
`Boolean`, `String` and `Symbol` are seeming exceptions. The first
two merit inclusion because of their cultural importance, while
`Symbol`s are included to allow their use as `Record` labels.
Primitive `Symbol` support avoids a bootstrapping issue.
All of these conventions are optional. They form a layer atop the core
`Value` structure. Non-domain-specific tools do not in general need to
treat them specially.
**Validity.** Many of the labels we will describe in this section come
with side-conditions on the contents of labelled `Record`s. It is
possible to construct an instance of `Value` that violates these
side-conditions without ceasing to be a `Value` or becoming
unrepresentable. However, we say that such a `Value` is *invalid*
because it fails to honour the necessary side-conditions.
Implementations *SHOULD* allow two modes of working: one which
treats all `Value`s identically, without regard for side-conditions,
and one which enforces validity (i.e. side-conditions) when reading,
writing, or constructing `Value`s.
## IOLists.
Inspired by Erlang's notions of
[`iolist()` and `iodata()`](http://erlang.org/doc/reference_manual/typespec.html),
an `IOList` is any tree constructed from `ByteString`s and
`Sequence`s. Formally, an `IOList` is either a `ByteString` or a
`Sequence` of `IOList`s.
`IOList`s can be useful for
[vectored I/O](https://en.wikipedia.org/wiki/Vectored_I/O).
Additionally, the flexibility of `IOList` trees allows annotation of
interior portions of a tree.
## Comments.
`String` values used as annotations are conventionally interpreted as
comments. Special syntax exists for such string annotations, though
the usual `@`-prefixed annotation notation can also be used.
;I am a comment for the Dictionary
{
;I am a comment for the key
key: ;I am a comment for the value
value
}
;I am a comment for this entire IOList
[
#x"00010203"
;I am a comment for the middle half of the IOList
;A second comment for the same portion of the IOList
@ ;I am the first and only comment for the following comment
"A third (itself commented!) comment for the same part of the IOList"
[
;"I am a comment for the following ByteString"
#x"04050607"
#x"08090A0B"
]
#x"0C0D0E0F"
]
## MIME-type tagged binary data.
Many internet protocols use
[media types](https://tools.ietf.org/html/rfc6838) (a.k.a MIME types)
to indicate the format of some associated binary data. For this
purpose, we define `MIMEData` to be a record labelled `mime` with two
fields, the first being a `Symbol`, the media type, and the second
being a `ByteString`, the binary data.
While each media type may define its own rules for comparing
documents, we define ordering among `MIMEData` *representations* of
such media types following the general rules for ordering of
`Record`s.
**Examples.**
«<mime application/octet-stream #"abcde">»
= B4 B3 04 "mime" B3 18 "application/octet-stream" B2 05 "abcde"
«<mime text/plain #"ABC">»
= B4 B3 04 "mime" B3 0A "text/plain" B2 03 "ABC" 84
«<mime application/xml #"<xhtml/>">»
= B4 B3 04 "mime" B3 0F "application/xml" B2 08 "<xhtml/>" 84
«<mime text/csv #"123,234,345">»
= B4 B3 04 "mime" B3 08 "text/csv" B2 0B "123,234,345" 84
## Unicode normalization forms.
Unicode defines multiple
[normalization forms](http://unicode.org/reports/tr15/) for text.
While no particular normalization form is required for `String`s,
users may need to unambiguously signal or require a particular
normalization form. A `NormalizedString` is a `Record` labelled with
`unicode-normalization` and having two fields, the first of which is a
`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
`nfkc`, `nfkd`), and the second of which is a `String` whose
underlying code point representation *MUST* be normalized according to
the named normalization form.
## IRIs (URIs, URLs, URNs, etc.).
An `IRI` is a `Record` labelled with `iri` and having one field, a
`String` which is the IRI itself and which *MUST* be a valid absolute
or relative IRI.
## Machine words.
The definition of `SignedInteger` captures all integers. However, in
certain circumstances it can be valuable to assert that a number
inhabits a particular range, such as a fixed-width machine word.
A family of labels `i`*n* and `u`*n* for *n* ∈ {8,16,32,64,128} denote
*n*-bit-wide signed and unsigned range restrictions, respectively.
Records with these labels *MUST* have one field, a `SignedInteger`,
which *MUST* fall within the appropriate range. That is, to be valid,
- in `<i8 `*x*`>`, -128 <= *x* <= 127.
- in `<u8 `*x*`>`, 0 <= *x* <= 255.
- in `<i16 `*x*`>`, -32768 <= *x* <= 32767.
- etc.
## Anonymous Tuples and Unit.
A `Tuple` is a `Record` with label `tuple` and zero or more fields,
denoting an anonymous tuple of values.
The 0-ary tuple, `<tuple>`, denotes the empty tuple, sometimes called
“unit” or “void” (but *not* e.g. JavaScript's “undefined” value).
## Null and Undefined.
Tony Hoare's
“[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)”
can be represented with the 0-ary `Record` `<null>`. An “undefined”
value can be represented as `<undefined>`.
## Dates and Times.
Dates, times, moments, and timestamps can be represented with a
`Record` with label `rfc3339` having a single field, a `String`, which
*MUST* conform to one of the `full-date`, `partial-time`, `full-time`,
or `date-time` productions of [section 5.6 of RFC
3339](https://tools.ietf.org/html/rfc3339#section-5.6). (In
`date-time`, "T" and "Z" *MUST* be upper-case and "T" *MUST* be used;
a space separating the `full-date` and `full-time` *MUST NOT* be
used.)
## XML Infoset
[XML Infoset](https://www.w3.org/TR/2004/REC-xml-infoset-20040204/)
describes the semantics of XML - that is, the underlying information
contained in a document, independent of surface syntax.
A useful subset of XML Infoset, namely its Element Information Items
(omitting processing instructions, entities, entity references,
comments, namespaces, name prefixes, and base URIs), can be captured
with the [schema](preserves-schema.html)
Node = Text / Element .
Text = string .
Element =
/ @withAttributes
<<rec> @localName symbol [@attributes Attributes @children Node ...]>
/ @withoutAttributes
<<rec> @localName symbol @children [Node ...]> .
Attributes = { symbol: string ...:... } .
**Examples.**
<html
<h1 {class: "title"} "Hello World!">
<p
"I could swear I've seen markup like this somewhere before. "
"Perhaps it was "
<a {href: "https://docs.racket-lang.org/search/index.html?q=xexpr%3F"} "here">
"?"
>
<table
<tr <th> <th "Column 1"> <th "Column 2">>
<tr <th "Row 1"> <td 123> <td 234>>>
>
<!-- Heading to visually offset the footnotes from the main document: -->
## Notes