diff --git a/syndicate/mc/preserve.md b/syndicate/mc/preserve.md index c8d9668..5a5392f 100644 --- a/syndicate/mc/preserve.md +++ b/syndicate/mc/preserve.md @@ -302,12 +302,20 @@ grouping of its field-`Value`s. curly-brace-enclosed colon-separated pairs of values. `Set`s are written either as a simple curly-brace-enclosed non-empty sequence of values, or as a possibly-empty sequence of values enclosed by the -tokens `#set{` and `}`. +tokens `#set{` and `}`.[^printing-collections] Sequence = "[" *Value ws "]" Dictionary = "{" *(Value ws ":" Value) ws "}" Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}" + [^printing-collections]: **Implementation note.** When implementing + printing of `Value`s using the textual syntax, consider supporting + (a) optional pretty-printing with indentation, (b) optional + JSON-compatible print mode for that subset of `Value` that is + compatible with JSON, and (c) optional submodes for no commas, + commas separating, and commas terminating elements or key/value + pairs within a collection. + Any `Value` may be represented using the [compact binary syntax](#compact-binary-syntax) by directly prefixing the binary form of the `Value` with ASCII `SOH` (`%x01`), or by @@ -419,28 +427,26 @@ URL-safe Base64 characters are allowed. ByteString =/ %s"#base64{" *(ws / base64char) ws "}" / base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "=" -A `Symbol` may be written in a "bare" form,[^cf-sexp-token] so long as +A `Symbol` may be written in a "bare" form[^cf-sexp-token] so long as it conforms to certain restrictions on the characters appearing in the -symbol, or in a quoted form. The quoted form is much the same as the -syntax for `String`s, including embedded escape syntax, except using a -bar or pipe character (`|`) instead of a double quote mark. +symbol. Alternatively, it may be written in a quoted form. The quoted +form is much the same as the syntax for `String`s, including embedded +escape syntax, except using a bar or pipe character (`|`) instead of a +double quote mark. Symbol = symstart *symcont / "|" *symchar "|" - symstart = ALPHA / sympunct - symcont = ALPHA / sympunct / DIGIT / "-" / "." + symstart = ALPHA / sympunct / symunicode + symcont = ALPHA / sympunct / symunicode / DIGIT / "-" / "." sympunct = "~" / "!" / "@" / "$" / "%" / "^" / "&" / "*" / "?" / "_" / "=" / "+" / "<" / ">" / "/" symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG) + symunicode = [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt] - definition of "token representation". - -TODO: More unicode in unescaped symbols? - -### Printing - -TODO: Recommend a JSON-compatible print mode. Recommend a submode with -trailing commas. + definition of "token representation", and with the + [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4). ## Compact Binary Syntax @@ -722,6 +728,8 @@ The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and ## Examples +### Simple examples + @@ -749,7 +757,7 @@ short form label number 0 to label `discard`, 1 to `capture`, and 2 to | `1.0` | 03 3F F0 00 00 00 00 00 00 | | `-1.202e300` | 03 FE 3C B7 B7 59 BF 04 26 | -Finally, a larger example, using a non-`Symbol` label for a record.[^extensibility2] The `Record` +The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record` [titled person 2 thing 1](101, "Blackwell", date(1821 2 3), "Dr") @@ -783,6 +791,97 @@ encodes to For more detail on Racket's representations of record labels, see [the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29). +--- + +### JSON examples + +The examples from +[RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as +valid Preserves, though the JSON literals `true`, `false` and `null` +read as `Symbol`s. The first example: + + { + "Image": { + "Width": 800, + "Height": 600, + "Title": "View from 15th Floor", + "Thumbnail": { + "Url": "http://www.example.com/image/481989943", + "Height": 125, + "Width": 100 + }, + "Animated" : false, + "IDs": [116, 943, 234, 38793] + } + } + +encodes to binary as follows: + + E2 + 55 "Image" + EC + 55 "Width" 42 03 20 + 55 "Title" 5F 14 "View from 15th Floor" + 58 "Animated" 75 "false" + 56 "Height" 42 02 58 + 59 "Thumbnail" + E6 + 55 "Width" 41 64 + 53 "Url" 5F 26 "http://www.example.com/image/481989943" + 56 "Height" 41 7D + 53 "IDs" C4 + 41 74 + 42 03 AF + 42 00 EA + 43 00 97 89 + +and the second example: + + [ + { + "precision": "zip", + "Latitude": 37.7668, + "Longitude": -122.3959, + "Address": "", + "City": "SAN FRANCISCO", + "State": "CA", + "Zip": "94107", + "Country": "US" + }, + { + "precision": "zip", + "Latitude": 37.371991, + "Longitude": -122.026020, + "Address": "", + "City": "SUNNYVALE", + "State": "CA", + "Zip": "94085", + "Country": "US" + } + ] + +encodes to binary as follows: + + C2 + EF 10 + 59 "precision" 53 "zip" + 58 "Latitude" 03 40 42 E2 26 80 9D 49 52 + 59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21 + 57 "Address" 50 + 54 "City" 5D "SAN FRANCISCO" + 55 "State" 52 "CA" + 53 "Zip" 55 "94107" + 57 "Country" 52 "US" + EF 10 + 59 "precision" 53 "zip" + 58 "Latitude" 03 40 42 AF 9D 66 AD B4 03 + 59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF + 57 "Address" 50 + 54 "City" 59 "SUNNYVALE" + 55 "State" 52 "CA" + 53 "Zip" 55 "94085" + 57 "Country" 52 "US" + ## Conventions for Common Data Types The `Value` data type is essentially an S-Expression, able to @@ -918,8 +1017,6 @@ or `date-time` productions of ## Security Considerations -TODO: Lots of whitespace is just like lots of empty chunks - **Empty chunks.** Streamed (format C) `String`s, `ByteString`s and `Symbol`s may include chunks of zero length. This opens up a possibility for denial-of-service: an attacker may begin streaming a @@ -929,6 +1026,12 @@ optional reasonable restrictions on the number of consecutive empty chunks that may appear in a stream, and may even supply an optional mode that rejects empty chunks entirely. +**Whitespace.** Similarly, the textual format for `Value`s allows +arbitrary whitespace in many positions. In streaming transfer +situations, consider optional restrictions on the amount of +consecutive whitespace and comments that may appear in a serialized +`Value`. + **Canonical form for cryptographic hashing and signing.** As specified, neither the textual nor the compact binary encoding rules for `Value`s force canonical serializations. Two serializations of the diff --git a/syndicate/mc/preserve.rkt b/syndicate/mc/preserve.rkt index 3874c55..77012ac 100644 --- a/syndicate/mc/preserve.rkt +++ b/syndicate/mc/preserve.rkt @@ -4,6 +4,8 @@ (provide (struct-out stream-of) (struct-out record) short-form-labels + read-preserve + string->preserve encode decode wire-value) @@ -176,15 +178,15 @@ (define s (bytes->string/utf-8 bs)) (lambda () (ks (if (= minor 3) (string->symbol s) s) rest))))])) +(define (build-record label fields) + (with-handlers [(exn:fail:contract? (lambda (e) (record label fields)))] + (apply make-prefab-struct label fields))) + (define (decode-record minor fields rest ks kf) - (define (build key fs) - (ks (with-handlers [(exn:fail:contract? (lambda (e) (record key fs)))] - (apply make-prefab-struct key fs)) - rest)) (match* (minor fields) - [(3 (list* key fs)) (build key fs)] + [(3 (list* key fs)) (ks (build-record key fs) rest)] [(3 '()) (kf)] - [(n fs) (build (vector-ref (short-form-labels) n) fs)])) + [(n fs) (ks (build-record (vector-ref (short-form-labels) n) fs) rest)])) (define (decode-collection minor vs rest ks kf) (match minor @@ -244,6 +246,241 @@ ;;--------------------------------------------------------------------------- +(define (read-preserve [i (current-input-port)]) + (local-require (only-in syntax/readerr raise-read-error)) + (local-require net/base64) + (local-require file/sha1) + + (define-match-expander px + (syntax-rules () + [(_ re pat) (app (lambda (v) (regexp-try-match re v)) pat)])) + + (define (skip-whitespace) + (regexp-match? #px#"^(\\s|,)*" i) + (match (peek-char i) + [#\; (regexp-match? #px#"[^\r\n]*[\r\n]" i) (skip-whitespace)] + [_ #t])) + + (define (parse-error fmt . args) + (define-values [line column pos] (port-next-location i)) + (raise-read-error (format "read-preserve: ~a" (apply format fmt args)) + (object-name i) + line + column + pos + #f)) + + (define (eof-guard ch) + (match ch + [(? eof-object?) (parse-error "Unexpected end of input")] + [ch ch])) + + (define (peek/no-eof) (eof-guard (peek-char i))) + (define (read/no-eof) (eof-guard (read-char i))) + + (define (read-sequence terminator) + (sequence-fold '() accumulate-value reverse terminator)) + + (define (accumulate-value acc) + (cons (read-value) acc)) + + (define (read-dictionary-or-set) + (sequence-fold #f + (lambda (acc) + (define k (read-value)) + (skip-whitespace) + (match (peek-char i) + [#\: (when (set? acc) (parse-error "Unexpected key/value separator in set")) + (read-char i) + (define v (read-value)) + (hash-set (or acc (hash)) k v)] + [_ (when (hash? acc) (parse-error "Missing expected key/value separator")) + (set-add (or acc (set)) k)])) + values + #\})) + + (define PIPE #\|) + + (define (read-raw-symbol) + (let loop ((acc '())) + (match (peek-char i) + [(or #\{ #\} #\[ #\] #\" #\; #\, #\# #\: (? char-whitespace?) (? eof-object?) (== PIPE)) + (string->symbol (list->string (reverse acc)))] + [_ (loop (cons (read-char i) acc))]))) + + (define (read-encoded-binary kind valid-char? finish) + (let loop ((acc '())) + (skip-whitespace) + (let ((ch (read/no-eof))) + (cond [(valid-char? ch) + (loop (cons ch acc))] + [(eqv? ch #\}) + (finish (list->string (reverse acc)))] + [else (parse-error "Invalid ~a character" kind)])))) + + (define (read-base64-binary) + (read-encoded-binary "base64" + (lambda (ch) + (or (and (char>=? ch #\A) (char<=? ch #\Z)) + (and (char>=? ch #\a) (char<=? ch #\z)) + (and (char>=? ch #\0) (char<=? ch #\9)) + (memv ch '(#\+ #\/ #\- #\_ #\=)))) + (lambda (s) + (base64-decode (string->bytes/latin-1 s))))) + + (define (read-hex-binary) + (read-encoded-binary "hex" + (lambda (ch) + (or (and (char>=? ch #\A) (char<=? ch #\F)) + (and (char>=? ch #\a) (char<=? ch #\f)) + (and (char>=? ch #\0) (char<=? ch #\9)))) + (lambda (s) + (hex-string->bytes s)))) + + (define (read-stringlike xform-item finish terminator-char hexescape-char hexescape-proc) + (let loop ((acc '())) + (match (read/no-eof) + [(== terminator-char) (finish (reverse acc))] + [#\\ (match (read/no-eof) + [(== hexescape-char) (loop (cons (hexescape-proc) acc))] + [(and c (or (== terminator-char) #\\ #\/)) (loop (cons (xform-item c) acc))] + [#\b (loop (cons (xform-item #\u08) acc))] + [#\f (loop (cons (xform-item #\u0C) acc))] + [#\n (loop (cons (xform-item #\u0A) acc))] + [#\r (loop (cons (xform-item #\u0D) acc))] + [#\t (loop (cons (xform-item #\u09) acc))] + [c (parse-error "Invalid escape code \\~a" c)])] + [c (loop (cons (xform-item c) acc))]))) + + (define (read-string terminator-char) + (read-stringlike values + list->string + terminator-char + #\u + (lambda () + (integer->char + (match i + [(px #px#"^[a-fA-F0-9]{4}" (list hexdigits)) + (define n1 (string->number (bytes->string/utf-8 hexdigits))) + (if (<= #xd800 n1 #xdfff) ;; surrogate pair first half + (match i + [(px #px#"^\\\\u([a-fA-F0-9]{4})" (list hexdigits2)) + (define n2 (string->number (bytes->string/utf-8 hexdigits2))) + (if (<= #xdc00 n2 #xdfff) + (+ (arithmetic-shift (- n1 #xd800) 10) + (- n2 #xdc00) + #x10000) + (parse-error "Bad second half of surrogate pair"))] + [_ (parse-error "Missing second half of surrogate pair")]) + n1)] + [_ (parse-error "Bad string \\u escape")]))))) + + (define (read-literal-binary) + (read-stringlike (lambda (c) + (define b (char->integer c)) + (when (>= b 256) + (parse-error "Invalid code point ~a (~v) in literal binary" b c)) + b) + list->bytes + #\" + #\x + (lambda () + (match i + [(px #px#"^[a-fA-F0-9]{2}" (list hexdigits)) + (string->number (bytes->string/utf-8 hexdigits))] + [_ (parse-error "Bad binary \\x escape")])))) + + (define (read-intpart acc-rev) + (match (peek-char i) + [#\0 (read-fracexp (cons (read-char i) acc-rev))] + [_ (read-digit+ acc-rev read-fracexp)])) + + (define (read-digit* acc-rev k) + (match (peek-char i) + [(? char-numeric?) (read-digit* (cons (read-char i) acc-rev) k)] + [_ (k acc-rev)])) + + (define (read-digit+ acc-rev k) + (match (peek-char i) + [(? char-numeric?) (read-digit* (cons (read-char i) acc-rev) k)] + [_ (parse-error "Incomplete number")])) + + (define (read-fracexp acc-rev) + (match (peek-char i) + [#\. (read-digit+ (cons (read-char i) acc-rev) read-exp)] + [_ (read-exp acc-rev)])) + + (define (read-exp acc-rev) + (match (peek-char i) + [(or #\e #\E) (read-sign-and-exp (cons (read-char i) acc-rev))] + [_ (finish-number acc-rev)])) + + (define (read-sign-and-exp acc-rev) + (match (peek-char i) + [(or #\+ #\-) (read-digit+ (cons (read-char i) acc-rev) finish-number)] + [_ (read-digit+ acc-rev finish-number)])) + + (define (finish-number acc-rev) + (define s (list->string (reverse acc-rev))) + (define n (string->number s)) + (when (not n) (parse-error "Invalid number: ~v" s)) + n) + + (define (read-number) + (match (peek/no-eof) + [#\- (read-intpart (list (read-char i)))] + [_ (read-intpart (list))])) + + (define (sequence-fold acc accumulate-one finish terminator-char) + (let loop ((acc acc)) + (skip-whitespace) + (match (peek/no-eof) + [(== terminator-char) (read-char i) (finish acc)] + [_ (loop (accumulate-one acc))]))) + + (define (collect-fields head) + (skip-whitespace) + (match (peek-char i) + [#\( (collect-fields (build-record head (read-sequence #\))))] + [_ head])) + + (define (read-value) + (skip-whitespace) + (collect-fields + (match (peek-char i) + [(? eof-object? o) o] + [#\{ (read-char i) (read-dictionary-or-set)] + [#\[ (read-char i) (read-sequence #\])] + [(or #\- #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9) (read-number)] + [#\" (read-char i) (read-string #\")] + [(== PIPE) (read-char i) (string->symbol (read-string PIPE))] + [#\# (match i + [(px #px#"^#set\\{" (list _)) + (sequence-fold (set) accumulate-value values #\})] + [(px #px#"^#hexvalue\\{" (list _)) + (decode (read-hex-binary) (lambda () (parse-error "Invalid #hexvalue encoding")))] + [(px #px#"^#true" (list _)) + #t] + [(px #px#"^#false" (list _)) + #f] + [(px #px#"^#\"" (list _)) + (read-literal-binary)] + [(px #px#"^#hex\\{" (list _)) + (read-hex-binary)] + [(px #px#"^#base64\\{" (list _)) + (read-base64-binary)] + [_ + (parse-error "Invalid preserve value")])] + [#\: (parse-error "Unexpected key/value separator between items")] + [_ (read-raw-symbol)]))) + + (read-value)) + +(define (string->preserve s) + (read-preserve (open-input-string s))) + +;;--------------------------------------------------------------------------- + (module+ test (require rackunit) (require (for-syntax racket syntax/srcloc)) @@ -444,4 +681,94 @@ #xE2 #x7A "first-name" #x59 "Elizabeth" #xE2 #x77 "surname" #x59 "Blackwell" )) + + (let () + (local-require json) + (define rfc8259-example1 (string->preserve #<preserve #<