* WGs marked with an * asterisk has had at least one new draft made available during the last 5 days

Ticket #111 (closed design: fixed)

Opened 6 years ago

Last modified 5 years ago

Use of TEXT

Reported by: mnot@pobox.com Owned by:
Priority: Milestone: 06
Component: p6-cache Severity:
Keywords: Cc:
Origin:

Description

The TEXT rule is used in a number of places, but it is often overlooked that its use implies that both iso-8859-1 and RFC2047 encoding are available.

The uses of TEXT need to be evaluated, and if such encoding is still viable, it should be called out more explicitly in the BNF and/or surrounding text. Candidates include:

  • reason-phrase
  • filename-parm
  • warn-text

TEXT is also referenced in other places where it is confusing or inappropriate; e.g., the definition of field-content. This should be clarified.

Change History

comment:1 Changed 6 years ago by mnot@pobox.com

proposal:

  • p1, 2.2:

Old: The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [ISO-8859-1] only when encoded according to the rules of [RFC2047].

TEXT = %x20-7E | %x80-FF | LWS

; any OCTET except CTLs, but including LWS

A CRLF is allowed in the definition of TEXT only as part of a header field continuation. It is expected that the folding LWS will be replaced with a single SP before interpretation of the TEXT value.

New: """ Words of *TEXT MUST NOT contain characters from character sets other than ISO-8859-1 [ISO-8859-1].

TEXT = %x20-7E | %x80-FF | LWS

; any OCTET except CTLs, but including LWS

A CRLF is allowed in the definition of TEXT only as part of a header field continuation. It is expected that the folding LWS will be replaced with a single SP before interpretation of the TEXT value.

Characters outside of ISO8859-1 MAY be included where the encoded-word rule (as defined in RFC2047, Section 2) is specified. The encoded-word rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. When used in HTTP, encoded-word has no specified length limit. """

Note that I've taken a minimal approach to #63 here, and that the outcome of i74 may change this.

  • p1, 2.2:

Old: comment = "(" *( ctext | quoted-pair | comment ) ")"

New: """ comment = "(" *( ctext | quoted-pair | comment | encoded-word ) ")" """

  • p1, 4.2:

Old:

field-content = <field content>

; the OCTETs making up the field-value ; and consisting of either *TEXT or combinations ; of token, separators, and quoted-string

New: """ field-content = <field content> ; the OCTETs making up the field-value, ; according to the syntax specified by the field. """

N.B. depending on how we resolve i74, we may want to add a constraint regarding character encodings, so that people don't start minting headers in random ones.

  • p3, B.1:

Old: filename-parm = "filename" "=" quoted-string

New: """ filename-parm = "filename" "=" quoted-string | encoded-word """

N.B.

  • p6, 16.6:

Old: warn-text = quoted-string New: """ warn-text = quoted-string | encoded-word """

Note that I have NOT suggested the use of encoded-word in the following places:

p1, 3.4 (Transfer Codings -- parameter values), p1, 6.1.1 (Reason-Phrase), p2, 10.2 (expect-extensions), p3, 3.3 (Media Types -- parameter values), p3, 6.1 (accept-extension), p4, 3 (ETag opaque-tag), p6, 16.2 (cache-extension), p6, 16.4 (extension-pragma).

I think the *-extension and parameter value ones are straightforward; if a particular extension wants to specify use of encoded-word, it should; we shouldn't specify use of encoded-word in the generic extension construct, but leave it to the specific instances. I.e., they still conform to TEXT, it's up to them to specify if that content can contain encoded-words.

comment:2 Changed 6 years ago by fielding@gbiv.com

I think the suggestion at the Dublin IETF meeting was that all such TEXT be reduced to US-ASCII (for generation) but specify that received values may contain non-ASCII OCTETs?

Since the TEXT rule is only intended for fluff, I suggest just removing it and specifying each field in specific terms, deprecating the use of non-ASCII and non-printing OCTETs.

comment:3 Changed 6 years ago by mnot@pobox.com

  • Milestone changed from unassigned to 06

The plan is to remove the TEXT rule (and associated commentary) altogether, and replace it with instances that call out the specific legal octets and how to interpret them.

They are;

1) field-content (p1)

field-content = *( VCHAR / WSP / obs-text )

obs-text = %x80-FF

Historically, HTTP has allowed field-content with text in the ISO-8859-1 charset (allowing other charsets through use of RFC2047 encoding). In practice, most HTTP header field-values are a subset of the ASCII charset, and newly defined headers SHOULD constrain their field-values to ASCII characters. Recipients SHOULD treat obs-text characters in header field-content as raw octets.

2) comment (p1)

ctext = *( OWS / %x21-27 / %x2A-7E / obs-text )

Note that OWS is Optional White Space, like LWS (work in progress).

3) quoted-string

qdtext = *( OWS / %x21 / %x23-5B / %x5D-7E / obs-text )

4) reason-phrase (p1/p2)

Reason-Phrase = *( VCHAR / WSP / obs-text )

Note that this also resolves the editorial issue #94 (Reason-Phrase BNF).

comment:4 Changed 6 years ago by fielding@gbiv.com

From [395]:

Deprecate line folding, addresses #77. Require that invalid whitespace around field-names be rejected, addresses #30. Make non-ASCII content obsolete and opaque in header fields and reason phrase, addresses #63, #74, #94, #111.

comment:5 Changed 6 years ago by julian.reschke@gmx.de

  • Status changed from new to closed
  • Resolution set to fixed

Fixed in [398]:

Resolve #63, #74, #94, #111: Issues around TEXT rule closed with revision [395] (closes #63, #74, #94, #111)

comment:6 Changed 6 years ago by julian.reschke@gmx.de

  • Status changed from closed to reopened
  • Resolution fixed deleted

re-open until reviewed

comment:7 Changed 5 years ago by julian.reschke@gmx.de

  • Component changed from non-specific to p6-cache

Part 6 still allows RFC2047 encoding for the Warn header.

comment:8 Changed 5 years ago by mnot@pobox.com

  • Status changed from reopened to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.