Goals for Work on IRIs in IETF
This page contains some desirable aspects of the results of work in the IETF on Resource Identifiers. Not all of these goals can be achieved simultaneously, but it is hoped that there is sufficient agreement that these are 'desirable' that it will motivate why we need to work in this area.
In addition, there are some technical goals of the HTML 5 specification editor that should somehow come out of this work... see below.
Multiple definitions of the same terms in different standards documents should be avoided; even consistent definitions are problematic, requiring cross check. If a concept is defined in one document in a way that is unsuitable for another which otherwise would refer to the term, avoid "redefining" the term within the newer context; either pick a different term or coordinate agreement to update the referenced specification.
Avoiding security problems (e.g., difficulties due to spoofing, renaming, misuse of DNS) is a high priority; avoiding security problems is a higher priority than being consistent with existing applications.
Optional interpretation rules for resource identifiers which give different results depending on the processing model chosen are to be avoided, if there are significant cases where the difference would cause interoperability problems.
Separate “specification of what a conservative producer should send” from “advice for what a liberal consumer should accept”: for robustness, the specification of a “conforming” resource identifier should produce can be (if necessary) more restrictive than the specification of what some common applications accept.
Normative lenient processing
It is highly desirable to provide normative rules for compatible "liberal consumers" of identifiers. Mandate that specifications must specify whether particular contexts require strict processing, lenient processing, or allow either. In cases where uniform behavior is necessary for interoperability, implementation advice is not strong enough.
Consistency of web and other Internet applications
Interoperability between web applications (browsers, proxies, spiders, etc.) and other Internet applications which use resource identifiers (email, directory services) is important, and should be given equal (or nearly equal) priority as interoperability between web browsers. Recommended practice for web applications and other Internet applications should be the same – those creating web content should not be encouraged to create Resource Identifiers (whether called URLs, URIs, IRIs, Web Addresses) which would not function in other applications.
Consistency of specifications with implementations
When existing specifications do not match the common practice of existing applications, it is appropriate to update the existing specification, even if long standing.
When existing implementations disagree, document existing practice, but recommend (normatively) the behavior that will best lead to improved interoperability.
Minimize options and specifications
The split between URI and IRI as separate protocol elements was an unfortunate necessity. While it was necessary to have two separate normative terms, “URI” and “IRI” to describe two variations of “resource identifiers”, having unnecessary multiple non-terminals and terms is harmful. Adding additional terms such as “LEIRI” and “Web Address” or HREF should be avoided, if possible. (URI was the term used to unify “URL” and “URN”).
Unless necessary for other reasons, avoid making existing, conforming, and widely implemented behavior non-conforming: Applications which accept URIs but not other forms should not be made “non-conforming” by a redefinition of terms.
HTML5 Editor Requirements
Another goal is to address the requirements that led to the processing rules previously defined in http://www.w3.org/TR/2009/WD-html5-20090423/.
- A normative definition (which can be implemented consistently) which describes the processing of relative and absolute forms, which has clear, consistent and testable rules for processing, and in particular, the hand-off between generic URL processing and HTTP. For example, the handling of UTF-16 and UTF-8 encoded EARLs, the effect of the document "base", whether backslashes are or are not changed to slashes, etc.
- An normative, uniform manner of parsing arbitrary strings to determine syntactic components, whether the strings are valid URIs, IRIs or other forms. In particular, a definition of "absolute" that works consistently whether the strings are valid URIs, IRIs or something else.
- A definition of a valid IRI that handles encodings other than UTF-8.
- A definition of absolute IRI that handles having invalid strings be absolute, e.g. http://www.example.com/%X being an "absolute IRI"
- An algorithm that defines how to take an arbitrary string (including one that isn't a valid IRI of any form) and get out the scheme, host, port, hostport, path, query, fragment, and host-specific parts.
Some related background information is in the archived "Error handling in URIs" message at http://lists.w3.org/Archives/Public/uri/2008Jun/0002.html and the related discussion thread at http://lists.w3.org/Archives/Public/uri/2008Jun/thread.html#msg2