* WGs marked with an * asterisk has had at least one new draft made available during the last 5 days

Ticket #96 (closed defect: fixed)

Opened 3 years ago

Last modified 3 years ago

add definition of terms and introduction to comparison document

Reported by: masinter@adobe.com Owned by: masinter@adobe.com
Priority: major Milestone:
Component: comparison Version:
Severity: - Keywords:
Cc: masinter@adobe.com

Description

http://en.wikipedia.org/wiki/URL_normalization

I think we should clearly state that 'normalization' and 'canonicalization' are just two words that refer to the same process.

http://lists.w3.org/Archives/Public/public-iri/2011Jul/0035.html
shows there is some ambiguity about this.

I wish the draft was stated more clearly in terms of comparison and equivalence rather than normalization/canonicalization, and I'm not sure that the 'ladder' concepts holds for all of the comparison functions in use, but perhaps those are other defects.

Change History

comment:1 Changed 3 years ago by masinter@adobe.com

"Normalization" and "canonicalization" are used equivalently, and
neither term is canonical.

In my mail, I specifically meant the terms "normalization" as used in
RFC 3986 and "canonicalization" as used by Adam Barth in his mail.

Those words, in those contexts, mean somewhat different things.

RFC 3986 describes and recommends good practices for implementing
normalization, but does not define any particular normalization
precisely. I think Adam was hoping to define a single normalization/
canonicalization method precisely. Do you agree that what Adam
is trying to define and what RFC 3986 describes are instances of
the same thing?

One can derive an equivalence relationship from a canonicalization
method: two elements are equivalent if they have the same canonical form.

# Indeed. But this is not the only possible use of canonicalization.

I agree.

Defining a normalization/canonicalization method is strictly "more powerful"

than defining an equivalence relationship, since you cannot derive the

normalization mapping from the equivalence relationship even though you
can define equivalence based on normalization.

The question is whether we actually need a normative normalization
algorithm. I'm having trouble thinking of a use case where it matters
for URIs in general, or even for HTTP URIs in specific.

For IRIs there are several equivalence relationships, useful for
different purposes.

Yep.

Defining a canonical form (choosing a canonical
canonicalization) doesn't seem necessary, although it might be useful.

Well, it's necessary for some use cases; specifically APIs that return
parts of a URI.

Could you please explain? Why do the APIs need a canonical
form? Why can't the APIs return the parts of a URI/URL/IRI just
work on the strings they are given? What are the requirements for
normalization/canonicalization that aren't satisfied if you parse
and just accept the original form?

Let's say, for example, that you use the "string equality"
as your equivalence relationship, and you do not normalize
ports, case of host names. Between http://example:80 and
http://example:80/ and http://example and http://example/ and
http://eXAMple etc. (with different cases for 'example'), that
these are not treated equivalently, and you get
host = "example", port = "80", path = ""
host = "example", port = "80", path = "/"
host = "example", port = "", path = ""
host = "eXAMple", port = "", path = ""

as the parsed components. What is the harm? Why does
the API need more? What harm is it if some APIs parse one
way and others parse other ways? Can you find a deployed web
site which wouldn't work if the API were allowed to choose what
kind of normalization it did?

But you would need a different canonicalization for every equivalence
relationship.

A priori yes. How many different web-facing equivalence relationships
are there in practice?

I don't know how many there are in total, but it is easy to find
lots of different cases. A number of services that do not use
URIs/IRIs for retrieval but rather as a carrier of semantics use
string-equality as the equivalence relationship, including XML
name spaces and semantic web. Perhaps these are not "web-facing"
(I can only imagine what you mean by that, but perhaps 'for
retrieval purposes' is what you mean?).

One major difference seems to be whether you want to be liberal
or conservative in cases where equivalence isn't certain. For
example, in the handling of final Greek sigma (vs non-final sigma)
or ß (s-set) in German (vs. double ss) are these equivalent or not?
Or any of the various unicode normalizations....

Now, I have three use cases: determining "same-document
fragment identifier", running a HTTP proxy cache, and comparing
a URI against a list of sites suspected of serving malware.
In the malware case, I might want to be liberal in determining
equivalence, and err on the side of suspecting malware even
if the host name is spelled in a way which some equivalence
relationships might consider not-equivalent. The requirement
for a cache, though, might decide to be conservative, though,
and treat IRIs as distinct even if some comparison methods
might consider them equivalent in some circumstances.

comment:2 Changed 3 years ago by masinter@adobe.com

  • Cc masinter@adobe.com added
  • Owner changed from draft-ietf-iri-3987bis@tools.ietf.org to masinter@adobe.com
  • Status changed from new to assigned
  • Component changed from 3987bis to comparison

comment:3 Changed 3 years ago by masinter@adobe.com

  • Status changed from assigned to closed
  • Resolution set to fixed

<section title="Comparison, Equivalence, Normalization and Canonicalization">

<t>In general, when considering a set of items or strings, there are several
interrelated concepts. A comparison method determines, between two items in the
set, their relationship. In particular, a comparison method for determining
equivalence might result in a determination that two (different) items are equivalent,
known to be different, or that equivalence isn't determined. </t>
<t> One way to define a comparison for equivalence is to define a
a normalization or canonicalization algorithm. For each item in a set
of equivalent items, one of them could be designated the "normal" or
"canonical" form. </t>

<t>These general concepts are used with IRIs in this document,
and in other circumstances, where a mapping from one sequence of Unicode
characters to another one could be described as a "normalization" algorithm.</t>
<t> In general, this document tries to stay with the "equivalence" or
"comparison" methods, become some times the mathematical notion of
"normalization" results in forms that ordinary users might not consider "normal"
in an ordinary sense.

</t>

</section>

Note: See TracTickets for help on using tickets.