URI Normalization is Weird

Wed Apr 1 00:12:39 2015 Zachary Scott mail@zzak.io

It all started with this this bug report: Bug #9127

Robert writes:

The documentation for URI::Generic#normalize is vague and does not provide enough details about the algorithm that applies normalization.

What do the docs tell us?

~ => ri URI::Generic#normalize
------------------------------------------------------------------------------
Returns normalized URI

~ => ri URI::Generic#normalize!
= URI::Generic#normalize!
------------------------------------------------------------------------------
Destructive version of #normalize

Ok... and the code?

#
# Returns normalized URI
#
def normalize
  uri = dup
  uri.normalize!
  uri
end

#
# Destructive version of #normalize
#
def normalize!
  if path && path.empty?
    set_path('/')
  end
  if scheme && scheme != scheme.downcase
    set_scheme(self.scheme.downcase)
  end
  if host && host != host.downcase
    set_host(self.host.downcase)
  end
end

We can see this method is designed to set the path, scheme, and host components of a URI after normalizing each field.

But why are they downcased?

Well, we know that URI::Generic implements RFC 2396... because it says so in the documentation... :trollface:

=> ri URI::Generic
= URI::Generic < Object
------------------------------------------------------------------------------
Base class for all URI classes. Implements generic URI syntax as per RFC 2396.

If we take a look at RFC2396, there is a section on how to handle case-sensitivity.

Section 6: Uri Normalization and Equivalence In many cases, different URI strings may actually identify the identical resource. For example, the host names used in URL are actually case insensitive, and the URL http://www.XEROX.com is equivalent to http://www.xerox.com.

This wasn't the first time we discussed this issue on Ruby's bug tracker..

I found this bug from nearly 5 years ago, normalization incomplete: Bug #2525

Marc-Andre writes:

"hTTp://example.com/" and "http://exa%4dple.com/" should both be normalized to "http://example.com/" as per RFC3986.

However, if you remember from earlier.. you would have been expecting:

Naruse: URI#normalize is based on RFC 2396.

Additionally, Marc-Andre was able to discover the following language in 2396:

Section 3.1 Scheme Component Scheme names consist of a sequence of characters beginning with a lower case letter and followed by any combination of lower case letters, digits, plus ("+"), period ("."), or hyphen ("-").

Suggesting at the time, Ruby's URI normalization was indeed incomplete.. because it wasn't downcase'ing the scheme component.

So the story goes, Naruse commits 26227 to add this to #normalize!:

set_scheme(self.scheme.downcase)

Further

Additionally, there is much more to learn from RFC 3986 regarding equivalence and case normalization.

Section 6.1 Equivalence In testing for equivalence, applications should not directly compare relative references; the references should be converted to their respective target URIs before comparison.

To me, this says we should consider each component of a URI in comparison.

Although it's suggested we can skip this step entirely:

Section 6.2.1 String Comparison If two URIs, when considered as character strings, are identical, then it is safe to conclude that they are equivalent.

We can see that normalization is key to helping us to make comparisons:

In practical terms, character-by-character comparisons should be done codepoint-by-codepoint after conversion to a common character encoding.

Regarding case-sensitivity, the "not-as-old-but-still-really-old" RFC says:

Section 6.2.2.1 Case Normalization When a URI uses components of the generic syntax, the component syntax equivalence rules always apply; namely, that the scheme and host are case-insensitive and therefore should be normalized to lowercase.

And there we have it, scheme and host should be compared case-insensitively, clearly stating "normalized to lowercase".

What have we learned?

Ruby's generic URI implementation is vague so we are vague.

We use normalization when we need to compare URI components.

The current implementation of normalize! is only designed to handle case-insenstively.

Ruby implements == and eql? but not === for generic URIs.

In summary

In order to fix this bug report, I need to clarify that normalization in this case just means to downcase each component of the URI.

It might also be worth adding some tests to Ruby, and RubySpec to confirm.

Well... I hope you have learned something about URI normalization and the URI library.

You can see that there is still work to do, and we're always improving.

There's more to discover and explore in the Ruby Standard Library! Enjoy!!