<?xml version="1.0" encoding="US-ASCII"?>

<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY ISO.646.1991 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml2/reference.ISO.646.1991.xml'>
<!ENTITY rfc0020 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0020.xml'>
<!ENTITY rfc0097 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0097.xml'>
<!ENTITY rfc0137 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0137.xml'>
<!ENTITY rfc0139 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0139.xml'>
<!ENTITY rfc0318 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0318.xml'>
<!ENTITY rfc0698 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0698.xml'>
<!ENTITY rfc0854 PUBLIC ''
'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0854.xml'>
<!ENTITY rfc2277 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2277.xml'>
<!ENTITY rfc3629 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.3629.xml'>
<!ENTITY rfc2781 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2781.xml'>
<!ENTITY rfc3491 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.3491.xml'>
<!ENTITY rfc3912 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.3912.xml'>
<!ENTITY rfc2119 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml'>
<!ENTITY rfc0542 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0542.xml'>
<!ENTITY rfc2821 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2821.xml'>
<!ENTITY rfc0954 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0954.xml'>
<!ENTITY rfc0742 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0742.xml'>
<!ENTITY rfc1945 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.1945.xml'>
<!ENTITY rfc2616 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2616.xml'>
<!ENTITY rfc3454 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.3454.xml'>
<!ENTITY rfc4251 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.4251.xml'>
<!ENTITY rfc4690 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.4690.xml'>
<!ENTITY rfc0959 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0959.xml'>
<!ENTITY rfc5234 PUBLIC ''
   'http://xml.resource.org/public/rfc/bibxml/reference.RFC.5234.xml'>

]>

<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc rfcedstyle="yes"?>
<?rfc subcompact="no" ?>

<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>

<!-- Expand crefs and put them inline -->
<?rfc comments='no' ?>
<?rfc inline='yes' ?>  

<rfc number="5198" 
	  category="std"
	  updates="854"
	  obsoletes="698">
  <front>
    <title abbrev="Network Unicode">
      Unicode Format for Network Interchange </title>
<author initials="J.C." surname="Klensin" fullname="John C Klensin">
  <organization/>
  <address>
    <postal>
      <street>1770 Massachusetts Ave, #322</street>
      <city>Cambridge</city> <region>MA</region> <code>02140</code>
      <country>USA</country> </postal>
    <phone>+1 617 491 5735 </phone>
    <email>john-ietf@jck.com</email>
  </address>
  </author>
<author initials="M.A." surname="Padlipsky" fullname="Michael A. Padlipsky">
  <organization/>
  <address>
    <postal>
      <street>8011 Stewart Ave.</street>
      <city>Los Angeles</city> <region>CA</region> <code>90045</code>
      <country>USA</country> </postal>
    <phone>+1 310-670-4288</phone>
    <email>the.map@alum.mit.edu</email>
  </address>
  </author>

  <date month="March" year="2008"/>

<!-- [rfced] Please insert any additional keywords (beyond those that 
appear in the title) for use on http://www.rfc-editor.org/rfcsearch.html. -->

    <keyword>Text</keyword>
    <keyword>Internationalization</keyword>
    <keyword>ASCII</keyword>
    <keyword>Unicode</keyword>
    <keyword>UTF-8</keyword>

    <abstract>
      <t> The Internet today is in need of a standardized form for the
		 transmission of internationalized "text" information,
		 paralleling the specifications for the use of ASCII that date
		 from the early days of the ARPANET.  This document specifies
		 that format, using UTF-8 with normalization
		 and specific line-ending sequences.
      </t>
    </abstract>
  </front>

  <middle>
<section title="Introduction">
   <section title="Requirement for a Standardized Text Stream Format">
	  <t> Historically,  Internet protocols have been largely ASCII-based and
		 references to "text" in protocols have assumed ASCII text and
		 specifically text in Network Virtual Terminal ("NVT") or
		 "Network ASCII" form (see <xref target="NVT-History"/> and
		 <xref target="NVT-Printer"/>).
		 Protocols and formats that have moved beyond ASCII have
		 included arrangements to specifically identify the character
		 set and often the language being used. </t>
	  <t> In our more internationalized world, "text" clearly no longer
		equates unambiguously to "network ASCII".  Fortunately, however,
		we are converging on
		Unicode <xref target="Unicode"/> <xref target="ISO10646"/> as a
		single international interchange character coding and 
		no longer need to deal with per-script standards for character
		sets (e.g., one standard for each of Arabic, Cyrillic, Devanagari,
		etc., or even standards keyed to languages that are usually
		considered to share a script, such as French, German, or Swedish).
		Unfortunately, though, while it is certainly time to define a
		Unicode-based text type for use as a common text interchange 
		format, "use Unicode" involves even more ambiguity than "use
		ASCII" did decades ago.</t>
	 <t>Unicode identifies each character by an integer, called its "code
		point", in the range 0-0x10ffff.  These integers can be encoded into
		byte sequences for transmission  in at least three standard and
		generally-recognized encoding forms, all of which are completely
		defined in The  Unicode Standard and the documents cited below:
	<list style="symbols">
	   <t>UTF-8 <xref target="RFC3629"/> defines a variable-length
		  encoding that may be applied uniformly to all code
		  points.</t>
	   <t>UTF-16  <xref target="RFC2781"/> encodes the range of
		  Unicode characters whose code points are less than 65536
		  straightforwardly as 16-bit integers, and provides a
		  "surrogate" mechanism for encoding larger code points in 32
		  bits. </t>
	   <t>UTF-32 (also known as UCS-4) simply encodes each code point
		  as a 32-bit integer.</t>
	   </list>
	   Older forms and nomenclature, such as the 16-bit UCS-2, are
	   now strongly discouraged. </t>
	<t>As with ASCII, any of these forms may be used with different
	   line-ending conventions.  That flexibility can be an additional
	   source of confusion with, e.g., index (offset) references into
	   documents based on character counts.</t> 
 <t> This document proposes to establish "Net-Unicode" as a new
	standardized text
	transmission form for the Internet, to serve as an internationalized
	alternative for NVT ASCII when
	specified in new -- and, where appropriate, updated -- protocols.
	UTF-8 <xref target="RFC3629"/> is chosen for the coding because it
	has good compatibility properties with ASCII and for other reasons
	discussed in the existing IETF character set
	policy <xref target="RFC2277"/>.  "Net-Unicode" is specified in
	Section 2; the subsequent sections of the document provide
	background and explanation.</t>
 <t> Whenever there is a choice, Unicode SHOULD be used with the text
	encoding specified here.  This combination is preferred to the
	double-byte encoding of "extended ASCII"
	<xref target="RFC0698"/> or the assorted
	per-language or per-country character coding systems.
 </t>
 </section>

 <section title="Terminology" anchor="terminology">
 <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in
   <xref target="RFC2119"/>.</t>
   </section>
   </section>

   <section title="Net-Unicode Definition" 
			anchor="NetUnicodeDef">
	  <t> The Network Unicode format (Net-Unicode) is
		 defined as follows.  Parts of this definition are
		 deliberately informal, providing guidance for specific
		 profiles or rules in the protocols that reference this one
		 rather than firm rules that apply globally.

		 <list style="numbers">
		 <t>Characters MUST be encoded in UTF-8 as defined in
			<xref target="RFC3629"/>.</t>
		 <t>If the protocol has the concept of "lines", line-endings
			MUST be indicated by the sequence 
			Carriage-Return (CR, U+000D) followed by Line-Feed
			(LF, U+000A), often known just as CRLF.  CR SHOULD NOT
			appear except when followed by LF.  The only other allowed
			context in which CR is permitted is in the combination CR
			NUL, which is not recommended (see the note at the end of
			this section).</t>
		 <t>The control characters in the ASCII range (U+0000 to U+001F and
			U+007F to U+009F) SHOULD generally be
			avoided.   Space (SP, U+0020), CR, LF, and Form Feed (FF,
			U+000C) are exceptions to this principle, but use of all
			but the first requires care as discussed elsewhere in this
			document. The so-called "C1 Controls" (U+0080 through
			U+009F), which did not appear in ASCII, MUST NOT appear.
			<vspace blankLines="1"/>
			FF should be used only with caution: it does not
			have a standard and universal interpretation and, in
			particular, if its use
			assumes a page length, such assumptions may not be
			appropriate in international 
			contexts (e.g., considering 8.5x11 inch paper versus A4).
			Other control characters
			are used to affect display format, control devices, or to
			structure files.  None of those uses is appropriate for
			streams of plain text.  </t>
		 <t>Before transmission, all character sequences SHOULD be
			normalized according to Unicode normalization form "NFC"
			(see <xref target="Normalization"/>).</t>
		 <t>As suggested in Section 6 of RFC 3629, the Byte Order Mark ("BOM")
			signature MUST NOT appear at the beginning of these text
			strings.</t>
		 <t>Systems conforming to this specification MUST NOT
			transmit any string containing any code point that is
			unassigned in the version of Unicode on which they are
			dependent.  The version of NFC and the version of Unicode
			used by that system MUST be consistent.</t> 
			</list> </t>
		 
		 <t>The use of LF without CR is questionable; see
			<xref target="NVT-Printer"/> for more discussion.  The
			newer control characters IND (U+0084) and NEL ("Next
			Line", U+0085) might have been used to disambiguate the
			various line-ending situations, but, because their use has
			not been established on the Internet, because many protocols
			require CRLF, and because IND and NEL fall within the "C1
			Controls" group (see below), they MUST NOT be used.
			Similar observations apply to the yet newer line and paragraph
			separators at U+2028 and U+2029 and any future characters
			that might be defined to serve these functions.  For this
			specification and protocols that depend on it, lines end
			in CRLF and only in CRLF.  Anything that does not end in
			CRLF is either not a line or is severely malformed.</t>
		 <t> The NVT specification contained a number of
			additional provisions, e.g., for the optional use of
			backspacing and "bare CR" (sent as CR NUL) to generate
			overstruck character sequences. The much greater number of 
			precomposed characters in Unicode, the availability of
			combining characters, and the growing use of markup
			conventions of various types to show, e.g.,
			emphasis (rather than attempting to do that via the use of
			special characters), should make such sequences largely
			unnecessary.  These sequences SHOULD be avoided if at all
			possible.  However, because they were optional in NVT
			applications and this specification is an NVT superset,
			they cannot be prohibited entirely. 
			The most important of these rules is that
			CR MUST NOT appear unless it is immediately followed by
			LF (indicating end of line) or NUL.   Because NUL (an
			octet whose value is all zeros, i.e., %x00 in the notation
			of <xref target="RFC5234"/>) is
			hostile to programming languages that use that character
			as a string delimiter, the CR NUL sequence SHOULD be
			avoided for that reason as well.</t>
      </section>

	  <section title="Normalization" anchor="Normalization">

		 <t>There are cases where strings of Unicode are fundamentally equivalent,
			essentially representing the same text. These are called "canonical
			equivalents" in the Unicode Standard. For example, the following pairs of
			strings are canonically equivalent:
			<vspace blankLines="1"/>
		U+2126 OHM SIGN
		<vspace blankLines="0"/>
		U+03A9 GREEK CAPITAL LETTER OMEGA
		<vspace blankLines="1"/>
		U+0061 LATIN SMALL LETTER A, U+0300 COMBINING GRAVE ACCENT
		<vspace blankLines="0"/>
		U+00E0 LATIN SMALL LETTER A WITH GRAVE</t>
		<t>Comparison of strings becomes much easier if any such cases are always
		represented by a single unique form. The Unicode Consortium specifies a
		normalization form, known as NFC [NFC], which provides the necessary
		mappings and mechanisms to convert all canonically equivalent
		sequences to a
		single unique form. Typically, this form produces precomposed characters for
		any sequences that can be represented in that fashion. It also reorders
		other combining marks so that they have a unique and
		unambiguous order. </t>
	 <t>Of the various normalization forms defined as part of Unicode,
		NFC is closest to actual use in practice, minimizes
		side-effects due to considering characters equivalent that may
		not be equivalent in all situations, and typically requires
		the least work when converting from non-Unicode encodings.</t>
	 <t>The section above requires that, except in very unusual
		circumstances, all Net-Unicode strings be
		transmitted in normalized form.  Recognition of the fact that
		some implementations of applications may rely on operating system
		libraries over which they have little control and adherence
		to the robustness 
		principle suggests that receivers of such strings should be
		prepared to receive unnormalized ones and to not react to that
		in excessive ways.</t>
	</section>

   <section title="Versions of Unicode" anchor="versions">
		<t> Unicode changes and expands over time. Large
		   blocks of space are reserved for future expansion.  New
		   versions, which appear at regular intervals, add new scripts
		   and characters.  Occasionally they also change some property
		   definitions. 
		   In retrospect, one of the advantages of
		   <xref target="ASCII"> ASCII </xref> when it was chosen
		   was that the code space was full when the Standard was
		   first published.  There was no practical way to add
		   characters or change code point assignments without being
		   obviously incompatible.  </t>
		<t>While there are some security issues if people deliberately
		   try to trick the system (see <xref target="Security"/>),
		   Unicode version changes should not have a significant
		   impact on the text stream specification of this document
		   for the following reasons:

		   <list style="symbols">
			  <t> The transformation between Unicode code table
				 positions and the corresponding UTF-8 code is
				 algorithmic; it does not depend on whether a code
				 point has been assigned or not.</t>
			  <t>The normalization recommended here, NFC (see
				 <xref target="Normalization"/>), performs a very
				 limited set of mappings, much more limited than those
				 of the more extensive NFKC used in, e.g.,
				 <xref target="RFC3491"> Nameprep</xref>.</t>

			  </list></t>
		   <t>
			  The NFC tables may be updated over time as new
			  characters are added, but the Unicode Consortium has
			  guaranteed the stability of all NFC strings. 
				That is, if a string does not contain any unassigned
				characters, and it is 
				normalized according to NFC, it will always be
				normalized according to all 
				future versions of the Unicode Standard. The
				stability of the Net-Unicode format is thus
				guaranteed when any implementation that converts text
				into Net-Unicode format does not permit unassigned
				characters.</t>
			 <t> Because Unicode code points that are reserved for
				private use do not have standard definitions or
				normalization interpretations, they SHOULD be avoided
				in strings intended for Internet interchange.</t>
			 <t>Were Unicode to be changed in a way that violated
				these assumptions, i.e., that either invalidated the
				byte string order specified in RFC 3629 or that changed 
				the stability of NFC as stated above, this specification would not apply.
				Put differently, this specification applies only to
				versions of Unicode 
				starting with version 5.0 and extending to, but not
				including, any version 
				for which changes are made in either the UTF-8
				definition or to NFC stability.   Such changes would
				violate established Unicode policies and are hence
				unlikely, but, should they occur, it would be 
				necessary to evaluate them for compatibility with this
				specification and other Internet uses of NFC. </t>
			 <t>If the specification of a protocol references this
				one, strings that are received by that protocol and
				that appear to be UTF-8 and are not otherwise
				identified (e.g., by charset labeling) SHOULD be
				treated as using UTF-8 in conformance with this
				specification.</t> 
		   </section>

	<section
	 title="Applicability and Stability of this Specification">
	   <section title="Use in IETF Applications Specifications">
	   <t>During the development of this specification, there was some
		  confusion about where it would be useful given that, e.g.,
		  the individual MIME media types used in email and with HTTP 
		  have their own rules about UTF-8 character types and
		  normalization, and the
		  application transport protocols impose their own conventions
		  about line endings.  There are
		  three answers.   The first is that, in retrospect, it would
		  have been better to have those protocols and content types
		  standardized in the way specified here, even though it is
		  certainly too 
		  late to change them at this time.   The second is that we
		  have several protocols that are dependent on 
		  either the original Telnet design or other arrangements
		  requiring a standard,  
		  interoperable, string definition without specific
		  content-labels of one sort or another.
		  Whois <xref target="RFC3912"/> is an example member of this
		  group.     As consideration is
		  given to upgrading them for non-ASCII use, this
		  specification provides a normative reference that
		  provides the same stability that NVT has provided the ASCII
		  forms.
		  This specification is intended for use by other
		  specifications that have not yet defined how to use Unicode.
		  Having a preferred standard
		  Internet definition for Unicode text streams -- rather than
		  just one for transmission codings -- may help improve the
		  specification and interoperability of protocols to be
		  developed in the future.	This specification is not intended
		  for use with specifications that already allow the use of
		  UTF-8 and precisely define that use.</t>
	   </section>
	   <section title="Unicode Versions and Applicability"
				anchor="UnicodeDilemma">
		  <t>The IETF faces a practical dilemma with regard to versions of
			 Unicode.  Each new version brings with it new characters
			 and sometimes new combining characters.  Version 5.0
			 introduces the new concept of sequences of characters
			 named as if they were individual characters (see
			 <xref target="NamedSequences" />).
			 The normalization represented by NFC is stable if all
			 strings are transmitted and stored in normalized form if
			 corrections are never made to character 
			 definitions or normalization tables and if unassigned
			 code points are never used.   The latter is important
			 because an unassigned code point always normalizes to
			 itself.  However, if the same code point is assigned to a
			 character in a future version, it may participate in some
			 other normalization mapping (some specific difficulties
			 in this regard are 
			 discussed in <xref target="RFC4690"/>).  It is worth
			 noting that transmission in normalized form is not
			 required by either the
			 IETF's UTF-8 Standard <xref target="RFC3629"/> or by
			 standards dependent on the current version of
			 Stringprep <xref target="RFC3454"/>. </t>
		  <t> All would be well with this as described in
			 <xref target="versions"/> except for one problem:
			 Applications typically do not perform their own
			 conversions to Unicode and may not perform their own
			 normalizations but instead rely on operating system or
			 language library
			 functions -- functions that may be upgraded or otherwise
			 changed without changes to the application code itself.
			 Consequently, there may be no plausible way for an
			 application to know which version of Unicode, or which
			 version of the normalization procedures, it is utilizing,
			 nor is there any way by which it can guarantee that the
			 two will be consistent.</t>
		  <t>Because of per-version changes in definitions and
			 tables, Stringprep and documents depending on it are now
			 tied to Unicode Version 3.2 <xref target="Unicode32"/>
			 and full interoperability of Internet Standard
			 UTF-8 <xref target="RFC3629"/>, when used with
			 normalization as specified here, is dependent on
			 normalization definitions and the definition of UTF-8 itself
			 not changing after Unicode Version 5.0.  These
			 assumptions seem fairly safe, but they are still
			 assumptions.
			 Rather than being linked to the latest available version
			 of Unicode, version 5.0 <xref target="Unicode"/> or
			 broader concepts of version independence based on
			 specific assumptions and conditions, this
			 specification could reasonably have been tied, like 
			 Stringprep and Nameprep to
			 Unicode 3.2 <xref target="Unicode32"/> or some more
			 recent intermediate version, but, in
			 addition to the obvious disadvantages of having different
			 IETF standards tied to 
			 different versions of Unicode, the library-based
			 application 
			 implementation behavior described above makes these
			 version linkages nearly meaningless in practice.</t>
		  <t>In theory, one can get around this problem in four ways:
			 <list style="numbers">
				<t>Freeze on a particular version of Unicode and try
				   to insist that applications enforce that version
				   by, e.g., containing lists of unassigned characters
				   and prohibiting their use.  Of course, this would
				   prohibit evolution to include newly-added scripts
				   and the tables of unassigned code points would be
				   cumbersome.</t>
				<t>Require that every Unicode "text" string or file
				   start with a version indication, somewhat akin to
				   the "byte order mark" indicator.  It is unlikely
				   that this provision would be practical.  More
				   important, it would require that each application
				   implementation be prepared to either support
				   multiple normalization tables and versions or that
				   it reject text from Unicode versions with which it
				   was not prepared to deal.</t>
				<t>Devise a different set of normalization rules that
				   would, e.g., guarantee that no character assigned
				   to a previously-unassigned code point in Unicode
				   was ever normalized to anything but itself and use
				   those rules instead of NFC.   It is not clear
				   whether or not such a set of rules is possible or
				   whether some other completely stable set of rules
				   could be devised, perhaps in combination with
				   restrictions on the ways in which characters were
				   added in future versions of Unicode.</t>
				<t>Devise a normalization process that is otherwise
				   equivalent to NFC but that rejects code points
				   that are unassigned in the current version of
				   Unicode, rather than mapping those code points to
				   themselves.  This would still leave some risk of
				   incompatible corrections in Unicode and possibly a
				   few edge cases, but it is probably stable enough for
				   Internet use in the overwhelming number of cases.
				   This process has been discussed in the Unicode
				   Consortium under the name "Stable NFC".</t>
				</list></t>
			 <t> None of these approaches seems ideal: the ideal
				procedure would be as stable and predictable as ASCII
				has been.  But that level is simply not feasible as
				long as Unicode continues to evolve by the addition of new
				code points and scripts.  The fourth option listed
				above appears to be a reasonable compromise.</t>
		  </section>
		  </section>
		   
 
    <section title="Security Considerations" anchor="Security">
    <t> This specification provides a standard form for the use of
	   Unicode as "network text".  Most of the same security issues that apply
	   to UTF-8, as discussed in <xref target="RFC3629"/>, apply to
	   it, although it should be slightly less subject to some
	   risks by virtue of requiring NFC normalization and generally
	   being somewhat more restrictive.  However, shifts in Unicode
	   versions, as discussed in <xref target="UnicodeDilemma"/>, may
	   introduce other security issues.</t>
	<t> Programs that receive these streams should use extreme caution
	   about assuming that incoming data are normalized, since it
	   might be possible to use unnormalized forms, as well as invalid
	   UTF-8, as part of an attack.  In particular, firewalls and
	   other systems that interpret UTF-8 streams should be developed
	   with the clear knowledge that an attacker may deliberately
	   send unnormalized text, for instance, to avoid detection by
	   naive text-matching systems.</t>
	<t> NVT contains a requirement, of necessity repeated here (see
	   <xref target="NetUnicodeDef"/>), that the CR character be
	   immediately followed by either LF or ASCII NUL (an octet with
	   all bits zero).  NUL may be problematic for some programming
	   languages that use it as a string terminator, and hence a trap
	   for the unwary, unless caution is used.  This may 
	   be an additional reason to avoid the use of CR entirely, except in
	   sequence with LF, as suggested above.</t> 
	<t>The discussion about Unicode versions above
	   (see <xref target="versions"/> and
	   <xref target="UnicodeDilemma"/>) makes several assumptions about
	   future versions of Unicode, about NFC normalization being 
	   applied properly, and about UTF-8 being processed and
	   transmitted exactly as specified in RFC 3629.  If
	   any of those assumptions are not correct, then there are cases
	   in which strings that would be considered equivalent do not
	   compare equal.  Robust code should be prepared for those
	   possibilities. </t>
    </section>

    <section title="Acknowledgments" anchor="Acknowledgments">
    <t> Many thanks to Mark Davis, Martin Duerst, and Michel Suignard
	   for suggestions about Unicode normalization that led to the
	   format described here, and especially to Mark for providing the
	   paragraphs that describe the role of NFC.  Thanks also to Mark,
	   Doug Ewell, Asmus Freytag for corrected text describing Unicode
	   transmission forms, and to Tim Bray, Carsten Bormann,
	   Stephane Bortzmeyer, Martin Duerst, Frank Ellermann,
	   Clive D.W. Feather,
	   Ted Hardie, Bjoern Hoehrmann, Alfred Hoenes, Kent Karlsson,
	   Bill McQuillan, George Michaelson,
	   Chris Newman, and Marcos Sanz
	   for a number of helpful comments and clarification requests.
    </t>
    </section>

  </middle>
 
  <back>

	 <section title="History and Context" anchor="NVT-History">
		<t> This subsection contains a review of prior work in the
		 ARPANET and Internet to establish a standard text type, work
		 that establishes the context and motivation for the approach
		 taken in this document.   The text is explanatory rather than
		 normative: nothing in this section is intended to change or
		 update any current specification.  Those who are
		 uninterested in this review and analysis can safely skip
		 this section.</t> 
   <t> One of the earlier application design decisions made in the
	development of
	ARPANET, a decision that was carried forward into the Internet,
	was the
	decision to standardize on a single and very specific coding for
	"text" to be passed across the
	network <xref target="RFC0020"/>.  Hosts on the network were then
	responsible for translating or mapping from whatever character
	coding conventions were used locally to that common intermediate
	representation, with sending hosts
	mapping to it and receiving ones mapping from it to their local
	forms as needed.
	It is interesting to note that at the time
	the ARPANET was being developed, participating host operating systems used
	at least three different character coding standards: the antiquated BCD (Binary
	Coded Decimal), the then-dominant major manufacturer-backed
	EBCDIC (Extended
	BCD Interchange Code), and the then-still emerging ASCII (American Standard Code
	for Information Interchange).
	Since the ARPANET was an "open" project and EBCDIC was intimately
	linked to a particular hardware vendor, the original Network
	Working Group agreed that its 
	standard should be ASCII.  That ASCII form was precisely "7-bit ASCII in an
	8-bit field", which was in effect a compromise between hosts that
	were natively 7-bit oriented (e.g., with five seven-bit characters
	in a 36-bit word), those that were 8-bit oriented (using eight-bit
	characters) and those that placed the seven-bit ASCII characters
	in 9-bit fields with two leading zero bits (four characters in a
	36-bit word).</t>
	<t>More standardization was suggested in the first preliminary
	description of the Telnet
	protocol <xref target="RFC0097"/>.  With the iterations of that
	protocol <xref target="RFC0137"/> <xref target="RFC0139"/> and the
	drawing together of an essentially formal definition somewhat later
	<xref target="RFC0318"/>, a
	standard abstraction, the Network Virtual Terminal (NVT) was
	established.
	NVT character-coding conventions
	(initially called "Telnet ASCII" and later called "NVT ASCII",
	or, more casually, "network ASCII") included the requirement 
	that Carriage Return followed by Line Feed (CRLF) be the common
    representation for ending lines of text (given that some
    participating "Host" operating systems used the one natively, some
    the other, at least one used both, and a few used neither
	(preferring variable-length lines with counts or special
	delimiters or markers instead) and specified conventions
	for some other characters. Also, since NVT ASCII was restricted to seven-bit
	characters, use of the high-order bit in octets was reserved for
	the transmission of control signaling information.</t>
	 <t>At a very high level, the concept was that a system could use
	whatever character coding and line representations were
	appropriate locally, but text transmitted over the network as text
	must conform to the single "network virtual terminal"
	convention.
	 Virtually all early Internet protocols that presume transfer of
	 "text" assume this virtual terminal model, although different
	 ones assume or limit it in different ways.  Telnet, the command
	 stream and ASCII Type in FTP <xref target="RFC0542"/>, the
	 message stream in SMTP transfer <xref target="RFC2821"/>, and
	 the strings passed to finger <xref target="RFC0742"/> and
	 whois <xref target="RFC0954"/> are the classic examples.  More
	 recently, HTTP <xref target="RFC1945"/> <xref target="RFC2616"/>
	 follows the same general  
	 model but permits 8-bit data and leaves the line end sequence
	 unspecified (the latter has been the source of a significant
	 number of problems).</t>

<!-- [rfced] Do you want to update the citations above to the most recent RFCs on the
  topics (or add mentions of the more recent RFCs)? Three of the references have been obsoleted: 
  742 by 1288, 
  954 by 3912 (which is cited elsewhere), and 
  542 by 959 (also cited elsewhere). -->

  </section>

	 <section title="The ASCII NVT Definition"
			  anchor="NVT-Printer">
		<t> The main body of this specification is intended as an
		   update to, and  internationalized version of, the
		   Net-ASCII definition.  The specification is self-contained
		   in that parts of the Net-ASCII definition that are no
		   longer recommended are not included above.  Because
		   Net-ASCII evolved somewhat over time and there has been
		   debate about which specification is the "official"
		   Net-ASCII, it is appropriate to review the key elements of
		   that definition here.  This review is informal with
		   regard to the contents of Net-ASCII and
		   should not be considered as a normative update or summary
		   of the earlier
		   specifications (<xref target="NetUnicodeDef"/> does specify
		   some normative updates to those specifications and some
		   comments below are consistent with it).</t> 
		<t>The first part of the section titled "THE NVT PRINTER AND
		   KEYBOARD" in <xref target="RFC0854">RFC 854</xref> is
		   generally, although not universally, considered to be the
		   normative definition of the (ASCII) Network Virtual
		   Terminal and hence of Net-ASCII.   It includes not only the
		   graphic ASCII characters but a number of control
		   characters.  The latter are given Internet-specific
		   meanings that are often more specific than the definitions
		   in the ASCII specification.  In today's usage, and for
		   the present specification, the following clarifications and
		   updates to that list should be noted.  Each one is
		   accompanied by a brief explanation of the reason why the
		   original specification is no longer appropriate.

		   <list style="numbers">
			  <t>The "defined but not required" codes --
				 BEL (U+0007), BS (U+0008), HT (U+0009),
				 VT (U+000B), and FF (U+000C) -- and the undefined
				 control codes ("C0")
				 SHOULD NOT be used unless required by 
				 exceptional circumstances.  Either their original
				 "network printer" definitions are no longer in
				 general use, common practice has evolved away from
				 the formats specified there, or their use to simulate
				 characters that are better handled by Unicode is no
				 longer appropriate.  While the appearance of some of
				 these characters on the list may seem surprising, BS
				 now has an ambiguous interpretation in practice
				 (erasing in some systems but not in others), the
				 width associated with HT varies with the environment,
				 and VT and FF do not have a uniform effect with
				 regard to either vertical positioning or the
				 associated horizontal position result.  Of course,
				 telnet escapes are not considered part of the data
				 stream and hence are unaffected by this
				 provision.</t> 
			  <t>In Net-ASCII, CR MUST NOT appear except when
				 immediately followed 
				 by either NUL or LF, with the latter (CR LF)
				 designating the "new line" function.  Today and as
				 specified above, CR should generally appear only
				 when followed by LF. 
				 Because page 
				 layout is better done in other ways, because NUL has
				 a special interpretation in some programming
				 languages, and to avoid other types of confusion, CR NUL
				 should preferably be avoided as specified above.</t>
			  <t>LF CR SHOULD NOT appear except as a side-effect of
				 multiple CR LF sequences (e.g., CR LF CR LF).</t>
			  <t>The historical NVT documents do not call out either
				 "bare LF" (LF without CR) or HT for special
				 treatment.  Both have generally been understood to
				 be problematic.  In the case of LF, there is a
				 difference in interpretation as to whether its
				 semantics imply "go to same position on the next
				 line" or "go to the first position on the next line"
				 and interoperability considerations suggest not
				 depending on which interpretation the receiver
				 applies.  At the same time, misinterpretation of LF
				 is less harmful than misinterpretation of "bare" CR:
				 in the CR case, text may be erased or made
				 completely unreadable; in the LF one, the worst
				 consequence is a very funny-looking display.
				 Obviously, HT is problematic because there is no
				 standard way to transmit intended tab position or
				 width information in running text.  Again, the harm
				 is unlikely to be great if HT is simply interpreted
				 as one or more spaces, but, in general, it cannot be
				 relied upon to format information.</t>=
		   </list></t>
		   <t>It is worth noting that the telnet IAC character
			  (an octet consisting of all ones, i.e., %xFF) itself is
			  not a problem for UTF-8 since that particular octet cannot
			  appear in a valid UTF-8 string.  However, while few of
			  them have been used, telnet permits other
			  command-introducer characters whose bit sequences in an
			  octet may be part of valid UTF-8 characters.  While
			  it causes no ambiguity in UTF-8, Unicode
			  assigns a graphic character ("Latin Small Letter Y with
			  Diaeresis") to U+00FF (octets C3 B0 in UTF-8).  Some
			  caution is clearly in order in this area.</t>
		</section>

	<section title="The Line-Ending Problem">
	   <t>The definition of how a line ending should be denoted in
		  plain text strings on the wire for the Internet has been
		  controversial from even before the introduction of NVT.
		  Some have argued that recipients should be required to
		  interpret almost anything that a sender might intend as a
		  line ending as actually a line ending.  Others have pointed
		  out that this would lead to some ambiguities of
		  interpretation and presentation and would violate the
		  principle that we should minimize the number of forms that
		  are permitted on the wire in order to promote
		  interoperability and eliminate the "every recipient needs to
		  understand every sender format" problem.  The design of this
		  specification, like that of NVT, takes the latter approach.
		  Its designers believe that there is little point in a
		  standard if it is to specify "anyone can do whatever they
		  like and the receiver just needs to cope".</t>
	  <t> A further discussion of the nature and evolution of the
			line-ending problem appears in Section 5.8 of the Unicode
			Standard <xref target="Unicode"/> and is suggested for
			additional reading.  If we were starting with the Internet
			today, it would probably be sensible to follow the
			recommendation there and use LS (U+2028) exclusively, in
			preference to CRLF.  However, the installed base of use of
			CRLF and the importance of forward compatibility with NVT
			and protocols that assume it makes that impossible, so it
			is necessary to continue using CRLF as the "New Line
			Function" ("NLF", see the terminology section in that
			reference)<!-- discussed there-->.</t>
	</section>	  

	<section title="A Note about Related Future Work">
	   <t><!--Once this proposal is approved, c --> Consideration should be
		  given to a Telnet
		  (or SSH <xref target="RFC4251"/>) option to specify this
		  type of stream and an
		  FTP extension <xref target="RFC0959"/> to permit a new
		  "Unicode text" data TYPE. </t>
	   </section>

<references title="Normative References">

	  &rfc3629;   <!-- UTF-8 -->
	  &rfc2119;   <!-- requirement levels -->
w	  &rfc5234;   <!-- ABNF -->

<reference anchor="Unicode">
		  	<front>
	  <title abbrev="Unicode 5.0">
		 The Unicode Standard, Version 5.0
			 </title>
		<author>
		<organization>The Unicode Consortium</organization>
		<address />
		</author>
		<date year="2007" />
		</front>
		<annotation>Boston, MA, USA: Addison-Wesley.
		   ISBN 0-321-48091-0</annotation>
		  </reference>

<reference anchor="Unicode32">
<front>
     <title abbrev="Unicode 3.2">
 The Unicode Standard, Version 3.0
     </title>
<author>
<organization>The Unicode Consortium</organization>
<address />
</author>
<date year="2000" />
</front>
<annotation>
(Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5). Version 3.2
consists of the definition in that book  
                as amended by the Unicode Standard Annex #27: Unicode
                3.1 (http://www.unicode.org/reports/tr27/) and by the
                Unicode Standard Annex #28: Unicode 3.2
                (http://www.unicode.org/reports/tr28/). </annotation>
</reference>

<reference anchor="NFC" target="http://www.unicode.org/reports/tr15/">
	<front>
		 <title abbrev="NFC">
			Unicode Standard Annex #15: Unicode Normalization Forms
		 </title>
	<author initials="M." surname="Davis" fullname="Mark Davis">
	<organization>The Unicode Consortium</organization>
	<address />
	</author>
	<author initials="M." surname="Duerst" fullname="Martin Duerst">
	   <organization/>  <address />
	   </author>
	<date year="2006" month="October" day="12" />
	</front>
</reference>

<reference anchor="ISO10646">
	<front>
	<title> Information Technology - Universal Multiple-Octet
		Coded Character Set (UCS) - Part 1: Architecture
		and Basic Multilingual Plane </title>
	<author>
	<organization>International Organization for Standardization</organization>
	<address />
	</author>
	<date year="2000" month="October" />
	</front>
	<seriesInfo name="ISO/IEC" value="10646-1:2000" />
</reference>
 
    </references>

<references title="Informative References">

   &rfc0020;
   &rfc0097;
   &rfc0854;
   	  &rfc0137;	  <!-- telnet/NVT definition -->
	  &rfc0139;   <!-- telnet/NVT explanation -->
	  &rfc0318;   <!-- telnet definition/ summary and NVT definition -->  
   &rfc0698;	<!-- telnet extended ASCII --> 
   &rfc2277;
   &rfc2781;   <!-- UTF-16 -->
   &rfc3491;   <!-- Nameprep -->
   &rfc3912;   <!-- Whois -->
   &rfc4251;	<!-- SSH -->
   &rfc4690;	<!-- IDNA-issues -->
   
	  &ISO.646.1991;

<reference anchor='ASCII'>
        <front>
          <title>USA Code for Information Interchange</title>
          <author>
            <organization abbrev="ANSI">American National Standards Institute
                (formerly United States of America Standards Institute)
	    </organization>
          </author>
          <date year="1968"/>
        </front>
      <seriesInfo name="ANSI" value="X3.4-1968" />
      <annotation>ANSI X3.4-1968 has been replaced by newer
	  versions with slight modifications, but the 1968 version
	  remains definitive for the Internet.
	  <xref target="ISO.646.1991">ISO 646 International Reverence Version
		 (IRV)</xref> is usually considered equivalent to
	  ASCII. </annotation>
      </reference>
	  
<!-- <reference anchor="ISO.8859.2003">
	 <front>
	 <title>Information processing - 8-bit single-byte coded graphic
		character sets - Part 1: Latin alphabet No. 1 (1998) - Part 2:
		Latin alphabet No. 2 (1999) - Part 3: Latin alphabet No. 3 (1999)
		- Part 4: Latin alphabet No. 4 (1998) - Part 5: Latin/Cyrillic
		alphabet (1999) - Part 6: Latin/Arabic alphabet (1999) - Part 7:
		Latin/Greek alphabet (2003) - Part 8: Latin/Hebrew alphabet (1999)
		- Part 9: Latin alphabet No. 5 (1999) - Part 10: Latin alphabet
		No. 6 (1998) -
		Part 11: Latin/Thai alphabet (2001) -
		Part 13: Latin alphabet No. 7 (1998) -
		Part 14: Latin alphabet No. 8 (Celtic) (1998) -
		Part 15: Latin alphabet No. 9 (1999) -
		Part 16: Part 16: Latin alphabet No. 10 (2001)</title>
	 <author>
	 <organization>International Organization for Standardization</organization>
	 </author>
	 <date month="" year="2003" />
	 </front>
	 <seriesInfo name="ISO" value="Standard 8859" />
</reference>  -->

<reference anchor="NamedSequences"
    target="http://www.unicode.org/Public/UNIDATA/NamedSequences.txt">
   <front>
     <title>NamedSequences-4.1.0.txt</title>
	<author>
	<organization>The Unicode Consortium</organization>
	<address />
	</author>
	<date year="2005" />
	</front>
</reference>

<!-- old NVT-using protocol examples in appendix -->
   &rfc0542;		<!-- ftp -->
   &rfc0959;		<!-- ftp standard -->
   &rfc2821;		<!-- smtp -->
   &rfc0954;		<!-- whois - original, not Leslie's -->
   &rfc0742;		<!-- finger -- original, not Zimmerman's -->
   &rfc1945;		<!-- HTTP 1.0 -->
   &rfc2616;		<!-- HTTP 1.1 -->

   &rfc3454;		<!-- Stringprep -->

</references>
</back>
</rfc>
