<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="type123.xsl"?>
<rt_package>
  <rt_workset rt_description="Created by TYPE123 version 1.0 on Wed 03-Sep-2003 15:32" rt_id="1">
  <!-- a workset is a collection of rt_page elements -->
    <rt_page rt_id="1" rt_width="8.5in" rt_height="11in" rt_description="A collection of paragraphs">
    <!-- each page consists of a number of paragraphs -->
      <rt_paragraph rt_top="15" rt_left="15" rt_width="804" rt_height="1026" rt_align="left" rt_section="1">
      <!-- each paragraph consists of a number of rt_text elements -->

<rt_text rt_font="  10pt Arial" rt_decoration="none"></rt_text>

      </rt_paragraph>
      <rt_paragraph rt_top="161" rt_left="20" rt_width="791" rt_height="884" rt_align="left" rt_section="2">

<rt_text rt_font="  10pt Verdana" rt_decoration="none">A </rt_text>

<rt_text rt_font="bold  10pt Verdana" rt_decoration="none">B</rt_text>

<rt_text rt_font="  10pt Verdana" rt_decoration="none">yte </rt_text>

<rt_text rt_font="bold  10pt Verdana" rt_decoration="none">O</rt_text>

<rt_text rt_font="  10pt Verdana" rt_decoration="none">rder </rt_text>

<rt_text rt_font="bold  10pt Verdana" rt_decoration="none">M</rt_text>

<rt_text rt_font="  10pt Verdana" rt_decoration="none">ark at the front of a text file specifies the type of </rt_text>

<rt_text rt_font="bold  10pt Verdana" rt_decoration="none">U</rt_text>

<rt_text rt_font="  10pt Verdana" rt_decoration="none">nicode </rt_text>

<rt_text rt_font="bold  10pt Verdana" rt_decoration="none">T</rt_text>

<rt_text rt_font="  10pt Verdana" rt_decoration="none">ransformation </rt_text>

<rt_text rt_font="bold  10pt Verdana" rt_decoration="none">F</rt_text>

<rt_text rt_font="  10pt Verdana" rt_decoration="none">ormat that is being used<?rt_break?>
and the little/big endian significance. For example, EF BB BF indicates that the file is UTF-8 encoded. It is known that<?rt_break?>
some utilities and web browsers on NT and XP class platforms will decode files and, with the appropriate font, show<?rt_break?>
you the right glyhs to go with any special Unicode characters cited. What may be a surprise is that utilities like<?rt_break?>
Notepad on Windows XP will decode UTF </rt_text>

<rt_text rt_font="bold  10pt Verdana" rt_decoration="none">even if the Byte Order Mark is not present!<?rt_break?>
<?rt_break?>
</rt_text>

<rt_text rt_font="  10pt Verdana" rt_decoration="none">Doesn't this wreck innocent text files? The answer is no, it works suprisingly well because of the nature of UTF<?rt_break?>
encoding. An ASCII character in the range $00 to $7F is not changed, so many UTF text files can be viewed (or even<?rt_break?>
edited) by utilities that are not aware of UTF. Characters in the range $80 to $7FF are encoded as a sequence of two<?rt_break?>
characters, $800 to $FFF a sequence of 3, and so on up to the the largest character that can be encoded<?rt_break?>
($7FFFFFFF) which requires a sequence of 6 characters.<?rt_break?>
<?rt_break?>
To decode a file that contains UTF, all bytes are scanned looking for a byte with the high order bit set (1xxxxxxx).<?rt_break?>
The number of bytes in a UTF sequence can be deduced from the number of consecutive one bits at the start of the<?rt_break?>
byte. Thus, 110????? introduces a two byte sequence, 1110????  a three byte sequence and so on. The specification<?rt_break?>
goes on to mandate that nothing should encode to a byte that could be construed as a control code. In fact, UTF<?rt_break?>
encoding adds a lot more redundancy by specifying that second and subsequent bytes start with binary 10. UTF<?rt_break?>
notation is very easy to recognise (by a program) and quite hard to generate by mistake.<?rt_break?>
<?rt_break?>
110xxxxx 10xxxxxx - is a legal two byte sequence.<?rt_break?>
1110xxxx 10xxxxxx 10xxxxxx - a three byte sequence.<?rt_break?>
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx - four bytes<?rt_break?>
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx - five<?rt_break?>
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx - six<?rt_break?>
<?rt_break?>
This notation is UTF-8. If you count the number of "x" binary digits that can be encoded by each sequence you will<?rt_break?>
note that a three byte sequence is enough to encode any sixteen bit value. The Windows API is normally restricted to<?rt_break?>
16 bits - the so called UTF-16 notation. It's enough for any double byte Unicode characters.<?rt_break?>
<?rt_break?>
To get the UTF-16 code from the UTF-8 just shift the bits into position.<?rt_break?>
<?rt_break?>
110xxxx 10yyyyyy decodes to 00000xxxxyyyyyy<?rt_break?>
1110xxx 10yyyyyy 10zzzzzz decodes to xxxyyyyyyzzzzzz<?rt_break?>
<?rt_break?>
The Unicode character for the Greek letter beta </rt_text>

<rt_text rt_font="  18pt Verdana" rt_decoration="none">β</rt_text>

<rt_text rt_font="  10pt Verdana" rt_decoration="none"> is UTF-16 code hex $03B2 binary 00000</rt_text>

<rt_text rt_font="bold  10pt Verdana" rt_decoration="none">011 10110010</rt_text>

<rt_text rt_font="  10pt Verdana" rt_decoration="none"> and this is<?rt_break?>
encoded to a two byte UTF-8 sequence binary 110</rt_text>

<rt_text rt_font="bold  10pt Verdana" rt_decoration="none">01110 </rt_text>

<rt_text rt_font="  10pt Verdana" rt_decoration="none">10</rt_text>

<rt_text rt_font="bold  10pt Verdana" rt_decoration="none">110010 </rt_text>

<rt_text rt_font="  10pt Verdana" rt_decoration="none">or hex $CE $B2<?rt_break?>
<?rt_break?>
If you look at this file with IE6 on Windows 98 or Notepad on Windows XP you will see the correct Greek glyph. IE6<?rt_break?>
knows about UTF encoded XML files. XP Notepad knows about UTF!<?rt_break?>
<?rt_break?>
P.W.H September 2003<?rt_break?>
<?rt_break?>
 <?rt_break?>
</rt_text>

<rt_text rt_font="  10pt Arial" rt_decoration="none">© RedTitan Technology 2003<?rt_break?>
</rt_text>

      </rt_paragraph>
      <rt_paragraph rt_top="22" rt_left="21" rt_width="383" rt_height="30" rt_align="left" rt_section="3">

<rt_text rt_font="bold  12pt Verdana" rt_decoration="none">UTF for fun and profit.</rt_text>

      </rt_paragraph>
      <rt_paragraph rt_top="17" rt_left="453" rt_width="128" rt_height="135" rt_align="left" rt_section="4">

<rt_text rt_font="  10pt Verdana" rt_decoration="none">BOM Bytes<?rt_break?>
<?rt_break?>
00 00 FE FF<?rt_break?>
FF FE 00 00<?rt_break?>
FE FF<?rt_break?>
FF FE<?rt_break?>
EF BB BF</rt_text>

      </rt_paragraph>
      <rt_paragraph rt_top="17" rt_left="586" rt_width="165" rt_height="135" rt_align="left" rt_section="5">

<rt_text rt_font="  10pt Verdana" rt_decoration="none">Encoding Form<?rt_break?>
<?rt_break?>
UTF-32, big-endian<?rt_break?>
UTF-32, little-endian<?rt_break?>
UTF-16, big-endian<?rt_break?>
UTF-16, little-endian<?rt_break?>
UTF-8</rt_text>

      </rt_paragraph>
      <rt_paragraph rt_top="53" rt_left="22" rt_width="359" rt_height="105" rt_align="left" rt_section="6">

<rt_text rt_font="  10pt Verdana" rt_decoration="none">There are a large number of text editors that able<?rt_break?>
to show you regular ASCII characters. Newer<?rt_break?>
products are able to handle UNICODE characters.<?rt_break?>
Special sequences at the beginning of a text file<?rt_break?>
are used by Unicode aware utilities to help<?rt_break?>
support cross platform editing.     </rt_text>

      </rt_paragraph>
      <rt_paragraph rt_top="1009" rt_left="632" rt_width="177" rt_height="24" rt_align="left" rt_section="7">

<rt_text rt_font="  10pt Arial" rt_decoration="none">utf.xml rev 1 September 2003</rt_text>

      </rt_paragraph>
    </rt_page>
  </rt_workset>
</rt_package>

