draft-ietf-idn-icu-00.txt

     
IETF IDN Working Group            Seungik Lee, Hyewon Shin, Dongman Lee
Internet Draft                                                      ICU
draft-ietf-idn-icu-00.txt                      Eunyong Park, Sungil Kim
Expires: 14 January 2001                                KKU, Netpia.com
                                                           14 July 2000

         Architecture of Internationalized Domain Name System

Status of this Memo

   This document is an Internet-Draft and is in full conformance with 
   all provisions of Section 10 of RFC2026. 
   
   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups. Note that other 
   groups may also distribute working documents as Internet-Drafts.
   
   Internet-Drafts are draft documents valid for a maximum of six months 
   and may be updated, replaced, or obsoleted by other documents at any 
   time. It is inappropriate to use Internet-Drafts as reference 
   material or to cite them other than as "work in progress." 
   
   The list of current Internet-Drafts can be accessed at 
   http://www.ietf.org/ietf/1id-abstracts.txt 
   
   The list of Internet-Draft Shadow Directories can be accessed at 
   http://www.ietf.org/shadow.html.



1. Abstract

   For restrict use of Domain Name System (DNS) for domain names with 
   alphanumeric characters only, there needs a way to find an Internet 
   host using multi-lingual domain names: Internationalized Domain Name 
   System (IDNS).
   
   This document describes how multi-lingual domain names are handled in 
   a new protocol scheme for IDNS servers and resolvers in architectural 
   view and it updates the [RFC1035] but still preserves the backward 
   compatibility with the current DNS protocol.



2. Conventions used in this document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this 
   document are to be interpreted as described in [RFC2119].
   
   "IDNS" (Internationalized Domain Name System) is used here to 
   indicate a new system designed for a domain name service, which 
   supports multi-lingual domain names.
   
   "The current/conventional DNS" or "DNS" (Domain Name System) is used 
   here to indicate the domain name systems currently in use. It 
   fulfills the [RFC1034, RFC1035], but implementations and functional 
   operations may be different from each other.
   
   The "alphanumeric" character data used here is the character set that 
   is allowed for a domain name in DNS query format, [a-zA-Z0-9-]. 



3. Introduction

   Domain Name System (DNS) has eliminated the difficulty of remembering 
   the IP addresses. As the Internet becomes spread over all the people, 
   the likelihood that the people who are not familiar with alphanumeric 
   characters use the Internet increases. The domain names in 
   alphanumeric characters are difficult to remember or use for the 
   people who is not educated English. Therefore, it needs a way to find 
   an Internet host using multi-lingual domain name: Internationalized 
   Domain Name System.


3.1 The current issues of IDNS

   IDNS maps a name to an IP address as the typical DNS does, but it 
   allows domain names to contain multi-lingual characters. The multi-
   lingual characters need to be encoded/decoded into one standardized 
   format, and it needs changes in the conventional DNS protocol 
   described in [RFC1034] and [RFC1035]. But it is required to minimize 
   the changes in the present DNS protocol so that it guarantees the 
   backward compatibility.
   
   The IDNS issues have been discussed in IETF IDN Working Group. These 
   issues are well described in [IDN-REQ]. The main issues are:
   
   - Compatibility and interoperability. The DNS protocol is in use 
   widely in the Internet. Although a new protocol is introduced for DNS, 
   the current protocol may be used with no changes. Therefore, a new 
   design for DNS protocol, IDNS must provide backward compatibility and 
   interoperability with the current DNS.
   
   - Internationalization. IDNS is on the purpose of using multi-lingual 
   domain names. The international character data must be represented by 
   one standardized format in domain names.
   
   - Canonicalization. DNS indexes and matches domain names to look up a 
   domain name from zone data. In the conventional DNS, canonicalization 
   is subjected to US-ASCII only. However, every multi-lingual character 
   data must be canonicalized in its own rules for a DNS standardized 
   matching policy, e.g. case-insensitive matching rule.
   
   - Operational issues. IDNS uses international character data for 
   domain names. Normalization and canonicalization of domain names are 
   needed in addition to the current DNS operations. IDNS also needs an 
   operation for interoperability with the current DNS. Therefore, it is 
   needed to specify the operational guidelines for IDNS.


3.2 Overview of the proposed scheme

   Our proposed scheme for IDNS is also subjected on the issues 
   described earlier to fulfill the requirements of IDN [IDN-REQ].
   
   The proposed scheme can be summarized as following:
   
   - The IN bit, which is reserved and currently unused in the DNS 
   query/response format header, is used to distinguish between the 
   queries generated by IDNS servers or resolvers and those of non-IDNS 
   ones [Oscarsson]. This mechanism is also needed to indicate whether 
   the query is generated by the appropriate IDNS operations for 
   canonicalization and normalization or not.
   
   - The multi-lingual domain names are encoded into UTF-8 as a wire 
   format. UTF-8 is recommended as a default character encoding scheme 
   (CES) in the creation of new protocols which transmit text in 
   [RFC2130]. This scheme allows the IDNS server to handle the DNS query 
   from non-IDNS servers or resolvers because the ASCII code has no 
   changes in UTF-8.
   
   - The UTF-8 domain names must be case-folded before transmission. It 
   minimizes the overhead on server's operations of matching names in 
   case-insensitive. It also guarantees that the result of caching 
   queries can be used without any further normalization and 
   canonicalization. If IDNS server gets non-IDNS query that is not 
   case-folded, it case-folds the query before transmitting to another 
   servers.



4. Design considerations

   Our proposed scheme is designed to fulfill the requirements of IETF 
   IDN WG [IDN-REQ]. All the methods for IDNS schemes must be approved 
   by the requirements documents. The design described in this document 
   is based on these requirements.


4.1 Protocol Extensions

   To indicate an IDNS query format, we use an unallocated bit in the 
   current DNS query format header, named 'IN' bit [Oscarsson]. All IDNS 
   queries are set IN bit to 1. Without this bit set to 1, we cannot 
   guarantee that the query is in the appropriate format for IDNS.

   'IN' bit is to indicate whether the query is from IDNS 
   resolvers/servers or not. It also reduces overhead on canonicalizing 
   operation at IDNS server. It will be described further in <4.4. 
   Canonicalization>.
   
   We devise new operations and new structures of resolvers and name 
   servers to add the multi-lingual domain name handling features into 
   the DNS. This causes changes of all DNS servers and resolvers to use 
   multi-lingual domain names. The new architectures for resolvers and 
   servers will be described in <5. Architectures>


4.2 Compatibility and interoperability

   The 'IN' bit is valid bit location of query for the conventional DNS 
   protocol to be set to zero [RFC1035]. And operations and structures 
   of IDNS preserve the conventional rules of DNS to guarantee the 
   interoperability with the conventional DNS servers or resolvers so 
   that the changes are optional. These make this scheme for IDNS 
   compatible with the current protocol.
   
   Although the current DNS protocol uses 7-bit ASCII characters only, 
   the query format of the current DNS protocol set is 8 bit-clean. 
   Therefore, we can guarantee the backward compatibility and 
   interoperability with the current DNS using UTF-8 code because the 
   ASCII code is preserved with no changes in UTF-8.
   
   Note: There are also in use implementations that are compatible with 
   the current DNS but extend their operations to use UTF-8 domain names. 
   The IDNS described here interoperates well with these implementations. 
   The interoperability with these implementations will be described in 
   <5.4 Interoperability with the current DNS>.


4.3 Internationalization

   All international character data must be represented in one 
   standardized format and the standardized format must be compatible 
   with the current ASCII-based protocols. Therefore, the coded 
   character set (CCS) for IDNS protocol must be Unicode [Unicode], and 
   be encoded using the UTF-8 [RFC2279] character encoding scheme (CES).
   
   The client-side interface may allow the domain names encoded in any 
   local character sets, Unicode, ASCII and so on. But they must be 
   encoded into Unicode before being used in IDNS resolver. The IDNS 
   resolver accepts Unicode character data only, and converts it to UTF-
   8 finally for transmission.


4.4 Canonicalization

   In the current DNS protocol, the domain names are matched in case-
   insensitive. Therefore, the domain names in a query and zone file 
   must be case-folded before equivalence test.
   
   The case-folding issue has been discussed for a long time in IETF IDN 
   WG. The main problem is for case folding in locale-dependent. Some 
   different local characters are overlapped within case-folded format. 
   For example, Latin capital letter I (U+0049) case-folded to lower 
   case in the Turkish context will become Latin small letter dotless i 
   (U+0131). But in the English context, it will become Latin small 
   letter i (U+0069)
   
   Therefore, we case-fold the domain names in locale-independent in our 
   new IDNS design with a method defined in [UTR21].
   
   Multi-lingual domain names should be case-folded in IDNS resolvers or 
   IDNS servers before transmitting to other IDNS/DNS servers. That is, 
   IDNS resolver should case-fold the domain name and converts it to 
   UTF-8 before transmission. In case of IDNS server, if it gets a query 
   with IN bit set to 1, then it needs not to make the multi-lingual 
   domain name canonicalized anymore. If the IDNS server gets a query 
   with IN bit set to 0, then it cannot determine the query is 
   appropriate canonicalized format for IDNS server, so that it case-
   folds that multi-lingual domain name in the query, and set 'IN' bit 
   to 1.
   
   The current DNS queries contain the original case of domain names to 
   preserve the original cases. To be consistent with this rule, all 
   case-folded multi-lingual domain names should be stored by IDNS 
   resolvers or servers before case-folding, and should be restored 
   before sending response.
   
   In the case of case-folding UTF-8 code, using the case-folding method 
   in [UTR21], the UTF-8 should be converted to Unicode and it must be 
   mapped to the mapping table finally. Of course that if we could make 
   a case-folding mapping table of UTF-8 character data, this overhead 
   could be reduced.
   
   However it cannot avoid an overhead in IDNS servers for 
   canonicalization, because the canonicalization of international 
   character data is complicated.
   
   To minimize this overhead, we use 'IN' bit to indicate that the 
   canonicalization for the query has been already handled. That means 
   it needs not canonicalization operation anymore. The detailed 
   operations according to the 'IN' bit are described later in <5. 
   Architectures>.
   
   With international character data, the canonicalization (e.g. case-
   folding) is much more complicated than the one with US-ASCII, and is 
   different from each other's by their locale contexts.
   
   But this document doesn't specify any method or recommendation more 
   than case-folding. For canonicalization of international character 
   data, [UTR15] is a good start. It must be discussed further and 
   specified in the IDNS protocol specification. 


4.5 Operational issues

   In the current DNS scheme, it uses only ASCII code for a wire format. 
   But our new IDNS scheme uses UTF-8 code for a wire format. All the 
   IDNS resolvers must transmit queries encoded in UTF-8 and case-folded. 
   This format can be guaranteed by checking the IN bit: if IN bit is 
   set to 1, the query is encoded in UTF-8 and case-folded. Otherwise 
   the IDNS server cannot assure that the query is encoded in UTF-8 and 
   case-folded. Therefore it needs additional operations for encoding to 
   UTF-8 and case-folding, etc in this case.
   
   The current DNS resolvers transmit the queries in ASCII code. But 
   it's not considerable in IDNS servers because the ASCII code is 
   preserved with no changes in UTF-8.
   
   Some applications and resolvers transmit the queries in UTF-8 
   although they don't fit on the new IDNS resolvers' structures, e.g. 
   Microsoft's DNS servers. We cannot guarantee that those queries are 
   case-folded correctly. Therefore, the IDNS servers should convert 
   them to appropriate IDNS queries instead of the IDNS resolver in that 
   case.
   
   All detailed operations of IDNS servers and resolvers are described 
   in <5. Architectures>.



5. Architectures


5.1 New header format

   A new IDNS servers and resolvers must interoperate with the ones of 
   current DNS. Therefore, we need a way to determine whether the query 
   is for IDN or not. For this reason, we use a new header format as 
   proposed in [Oscarsson].

                                         1  1  1  1  1  1
           0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
          +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
          |                      ID                       |
          +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
          |QR|   Opcode  |AA|TC|RD|RA|IN|AD|CD|   RCODE   |
          +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
          |                    QDCOUNT                    |
          +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
          |                    ANCOUNT                    |
          +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
          |                    NSCOUNT                    |
          +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
          |                    ARCOUNT                    |
          +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+


   The IDNS resolvers and servers identify themselves in a query or a 
   response by setting the 'IN' bit to 1 in the DNS query/response 
   format header. This bit is defined to be zero by default in the 
   current DNS servers and resolvers.


5.2 Structures of IDNS resolvers

   To use multi-lingual domain names with IDNS servers, all the IDNS/DNS 
   resolvers must generate the query in a format of UTF-8 or ASCII. The 
   design of a resolver could be different with each other according to 
   the local operating systems or applications. We propose new design 
   guidelines of a resolver for a new standardization. 
   
   The IDNS resolver accepts Unicode from user interface for domain 
   names. The other character sets should be rejected. It encodes all 
   such character data into UTF-8 for transmission to name servers.
   
   The procedures of the operation of an IDNS resolver are below:
   
   <1>. If the resolver gets a domain name in Unicode or ASCII then it 
   stores the original domain name query. Otherwise the request for 
   lookup is rejected. In the current DNS protocol, the original case of 
   the domain name should be preserved. Therefore, the resolver must 
   store the original cases of the domain names before canonicalization 
   (e.g. case-folding).
   
   <2>. Make the domain name case-folded with locale-independent case-
   mapping table defined in [UTR21].
   
   <3>. Convert it to UTF-8.
   
   <4>. Set IN bit to 1. It indicates the query is from IDNS resolver 
   and the format is UTF-8, case-folded.
    
   <5>. Send request query to name servers.
   
   <6>. Restore the original domain name query into the response query 
   format.
   
   <7>. Send response to the application.


5.3 Structures of IDNS servers

   The operation of IDNS server is similar to the current one of DNS 
   server, but the IDNS server accepts UTF-8 queries and converts them 
   to the appropriate formats additionally.
   
   The IDNS server distinguishes between the IDNS queries and DNS 
   queries by checking IN bit in the query/response format header. 
   According to the 'IN' bit, it operates differently.
   
   The procedures of the operation of an IDNS server are below:
   
   <1>. If the IN bit in the query/response format header is set to 1 
   then it matches the domain name within zone file data or forwards 
   request query to resolve. It operates as same as the operations of 
   the current DNS servers but retrieves UTF-8 code. In this case, it 
   needs not to make domain name canonicalized because the domain name 
   is already canonicalized in the previous procedures of IDNS resolvers 
   or IDNS servers. Go to step <7>.
   
   <2>. Set IN bit to 1.
   
   <3>. Store the original domain name query.
   
   <4>. Make the domain name case-folded with locale-independent case-
   mapping table defined in [UTR21].
   
   <5>. Match the domain name within zone file data or send request 
   query to lookup.
   
   <6>. Restore the original domain name query into the response query 
   format.
   
   <7>. Send response for the query to the resolver or the other server 
   requested.


5.4 Interoperability with the current DNS

   The DNS servers and resolvers accept domain names in ASCII only. But 
   IDNS servers and resolvers accept domain names in UTF-8. Therefore, 
   the queries from DNS ones to IDNS ones can be well handled because 
   the UTF-8 is a superset of ASCII code. But the queries from IDNS ones 
   to DNS ones will be rejected because the UTF-8 code is beyond the 
   range of ASCII code.
   
   Note: There are some implementations which can handle UTF-8 domain 
   names although they don't fit on this specification of IDNS and fully 
   implemented with DNS protocol specification, e.g. Microsoft's DNS 
   server and resolvers. In this case, we cannot guarantee that the 
   queries from these 3rd-party implementations are encoded into UTF-8 
   and well canonicalized. But this queries are set 'IN' bit to 0, so 
   that the IDNS evaluates whether the domain name is the range of UTF-8 
   or not, and converts it into UTF-8 and makes it canonicalized finally.



6. Security Considerations

   This architecture of IDNS uses 8bit-clean queries for transmission 
   and the UTF-8 code is handled instead of ASCII. The DNS protocol has 
   already allocated 8bit query format for domain names Therefore, the 
   IDNS protocol inherits the security issues for the current DNS.
   
   Canonicalization of IDNS is defined in [UTR15] and case folding in 
   [UTR21]. All security issues related with canonicalization or 
   normalization inherits ones described in [UTR15, UTR21].
   
   As always with data, if software does not check for data that can be 
   a problem, security may be affected. As more characters than ASCII is 
   allowed, software only expecting ASCII and with no checks may now get 
   security problems.



7. References

   [IDN-REQ]    James Seng, "Requirements of Internationalized Domain 
                Names," Internet Draft, June 2000
   
   [KWAN]       Stuart Kwan, "Using the UTF-8 Character Set in the 
                Domain Name System," Internet Draft, February 2000
   
   [Oscarsson]  Dan Oscarsson, "Internationalisation of the Domain Name 
                Service," Internet Draft, February 2000
   
   [RFC1034]	Mockapetris, P., "Domain Names - Concepts and 
                Facilities," STD 13, RFC 1034, USC/ISI, November 1987
   
   [RFC1035]    Mockapetris, P., "Domain Names - Implementation and 
                Specification," STD 13, RFC 1035, USC/ISI, November 
                1987
   
   [RFC2119]    S. Bradner, "Key words for use in RFCs to Indicate 
                Requirement Levels," RFC 2119, March 1997
   
   [RFC2130]    C. Weider et. Al., "The Report of the IAB Character Set 
                Workshop held 29 February - 1 March 1996," RFC 2130, 
                Apr 1997.
   
   [RFC2279]    F. Yergeau, "UTF-8, a transformation format of ISO 
                10646," RFC 2279, January 1998
   
   [RFC2535]    D. Eastlake, "Domain Name System Security Extensions," 
                RFC 2535, March 1999
   
   [UNICODE]    The Unicode Consortium, "The Unicode Standard - Version 
                3.0," http://www.unicode.org/unicode/
   
   [UTR15]      M. Davis and M. Duerst, "Unicode Normalization Forms", 
                Unicode Technical Report #15, Nov 1999, 
                http://www.unicode.org/unicode/reports/tr15/
   
   [UTR21]      Mark Davis, "Case Mappings," Unicode Technical Report 
                #21, May 2000, 
                http://www.unicode.org/unicode/reports/tr21


8. Acknowledgments

   Kyoungseok Kim <gimgs@asadal.cs.pusan.ac.kr>
   Chinhyun Bae <piano@netpia.com>



9. Author's Addresses

   Seungik Lee
   Email: silee@icu.ac.kr
   
   Hyewon Shin
   Email: hwshin@icu.ac.kr
   
   Dongman Lee
   Email: dlee@icu.ac.kr
   
   Information & Communications University
   58-4 Whaam-dong Yuseong-gu Taejon, 305-348 Korea
   
   
   Eunyong Park
   Email: eunyong@eunyong.pe.kr
   Konkuk University
   93-1 Mojindong, Kwangjin-ku Seoul, 143-701 Korea
   
   
   Sungil Kim
   Email: clicky@netpia.com
   Netpia.com
   35-1 8-ga Youngdeungpo-dong Youngdeungpo-gu Seoul, 150-038 Korea