Encode a String Using Alternate Character Encoding

Unicode is not the only character-encoding scheme, nor is UTF-16 the only way to represent Unicode characters. When your application needs to exchange character data with external systems (particularly legacy systems), you must convert the data between UTF-16 and the encoding scheme supported by the other system.

The abstract class Encoding, and its concrete subclasses, provide the functionality to convert characters to and from a variety of encoding schemes. Each subclass instance supports the conversion of characters between UTF-16 and one other encoding scheme. You obtain instances of the encoding specific classes using the static factory method Encoding.GetEncoding, which accepts either the name or the code page number of the required encoding scheme.

Table 2.1 lists some commonly used character encoding schemes and the code page number you must pass to the GetEncoding method to create an instance of the appropriate encoding class. The table also shows static properties of the Encoding class that provide shortcuts for obtaining the most commonly used types of encoding object.

Table 2.1: Character Encoding Classes
Encoding Scheme	Class	Create Using
ASCII	ASCIIEncoding	GetEncoding(20127) or the ASCII property
Default (current Microsoft Windows default)	Encoding	GetEncoding(0) or the Default property
UTF-7	UTF7Encoding	GetEncoding(65000) or the UTF7 property
UTF-8	UTF8Encoding	GetEncoding(65001) or the UTF8 property
UTF-16 (Big Endian)	UnicodeEncoding	GetEncoding(1201) or the BigEndianUnicode property
UTF-16 (Little Endian)	UnicodeEncoding	GetEncoding(1200) or the Unicode property
Windows OS	Encoding	GetEncoding(1252)

Once you have an Encoding object of the appropriate type, you convert a UTF-16 encoded Unicode string to a byte array of encoded characters using the GetBytes method and convert a byte array of encoded characters to a string using the GetString method. The following code demonstrates the use of some encoding classes.

using System;
using System.IO;
using System.Text;

public class CharacterEncodingExample {

    public static void Main() {
        
        // Create a file to hold the output
        using (StreamWriter output = new StreamWriter("output.txt")) {

            // Create and write a string containing the symbol for Pi
            string srcString = "Area = \u03A0r^2";
            output.WriteLine("Source Text : " + srcString);

            // Write the UTF-16 encoded bytes of the source string
            byte[] utf16String = Encoding.Unicode.GetBytes(srcString);
            output.WriteLine("UTF-16 Bytes: {0}", 
                BitConverter.ToString(utf16String));

            // Convert the UTF-16 encoded source string to UTF-8 and ASCII
            byte[] utf8String = Encoding.UTF8.GetBytes(srcString);
            byte[] asciiString = Encoding.ASCII.GetBytes(srcString);
            
            // Write the UTF-8 and ASCII encoded byte arrays        
            output.WriteLine("UTF-8  Bytes: {0}", 
                BitConverter.ToString(utf8String));
            output.WriteLine("ASCII  Bytes: {0}", 
                BitConverter.ToString(asciiString));

            // Convert UTF-8 and ASCII encoded bytes back to UTF-16 
            // encoded string and write
            output.WriteLine("UTF-8  Text : {0}", 
                Encoding.UTF8.GetString(utf8String));
            output.WriteLine("ASCII  Text : {0}", 
                Encoding.ASCII.GetString(asciiString));

            // Flush and close the output file
            output.Flush();
            output.Close();
        }
    }
}

Running CharacterEncodingExample will generate a file named output.txt. If you open this file in a text editor that supports Unicode, you will see the following content:

Source Text : Area = r^2
UTF-16 Bytes: 41-00-72-00-65-00-61-00-20-00-3D-00-20-00-A0-03-72-00-5E-00-32-00
UTF-8  Bytes: 41-72-65-61-20-3D-20-CE-A0-72-5E-32
ASCII  Bytes: 41-72-65-61-20-3D-20-3F-72-5E-32
UTF-8  Text : Area = r^2
ASCII  Text : Area = ?r^2

Notice that using UTF-16 encoding, each character occupies 2 bytes, but because most of the characters are standard characters, the high-order byte is 0. (The use of little-endian byte ordering means that the low-order byte appears first.) This means that most of the characters are encoded using the same numeric values across all three encoding schemes. However, the numeric value for the symbol pi (emphasized in bold in the preceding code) is different in each of the encodings. The value of pi requires more than one byte to represent—UTF-8 encoding uses 2 bytes, but ASCII has no direct equivalent and so replaces pi with the code 3F. As you can see in the text version of the string, 3F is the symbol for an English question mark (?).

Warning:

If you convert Unicode characters to ASCII or a specific code page encoding scheme, you risk losing data. Any Unicode character with a character code that can't be represented in the scheme will be ignored.

The Encoding class also provides the static method Convert to simplify the conversion of a byte array from one encoding scheme to another without the need to manually perform an interim conversion to UTF-16. For example, the following statement converts the ASCII encoded bytes contained in the asciiString byte array directly from ASCII encoding to UTF-8 encoding:

byte[] utf8String = Encoding.Convert(Encoding.ASCII, Encoding.UTF8,
    asciiString);

C# Slackers

Subscribe

Encode a String Using Alternate Character Encoding

No comments:

Post a Comment

Archives

LocalsAdda.com-Variety In Web World

Fun Mail - Fun in the Mail

Subscribe for Emails

ASP.NET Concepts

Browse C# Topics