The implementation of Java character encoding and decoding

  • 2020-04-01 01:56:38
  • OfStack

  Character set basics:

Character set
                A collection of characters, that is, symbols with special semantics. The letter "A" is A character. "%" is also a character. No intrinsic digital value, no direct connection to ASC II, Unicode, or even computers. Symbols existed long before computers were invented.
A Coded character set
                A collection of characters assigned to a value. Assign code to characters so that they can represent the result of a number in a particular set of character encodings. Other coded character sets can assign different values to the same character. Character set mappings are usually determined by standards bodies, such as USASCII, ISO 8859-1, Unicode (ISO 10646-1), and JIS X0201.
Character-encoding scheme
                Mapping of coded character set members to 8-bit bytes (8-bit bytes). The encoding scheme defines how to express a sequence of character encodings as a sequence of bytes. The numeric value of the character encoding does not need to be the same as the encoded byte, nor does it need to be a one-to-one or one-to-many relationship. In principle, the character set encoding and decoding are approximately regarded as the serialization and deserialization of objects.


Usually character data encoding is used for network transmission or file storage. The encoding scheme is not a character set, it is a map; But because of their close association, most of the encoding is associated with a separate character set. For example, utf-8,
Only used to encode Unicode character sets. Nevertheless, it is possible to work with multiple character sets in one encoding scheme. For example, EUC can encode characters in several Asian languages.
Figure 6-1 is a graphical expression that encodes a sequence of Unicode characters into a sequence of bytes using the utf-8 encoding scheme. Utf-8 encodes a character code value less than 0x80 into a single byte value (standard ASC II). All the other Unicode characters are encoded into 2 to 6 multi-byte sequence of bytes (http://www.ietf.org/rfc/rfc2279.txt).

Charset (character set)
            Term charset is in RFC2278 defined (http://ietf.org/rfc/rfc2278.txt). It is a collection of coded character sets and character encoding schemes. The class of the java.nio.charset package is charset, which encapsulates character set extraction.
1111111111111111
  Unicode is a 16-bit character encoding. It tries to unify the character sets of all the world's languages into a single, comprehensive mapping. It has earned its place, but many other character encodings are still in widespread use.
Most operating systems are still byte-oriented in terms of I/O and file storage, so no matter what encoding, Unicode or otherwise, translation between byte sequences and character set encoding is required.
The class composed of the java.nio.charset package satisfies this requirement. This is not the first time the Java platform has dealt with character set coding, but it is the most systematic, comprehensive, and flexible solution. The java.nio.charset.spi package provides a server feed interface (spi) that allows the encoder and decoder to insert as needed.


Character set: the default value is determined at JVM startup, depending on the underlying operating system environment, locale, and/or JVM configuration. If you need a specific character set, it is safest to name it explicitly. Don't assume that the default deployment is the same as your development environment. Character set names are case-insensitive, that is, when comparing character set names, uppercase and lowercase letters are considered the same. The Internet corporation for assigned names (IANA) maintains all formally registered character set names.


Example 6-1 demonstrates how characters can be translated into byte sequences through different Charset implementations.
 
Example 6-1. Using standard character set encoding


    package com.ronsoft.books.nio.charset;  

    import java.nio.charset.Charset;  
    import java.nio.ByteBuffer;  

      
    public class EncodeTest {  
        public static void main(String[] argv) throws Exception {  
            // This is the character sequence to encode  
            String input = " u00bfMau00f1ana?";  
            // the list of charsets to encode with  
            String[] charsetNames = { "US-ASCII", "ISO-8859-1", "UTF-8",  
                    "UTF-16BE", "UTF-16LE", "UTF-16" // , "X-ROT13"  
            };  
            for (int i = 0; i < charsetNames.length; i++) {  
                doEncode(Charset.forName(charsetNames[i]), input);  
            }  
        }  

          
        private static void doEncode(Charset cs, String input) {  
            ByteBuffer bb = cs.encode(input);  
            System.out.println("Charset: " + cs.name());  
            System.out.println("  Input: " + input);  
            System.out.println("Encoded: ");  
            for (int i = 0; bb.hasRemaining(); i++) {  
                int b = bb.get();  
                int ival = ((int) b) & 0xff;  
                char c = (char) ival;  
                // Keep tabular alignment pretty  
                if (i < 10)  
                    System.out.print(" ");  
                // Print index number  
                System.out.print("  " + i + ": ");  
                // Better formatted output is coming someday...  
                if (ival < 16)  
                    System.out.print("0");  
                // Print the hex value of the byte  
                System.out.print(Integer.toHexString(ival));  
                // If the byte seems to be the value of a  
                // printable character, print it. No guarantee  
                // it will be.  
                if (Character.isWhitespace(c) || Character.isISOControl(c)) {  
                    System.out.println("");  
                } else {  
                    System.out.println(" (" + c + ")");  
                }  
            }  
            System.out.println("");  
        }  
    }  

Results:

 Charset: US-ASCII  
  Input:  ?Ma?ana?  
Encoded:   
   0: 20  
   1: 3f (?)  
   2: 4d (M)  
   3: 61 (a)  
   4: 3f (?)  
   5: 61 (a)  
   6: 6e (n)  
   7: 61 (a)  
   8: 3f (?)  

Charset: ISO-8859-1  
  Input:  ?Ma?ana?  
Encoded:   
   0: 20  
   1: bf (?)  
   2: 4d (M)  
   3: 61 (a)  
   4: f1 (?)  
   5: 61 (a)  
   6: 6e (n)  
   7: 61 (a)  
   8: 3f (?)  

Charset: UTF-8  
  Input:  ?Ma?ana?  
Encoded:   
   0: 20  
   1: c2 (?)  
   2: bf (?)  
   3: 4d (M)  
   4: 61 (a)  
   5: c3 (?)  
   6: b1 ( Plus or minus )  
   7: 61 (a)  
   8: 6e (n)  
   9: 61 (a)  
  10: 3f (?)  

Charset: UTF-16BE  
  Input:  ?Ma?ana?  
Encoded:   
   0: 00  
   1: 20  
   2: 00  
   3: bf (?)  
   4: 00  
   5: 4d (M)  
   6: 00  
   7: 61 (a)  
   8: 00  
   9: f1 (?)  
  10: 00  
  11: 61 (a)  
  12: 00  
  13: 6e (n)  
  14: 00  
  15: 61 (a)  
  16: 00  
  17: 3f (?)  

Charset: UTF-16LE  
  Input:  ?Ma?ana?  
Encoded:   
   0: 20  
   1: 00  
   2: bf (?)  
   3: 00  
   4: 4d (M)  
   5: 00  
   6: 61 (a)  
   7: 00  
   8: f1 (?)  
   9: 00  
  10: 61 (a)  
  11: 00  
  12: 6e (n)  
  13: 00  
  14: 61 (a)  
  15: 00  
  16: 3f (?)  
  17: 00  

Charset: UTF-16  
  Input:  ?Ma?ana?  
Encoded:   
   0: fe (?)  
   1: ff (?)  
   2: 00  
   3: 20  
   4: 00  
   5: bf (?)  
   6: 00  
   7: 4d (M)  
   8: 00  
   9: 61 (a)  
  10: 00  
  11: f1 (?)  
  12: 00  
  13: 61 (a)  
  14: 00  
  15: 6e (n)  
  16: 00  
  17: 61 (a)  
  18: 00  
  19: 3f (?) 

Character set:

    package java.nio.charset;   
    public abstract class Charset implements Comparable   
    {   
            public static boolean isSupported (String charsetName)   
            public static Charset forName (String charsetName)   
            public static SortedMap availableCharsets()    
            public final String name()    
            public final Set aliases()   
            public String displayName()   
            public String displayName (Locale locale)    
            public final boolean isRegistered()    
            public boolean canEncode()    
            public abstract CharsetEncoder newEncoder();    
            public final ByteBuffer encode (CharBuffer cb)    
            public final ByteBuffer encode (String str)    
            public abstract CharsetDecoder newDecoder();    
            public final CharBuffer decode (ByteBuffer bb)    
            public abstract boolean contains (Charset cs);   
            public final boolean equals (Object ob)   
            public final int compareTo (Object ob)    
            public final int hashCode()   
            public final String toString()    
    }  

  Then the Charset object needs to satisfy several conditions:
 
The & # 61548;   The specification name of the character set should correspond to the name registered with IANA.
The & # 61548;   If IANA registers multiple names with the same character set, the specification name returned by the object should match the mime-preferred name in the IANA registry.
The & # 61548;   If the character set name is removed from the registry, the current specification name should remain as an alias.
The & # 61548;   If the character set is not registered with IANA, its canonical name must begin with "X -" or "X -".

In most cases, only the JVM vendor will pay attention to these rules. However, if you intend to use your own character set as part of your application, it will be helpful to know what not to do. For isRegistered() you should return false and name your character set beginning with "X -".


Character set comparison:


    public abstract class Charset implements Comparable   
    {   
            // This is a partial API listing   
            public abstract boolean contains (Charset cs);    
            public final boolean equals (Object ob)   
            public final int compareTo (Object ob)    
            public final int hashCode()   
            public final String toString()    
    }  

Recall that a character set is composed of a character's encoding set and the encoding scheme for that character set. Similar to a normal set, one character set may be a subset of another. One character set (c1) contains another (c2), indicating that every character expressed in c2 can be expressed in c1 in the same way. Each character set is considered to contain itself. If this inclusion relationship is true, then any stream you encode in c2 (the included subset) must also be encoded in c1 without any substitution.


Character set encoder: a character set is composed of a coded character set and a related coding scheme. The CharsetEncoder and CharsetDecoder classes implement the transformation scheme.


 float averageBytesPerChar()   
          Returns the average number of bytes that will be produced for each character of input.   
 boolean canEncode(char c)   
          Tells whether or not this encoder can encode the given character.   
 boolean canEncode(CharSequence cs)   
          Tells whether or not this encoder can encode the given character sequence.   
 Charset charset()   
          Returns the charset that created this encoder.   
 ByteBuffer encode(CharBuffer in)   
          Convenience method that encodes the remaining content of a single input character buffer into a newly-allocated byte buffer.   
 CoderResult encode(CharBuffer in, ByteBuffer out, boolean endOfInput)   
          Encodes as many characters as possible from the given input buffer, writing the results to the given output buffer.   
protected abstract  CoderResult encodeLoop(CharBuffer in, ByteBuffer out)   
          Encodes one or more characters into one or more bytes.   
 CoderResult flush(ByteBuffer out)   
          Flushes this encoder.   
protected  CoderResult implFlush(ByteBuffer out)   
          Flushes this encoder.   
protected  void implOnMalformedInput(CodingErrorAction newAction)   
          Reports a change to this encoder's malformed-input action.   
protected  void implOnUnmappableCharacter(CodingErrorAction newAction)   
          Reports a change to this encoder's unmappable-character action.   
protected  void implReplaceWith(byte[] newReplacement)   
          Reports a change to this encoder's replacement value.   
protected  void implReset()   
          Resets this encoder, clearing any charset-specific internal state.   
 boolean isLegalReplacement(byte[] repl)   
          Tells whether or not the given byte array is a legal replacement value for this encoder.   
 CodingErrorAction malformedInputAction()   
          Returns this encoder's current action for malformed-input errors.   
 float maxBytesPerChar()   
          Returns the maximum number of bytes that will be produced for each character of input.   
 CharsetEncoder onMalformedInput(CodingErrorAction newAction)   
          Changes this encoder's action for malformed-input errors.   
 CharsetEncoder onUnmappableCharacter(CodingErrorAction newAction)   
          Changes this encoder's action for unmappable-character errors.   
 byte[] replacement()   
          Returns this encoder's replacement value.   
 CharsetEncoder replaceWith(byte[] newReplacement)   
          Changes this encoder's replacement value.   
 CharsetEncoder reset()   
          Resets this encoder, clearing any internal state.   
 CodingErrorAction unmappableCharacterAction()   
          Returns this encoder's current action for unmappable-character errors.  

A CharsetEncoder object is a state transition engine: characters go in, bytes come out. Some encoder calls may require a conversion. The encoder stores the state of transitions between calls.

One note about the CharsetEncoder API: first, the simpler the encode() form, the more convenient it is. The encoding of the CharBuffer that you provide in the redistributed ByteBuffer sets all the encoding in one. This is the last method you call when you call encode() directly on the Charset class.

Underflow

Overflow (Overflow)

Malformed input

Unmappable character


When encoding, if the encoder encounters a defective or unmapped input, the resulting object is returned. You can also detect individual characters, or sequences of characters, to see if they can be encoded. Here's how to check if it can be coded:


    package java.nio.charset;   
    public abstract class CharsetEncoder    
    {   
             // This is a partial API listing    
            public boolean canEncode (char c)    
            public boolean canEncode (CharSequence cs)   
    }  

  The CodingErrorAction defines three public domains:

REPORT (REPORT)
            The default behavior when creating a CharsetEncoder. This behavior indicates that the coding error should be made by returning the CoderResult object
The report, mentioned earlier.

IGNORE
                Indicates that coding errors should be ignored and any incorrect input should be aborted if the location is incorrect.

REPLACE (REPLACE)
                Handles the encoding error by aborting the wrong input and outputting the current sequence of replacement bytes defined for the CharsetEncoder.

 

Remember that character set encodings convert characters into sequences of bytes for later decoding. If the replacement sequence cannot be decoded into a valid character sequence, the encoded byte sequence becomes invalid.

CoderResult class: CoderResult object is returned by CharsetEncoder and CharsetDecoder objects:


    package java.nio.charset;   
    public class CoderResult {   
            public static final CoderResult OVERFLOW   
            public static final CoderResult UNDERFLOW    
            public boolean isUnderflow()    
            public boolean isOverflow()   
    <span style="white-space:pre">  </span>public boolean isError()   
            public boolean isMalformed()    
            public boolean isUnmappable()   
            public int length()    
            public static CoderResult malformedForLength (int length)     
            public static CoderResult unmappableForLength (int length)    
    <span style="white-space:pre">  </span>public void throwException() throws CharacterCodingException   
    }   

Character set decoder: a character set decoder is a reversal of an encoder. A sequence of 16-bit Unicode characters converted from byte encodings by a special encoding scheme. Similar to CharsetEncoder, CharsetDecoder is a state transition engine. Neither is thread-safe because calling their methods also changes their state, and that state is preserved.

float averageCharsPerByte()   
          Returns the average number of characters that will be produced for each byte of input.   
 Charset charset()   
          Returns the charset that created this decoder.   
 CharBuffer decode(ByteBuffer in)   
          Convenience method that decodes the remaining content of a single input byte buffer into a newly-allocated character buffer.   
 CoderResult decode(ByteBuffer in, CharBuffer out, boolean endOfInput)   
          Decodes as many bytes as possible from the given input buffer, writing the results to the given output buffer.   
protected abstract  CoderResult decodeLoop(ByteBuffer in, CharBuffer out)   
          Decodes one or more bytes into one or more characters.   
 Charset detectedCharset()   
          Retrieves the charset that was detected by this decoder  (optional operation).   
 CoderResult flush(CharBuffer out)   
          Flushes this decoder.   
protected  CoderResult implFlush(CharBuffer out)   
          Flushes this decoder.   
protected  void implOnMalformedInput(CodingErrorAction newAction)   
          Reports a change to this decoder's malformed-input action.   
protected  void implOnUnmappableCharacter(CodingErrorAction newAction)   
          Reports a change to this decoder's unmappable-character action.   
protected  void implReplaceWith(String newReplacement)   
          Reports a change to this decoder's replacement value.   
protected  void implReset()   
          Resets this decoder, clearing any charset-specific internal state.   
 boolean isAutoDetecting()   
          Tells whether or not this decoder implements an auto-detecting charset.   
 boolean isCharsetDetected()   
          Tells whether or not this decoder has yet detected a charset  (optional operation).   
 CodingErrorAction malformedInputAction()   
          Returns this decoder's current action for malformed-input errors.   
 float maxCharsPerByte()   
          Returns the maximum number of characters that will be produced for each byte of input.   
 CharsetDecoder onMalformedInput(CodingErrorAction newAction)   
          Changes this decoder's action for malformed-input errors.   
 CharsetDecoder onUnmappableCharacter(CodingErrorAction newAction)   
          Changes this decoder's action for unmappable-character errors.   
 String replacement()   
          Returns this decoder's replacement value.   
 CharsetDecoder replaceWith(String newReplacement)   
          Changes this decoder's replacement value.   
 CharsetDecoder reset()   
          Resets this decoder, clearing any internal state.   
 CodingErrorAction unmappableCharacterAction()   
          Returns this decoder's current action for unmappable-character errors.  

In terms of the actual method to complete the decoding:

    package java.nio.charset;   
    public abstract class CharsetDecoder   
    {   
            // This is a partial API listing   
            public final CharsetDecoder reset()    
            public final CharBuffer decode (ByteBuffer in)      
                   throws CharacterCodingException   
            public final CoderResult decode (ByteBuffer in, CharBuffer out,      
                   boolean endOfInput)   
            public final CoderResult flush (CharBuffer out)   
    }   

The decoding process is similar to encoding and contains the same basic steps:

1.     Reset the decoder, by calling reset(), and place the decoder in a known state ready to receive input.

2.     Set endOfInput to false without calling or calling decode() multiple times, feeding bytes to the decoding engine. As the decoding proceeds, characters are added to the given CharBuffer.

3.     Set endOfInput to true and call decode() once to inform the decoder that all inputs have been provided.

4.     Call flush() to ensure that all decoded characters have been sent to the output.


Example 6-2 illustrates how to encode a byte stream representing a character set encoding.

Example 6-2.   Character set decoding


    package com.ronsoft.books.nio.charset;  

    import java.nio.*;  
    import java.nio.charset.*;  
    import java.nio.channels.*;  
    import java.io.*;  

      
    public class CharsetDecode {  
        /** 
         * Test charset decoding in the general case, detecting and handling buffer 
         * under/overflow and flushing the decoder state at end of input. This code 
         * reads from stdin and decodes the ASCII-encoded byte stream to chars. The 
         * decoded chars are written to stdout. This is effectively a 'cat' for 
         * input ascii files, but another charset encoding could be used by simply 
         * specifying it on the command line. 
         */  
        public static void main(String[] argv) throws IOException {  
            // Default charset is standard ASCII  
            String charsetName = "ISO-8859-1";  
            // Charset name can be specified on the command line  
            if (argv.length > 0) {  
                charsetName = argv[0];  
            }  
            // Wrap a Channel around stdin, wrap a channel around stdout,  
            // find the named Charset and pass them to the deco de method.  
            // If the named charset is not valid, an exception of type  
            // UnsupportedCharsetException will be thrown.  
            decodeChannel(Channels.newChannel(System.in), new OutputStreamWriter(  
                    System.out), Charset.forName(charsetName));  
        }  

          
        public static void decodeChannel(ReadableByteChannel source, Writer writer,  
                Charset charset) throws UnsupportedCharsetException, IOException {  
            // Get a decoder instance from the Charset  
            CharsetDecoder decoder = charset.newDecoder();  
            // Tell decoder to replace bad chars with default mark  
            decoder.onMalformedInput(CodingErrorAction.REPLACE);  
            decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);  
            // Allocate radically different input and output buffer sizes  
            // for testing purposes  
            ByteBuffer bb = ByteBuffer.allocateDirect(16 * 1024);  
            CharBuffer cb = CharBuffer.allocate(57);  
            // Buffer starts empty; indicate input is needed  
            CoderResult result = CoderResult.UNDERFLOW;  
            boolean eof = false;  
            while (!eof) {  
                // Input buffer underflow; decoder wants more input  
                if (result == CoderResult.UNDERFLOW) {  
                    // decoder consumed all input, prepare to refill  
                    bb.clear();  
                    // Fill the input buffer; watch for EOF  
                    eof = (source.read(bb) == -1);  
                    // Prepare the buffer for reading by decoder  
                    bb.flip();  
                }  
                // Decode input bytes to output chars; pass EOF flag  
                result = decoder.decode(bb, cb, eof);  
                // If output buffer is full, drain output  
                if (result == CoderResult.OVERFLOW) {  
                    drainCharBuf(cb, writer);  
                }  
            }  
            // Flush any remaining state from the decoder, being careful  
            // to detect output buffer overflow(s)  
            while (decoder.flush(cb) == CoderResult.OVERFLOW) {  
                drainCharBuf(cb, writer);  
            }  
            // Drain any chars remaining in the output buffer  
            drainCharBuf(cb, writer);  
            // Close the channel; push out any buffered data to stdout  
            source.close();  
            writer.flush();  
        }  

          
        static void drainCharBuf(CharBuffer cb, Writer writer) throws IOException {  
            cb.flip(); // Prepare buffer for draining  
            // This writes the chars contained in the CharBuffer but  
            // doesn't actually modify the state of the buffer.  
            // If the char buffer was being drained by calls to get( ),  
            // a loop might be needed here.  
            if (cb.hasRemaining()) {  
                writer.write(cb.toString());  
            }  
            cb.clear(); // Prepare buffer to be filled again  
        }  
    }  

Character set server provider interface: the pluggable SPI structure is used throughout the Java environment in many different contexts. There are eight packages in the 1.4JDK, one called spi and the rest with other names. Pluggable is a powerful design technique that is one of the cornerstones of Java's portability and adaptability.

Before you dive into the API, you need to explain how Charset SPI works. The java.nio.charset.spi package contains only one extraction class, CharsetProvider. The concrete implementations of this class provide information about the Charset objects they provide. To define a custom character set, you must first create a concrete implementation of charset, CharsetEncoder, and CharsetDecoder from the java.nio.charset package. You then create custom subclasses of CharsetProvider, which will supply those classes to the JVM.

Create custom character set:

The least you can do is create subclasses of java.nio.charset.charset, provide concrete implementations of the three extraction methods, and a constructor. The Charset class has no default, parameterless constructor. This means that your custom character set class must have a constructor, even if it does not accept arguments. This is because you must call the Charset constructor at instantiation time (by calling super() at the beginning of your constructor) to feed it through your Charset specification name and alias. Doing so lets the methods in the Charset class do things with names for you, so that's a good thing.

Similarly, you need to provide specific implementations of CharsetEncoder and CharsetDecoder. Recall that a character set is a collection of encoded characters and encoding/decoding schemes. As we've seen before, encoding and decoding are almost symmetrical at the API level. Here is a brief discussion of what is needed to implement an encoder: the same applies to building a decoder.

Like Charset, CharsetEncoder does not have a default constructor, so you need to call super() in the concrete class constructor to provide the required parameters.

To supply your own CharsetEncoder implementation, you should at least provide specific encodeLoop () methods. For simple coding algorithms, the default implementation of other methods should work fine. Note that encodeLoop () takes a parameter similar to that of encode (), excluding the Boolean flag. The encode () method represents the actual encoding to encodeLoop(), which only needs to focus on the characters consumed from the CharBuffer parameter and output the encoded bytes to the supplied ByteBuffer.


Now that we've seen how to implement the custom character set, including the associated encoders and decoders, let's take a look at how to connect them to the JVM so you can use them to run code.


For your custom character set:

  To provide your own Charset implementation for the JVM runtime environment, you must create concrete subclasses of the CharsetProvider class in java.nio.charsets.-spi, each with a parameterless constructor. The no-argument constructor is important because your CharsetProvider class will be positioned by reading all qualified names from the configuration file. The Class name string is then imported into class.newinstance () to instantiate your provider, which works only through the no-argument constructor.

JVM reads the configuration file location provider for character set, named Java nio. Charset. Spi. CharsetProvider. It resides in the source directory (meta-inf /services) in the JVM classpath. Each Java archive (JAR) has a meta-inf directory that can contain information about classes and resources in that JAR. A directory named meta-inf can also be placed at the top of a regular directory in the JVM classpath.

The CharsetProvider API is almost useless. The actual work of providing a custom character set takes place in creating custom Charset, CharsetEncoder, and CharsetDecoder classes. CharsetProvider is simply the facilitator that connects your character set to your runtime environment.


The implementation of a custom Charset and CharsetProvider is demonstrated in example 6-3, with sample code, encoding and decoding, and Charset SPI to illustrate the use of the Charset. Example 6-3 implements a custom Charset.

  Example 6 -3. Custom Rot13 character set


    package com.ronsoft.books.nio.charset;  

    import java.nio.CharBuffer;  
    import java.nio.ByteBuffer;  
    import java.nio.charset.Charset;  
    import java.nio.charset.CharsetEncoder;  
    import java.nio.charset.CharsetDecoder;  
    import java.nio.charset.CoderResult;  
    import java.util.Map;  
    import java.util.Iterator;  
    import java.io.Writer;  
    import java.io.PrintStream;  
    import java.io.PrintWriter;  
    import java.io.OutputStreamWriter;  
    import java.io.BufferedReader;  
    import java.io.InputStreamReader;  
    import java.io.FileReader;  

      
    public class Rot13Charset extends Charset {  
        // the name of the base charset encoding we delegate to  
        private static final String BASE_CHARSET_NAME = "UTF-8";  
        // Handle to the real charset we'll use for transcoding between  
        // characters and bytes. Doing this allows us to apply the Rot13  
        // algorithm to multibyte charset encodings. But only the  
        // ASCII alpha chars will be rotated, regardless of the base encoding.  
        Charset baseCharset;  

          
        protected Rot13Charset(String canonical, String[] aliases) {  
            super(canonical, aliases);  
            // Save the base charset we're delegating to  
            baseCharset = Charset.forName(BASE_CHARSET_NAME);  
        }  

        // ----------------------------------------------------------  
          
        public CharsetEncoder newEncoder() {  
            return new Rot13Encoder(this, baseCharset.newEncoder());  
        }  

          
        public CharsetDecoder newDecoder() {  
            return new Rot13Decoder(this, baseCharset.newDecoder());  
        }  

          
        public boolean contains(Charset cs) {  
            return (false);  
        }  

          
        private void rot13(CharBuffer cb) {  
            for (int pos = cb.position(); pos < cb.limit(); pos++) {  
                char c = cb.get(pos);  
                char a = 'u0000';  
                // Is it lowercase alpha?  
                if ((c >= 'a') && (c <= 'z')) {  
                    a = 'a';  
                }  
                // Is it uppercase alpha?  
                if ((c >= 'A') && (c <= 'Z')) {  
                    a = 'A';  
                }  
                // If either, roll it by 13  
                if (a != 'u0000') {  
                    c = (char) ((((c - a) + 13) % 26) + a);  
                    cb.put(pos, c);  
                }  
            }  
        }  

        // --------------------------------------------------------  
          
        private class Rot13Encoder extends CharsetEncoder {  
            private CharsetEncoder baseEncoder;  

              
            Rot13Encoder(Charset cs, CharsetEncoder baseEncoder) {  
                super(cs, baseEncoder.averageBytesPerChar(), baseEncoder  
                        .maxBytesPerChar());  
                this.baseEncoder = baseEncoder;  
            }  

              
            protected CoderResult encodeLoop(CharBuffer cb, ByteBuffer bb) {  
                CharBuffer tmpcb = CharBuffer.allocate(cb.remaining());  
                while (cb.hasRemaining()) {  
                    tmpcb.put(cb.get());  
                }  
                tmpcb.rewind();  
                rot13(tmpcb);  
                baseEncoder.reset();  
                CoderResult cr = baseEncoder.encode(tmpcb, bb, true);  
                // If error or output overflow, we need to adjust  
                // the position of the input buffer to match what  
                // was really consumed from the temp buffer. If  
                // underflow (all input consumed), this is a no-op.  
                cb.position(cb.position() - tmpcb.remaining());  
                return (cr);  
            }  
        }  

        // --------------------------------------------------------  
          
        private class Rot13Decoder extends CharsetDecoder {  
            private CharsetDecoder baseDecoder;  

            /** 
             * Constructor, call the superclass constructor with the Charset object 
             * and pass alon the chars/byte values from the delegate decoder. 
             */  
            Rot13Decoder(Charset cs, CharsetDecoder baseDecoder) {  
                super(cs, baseDecoder.averageCharsPerByte(), baseDecoder  
                        .maxCharsPerByte());  
                this.baseDecoder = baseDecoder;  
            }  

              
            protected CoderResult decodeLoop(ByteBuffer bb, CharBuffer cb) {  
                baseDecoder.reset();  
                CoderResult result = baseDecoder.decode(bb, cb, true);  
                rot13(cb);  
                return (result);  
            }  
        }  

        // --------------------------------------------------------  
          
        public static void main(String[] argv) throws Exception {  
            BufferedReader in;  
            if (argv.length > 0) {  
                // Open the named file  
                in = new BufferedReader(new FileReader(argv[0]));  
            } else {  
                // Wrap a BufferedReader around stdin  
                in = new BufferedReader(new InputStreamReader(System.in));  
            }  
            // Create a PrintStream that uses the Rot13 encoding  
            PrintStream out = new PrintStream(System.out, false, "X -ROT13");  
            String s = null;  
            // Read all input and write it to the output.  
            // As the data passes through the PrintStream,  
            // it will be Rot13-encoded.  
            while ((s = in.readLine()) != null) {  
                out.println(s);  
            }  
            out.flush();  
        }  
    }  

In order to use this Charset and its encoder and decoder, it must be valid for the Java runtime environment. This is done with the CharsetProvider class (example 6-4).

Example 6-4. Custom character set provider

    package com.ronsoft.books.nio.charset;  

    import java.nio.charset.Charset;  
    import java.nio.charset.spi.CharsetProvider;  
    import java.util.HashSet;  
    import java.util.Iterator;  

    /** 
     * A CharsetProvider class which makes available the charsets provided by 
     * Ronsoft. Currently there is only one, namely the X -ROT13 charset. This is 
     * not a registered IANA charset, so it's name begins with "X-" to avoid name 
     * clashes with offical charsets. 
     *  
     * To activate this CharsetProvider, it's necessary to add a file to the 
     * classpath of the JVM runtime at the following location: 
     * META-INF/services/java.nio.charsets.spi.CharsetP rovider 
     *  
     * That file must contain a line with the fully qualified name of this class on 
     * a line by itself: com.ronsoft.books.nio.charset.RonsoftCharsetProvider Java 
     * NIO 216 
     *  
     * See the javadoc page for java.nio.charsets.spi.CharsetProvider for full 
     * details. 
     *  
     * @author Ron Hitchens (ron@ronsoft.com) 
     */  
    public class RonsoftCharsetProvider extends CharsetProvider {  
        // the name of the charset we provide  
        private static final String CHARSET_NAME = "X-ROT13";  
        // a handle to the Charset object  
        private Charset rot13 = null;  

          
        public RonsoftCharsetProvider() {  
            this.rot13 = new Rot13Charset(CHARSET_NAME, new String[0]);  
        }  

          
        public Charset charsetForName(String charsetName) {  
            if (charsetName.equalsIgnoreCase(CHARSET_NAME)) {  
                return (rot13);  
            }  
            return (null);  
        }  

          
        public Iterator<Charset> charsets() {  
            HashSet<Charset> set = new HashSet<Charset>(1);  
          
                

Related articles: