Apr 13, 2012

Tokenizing

This last stop on our journey introduces String tokenizing and covers the following topics:
Ø     StringTokenizer
Ø     StreamTokenizer

StringTokenizer
        Remember earlier when you wrote lines that contain data items separated by delimiters? The delimiters, which separate the data items (or tokens, as they are called), might be commas or semicolons or tab characters.
When you read each line back in, you need to separate the tokens. Each line might represent a name and address, such as in a mail-merge file.We want to break that line up into its individual parts, such as name, street, city, state, and zip.
The way to do that is to use the StringTokenizer class. Although this is not part of the I/O hierarchy, we'll cover it here because it is useful in reading delimited streams. Here's how that class works.
Constructing a StringTokenizer object
           You construct an object of StringTokenizer , giving it the String you want to break up and what the delimiters are. That is, you tell it what characters are going to be used to break up the tokens in the String. After we have constructed an object of that type, we have a couple of things we can ask it:
ü     How many tokens are in this String?
ü     Have I used up all my tokens in this String?
ü     Give me the next token in the String.
StringTokenizer will break up the String and return to you as a String the characters up to the next delimiter. Its methods will throw a NoSuchElementException if there are no more tokens. Because this exception is derived from RuntimeException  so that it is not declared with a throws clause, you do not need to catch it.
One more thing: you can switch delimiters for each token. In other words, even though you have created a StringTokenizer object that is looking for particular delimiters, you can say, "For this next token, change the delimiter."
The example on the next panel illustrates this process.
StringTokenizer: Example code
            Look at the code below. Notice our String has a vertical bar and a question mark. Those are our delimiters. We are going to create a new StringTokenizer , and we are going to pass it the String abc (bar) def (question mark) ghi. We are going to use as our delimiters a bar and a question mark.
The StringTokenizer hasMoreTokens() returns true if there are still more tokens. If it is true , we are going to call nextToken().This returns a String , and s will have the value abc in it the first time  round the while loop.
We come around the loop a second time and hasMoreTokens() is still true. Now when we get the next token, def is the value of the String that is returned.
Go around the loop a third time and hasMoreTokens() is still true ; ghi is returned.
Now go around the loop one more time and hasMoreTokens() is false . At this point, we drop out of the loop. So we have broken up the String without too much effort.
Here is the example code:
import java.io.*;
import java.util.*;
public class StringTokenizerExample
{
   public static void main(String args[])
   {
       String line = \"abc|def?ghi\";
       StringTokenizer st = new StringTokenizer(line, \"|?\");
       while (st.hasMoreTokens())
       {
          String s = st.nextToken();
          System.out.println(\"Token is \" + s);
       }
   }
}

StreamTokenizer
           You can parse an entire input stream. Unlike the StringTokenizer , which parses a String, the StreamTokenizer reads from a Reader and breaks the stream into tokens. The parsing process is controlled by syntax tables and flags.
Each token that is parsed is placed in a category. The categories include identifiers, numbers, quoted Strings, and various comment styles. The parsing that is performed is suitable for breaking a Java, C, or C++ source file into its tokens. StreamTokenizer is not in the I/O hierarchy, but because it is used with streams, we cover it here.
The StreamTokenizer recognizes characters in the range from u0000 through u00FF. Each character value can have up to five possible attributes. The attributes are white space, alphabetic, numeric, String quote, and comment character.
ü     A character that is white space is used to separate tokens.
ü     An alphabetic character is part of an identifier.
ü     A numeric character can be part of an identifier or a number.
ü     A String quote character surrounds a quoted String.
ü     A comment character precedes or surrounds a comment.
ü     A character that does not have any of these attributes is an ordinary character. When an ordinary character is encountered, it is treated as a single character token.
Flags can be set to alter the parsing. Line terminators can either be tokens or white space separating tokens. C-style and C++-style comments can be recognized and skipped. Identifiers are converted to lower case or left as is.
Use of StreamTokenizer
        To use a StreamTokenizer , you construct it with the underlying Reader. Then you set up the syntax tables and flags. Next, you loop on the tokens, calling nextToken() until it returns TT_EOF.
The nextToken() method parses the next token.The type is both returned by the method and also placed in the type field. The value of the token is placed in the sval field (String value) if the token is a word or in the nval field (numeric value) if the token is a number.
The token type can be either a character or a value, which represents the type of the token. If a token consists of a single character, the value of the type is the character value. If the token is a quoted String, the value is the value of the quote character. Otherwise it is one of the following:
ü     TT_EOF means the end of the stream has been read.
ü     TT_EOL means the end of the line has been read (if end of line characters are treated as tokens).
ü     TT_NUMBER means that a number token has been read.
ü     TT_WORD means that a word token has been read.
Methods of StreamTokenizer
         There are many methods in StreamTokenizer to set up the syntax tables. We'll only mention two here. The first is resetSyntax() , which sets all characters to ordinary. The second is wordChars() , which gives a set of characters the alphabetic attribute.
StreamTokenizer: Example code
         This example is a simple one that shows the use of the StreamTokenizer. It breaks a file into words consisting of lower- or upper-case letters. When a word is found, nextToken() returns TT_EOF.
If an ordinary character is found (in this case, anything not set as alphabetic), nextToken() returns the value of that character.
import java.io.*;
import java.util.*;
public class StreamTokenizerExample
{
   public static void main(String args[])
   {
      try
      {
          FileReader fr = new FileReader(t.txt);
          BufferedReader br = new BufferedReader(fr);
          StreamTokenizer st = new StreamTokenizer(br);
          st.resetSyntax();
          st.wordChars(\'A\', \'Z\');
          st.wordChars(\'a\', \'z\');
          int type;
         while ((type = st.nextToken()) != StreamTokenizer.TT_EOF)
         {
            if (type == StreamTokenizer.TT_WORD)
            System.out.println(st.sval);
         }
      }
      catch (IOException e)
      {
        System.out.println(e);
      }
    }
}



0 comments :

Post a Comment