Class WordDelimiterIterator


  • public final class WordDelimiterIterator
    extends java.lang.Object
    A BreakIterator-like API for iterating over subwords in text, according to WordDelimiterGraphFilter rules.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int ALPHA  
      static int ALPHANUM  
      private byte[] charTypeTable  
      (package private) int current
      Beginning of subword
      static byte[] DEFAULT_WORD_DELIM_TABLE  
      (package private) static int DIGIT  
      static int DONE
      Indicates the end of iteration
      (package private) int end
      End of subword
      (package private) int endBounds
      end position of text, excluding trailing delimiters
      private boolean hasFinalPossessive  
      (package private) int length  
      (package private) static int LOWER  
      private boolean skipPossessive
      if true, need to skip over a possessive found in the last call to next()
      (package private) boolean splitOnCaseChange
      If false, causes case changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens).
      (package private) boolean splitOnNumerics
      If false, causes numeric changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens).
      (package private) int startBounds
      start position of text, excluding leading delimiters
      (package private) boolean stemEnglishPossessive
      If true, causes trailing "'s" to be removed for each subword.
      (package private) static int SUBWORD_DELIM  
      (package private) char[] text  
      (package private) static int UPPER  
    • Constructor Summary

      Constructors 
      Constructor Description
      WordDelimiterIterator​(byte[] charTypeTable, boolean splitOnCaseChange, boolean splitOnNumerics, boolean stemEnglishPossessive)
      Create a new WordDelimiterIterator operating with the supplied rules.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private int charType​(int ch)
      Determines the type of the given character
      private boolean endsWithPossessive​(int pos)
      Determines if the text at the given position indicates an English possessive which should be removed
      static byte getType​(int ch)
      Computes the type of the given character
      (package private) static boolean isAlpha​(int type)
      Checks if the given word type includes ALPHA
      private boolean isBreak​(int lastType, int type)
      Determines whether the transition from lastType to type indicates a break
      (package private) static boolean isDigit​(int type)
      Checks if the given word type includes DIGIT
      (package private) boolean isSingleWord()
      Determines if the current word contains only one subword.
      (package private) static boolean isSubwordDelim​(int type)
      Checks if the given word type includes SUBWORD_DELIM
      (package private) static boolean isUpper​(int type)
      Checks if the given word type includes UPPER
      (package private) int next()
      Advance to the next subword in the string.
      private void setBounds()
      Set the internal word bounds (remove leading and trailing delimiters).
      (package private) void setText​(char[] text, int length)
      Reset the text to a new value, and reset all state
      (package private) int type()
      Return the type of the current subword.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • DEFAULT_WORD_DELIM_TABLE

        public static final byte[] DEFAULT_WORD_DELIM_TABLE
      • text

        char[] text
      • length

        int length
      • startBounds

        int startBounds
        start position of text, excluding leading delimiters
      • endBounds

        int endBounds
        end position of text, excluding trailing delimiters
      • current

        int current
        Beginning of subword
      • end

        int end
        End of subword
      • hasFinalPossessive

        private boolean hasFinalPossessive
      • splitOnCaseChange

        final boolean splitOnCaseChange
        If false, causes case changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens). (Defaults to true)
      • splitOnNumerics

        final boolean splitOnNumerics
        If false, causes numeric changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens). (Defaults to true)
      • stemEnglishPossessive

        final boolean stemEnglishPossessive
        If true, causes trailing "'s" to be removed for each subword. (Defaults to true)

        "O'Neil's" => "O", "Neil"

      • charTypeTable

        private final byte[] charTypeTable
      • skipPossessive

        private boolean skipPossessive
        if true, need to skip over a possessive found in the last call to next()
    • Constructor Detail

      • WordDelimiterIterator

        WordDelimiterIterator​(byte[] charTypeTable,
                              boolean splitOnCaseChange,
                              boolean splitOnNumerics,
                              boolean stemEnglishPossessive)
        Create a new WordDelimiterIterator operating with the supplied rules.
        Parameters:
        charTypeTable - table containing character types
        splitOnCaseChange - if true, causes "PowerShot" to be two tokens; ("Power-Shot" remains two parts regardless)
        splitOnNumerics - if true, causes "j2se" to be three tokens; "j" "2" "se"
        stemEnglishPossessive - if true, causes trailing "'s" to be removed for each subword: "O'Neil's" => "O", "Neil"
    • Method Detail

      • next

        int next()
        Advance to the next subword in the string.
        Returns:
        index of the next subword, or DONE if all subwords have been returned
      • type

        int type()
        Return the type of the current subword. This currently uses the type of the first character in the subword.
        Returns:
        type of the current word
      • setText

        void setText​(char[] text,
                     int length)
        Reset the text to a new value, and reset all state
        Parameters:
        text - New text
        length - length of the text
      • isBreak

        private boolean isBreak​(int lastType,
                                int type)
        Determines whether the transition from lastType to type indicates a break
        Parameters:
        lastType - Last subword type
        type - Current subword type
        Returns:
        true if the transition indicates a break, false otherwise
      • isSingleWord

        boolean isSingleWord()
        Determines if the current word contains only one subword. Note, it could be potentially surrounded by delimiters
        Returns:
        true if the current word contains only one subword, false otherwise
      • setBounds

        private void setBounds()
        Set the internal word bounds (remove leading and trailing delimiters). Note, if a possessive is found, don't remove it yet, simply note it.
      • endsWithPossessive

        private boolean endsWithPossessive​(int pos)
        Determines if the text at the given position indicates an English possessive which should be removed
        Parameters:
        pos - Position in the text to check if it indicates an English possessive
        Returns:
        true if the text at the position indicates an English possessive, false otherwise
      • charType

        private int charType​(int ch)
        Determines the type of the given character
        Parameters:
        ch - Character whose type is to be determined
        Returns:
        Type of the character
      • getType

        public static byte getType​(int ch)
        Computes the type of the given character
        Parameters:
        ch - Character whose type is to be determined
        Returns:
        Type of the character
      • isAlpha

        static boolean isAlpha​(int type)
        Checks if the given word type includes ALPHA
        Parameters:
        type - Word type to check
        Returns:
        true if the type contains ALPHA, false otherwise
      • isDigit

        static boolean isDigit​(int type)
        Checks if the given word type includes DIGIT
        Parameters:
        type - Word type to check
        Returns:
        true if the type contains DIGIT, false otherwise
      • isSubwordDelim

        static boolean isSubwordDelim​(int type)
        Checks if the given word type includes SUBWORD_DELIM
        Parameters:
        type - Word type to check
        Returns:
        true if the type contains SUBWORD_DELIM, false otherwise
      • isUpper

        static boolean isUpper​(int type)
        Checks if the given word type includes UPPER
        Parameters:
        type - Word type to check
        Returns:
        true if the type contains UPPER, false otherwise