Class Stemmer


  • final class Stemmer
    extends java.lang.Object
    Stemmer uses the affix rules declared in the Dictionary to generate one or more stems for a word. It conforms to the algorithm in the original hunspell algorithm, including recursive suffix stripping.
    • Constructor Summary

      Constructors 
      Constructor Description
      Stemmer​(Dictionary dictionary)
      Constructs a new Stemmer which will use the provided Dictionary to create its stems.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      (package private) java.util.List<CharsRef> applyAffix​(char[] strippedWord, int length, int affix, int prefixFlag, int recursionDepth, boolean prefix, boolean circumfix, boolean caseVariant)
      Applies the affix rule to the given word, producing a list of stems if any are found
      private void caseFoldLower​(char[] word, int length)
      folds lowercase variant of word (title cased) to lowerBuffer
      private void caseFoldTitle​(char[] word, int length)
      folds titlecase variant of word to titleBuffer
      private int caseOf​(char[] word, int length)
      returns EXACT_CASE,TITLE_CASE, or UPPER_CASE type for the word
      private boolean checkCondition​(int condition, char[] c1, int c1off, int c1len, char[] c2, int c2off, int c2len)
      checks condition of the concatenation of two strings
      private java.util.List<CharsRef> doStem​(char[] word, int length, boolean caseVariant)  
      private boolean hasCrossCheckedFlag​(char flag, char[] flags, boolean matchEmpty)
      Checks if the given flag cross checks with the given array of flags
      private CharsRef newStem​(char[] buffer, int length, IntsRef forms, int formID)  
      java.util.List<CharsRef> stem​(char[] word, int length)
      Find the stem(s) of the provided word
      private java.util.List<CharsRef> stem​(char[] word, int length, int previous, int prevFlag, int prefixFlag, int recursionDepth, boolean doPrefix, boolean doSuffix, boolean previousWasPrefix, boolean circumfix, boolean caseVariant)
      Generates a list of stems for the provided word
      java.util.List<CharsRef> stem​(java.lang.String word)
      Find the stem(s) of the provided word.
      java.util.List<CharsRef> uniqueStems​(char[] word, int length)
      Find the unique stem(s) of the provided word
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • scratch

        private final BytesRef scratch
      • segment

        private final java.lang.StringBuilder segment
      • scratchSegment

        private final java.lang.StringBuilder scratchSegment
      • scratchBuffer

        private char[] scratchBuffer
      • formStep

        private final int formStep
      • lowerBuffer

        private char[] lowerBuffer
      • titleBuffer

        private char[] titleBuffer
    • Constructor Detail

      • Stemmer

        public Stemmer​(Dictionary dictionary)
        Constructs a new Stemmer which will use the provided Dictionary to create its stems.
        Parameters:
        dictionary - Dictionary that will be used to create the stems
    • Method Detail

      • stem

        public java.util.List<CharsRef> stem​(java.lang.String word)
        Find the stem(s) of the provided word.
        Parameters:
        word - Word to find the stems for
        Returns:
        List of stems for the word
      • stem

        public java.util.List<CharsRef> stem​(char[] word,
                                             int length)
        Find the stem(s) of the provided word
        Parameters:
        word - Word to find the stems for
        Returns:
        List of stems for the word
      • caseOf

        private int caseOf​(char[] word,
                           int length)
        returns EXACT_CASE,TITLE_CASE, or UPPER_CASE type for the word
      • caseFoldTitle

        private void caseFoldTitle​(char[] word,
                                   int length)
        folds titlecase variant of word to titleBuffer
      • caseFoldLower

        private void caseFoldLower​(char[] word,
                                   int length)
        folds lowercase variant of word (title cased) to lowerBuffer
      • doStem

        private java.util.List<CharsRef> doStem​(char[] word,
                                                int length,
                                                boolean caseVariant)
      • uniqueStems

        public java.util.List<CharsRef> uniqueStems​(char[] word,
                                                    int length)
        Find the unique stem(s) of the provided word
        Parameters:
        word - Word to find the stems for
        Returns:
        List of stems for the word
      • newStem

        private CharsRef newStem​(char[] buffer,
                                 int length,
                                 IntsRef forms,
                                 int formID)
      • stem

        private java.util.List<CharsRef> stem​(char[] word,
                                              int length,
                                              int previous,
                                              int prevFlag,
                                              int prefixFlag,
                                              int recursionDepth,
                                              boolean doPrefix,
                                              boolean doSuffix,
                                              boolean previousWasPrefix,
                                              boolean circumfix,
                                              boolean caseVariant)
                                       throws java.io.IOException
        Generates a list of stems for the provided word
        Parameters:
        word - Word to generate the stems for
        previous - previous affix that was removed (so we dont remove same one twice)
        prevFlag - Flag from a previous stemming step that need to be cross-checked with any affixes in this recursive step
        prefixFlag - flag of the most inner removed prefix, so that when removing a suffix, it's also checked against the word
        recursionDepth - current recursiondepth
        doPrefix - true if we should remove prefixes
        doSuffix - true if we should remove suffixes
        previousWasPrefix - true if the previous removal was a prefix: if we are removing a suffix, and it has no continuation requirements, it's ok. but two prefixes (COMPLEXPREFIXES) or two suffixes must have continuation requirements to recurse.
        circumfix - true if the previous prefix removal was signed as a circumfix this means inner most suffix must also contain circumfix flag.
        caseVariant - true if we are searching for a case variant. if the word has KEEPCASE flag it cannot succeed.
        Returns:
        List of stems, or empty list if no stems are found
        Throws:
        java.io.IOException
      • checkCondition

        private boolean checkCondition​(int condition,
                                       char[] c1,
                                       int c1off,
                                       int c1len,
                                       char[] c2,
                                       int c2off,
                                       int c2len)
        checks condition of the concatenation of two strings
      • applyAffix

        java.util.List<CharsRef> applyAffix​(char[] strippedWord,
                                            int length,
                                            int affix,
                                            int prefixFlag,
                                            int recursionDepth,
                                            boolean prefix,
                                            boolean circumfix,
                                            boolean caseVariant)
                                     throws java.io.IOException
        Applies the affix rule to the given word, producing a list of stems if any are found
        Parameters:
        strippedWord - Word the affix has been removed and the strip added
        length - valid length of stripped word
        affix - HunspellAffix representing the affix rule itself
        prefixFlag - when we already stripped a prefix, we cant simply recurse and check the suffix, unless both are compatible so we must check dictionary form against both to add it as a stem!
        recursionDepth - current recursion depth
        prefix - true if we are removing a prefix (false if it's a suffix)
        Returns:
        List of stems for the word, or an empty list if none are found
        Throws:
        java.io.IOException
      • hasCrossCheckedFlag

        private boolean hasCrossCheckedFlag​(char flag,
                                            char[] flags,
                                            boolean matchEmpty)
        Checks if the given flag cross checks with the given array of flags
        Parameters:
        flag - Flag to cross check with the array of flags
        flags - Array of flags to cross check against. Can be null
        Returns:
        true if the flag is found in the array or the array is null, false otherwise