Class CustomAnalyzer

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable

    public final class CustomAnalyzer
    extends Analyzer
    A general-purpose Analyzer that can be created with a builder-style API. Under the hood it uses the factory classes TokenizerFactory, TokenFilterFactory, and CharFilterFactory.

    You can create an instance of this Analyzer using the builder by passing the SPI names (as defined by ServiceLoader interface) to it:

     Analyzer ana = CustomAnalyzer.builder(Paths.get("/path/to/config/dir"))
       .withTokenizer(StandardTokenizerFactory.NAME)
       .addTokenFilter(LowerCaseFilterFactory.NAME)
       .addTokenFilter(StopFilterFactory.NAME, "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
       .build();
     
    The parameters passed to components are also used by Apache Solr and are documented on their corresponding factory classes. Refer to documentation of subclasses of TokenizerFactory, TokenFilterFactory, and CharFilterFactory.

    This is the same as the above:

     Analyzer ana = CustomAnalyzer.builder(Paths.get("/path/to/config/dir"))
       .withTokenizer("standard")
       .addTokenFilter("lowercase")
       .addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
       .build();
     

    The list of names to be used for components can be looked up through: TokenizerFactory.availableTokenizers(), TokenFilterFactory.availableTokenFilters(), and CharFilterFactory.availableCharFilters().

    You can create conditional branches in the analyzer by using CustomAnalyzer.Builder.when(String, String...) and CustomAnalyzer.Builder.whenTerm(Predicate):

     Analyzer ana = CustomAnalyzer.builder()
        .withTokenizer("standard")
        .addTokenFilter("lowercase")
        .whenTerm(t -> t.length() > 10)
          .addTokenFilter("reversestring")
        .endwhen()
        .build();
     
    Since:
    5.0.0
    • Field Detail

      • posIncGap

        private final java.lang.Integer posIncGap
      • offsetGap

        private final java.lang.Integer offsetGap
    • Method Detail

      • builder

        public static CustomAnalyzer.Builder builder()
        Returns a builder for custom analyzers that loads all resources from Lucene's classloader. All path names given must be absolute with package prefixes.
      • builder

        public static CustomAnalyzer.Builder builder​(java.nio.file.Path configDir)
        Returns a builder for custom analyzers that loads all resources from the given file system base directory. Place, e.g., stop word files there. Files that are not in the given directory are loaded from Lucene's classloader.
      • initReader

        protected java.io.Reader initReader​(java.lang.String fieldName,
                                            java.io.Reader reader)
        Description copied from class: Analyzer
        Override this if you want to add a CharFilter chain.

        The default implementation returns reader unchanged.

        Overrides:
        initReader in class Analyzer
        Parameters:
        fieldName - IndexableField name being indexed
        reader - original Reader
        Returns:
        reader, optionally decorated with CharFilter(s)
      • getPositionIncrementGap

        public int getPositionIncrementGap​(java.lang.String fieldName)
        Description copied from class: Analyzer
        Invoked before indexing a IndexableField instance if terms have already been added to that field. This allows custom analyzers to place an automatic position increment gap between IndexbleField instances using the same field name. The default value position increment gap is 0. With a 0 position increment gap and the typical default token position increment of 1, all terms in a field, including across IndexableField instances, are in successive positions, allowing exact PhraseQuery matches, for instance, across IndexableField instance boundaries.
        Overrides:
        getPositionIncrementGap in class Analyzer
        Parameters:
        fieldName - IndexableField name being indexed.
        Returns:
        position increment gap, added to the next token emitted from Analyzer.tokenStream(String,Reader). This value must be >= 0.
      • getCharFilterFactories

        public java.util.List<CharFilterFactory> getCharFilterFactories()
        Returns the list of char filters that are used in this analyzer.
      • getTokenizerFactory

        public TokenizerFactory getTokenizerFactory()
        Returns the tokenizer that is used in this analyzer.
      • getTokenFilterFactories

        public java.util.List<TokenFilterFactory> getTokenFilterFactories()
        Returns the list of token filters that are used in this analyzer.
      • toString

        public java.lang.String toString()
        Overrides:
        toString in class java.lang.Object