Class JapaneseTokenizerFactory
- java.lang.Object
-
- org.apache.lucene.analysis.util.AbstractAnalysisFactory
-
- org.apache.lucene.analysis.util.TokenizerFactory
-
- org.apache.lucene.analysis.ja.JapaneseTokenizerFactory
-
- All Implemented Interfaces:
ResourceLoaderAware
public class JapaneseTokenizerFactory extends TokenizerFactory implements ResourceLoaderAware
Factory forJapaneseTokenizer
.<fieldType name="text_ja" class="solr.TextField"> <analyzer> <tokenizer class="solr.JapaneseTokenizerFactory" mode="NORMAL" userDictionary="user.txt" userDictionaryEncoding="UTF-8" discardPunctuation="true" discardCompoundToken="false" /> <filter class="solr.JapaneseBaseFormFilterFactory"/> </analyzer> </fieldType>
Additional expert user parameters nBestCost and nBestExamples can be used to include additional searchable tokens that those most likely according to the statistical model. A typical use-case for this is to improve recall and make segmentation more resilient to mistakes. The feature can also be used to get a decompounding effect.
The nBestCost parameter specifies an additional Viterbi cost, and when used, JapaneseTokenizer will include all tokens in Viterbi paths that are within the nBestCost value of the best path.
Finding a good value for nBestCost can be difficult to do by hand. The nBestExamples parameter can be used to find an nBestCost value based on examples with desired segmentation outcomes.
For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts, 箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we'd like a cost that gives is us 箱根 (Hakone) and 成田 (Narita). Notice that costs are estimated for each example individually, and the maximum nBestCost found across all examples is used.
If both nBestCost and nBestExamples is used in a configuration, the largest value of the two is used.
Parameters nBestCost and nBestExamples work with all tokenizer modes, but it makes the most sense to use them with NORMAL mode.
- Since:
- 3.6.0
-
-
Field Summary
Fields Modifier and Type Field Description private static java.lang.String
DISCARD_COMPOUND_TOKEN
private static java.lang.String
DISCARD_PUNCTUATION
private boolean
discardCompoundToken
private boolean
discardPunctuation
private JapaneseTokenizer.Mode
mode
private static java.lang.String
MODE
static java.lang.String
NAME
SPI nameprivate static java.lang.String
NBEST_COST
private static java.lang.String
NBEST_EXAMPLES
private int
nbestCost
private java.lang.String
nbestExamples
private static java.lang.String
USER_DICT_ENCODING
private static java.lang.String
USER_DICT_PATH
private UserDictionary
userDictionary
private java.lang.String
userDictionaryEncoding
private java.lang.String
userDictionaryPath
-
Fields inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
-
Constructor Summary
Constructors Constructor Description JapaneseTokenizerFactory(java.util.Map<java.lang.String,java.lang.String> args)
Creates a new JapaneseTokenizerFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description JapaneseTokenizer
create(AttributeFactory factory)
Creates a TokenStream of the specified input using the given AttributeFactoryvoid
inform(ResourceLoader loader)
Initializes this component with the provided ResourceLoader (used for loading classes, files, etc).-
Methods inherited from class org.apache.lucene.analysis.util.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers
-
Methods inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
-
-
-
Field Detail
-
NAME
public static final java.lang.String NAME
SPI name- See Also:
- Constant Field Values
-
MODE
private static final java.lang.String MODE
- See Also:
- Constant Field Values
-
USER_DICT_PATH
private static final java.lang.String USER_DICT_PATH
- See Also:
- Constant Field Values
-
USER_DICT_ENCODING
private static final java.lang.String USER_DICT_ENCODING
- See Also:
- Constant Field Values
-
DISCARD_PUNCTUATION
private static final java.lang.String DISCARD_PUNCTUATION
- See Also:
- Constant Field Values
-
DISCARD_COMPOUND_TOKEN
private static final java.lang.String DISCARD_COMPOUND_TOKEN
- See Also:
- Constant Field Values
-
NBEST_COST
private static final java.lang.String NBEST_COST
- See Also:
- Constant Field Values
-
NBEST_EXAMPLES
private static final java.lang.String NBEST_EXAMPLES
- See Also:
- Constant Field Values
-
userDictionary
private UserDictionary userDictionary
-
mode
private final JapaneseTokenizer.Mode mode
-
discardPunctuation
private final boolean discardPunctuation
-
discardCompoundToken
private final boolean discardCompoundToken
-
userDictionaryPath
private final java.lang.String userDictionaryPath
-
userDictionaryEncoding
private final java.lang.String userDictionaryEncoding
-
nbestExamples
private final java.lang.String nbestExamples
-
nbestCost
private int nbestCost
-
-
Method Detail
-
inform
public void inform(ResourceLoader loader) throws java.io.IOException
Description copied from interface:ResourceLoaderAware
Initializes this component with the provided ResourceLoader (used for loading classes, files, etc).- Specified by:
inform
in interfaceResourceLoaderAware
- Throws:
java.io.IOException
-
create
public JapaneseTokenizer create(AttributeFactory factory)
Description copied from class:TokenizerFactory
Creates a TokenStream of the specified input using the given AttributeFactory- Specified by:
create
in classTokenizerFactory
-
-