public final class Tokenizers extends Object
All methods return immutable objects provided the arguments are also immutable.
Modifier and Type | Method and Description |
---|---|
static Tokenizer |
chain(List<Tokenizer> tokenizers)
Chains tokenizers together.
|
static Tokenizer |
chain(Tokenizer tokenizer,
Tokenizer... tokenizers)
Chains tokenizers together.
|
static Tokenizer |
filter(Tokenizer tokenizer,
com.google.common.base.Predicate<String> predicate)
Constructs a new filtering tokenizer.
|
static Tokenizer |
pattern(Pattern pattern)
Returns a tokenizer that splits a string into tokens around the pattern
as if calling
pattern.split(input,-1) . |
static Tokenizer |
pattern(String regex)
Returns a tokenizer that splits a string into tokens around the pattern
as if calling
Pattern.compile(regex).split(input,-1) . |
static Tokenizer |
qGram(int q)
Returns a basic q-gram tokenizer for a variable q.
|
static Tokenizer |
qGramWithFilter(int q)
Returns a basic q-gram tokenizer for a variable q.
|
static Tokenizer |
qGramWithPadding(int q)
Returns a basic q-gram tokenizer for a variable q.The input is padded
with q-1 special characters before being tokenized.
|
static Tokenizer |
qGramWithPadding(int q,
String padding)
Returns a basic q-gram tokenizer for a variable Q.The Q-Gram is extended
beyond the length of the string with padding.
|
static Tokenizer |
qGramWithPadding(int q,
String startPadding,
String endPadding)
Returns a basic q-gram tokenizer for a variable Q.The Q-Gram is extended
beyond the length of the string with padding.
|
static Tokenizer |
transform(Tokenizer tokenizer,
com.google.common.base.Function<String,String> function)
Constructs a new transforming tokenizer.
|
static Tokenizer |
whitespace()
Returns a tokenizer that splits a string into tokens around whitespace.
|
public static Tokenizer chain(List<Tokenizer> tokenizers)
tokenizers
- a non-empty list of tokenizerspublic static Tokenizer chain(Tokenizer tokenizer, Tokenizer... tokenizers)
tokenizer
- the first tokenizertokenizers
- a the other tokenizerspublic static Tokenizer filter(Tokenizer tokenizer, com.google.common.base.Predicate<String> predicate)
tokenizer
- delegate tokenizerpredicate
- for tokens to keeppublic static Tokenizer pattern(Pattern pattern)
pattern.split(input,-1)
.pattern
- to split the the string aroundpublic static Tokenizer pattern(String regex)
Pattern.compile(regex).split(input,-1)
.regex
- to split the the string aroundpublic static Tokenizer qGram(int q)
q
- size of the tokenspublic static Tokenizer qGramWithFilter(int q)
q
- size of the tokenspublic static Tokenizer qGramWithPadding(int q)
#
as the
default padding.q
- size of the tokenspublic static Tokenizer qGramWithPadding(int q, String padding)
q
- size of the tokenspadding
- padding to padd start and end of string withpublic static Tokenizer qGramWithPadding(int q, String startPadding, String endPadding)
q
- size of the tokensstartPadding
- padding to padd startof string withendPadding
- padding to padd end of string withpublic static Tokenizer transform(Tokenizer tokenizer, com.google.common.base.Function<String,String> function)
tokenizer
- delegate tokenizerfunction
- to transform tokenspublic static Tokenizer whitespace()
To create tokenizer that returns leading and trailing empty tokens use
Tokenizers.pattern("\\s+")
Copyright © 2014–2018. All rights reserved.