Advance Options allows you to fine tune your classifier using the parameters like Max Features, N-Gram Range, Default Stopwords, etc.
Features are relevant words or n-grams selected from the training text for classification. Max Features is the top maximum number of features selected by the classifier for classifying text. By default, Max Features is set to 10000, which can be adjusted according to the requirement.
N-gram is a contiguous sequence of n-words. Classifier uses this setting to build n-gram feature vectors for classification. By default, the N-Gram Range is set to “Unigram, Bigram” i.e both Unigram and Bigram words will be considered for classification.
Below are the available options,
- Unigrams (n=1, single word)
- Bigrams (n=2, sequence of two words)
- Trigrams (n=3, sequence of three words)
- Fourgrams (n=4, sequence of four words)
- Unigrams, Bigrams
- Unigrams, Bigrams, Trigrams
- Unigrams, Bigrams, Trigrams, Fourgrams
- Bigrams, Trigrams
- Bigrams, Trigrams, Fourgrams
- Trigrams, Fourgrams
Default Filter Stopwords
Stopwords are common and non-important features that are unlikely to help in classification of text. For using custom stop words, “Default Filter StopWords” should be disabled. By default, this setting is enabled and the system provided stopwords are used for filtering stopwords.
Comma separated list of stopwords based on the language selected for the classifier. Custom stopwords can be added to the list by disabling “Default Filter StopWords” setting.
Weight normalization is useful when training data for the categories is unbalanced. Weight normalization should be disabled if data is required to remain unbalanced. By default, weight normalization is enabled.
Stemming tries to transform words to its root form, which helps in generalizing feature patterns. Stemming should be disabled if generalization of features is not required. By default, stemming is enabled.