ext Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
scikit-learn provides utilities for the most common ways to extract numerical features from text content:
tokenizing
strings into integer ids for each possible token. Whitespace characters and punctuation are treated as token separators.
counting
the occurrences of tokens in each document.
normalizing
and weighting (with diminishing importance) tokens that occur in the majority of samples / documents.
Features and samples are defined as:
each individual token occurrence frequency (normalized or not) is treated as a feature.
the vector of all the token frequencies for a given document is considered a multivariate sample.
A corpus can thus be represented by a matrix with one row/document and one column/token (word).
Vectorization is the process of turning a collection of text documents into numerical feature vectors. The task (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while ignoring the relative position information of the words in the document.
Most documents use a very small subset of the words used in a corpus. The resulting matrix will typically contain >99% zeroes).
Implementations typically use a sparse representation from scipy.sparse for storage.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',]
X = vectorizer.fit_transform(corpus)
X
<4x9 sparse matrix of type '<class 'numpy.int64'>' with 19 stored elements in Compressed Sparse Row format>
The default configuration extracts words of at least 2 letters. This function can be requested explicitly.
Each term found during the fit is assigned a unique integer index corresponding to a column in the resulting matrix.
analyze = vectorizer.build_analyzer()
print(analyze("This is a text document to analyze.") == (
['this', 'is', 'text', 'document', 'to', 'analyze']))
vectorizer.get_feature_names() == (
['and', 'document', 'first', 'is', 'one',
'second', 'the', 'third', 'this'])
print(X.toarray())
True [[0 1 1 1 0 0 1 0 1] [0 1 0 1 0 2 1 0 1] [1 0 0 0 1 0 1 1 0] [0 1 1 1 0 0 1 0 1]]
vocabulary_
.transform
.print(vectorizer.vocabulary_.get('document'))
print(vectorizer.transform(['Something completely new.']).toarray())
1 [[0 0 0 0 0 0 0 0 0]]
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
token_pattern=r'\b\w+\b', min_df=1)
analyze = bigram_vectorizer.build_analyzer()
analyze('Bi-grams are cool!') == (
['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
True
X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
print(X_2)
feature_index = bigram_vectorizer.vocabulary_.get('is this')
print(X_2[:, feature_index])
[[0 0 1 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0] [0 0 1 0 0 1 1 0 0 2 1 1 1 0 1 0 0 0 1 1 0] [1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 0 0 0] [0 0 1 1 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1]] [0 0 0 1]
Stop words (“and”, “the”, “him”, etc.), are assumed to be uninformative & which may be removed to avoid mistaking them for a signal. Sometimes, however, similar words are useful for prediction, such as in classifying writing style or personality.
There are several known issues in scikit's default ‘english’ stop word list. It does not aim to be a general, ‘one-size-fits-all’ solution as some tasks may require a more custom solution. See [NQY18] for more details.
Please take care in choosing a stop word list. Popular stop word lists may include words that are highly informative to some tasks, such as computer.
Ensure the stop word list has undergone the same preprocessing and tokenization as used in the vectorizer. The word we’ve
is split into we and ve by CountVectorizer’s default tokenizer, so if we’ve
is in stop_words, but ve is not, ve will be retained from we’ve in transformed text. Our vectorizers will try to identify and warn about some kinds of inconsistencies.
In a large text corpus, some words (e.g. “the”, “a”, “is” in English) will convey little meaningful information. These very frequent terms would overshadow the frequencies of rarer yet more interesting terms in a classifier.
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.
$\text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}$
TfidfTransformer
default settings: TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
inverse document frequency (IDF): $\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1$ where $n$ = the #documents in the corpus; $df(t)$ is the #documents in the corpus containing the term $t$.
The results are Euclidean-normalized: $v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}$
smooth_idf=False
tells the Transformer & Vectorizer to add the "1" count to the idf instead of the idf's denominator: $\text{idf}(t) = \log{\frac{n}{\text{df}(t)}} + 1$
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)
counts = [[3, 0, 1],
[2, 0, 0],
[3, 0, 0],
[4, 0, 0],
[3, 2, 0],
[3, 0, 2]]
tfidf = transformer.fit_transform(counts)
print(tfidf,"\n\n",tfidf.toarray())
(0, 2) 0.5732079309279059 (0, 0) 0.8194099510753754 (1, 0) 1.0 (2, 0) 1.0 (3, 0) 1.0 (4, 1) 0.8808994832762984 (4, 0) 0.47330339145578754 (5, 2) 0.8135516873095774 (5, 0) 0.5814926070688599 [[0.81940995 0. 0.57320793] [1. 0. 0. ] [1. 0. 0. ] [1. 0. 0. ] [0.47330339 0.88089948 0. ] [0.58149261 0. 0.81355169]]
transformer = TfidfTransformer()
transformer.fit_transform(counts).toarray()
array([[0.85151335, 0. , 0.52433293], [1. , 0. , 0. ], [1. , 0. , 0. ], [1. , 0. , 0. ], [0.55422893, 0.83236428, 0. ], [0.63035731, 0. , 0.77630514]])
# model weights of each feature - from fit method
transformer.idf_
array([1. , 2.25276297, 1.84729786])
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(corpus)
<4x9 sparse matrix of type '<class 'numpy.float64'>' with 19 stored elements in Compressed Sparse Row format>
Binary occurrence markers (using the binary
param) may offer perform better in some case. Some estimators, Bernoulli Naive Bayes
, in particular, explicitly model discrete boolean random variables.
Also, very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable.
Use cross validation to find the best feature extraction parameters.
from pprint import pprint
from time import time
import logging
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
categories = [
'alt.atheism',
'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None
data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
857 documents 2 categories
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
# 'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams
# 'tfidf__use_idf': (True, False),
# 'tfidf__norm': ('l1', 'l2'),
'clf__max_iter': (20,),
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
# 'clf__max_iter': (10, 50, 80),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
t0 = time()
grid_search.fit(data.data, data.target)
print("done in %0.3fs" % (time() - t0))
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
pipeline: ['vect', 'tfidf', 'clf'] parameters: {'clf__alpha': (1e-05, 1e-06), 'clf__max_iter': (20,), 'clf__penalty': ('l2', 'elasticnet'), 'vect__max_df': (0.5, 0.75, 1.0), 'vect__ngram_range': ((1, 1), (1, 2))} Fitting 5 folds for each of 24 candidates, totalling 120 fits done in 12.465s Best score: 0.952 Best parameters set: clf__alpha: 1e-05 clf__max_iter: 20 clf__penalty: 'l2' vect__max_df: 1.0 vect__ngram_range: (1, 2)
Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding. To work with text files in Python, their bytes must be decoded to a character set called Unicode. Common encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian) and the universal encodings UTF-8 and UTF-16. Many others exist.
An encoding can also be called a ‘character set’, but this term is less accurate: several encodings can exist for a single character set.
The text feature extractors in scikit-learn know how to decode text files, but only if you tell them what encoding the files are in. The CountVectorizer takes an encoding
parameter for this purpose. For modern text files, the correct encoding is probably UTF-8, which is therefore the default (encoding="utf-8").
If the text you are loading is not encoded with UTF-8, however, you will get a UnicodeDecodeError. The vectorizers can be muted about decoding errors by setting the decode_error
to "ignore" or "replace". See the documentation for the Python function bytes.decode for more details (type help(bytes.decode) at the Python prompt).
If you are having trouble decoding text, here are some things to try:
Find out what the actual encoding of the text is. The file might come with a header or README that tells you the encoding, or there might be some standard encoding you can assume based on where the text comes from.
You may be able to find out what kind of encoding it is in general using the UNIX command file. The Python chardet
module comes with a script called chardetect.py
that will guess the specific encoding, though you cannot rely on its guess being correct.
You could try UTF-8 and disregard the errors. You can decode byte strings with bytes.decode(errors='replace')
to replace all decoding errors with a meaningless character, or set decode_error='replace'
in the vectorizer. This may damage the usefulness of your features.
Real text may come from a variety of sources that may have used different encodings, or even be sloppily decoded in a different encoding than the one it was encoded with. This is common in text retrieved from the Web. The Python package ftfy can automatically sort out some classes of decoding errors, so you could try decoding the unknown text as latin-1
and then using ftfy
to fix errors.
If the text is in a mish-mash of encodings that is simply too hard to sort out (which is the case for the 20 Newsgroups dataset), you can fall back on a simple single-byte encoding such as latin-1
. Some text may display incorrectly, but at least the same sequence of bytes will always represent the same feature.
import chardet
text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
text2 = b"holdselig sind deine Ger\xfcche"
text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"
decoded = [x.decode(chardet.detect(x)['encoding'])
for x in (text1, text2, text3)]
v = CountVectorizer().fit(decoded).vocabulary_
for term in v: print(v)
{'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5} {'sei': 15, 'mir': 13, 'gegrüßt': 6, 'mein': 12, 'sauerkraut': 14, 'holdselig': 10, 'sind': 16, 'deine': 1, 'gerüche': 7, 'auf': 0, 'flügeln': 4, 'des': 2, 'gesanges': 8, 'herzliebchen': 9, 'trag': 17, 'ich': 11, 'dich': 3, 'fort': 5}
Unigrams (aka bag of words) cannot capture phrases and multi-word expressions, effectively disregarding any word order dependence.
Bag of words models can't account for misspellings or word derivations.
Instead, consider building a collection of bigrams (n=2), which counts occurrences of consecutive-word pairs.
Or, consider a collection of character n-grams, which is more resilient against misspellings and derivations.
ngram_vectorizer = CountVectorizer(analyzer='char_wb',
ngram_range=(2, 2))
counts = ngram_vectorizer.fit_transform(['words',
'wprds'])
print(ngram_vectorizer.get_feature_names() == (
[' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp']))
counts.toarray().astype(int)
True
array([[1, 1, 1, 0, 1, 1, 1, 0], [1, 1, 0, 1, 1, 1, 0, 1]])
char_wb
analyzer is used. It creates n-grams only from characters inside word boundaries (padded with space on each side). char
analyzer creates n-grams that span across words.ngram_vectorizer = CountVectorizer(analyzer='char_wb',
ngram_range=(5, 5))
ngram_vectorizer.fit_transform(['jumpy fox'])
print(ngram_vectorizer.get_feature_names() == (
[' fox ', ' jump', 'jumpy', 'umpy ']),"\n")
ngram_vectorizer = CountVectorizer(analyzer='char',
ngram_range=(5, 5))
ngram_vectorizer.fit_transform(['jumpy fox'])
print(ngram_vectorizer.get_feature_names() == (
['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox']),"\n")
True True
char_wb
is especially interesting for languages that use whitespace for word separation - it generates significantly less noisy features than the raw char variant.
It can increase both predictive accuracy and convergence speed of classifiers while retaining the robustness to misspellings and word derivations.
While local position information can be preserved by extracting n-grams instead of individual words, BoW and bag of n-grams models destroy most of the inner structure of the document - hence most of the meaning.
To address the wider task of Natural Language Understanding, the local structure of sentences and paragraphs should thus be taken into account. Many such models will thus be casted as “Structured output” problems which are currently outside of the scope of scikit-learn.
Simple vectorization uses in-memory mapping from the string tokens to the integer feature indices (the vocabulary_
). This causes several problems when dealing with large datasets:
The larger the corpus, the larger the vocabulary - hence the memory use too.
Fitting requires intermediate data structures of size proportional to the original dataset.
Building word maps requires a full pass over the dataset - so it is not possible to fit text classifiers in an online manner.
Pickling/un-pickling vectorizers with a large vocabulary_
can be very slow.
It's not easy to split vectorization into concurrent subtasks - vocabulary_
would have to be a shared state with a fine grained synchronization barrier.
It's possible to overcome these issues by combining the “hashing trick” (Feature hashing, by FeatureHasher
) plus text preprocessing & tokenization (by CountVectorizer
).
This combination is built into HashingVectorizer
, a transformer class that is mostly API compatible with CountVectorizer
. HashingVectorizer
is stateless, meaning that you don’t have to call fit on it.
from sklearn.feature_extraction.text import HashingVectorizer
hv = HashingVectorizer(n_features=10)
hv.transform(corpus)
<4x10 sparse matrix of type '<class 'numpy.float64'>' with 16 stored elements in Compressed Sparse Row format>
16 non-zero feature tokens were extracted: this is less than the 19 non-zeros extracted by CountVectorizer
on the same corpus. The discrepancy comes from hash function collisions due to the low n_features
parameter value.
In a real world setting, n_features
can be left to its default of 2^20 (roughly 1e6 possible features). If memory or downstream model size is an issue, use a lower value such as 2^18.
Dimensionality does not affect training time of algorithms which operate on CSR matrices (LinearSVC(dual=True), Perceptron, SGDClassifier, PassiveAggressive). It does for algorithms that work with CSC matrices (LinearSVC(dual=False), Lasso(), etc).
hv = HashingVectorizer()
hv.transform(corpus)
<4x1048576 sparse matrix of type '<class 'numpy.float64'>' with 19 stored elements in Compressed Sparse Row format>
We no longer get the collisions, but we need a much larger output space dimensionality. Of course, other terms than these 19 might still collide.
HashingVectorizer
comes with the following limitations:
inverse_transform
method), nor to access the original string representation of the features, because of the one-way nature of the hash function that performs the mapping.It does not provide IDF weighting as that would introduce statefulness in the model. A TfidfTransformer
can be appended in a pipeline if required.
This allows learning from data that does not fit into main memory.
The idea is to stream data to the estimator in mini-batches. Each mini-batch is vectorized to guarantee the estimator's input space always has the same dimensionality.
The amount of memory used at any time is thus bounded by the size of a mini-batch. Although there is no limit to the amount of data ingested using this approach, learning time is usually limited by CPU runtime budget.
Customize the behavior by passing a callable to the vectorizer constructor.
preprocessor
: a callable that ingests an entire document as a single string) & returns a possibly transformed version - still as an entire string. This can be used to remove HTML tags, lowercasing, etc.
tokenizer
: a callable. Takes the output from the preprocessor and returns a list of tokens.
analyzer
: a callable that replaces the preprocessor and tokenizer. The default analyzers all call the preprocessor and tokenizer, but custom analyzers will skip this. N-gram extraction and stop word filtering take place at the analyzer level, so a custom analyzer may have to reproduce these steps.
If documents are pre-tokenized by an external package, store them in files (or strings) with the tokens separated by whitespace and pass analyzer=str.split
.
Token-level analysis such as stemming, lemmatizing, compound splitting, filtering based on part-of-speech, etc. are not included in the scikit-learn but can be added by customizing either the tokenizer or the analyzer. Here’s a CountVectorizer with a tokenizer and lemmatizer using NLTK
:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer:
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, doc):
return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
vect = CountVectorizer(tokenizer=LemmaTokenizer())
import re
def to_british(tokens):
for t in tokens:
t = re.sub(r"(...)our$", r"\1or", t)
t = re.sub(r"([bt])re$", r"\1er", t)
t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t)
t = re.sub(r"ogue$", "og", t)
yield t
class CustomVectorizer(CountVectorizer):
def build_tokenizer(self):
tokenize = super().build_tokenizer()
return lambda doc: list(to_british(tokenize(doc)))
print(CustomVectorizer().build_analyzer()(u"color colour"))
['color', 'color']