### Bag of Words¶

• ext Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

• scikit-learn provides utilities for the most common ways to extract numerical features from text content:

• tokenizing strings into integer ids for each possible token. Whitespace characters and punctuation are treated as token separators.

• counting the occurrences of tokens in each document.

• normalizing and weighting (with diminishing importance) tokens that occur in the majority of samples / documents.

• Features and samples are defined as:

• each individual token occurrence frequency (normalized or not) is treated as a feature.

• the vector of all the token frequencies for a given document is considered a multivariate sample.

• A corpus can thus be represented by a matrix with one row/document and one column/token (word).

• Vectorization is the process of turning a collection of text documents into numerical feature vectors. The task (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while ignoring the relative position information of the words in the document.

### Sparsity¶

• Most documents use a very small subset of the words used in a corpus. The resulting matrix will typically contain >99% zeroes).

• Implementations typically use a sparse representation from scipy.sparse for storage.

### Count Vectorizer¶

• Does both tokenization and occurrence counting in a single class, such as this tiny corpus of text documents.
• The default configuration extracts words of at least 2 letters. This function can be requested explicitly.

• Each term found during the fit is assigned a unique integer index corresponding to a column in the resulting matrix.

• The map from feature name to column index is stored in vocabulary_.
• Words not seen in the training corpus will be ignored in future calls to transform.
• Previous corpus: the first and last documents have the same words, so are encoded in equal vectors. We lose the knowledge that the last document is a question. To preserve the local order information we can extract 2-grams of words in addition to the 1-grams (individual words).
• This vocabular is much bigger. It can resolve ambiguities in the local position patterns.
• For example, it knows "is this" is present in the last document.

### Stop Words¶

• Stop words (“and”, “the”, “him”, etc.), are assumed to be uninformative & which may be removed to avoid mistaking them for a signal. Sometimes, however, similar words are useful for prediction, such as in classifying writing style or personality.

• There are several known issues in scikit's default ‘english’ stop word list. It does not aim to be a general, ‘one-size-fits-all’ solution as some tasks may require a more custom solution. See [NQY18] for more details.

• Please take care in choosing a stop word list. Popular stop word lists may include words that are highly informative to some tasks, such as computer.

• Ensure the stop word list has undergone the same preprocessing and tokenization as used in the vectorizer. The word we’ve is split into we and ve by CountVectorizer’s default tokenizer, so if we’ve is in stop_words, but ve is not, ve will be retained from we’ve in transformed text. Our vectorizers will try to identify and warn about some kinds of inconsistencies.

### Tf-Idf Transformer and Vectorizer¶

• In a large text corpus, some words (e.g. “the”, “a”, “is” in English) will convey little meaningful information. These very frequent terms would overshadow the frequencies of rarer yet more interesting terms in a classifier.

• In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

• $\text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}$

• TfidfTransformer default settings: TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

• inverse document frequency (IDF): $\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1$ where $n$ = the #documents in the corpus; $df(t)$ is the #documents in the corpus containing the term $t$.

• The results are Euclidean-normalized: $v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}$

• smooth_idf=False tells the Transformer & Vectorizer to add the "1" count to the idf instead of the idf's denominator: $\text{idf}(t) = \log{\frac{n}{\text{df}(t)}} + 1$

### Example¶

• 1st item is present 100% of the time = not very interesting.
• 2nd,3rd items present <50% of the time.

### Tfidf Vectorizer¶

• Combines Count Vectorizer and Tfidf Transformer in a single object.

### Binary Occurrences¶

• Binary occurrence markers (using the binary param) may offer perform better in some case. Some estimators, Bernoulli Naive Bayes, in particular, explicitly model discrete boolean random variables.

• Also, very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable.

• Use cross validation to find the best feature extraction parameters.

### Decoding Text files¶

• Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding. To work with text files in Python, their bytes must be decoded to a character set called Unicode. Common encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian) and the universal encodings UTF-8 and UTF-16. Many others exist.

• An encoding can also be called a ‘character set’, but this term is less accurate: several encodings can exist for a single character set.

• The text feature extractors in scikit-learn know how to decode text files, but only if you tell them what encoding the files are in. The CountVectorizer takes an encoding parameter for this purpose. For modern text files, the correct encoding is probably UTF-8, which is therefore the default (encoding="utf-8").

• If the text you are loading is not encoded with UTF-8, however, you will get a UnicodeDecodeError. The vectorizers can be muted about decoding errors by setting the decode_error to "ignore" or "replace". See the documentation for the Python function bytes.decode for more details (type help(bytes.decode) at the Python prompt).

• If you are having trouble decoding text, here are some things to try:

• Find out what the actual encoding of the text is. The file might come with a header or README that tells you the encoding, or there might be some standard encoding you can assume based on where the text comes from.

• You may be able to find out what kind of encoding it is in general using the UNIX command file. The Python chardet module comes with a script called chardetect.py that will guess the specific encoding, though you cannot rely on its guess being correct.

• You could try UTF-8 and disregard the errors. You can decode byte strings with bytes.decode(errors='replace') to replace all decoding errors with a meaningless character, or set decode_error='replace' in the vectorizer. This may damage the usefulness of your features.

• Real text may come from a variety of sources that may have used different encodings, or even be sloppily decoded in a different encoding than the one it was encoded with. This is common in text retrieved from the Web. The Python package ftfy can automatically sort out some classes of decoding errors, so you could try decoding the unknown text as latin-1 and then using ftfy to fix errors.

• If the text is in a mish-mash of encodings that is simply too hard to sort out (which is the case for the 20 Newsgroups dataset), you can fall back on a simple single-byte encoding such as latin-1. Some text may display incorrectly, but at least the same sequence of bytes will always represent the same feature.

### Bag of Words Limitations¶

• Unigrams (aka bag of words) cannot capture phrases and multi-word expressions, effectively disregarding any word order dependence.

• Bag of words models can't account for misspellings or word derivations.

• Instead, consider building a collection of bigrams (n=2), which counts occurrences of consecutive-word pairs.

• Or, consider a collection of character n-grams, which is more resilient against misspellings and derivations.

### Example:¶

• a corpus of two documents: ['words', 'wprds'].
• The 2nd document contains a misspelling of the word ‘words’.
• A simple BoW model considers them as very distinct documents, differing in both of the two possible features.
• A character 2-gram representation would find the documents matching in 4 out of 8 features, which may help a classifier.
• Above: char_wb analyzer is used. It creates n-grams only from characters inside word boundaries (padded with space on each side).
• Below: The char analyzer creates n-grams that span across words.
• char_wb is especially interesting for languages that use whitespace for word separation - it generates significantly less noisy features than the raw char variant.

• It can increase both predictive accuracy and convergence speed of classifiers while retaining the robustness to misspellings and word derivations.

• While local position information can be preserved by extracting n-grams instead of individual words, BoW and bag of n-grams models destroy most of the inner structure of the document - hence most of the meaning.

• To address the wider task of Natural Language Understanding, the local structure of sentences and paragraphs should thus be taken into account. Many such models will thus be casted as “Structured output” problems which are currently outside of the scope of scikit-learn.

### The Hashing Trick¶

• Simple vectorization uses in-memory mapping from the string tokens to the integer feature indices (the vocabulary_). This causes several problems when dealing with large datasets:

• The larger the corpus, the larger the vocabulary - hence the memory use too.

• Fitting requires intermediate data structures of size proportional to the original dataset.

• Building word maps requires a full pass over the dataset - so it is not possible to fit text classifiers in an online manner.

• Pickling/un-pickling vectorizers with a large vocabulary_ can be very slow.

• It's not easy to split vectorization into concurrent subtasks - vocabulary_ would have to be a shared state with a fine grained synchronization barrier.

• It's possible to overcome these issues by combining the “hashing trick” (Feature hashing, by FeatureHasher) plus text preprocessing & tokenization (by CountVectorizer).

• This combination is built into HashingVectorizer, a transformer class that is mostly API compatible with CountVectorizer. HashingVectorizer is stateless, meaning that you don’t have to call fit on it.

• 16 non-zero feature tokens were extracted: this is less than the 19 non-zeros extracted by CountVectorizer on the same corpus. The discrepancy comes from hash function collisions due to the low n_features parameter value.

• In a real world setting, n_features can be left to its default of 2^20 (roughly 1e6 possible features). If memory or downstream model size is an issue, use a lower value such as 2^18.

• Dimensionality does not affect training time of algorithms which operate on CSR matrices (LinearSVC(dual=True), Perceptron, SGDClassifier, PassiveAggressive). It does for algorithms that work with CSC matrices (LinearSVC(dual=False), Lasso(), etc).

• We no longer get the collisions, but we need a much larger output space dimensionality. Of course, other terms than these 19 might still collide.

• HashingVectorizer comes with the following limitations:

• It is not possible to invert the model (no inverse_transform method), nor to access the original string representation of the features, because of the one-way nature of the hash function that performs the mapping.
• It does not provide IDF weighting as that would introduce statefulness in the model. A TfidfTransformer can be appended in a pipeline if required.

### Out-of-core Scaling with Hashing Vectorizer¶

• This allows learning from data that does not fit into main memory.

• The idea is to stream data to the estimator in mini-batches. Each mini-batch is vectorized to guarantee the estimator's input space always has the same dimensionality.

• The amount of memory used at any time is thus bounded by the size of a mini-batch. Although there is no limit to the amount of data ingested using this approach, learning time is usually limited by CPU runtime budget.

### Custom Vectorizer Classes¶

• Customize the behavior by passing a callable to the vectorizer constructor.

• preprocessor: a callable that ingests an entire document as a single string) & returns a possibly transformed version - still as an entire string. This can be used to remove HTML tags, lowercasing, etc.

• tokenizer: a callable. Takes the output from the preprocessor and returns a list of tokens.

• analyzer: a callable that replaces the preprocessor and tokenizer. The default analyzers all call the preprocessor and tokenizer, but custom analyzers will skip this. N-gram extraction and stop word filtering take place at the analyzer level, so a custom analyzer may have to reproduce these steps.

• If documents are pre-tokenized by an external package, store them in files (or strings) with the tokens separated by whitespace and pass analyzer=str.split.

• Token-level analysis such as stemming, lemmatizing, compound splitting, filtering based on part-of-speech, etc. are not included in the scikit-learn but can be added by customizing either the tokenizer or the analyzer. Here’s a CountVectorizer with a tokenizer and lemmatizer using NLTK:

• This example transforms British spelling to American spelling.