### Feature Extraction (FE)¶

• Used to extract feature information from text & image datasets.
• Very different from feature selection (FE is a technique that is applied to the result of a FE method.)

### Feactures from Dicts¶

• DictVectorizer converts feature arrays (lists of Python dict objects) to NumPy/SymPy format.

• Uses one-of-K (aka "one hot") category coding. Category features are unordered attribute:value pairs.

• DictVectorizer accepts multiple strings for one feature (aka, multiple categories per movie).

movie_entry = [{'category': ['thriller', 'drama'], 'year': 2003}, {'category': ['animation', 'family'], 'year': 2011}, {'year': 1974}]

vec.fit_transform(movie_entry).toarray()

vec.get_feature_names() == ['category=animation', 'category=drama', 'category=family', 'category=thriller', 'year']

vec.transform({'category': ['thriller'], 'unseen_feature': '3'}).toarray()

### Dict Vectorizer - NLP applications¶

• Suppose we have an algorithm that extracts Part of Speech (PoS) tags to use for training a sequence classifier (e.g. a chunker). The following dict could be such a window of features extracted around the word ‘sat’ in the sentence ‘The cat sat on the mat.’:

• The description can be vectorized into a sparse 2D matrix, suitable for a classifier.

• Extracting this info around each individual word of a corpus of documents will return a very wide (many one-hot-features) matrix - with mostly zero values. DictVectorizer therefore uses a scipy.sparse matrix by default.

### Feature Hashing¶

• FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”.

• Instead of building a hash table of features during training, as vectorizers do, instances of FeatureHasher apply a hash function to the features to directly determine their column index in sample matrices.

• The result is increased speed and reduced memory usage, at the expense of inspectability; the hasher does not remember what the input features looked like and has no inverse_transform method.

• Since the hash function can cause collisions between (unrelated) features, a signed hash function is used. The sign determines the sign of the value stored in the output matrix for a feature.

• This means that collisions are likely to cancel out rather than accumulate error - so the expected mean of any output feature’s value is zero.

• It is enabled by default with alternate_sign=True and is particularly useful for small hash table sizes (n_features < 10000). For large hash table sizes, it can be disabled. This will allow outputs to be passed to estimators like MultinomialNB or chi2 feature selectors that expect non-negative inputs.

• FeatureHasher accepts maps (like Python’s dict and its variants in the collections module), (feature, value) pairs, or strings, depending on the constructor parameter input_type. Maps are treated as lists of (feature, value) pairs.

• Single strings have an implicit value of 1, so ['feat1', 'feat2', 'feat3'] is interpreted as [('feat1', 1), ('feat2', 1), ('feat3', 1)].

• If a single feature occurs multiple times in a sample, the feature values will be summed (so ('feat', 2) and ('feat', 3.5) become ('feat', 5.5)). The output from FeatureHasher is a scipy.sparse matrix in the CSR format.

• Feature hashing can be used in document classification. Unlike CountVectorizer, FeatureHasher does not do word splitting or any other preprocessing except Unicode-to-UTF-8 encoding. See below for a combined tokenizer/hasher.

### Implementation¶

• FeatureHasher uses the signed 32-bit variant of MurmurHash3. The maximum number of features supported is currently $2^31-1$.

• The original formulation of the hashing trick used two separate hash functions $h$ and $phi$ to determine the column index and sign of a feature,. This implementation assumes the sign bit of MurmurHash3 is independent of its other bits.

• Since a simple modulo is used to transform the hash function to a column index, consider using a power of two as the n_features param. Otherwise the features will not be mapped evenly to the columns.