DictVectorizer converts feature arrays (lists of Python dict
objects) to NumPy/SymPy format.
Uses one-of-K (aka "one hot") category coding. Category features are unordered attribute:value
pairs.
measurements = [
{'city': 'Dubai', 'temperature': 33.},
{'city': 'London', 'temperature': 12.},
{'city': 'San Francisco', 'temperature': 18.},
]
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
vec.fit_transform(measurements).toarray()
vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']
movie_entry = [{'category': ['thriller', 'drama'], 'year': 2003}, {'category': ['animation', 'family'], 'year': 2011}, {'year': 1974}]
vec.fit_transform(movie_entry).toarray()
vec.get_feature_names() == ['category=animation', 'category=drama', 'category=family', 'category=thriller', 'year']
vec.transform({'category': ['thriller'], 'unseen_feature': '3'}).toarray()
Suppose we have an algorithm that extracts Part of Speech (PoS) tags to use for training a sequence classifier (e.g. a chunker). The following dict could be such a window of features extracted around the word ‘sat’ in the sentence ‘The cat sat on the mat.’:
The description can be vectorized into a sparse 2D matrix, suitable for a classifier.
Extracting this info around each individual word of a corpus of documents will return a very wide (many one-hot-features) matrix - with mostly zero values. DictVectorizer
therefore uses a scipy.sparse matrix by default.
pos_window = [
{
'word-2': 'the',
'pos-2': 'DT',
'word-1': 'cat',
'pos-1': 'NN',
'word+1': 'on',
'pos+1': 'PP',
},
# in a real application one would extract many such dictionaries
]
vec = DictVectorizer()
pos_vectorized = vec.fit_transform(pos_window)
print(pos_vectorized)
pos_vectorized.toarray()
print(vec.get_feature_names())
(0, 0) 1.0 (0, 1) 1.0 (0, 2) 1.0 (0, 3) 1.0 (0, 4) 1.0 (0, 5) 1.0 ['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']
FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”.
Instead of building a hash table of features during training, as vectorizers do, instances of FeatureHasher apply a hash function to the features to directly determine their column index in sample matrices.
The result is increased speed and reduced memory usage, at the expense of inspectability; the hasher does not remember what the input features looked like and has no inverse_transform
method.
Since the hash function can cause collisions between (unrelated) features, a signed hash function is used. The sign determines the sign of the value stored in the output matrix for a feature.
This means that collisions are likely to cancel out rather than accumulate error - so the expected mean of any output feature’s value is zero.
It is enabled by default with alternate_sign=True
and is particularly useful for small hash table sizes (n_features < 10000). For large hash table sizes, it can be disabled. This will allow outputs to be passed to estimators like MultinomialNB or chi2 feature selectors that expect non-negative inputs.
FeatureHasher
accepts maps (like Python’s dict and its variants in the collections module), (feature, value) pairs, or strings, depending on the constructor parameter input_type. Maps are treated as lists of (feature, value)
pairs.
Single strings have an implicit value of 1, so ['feat1', 'feat2', 'feat3'] is interpreted as [('feat1', 1), ('feat2', 1), ('feat3', 1)].
If a single feature occurs multiple times in a sample, the feature values will be summed (so ('feat', 2) and ('feat', 3.5) become ('feat', 5.5)). The output from FeatureHasher
is a scipy.sparse matrix in the CSR format.
Feature hashing can be used in document classification. Unlike CountVectorizer
, FeatureHasher
does not do word splitting or any other preprocessing except Unicode-to-UTF-8 encoding. See below for a combined tokenizer/hasher.
FeatureHasher uses the signed 32-bit variant of MurmurHash3. The maximum number of features supported is currently $2^31-1$.
The original formulation of the hashing trick used two separate hash functions $h$ and $phi$ to determine the column index and sign of a feature,. This implementation assumes the sign bit of MurmurHash3 is independent of its other bits.
Since a simple modulo is used to transform the hash function to a column index, consider using a power of two as the n_features
param. Otherwise the features will not be mapped evenly to the columns.