cythonized_feature_encoder.pyx module
- class FeatureEncoder
Bases:
object
Encodes JSON encodable objects into float vectors
- __init__()
Initialize the feature encoder for this model
- Parameters:
feature_names (list) – the list of feature names. Order matters - first feature name should be the first feature in the model
string_tables (dict) – a mapping from feature names to string hash tables
model_seed (int) – model seed to be used during string encoding
- Raises:
ValueError if feature names or tables are corrupt –
- _encode()
Encodes a JSON serializable object to a float vector Rules of encoding go as follows:
None, json null, {}, [], and nan are treated as missing features and ignored.
numbers and booleans are encoded as-is.
strings are encoded using a lookup table
- Parameters:
obj (object) – a JSON serializable object to be encoded to a flat key-value structure
path (str) – the path to the current object
into (np.ndarray[double, ndim=1, mode='c']) – an array into which feature values will be encoded
noise_shift (double) – small bias added to the feature value
noise_scale (double) – small multiplier of the feature value
- encode_context()
Encodes provided context to input numpy array
- Parameters:
context (object) – JSON encodable python object
into (np.ndarray[double, ndim=1, mode='c']) – array storing results of encoding
noise_shift (double) – value to be added to features
noise_scale (double) – multiplier used to scale shifted feature value
- encode_feature_vector()
Fully encodes provided variant and context into a np.ndarray provided as into parameter. into must not be None
- Parameters:
item (object) – a JSON encodable object to be encoded
context (object) – a JSON encodable object to be encoded
into (np.ndarray[double, ndim=1, mode='c']) – an array into which feature values will be added
noise (double) – value in [0, 1) which will be combined with the feature value
- encode_item()
Encodes provided item to input numpy array
- Parameters:
item (object) – JSON encodable python object
into (np.ndarray[double, ndim=1, mode='c']) – array storing results of encoding
noise_shift (double) – value to be added to features
noise_scale (double) – multiplier used to scale shifted feature value
- feature_indexes
A map between feature names and feature indexes. Created by simple iteration with enumeration over feature names
- string_tables
List of StringTable objects. The order of elements follows constructor’s string_tables parameter.
- class StringTable
Bases:
object
A class responsible for target encoding of strings
- __init__()
Init StringTable with params
- Parameters:
string_table (list) – a list of masked hashed strings for each string feature
model_seed (int) – model seed value
- encode()
Encode input string to a target value
- Parameters:
string (str) – string to be encoded
- Returns:
encoded value
- Return type:
double
- encode_miss()
Encodes string hash as a miss
- Parameters:
string_hash (unsigned long) – string hash to be encoded as a miss
- Returns:
encoded miss value
- Return type:
double
- mask
At most 64 bit int representation of a string hash mask e.g., 000..00111
- miss_width
Float value representing snap / width of the ‘miss interval’ - numeric interval into which all missing / unknown values are encoded. It is 0-centered.
- model_seed
32-bit random integer used for string hashing with xxhash
- value_table
A mapping from masked string hash to target encoding’s target value for a given feature
- encode_candidates_to_matrix()
Encodes list of candidates to 2D np.array for a given context with provided noise
- Parameters:
candidates (object) – list or tuple or np.ndarray of JSON encodable candidates / items to encode
context (object) – JSON encodable object
noise (double) – noise to be used for sprinkling of encoded features
- Returns:
2D numpy array with encoded candidates
- Return type:
np.ndarray[double, ndim=2, mode=’c’]
- get_mask()
Returns an integer representation of binary mask for a given string table
- Parameters:
string_table (list) – list of hash string values for a given feature
- Returns:
number of bytes needed to represent string hashed in the table
- Return type:
unsigned long long
- get_noise_shift_scale()
Returns noise shift (small value added to feature value) and noise scale (value by which shifted feature value is multiplied)
- Parameters:
noise (double) – value in [0, 1) which will be combined with the feature value
- Returns:
tuple of double: (noise_shift, noise_scale)
- Return type:
tuple
- scale()
Scales input miss value to [-width/2, width/2]. Assumes input is within [0, 1] range.
- Parameters:
val (double) – miss value to be scaled
width (double) – miss range width
- Returns:
scaled miss value
- Return type:
double
- sprinkle()
Adds noise shift and scales shifted value
- Parameters:
x (double) – value to be sprinkled
noise_shift (double) – small bias added to the feature value
noise_scale (double) – small multiplier of the shifted feature value
- Returns:
sprinkled value
- Return type:
double