feature_encoder.py module
- ITEM_FEATURE_KEY = 'item'
Feature names prefix for features derived from candidates / items, e.g.:
item == 1 -> feature name is “item”
item == [1] -> feature names is “item.0”
item == {“a”: 1}} - feature name is “item.a”
- CONTEXT_FEATURE_KEY = 'context'
Feature names prefix for features derived from context, e.g.:
context == 1 -> feature name is “context”
context == [1] -> feature names is “context.0”
context == {“a”: 1}} - feature name is “context.a”
- class FeatureEncoder
Bases:
object
Encodes JSON encodable objects into float vectors
- __init__(feature_names, string_tables, model_seed)
Initialize the feature encoder for this model
- Parameters:
feature_names (list) – the list of feature names. Order matters - first feature name should be the first feature in the model
string_tables (dict) – a mapping from feature names to string hash tables
model_seed (int) – model seed to be used during string encoding
- Raises:
ValueError if feature names or tables are corrupt –
- property feature_indexes: dict
A map between feature names and feature indexes. Created by simple iteration with enumeration over feature names
- Returns:
a mapping between a string feature names and feature index
- Return type:
dict
- property string_tables: list
List of StringTable objects. The order of elements follows constructor’s string_tables parameter.
- Returns:
list of StringTables
- Return type:
list
- _check_into(into)
Checks if the provided into array is an array and has desired np.float64 dtype
- Parameters:
into (np.ndarray) – array which will store feature values
- Raises:
ValueError if into is not a numpy array or not of a float64 dtype –
- encode_item(item, into, noise_shift=0.0, noise_scale=1.0)
Encodes provided item to input numpy array
- Parameters:
item (object) – JSON encodable python object
into (np.ndarray) – array storing results of encoding
noise_shift (float) – value to be added to values of features
noise_scale (float) – multiplier used to scale shifted feature values
- encode_context(context, into, noise_shift=0.0, noise_scale=1.0)
Encodes provided context to input numpy array
- Parameters:
context (object) – JSON encodable python object
into (np.ndarray) – array storing results of encoding
noise_shift (float) – value to be added to values of features
noise_scale (float) – multiplier used to scale shifted feature values
- encode_feature_vector(item, context, into, noise=0.0)
Fully encodes provided variant and context into a np.ndarray provided as into parameter. into must not be None
- Parameters:
item (object) – a JSON encodable item to be encoded
context (object) – a JSON encodable context to be encoded
into (np.ndarray) – an array into which feature values will be added
noise (float) – value in [0, 1) which will be combined with the feature value
- _encode(obj, path, into, noise_shift=0.0, noise_scale=1.0)
Encodes a JSON serializable object to a float vector Rules of encoding go as follows:
None, json null, {}, [], and nan are treated as missing features and ignored.
numbers and booleans are encoded as-is.
strings are encoded using a lookup table
- Parameters:
obj (object) – a JSON serializable object to be encoded to a flat key-value structure
path (str) – the JSON-normalized path to the current object
into (np.ndarray) – an array into which feature values will be encoded
noise_shift (float) – small bias added to the feature value
noise_scale (float) – small multiplier of the feature value
- get_noise_shift_scale(noise)
Returns noise shift (small value added to feature value) and noise scale (value by which shifted feature value is multiplied)
- Parameters:
noise (float) – value in [0, 1) which will be combined with the feature value
- Returns:
tuple of floats: (noise_shift, noise_scale)
- Return type:
tuple
- sprinkle(x, noise_shift, noise_scale)
Adds noise shift and scales shifted value
- Parameters:
x (float) – value to be sprinkled
noise_shift (float) – small bias added to the feature value
noise_scale (float) – small multiplier of the shifted feature value
- Returns:
sprinkled value
- Return type:
float
- class StringTable
Bases:
object
A class responsible for target encoding of strings for a given feature
- __init__(string_table, model_seed)
Init StringTable with params
- Parameters:
string_table (list) – a list of masked hashed strings for each string feature
model_seed (int) – model seed value
- property model_seed: int
32-bit random integer used to hash strings with xxhash
- Returns:
model seed
- Return type:
int
- property mask: int
At most 64 bit int representation of a string hash mask e.g., 000..00111
- Returns:
mask used to ‘decrease’ hashed string value
- Return type:
int
- property miss_width: float
Float value representing snap / width of the ‘miss interval’ - numeric interval into which all missing / unknown values are encoded. It is 0-centered.
- Returns:
miss width value
- Return type:
float
- property value_table: dict
A mapping from masked string hash to target encoding’s target value for a given feature
- Returns:
a dict with target value encoding
- Return type:
dict
- encode(string)
Encode input string to a target value
- Parameters:
string (str) – string to be encoded
- Returns:
encoded value
- Return type:
float
- encode_miss(string_hash)
Encodes string hash as a miss
- Parameters:
string_hash (int) – string hash to be encoded as a miss
- Returns:
encoded miss value
- Return type:
float
- scale(val, width=2)
Scales input miss value to [-width/2, width/2]. Assumes input is within [0, 1] range.
- Parameters:
val (float) – miss value to be scaled
width (float) – miss range width
- Returns:
scaled miss value
- Return type:
float
- get_mask(string_table)
Returns an integer representation of a binary mask for a given string table
- Parameters:
string_table (list) – list of hash string values for a given feature
- Returns:
number of bytes needed to represent string hashed in the table
- Return type:
int