cythonized_feature_encoder.pyx module

class FeatureEncoder

Bases: object

Encodes JSON encodable objects into float vectors

__init__()

Initialize the feature encoder for this model

Parameters:
  • feature_names (list) – the list of feature names. Order matters - first feature name should be the first feature in the model

  • string_tables (dict) – a mapping from feature names to string hash tables

  • model_seed (int) – model seed to be used during string encoding

Raises:

ValueError if feature names or tables are corrupt

_encode()

Encodes a JSON serializable object to a float vector Rules of encoding go as follows:

  • None, json null, {}, [], and nan are treated as missing features and ignored.

  • numbers and booleans are encoded as-is.

  • strings are encoded using a lookup table

Parameters:
  • obj (object) – a JSON serializable object to be encoded to a flat key-value structure

  • path (str) – the path to the current object

  • into (np.ndarray[double, ndim=1, mode='c']) – an array into which feature values will be encoded

  • noise_shift (double) – small bias added to the feature value

  • noise_scale (double) – small multiplier of the feature value

encode_context()

Encodes provided context to input numpy array

Parameters:
  • context (object) – JSON encodable python object

  • into (np.ndarray[double, ndim=1, mode='c']) – array storing results of encoding

  • noise_shift (double) – value to be added to features

  • noise_scale (double) – multiplier used to scale shifted feature value

encode_feature_vector()

Fully encodes provided variant and context into a np.ndarray provided as into parameter. into must not be None

Parameters:
  • item (object) – a JSON encodable object to be encoded

  • context (object) – a JSON encodable object to be encoded

  • into (np.ndarray[double, ndim=1, mode='c']) – an array into which feature values will be added

  • noise (double) – value in [0, 1) which will be combined with the feature value

encode_item()

Encodes provided item to input numpy array

Parameters:
  • item (object) – JSON encodable python object

  • into (np.ndarray[double, ndim=1, mode='c']) – array storing results of encoding

  • noise_shift (double) – value to be added to features

  • noise_scale (double) – multiplier used to scale shifted feature value

feature_indexes

A map between feature names and feature indexes. Created by simple iteration with enumeration over feature names

string_tables

List of StringTable objects. The order of elements follows constructor’s string_tables parameter.

class StringTable

Bases: object

A class responsible for target encoding of strings

__init__()

Init StringTable with params

Parameters:
  • string_table (list) – a list of masked hashed strings for each string feature

  • model_seed (int) – model seed value

encode()

Encode input string to a target value

Parameters:

string (str) – string to be encoded

Returns:

encoded value

Return type:

double

encode_miss()

Encodes string hash as a miss

Parameters:

string_hash (unsigned long) – string hash to be encoded as a miss

Returns:

encoded miss value

Return type:

double

mask

At most 64 bit int representation of a string hash mask e.g., 000..00111

miss_width

Float value representing snap / width of the ‘miss interval’ - numeric interval into which all missing / unknown values are encoded. It is 0-centered.

model_seed

32-bit random integer used for string hashing with xxhash

value_table

A mapping from masked string hash to target encoding’s target value for a given feature

encode_candidates_to_matrix()

Encodes list of candidates to 2D np.array for a given context with provided noise

Parameters:
  • candidates (object) – list or tuple or np.ndarray of JSON encodable candidates / items to encode

  • context (object) – JSON encodable object

  • noise (double) – noise to be used for sprinkling of encoded features

Returns:

2D numpy array with encoded candidates

Return type:

np.ndarray[double, ndim=2, mode=’c’]

get_mask()

Returns an integer representation of binary mask for a given string table

Parameters:

string_table (list) – list of hash string values for a given feature

Returns:

number of bytes needed to represent string hashed in the table

Return type:

unsigned long long

get_noise_shift_scale()

Returns noise shift (small value added to feature value) and noise scale (value by which shifted feature value is multiplied)

Parameters:

noise (double) – value in [0, 1) which will be combined with the feature value

Returns:

tuple of double: (noise_shift, noise_scale)

Return type:

tuple

scale()

Scales input miss value to [-width/2, width/2]. Assumes input is within [0, 1] range.

Parameters:
  • val (double) – miss value to be scaled

  • width (double) – miss range width

Returns:

scaled miss value

Return type:

double

sprinkle()

Adds noise shift and scales shifted value

Parameters:
  • x (double) – value to be sprinkled

  • noise_shift (double) – small bias added to the feature value

  • noise_scale (double) – small multiplier of the shifted feature value

Returns:

sprinkled value

Return type:

double