Manual

Categorical Features

Functions for encoding a column of categorical data as multiple, binary columns.

FeatureEng.encode_hashMethod
encode_hash(column::T, n_cols::Int = 8, prefix::String = "c") where T <: AbstractArray

Deterministically encode categorical features with high cardinality as a DataFrame with n_cols columns.

julia> data = [1:100;1000;];
julia> encode_hash([1:1_000:10_000;])
10×8 DataFrame
 Row │ c1     c2     c3     c4     c5     c6     c7     c8    
     │ Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64 
─────┼────────────────────────────────────────────────────────
   1 │     1      1      1      1      1      1      1      0
   2 │     1      1      0      1      1      0      1      0
   3 │     1      0      1      0      1      1      0      1
   4 │     1      1      1      1      0      1      1      1
   5 │     1      0      0      0      0      0      1      0
   6 │     1      0      1      0      0      1      0      1
   7 │     0      1      1      0      0      1      1      1
   8 │     1      1      0      1      0      1      0      0
   9 │     0      1      0      1      0      1      1      0
  10 │     0      1      0      0      0      1      1      0

See also: encode_onehot, encode_dummy

source
FeatureEng.encode_onehotMethod
encode_onehot(column::T[, categories::T[, prefix::String]]) where T <: AbstractArray

Converts a categorical column into a DataFrame of one-hot-encoded columns – with one binary-encoded column per unique value in column.

Examples

The basic version of this function makes a column for each unique value in column.

julia> data = [3,1,2,4];
julia> encode_onehot(data)
4×4 DataFrame
 Row │ 1      2      3      4     
     │ Bool   Bool   Bool   Bool  
─────┼────────────────────────────
   1 │ false  false   true  false
   2 │  true  false  false  false
   3 │ false   true  false  false
   4 │ false  false  false   true

You can also specify a prefix for each column.

julia> data = [3,1,2,4];
julia> encode_onehot(data,"col_")
4×4 DataFrame
 Row │ col_1  col_2  col_3  col_4 
     │ Bool   Bool   Bool   Bool  
─────┼────────────────────────────
   1 │ false  false   true  false
   2 │  true  false  false  false
   3 │ false   true  false  false
   4 │ false  false  false   true

Additionally, you can specify the categories to convert to columns, regardless of whether it exists in column.

julia> data = [3,1,2,4];
julia> encode_onehot(data,[1:6;],"c")
4×6 DataFrame
 Row │ c1     c2     c3     c4     c5     c6    
     │ Bool   Bool   Bool   Bool   Bool   Bool  
─────┼──────────────────────────────────────────
   1 │ false  false   true  false  false  false
   2 │  true  false  false  false  false  false
   3 │ false   true  false  false  false  false
   4 │ false  false  false   true  false  false

See also: encode_dummy, encode_hash

source

DateTime Features

Functions for extracting helpful information from a column DateTime data.

FeatureEng.extract_date_featuresMethod
extract_date_features(datetimes::T) where T <: AbstractArray{<:Union{Date,DateTime}}

Extract a DataFrame of features from an array of DateTime or Date objects. Features extracted:

  • year: Year from datetime
  • month: Month from datetime
  • dayofmonth: Day of the month (1-31)
  • dayofweek: Day of the week (ordered)
  • isweekend: Is datetime a weekend?
  • quarter: The quarter from datetimes

Examples

julia> data = strp_datetimes([
    "2021-01-27 14:03:25",
    "1999-10-05 01:13:43",
    "2010-06-11 11:00:00"
]);
julia> extract_date_features(data)
3×6 DataFrame
 Row │ year   month    dayofmonth  dayofweek  isweekend  quarter 
     │ Int64  Cat…     Int64       Cat…       Bool       Int64   
─────┼───────────────────────────────────────────────────────────
   1 │  2021  January          27  Wednesday      false        1
   2 │  1999  October           5  Tuesday        false        4
   3 │  2010  June             11  Friday         false        2

See also: extract_datetime_features, extract_time_features

source
FeatureEng.extract_datetime_featuresMethod
extract_datetime_features(datetimes::T) where T <: AbstractArray{<:DateTime}

Extract a DataFrame of features from an array of DateTime objects. Features extracted:

  • year: Year from datetime
  • month: Month from datetime
  • dayofmonth: Day of the month (0-31)
  • dayofweek: Day of the week (ordered)
  • isweekend: Is datetime a weekend?
  • quarter: The quarter from datetimes
  • hour: Hour of the day from datetime
  • minute: Minute from datetime
  • second: Second from datetime
  • isAM: Is time AM (vs PM)?

The same as the following:

julia>  hcat(
    extract_date_features(datetimes),
    extract_time_features(datetimes)
    )

Examples

julia> data = strp_datetimes([
    "2021-01-27 14:03:25",
    "1999-10-05 01:13:43",
    "2010-06-11 11:00:00"
]);
julia> extract_datetime_features(data)
3×10 DataFrame
 Row │ year   month    dayofmonth  dayofweek  isweekend  quarter  hour   minut ⋯
     │ Int64  Cat…     Int64       Cat…       Bool       Int64    Int64  Int64 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │  2021  January          27  Wednesday      false        1     14        ⋯
   2 │  1999  October           5  Tuesday        false        4      1      1
   3 │  2010  June             11  Friday         false        2     11
                                                               3 columns omitted

See also: extract_date_features, extract_time_features

source
FeatureEng.extract_time_featuresMethod
extract_time_features(datetimes::T) where T <: AbstractArray{<:Union{Time,DateTime}}

Extract a DataFrame of features from an array of DateTime or Time objects. Features extracted:

  • hour: Hour of the day from datetime
  • minute: Minute from datetime
  • second: Second from datetime
  • isAM: Is time AM (vs PM)?

Examples

julia> data = strp_datetimes([
    "2021-01-27 14:03:25",
    "1999-10-05 01:13:43",
    "2010-06-11 11:00:00"
]);
julia> extract_time_features(data)
3×4 DataFrame
 Row │ hour   minute  second   isAM  
     │ Int64  Int64   Float64  Bool  
─────┼───────────────────────────────
   1 │    14       3     25.0  false
   2 │     1      13     43.0   true
   3 │    11       0      0.0   true

See also: extract_datetime_features, extract_date_features

source
FeatureEng.get_monthMethod
get_month(datetimes::T) where T <: AbstractArray{<:Union{Date,DateTime}}

Return an ordered CategoricalArray of month names extracted from datetimes.

Examples:

julia> data = strp_datetimes([
    "2021-01-27 14:03:25",
    "1999-10-05 01:13:43",
    "2010-06-11 11:00:00"
]);
julia> get_month(data)
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "January"
 "October"
 "June"

See also: extract_datetime_features, extract_date_features, get_weekday

source
FeatureEng.get_weekdayMethod
get_weekday(datetimes::T) where T <: AbstractArray{<:Union{Date,DateTime}}

Return an ordered CategoricalArray of weekday names extracted from datetimes.

Examples:

julia> data = strp_datetimes([
    "2021-01-27 14:03:25",
    "1999-10-05 01:13:43",
    "2010-06-11 11:00:00"
]);
julia> get_weekday(data)
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "Wednesday"
 "Tuesday"
 "Friday"

See also: extract_datetime_features, extract_date_features, get_weekday

source
FeatureEng.strp_datetimesMethod
strp_datetimes(datetimes::T, format::Union{String,DateFormat} = "y-m-d H:M:S") where T <: AbstractArray{<:AbstractString}

Convert an array of timestamps and to an array of DateTime objects.

Any of the strings it's unable to parse, will be replaced with missing.

Examples

julia> date_strings = [
    "2021-01-27 14:03:25",
    "1999-10-05 01:13:43",
    "abcdefg"
    ];
julia> strp_datetimes(date_strings)
3-element Array{Union{Missing, DateTime},1}:
 2021-01-27T14:03:25
 1999-10-05T01:13:43
 missing
source

Numeric Features

Working with numeric features.

Numeric – Binning Features

Converting continuous data to categorical data.

Numeric – Scaling Features

Scaling or normalizing numeric columns.

A helpful pre-processing step for ML models that are sensitive to data scale (ex k-means clustering, regularized regression).

Numeric – Transforming Features

Power transformations for numeric data.

Helpful for data with a distribution that doesn't work well with the model you're using (ex log-transforming data drawn from an exponential distribution before linear regression).

FeatureEng.transformBoxCoxMethod
transformBoxCox(data::T, λ::Real = 0.0) where T <: AbstractArray{<: Number}

Box-Cox power transformation following the following function:

\[y_i^{(\lambda)} = \left\{\begin{matrix} \frac{y_i^\lambda - 1}{\lambda} & \mathrm{if} \lambda \neq 0, \\ \mathrm{ln} y_i & \mathrm{if} \lambda = 0, \end{matrix}\right.\]

Examples:

julia> data = [0:5;];

julia> transformBoxCox(data)
6-element Array{Float64,1}:
 -Inf
   0.0
   0.6931471805599453
   1.0986122886681098
   1.3862943611198906
   1.6094379124341003

julia> transformBoxCox(data,.1)
6-element Array{Float64,1}:
 -10.0
   0.0
   0.7177346253629313
   1.1612317403390437
   1.486983549970351
   1.7461894308801895

julia> transformBoxCox(data,1)
6-element Array{Float64,1}:
 -1.0
  0.0
  1.0
  2.0
  3.0
  4.0

See also: transformLog, transformRoot

source
FeatureEng.transformLogMethod
transformLog(data::T, base::Real = ℯ) where T <: AbstractArray{<: Number}

Log transform data using log-base, base.

Examples:

julia> data = [0:5;];

julia> transformLog(data)
6-element Array{Float64,1}:
 -Inf
   0.0
   0.6931471805599453
   1.0986122886681098
   1.3862943611198906
   1.6094379124341003

julia> transformLog(data,2)
6-element Array{Float64,1}:
 -Inf
   0.0
   1.0
   1.5849625007211563
   2.0
   2.321928094887362

See also: transformRoot, transformBoxCox

source
FeatureEng.transformRootMethod
transformRoot(data::T, index::Real = 10) where T <: AbstractArray{<: Number}

Root transform data using root index, index.

Examples:

julia> data = [0:5;];

julia> transformRoot(data)
6-element Array{Float64,1}:
 0.0
 1.0
 1.0717734625362931
 1.1161231740339044
 1.148698354997035
 1.174618943088019

julia> transformRoot(data,2)
6-element Array{Float64,1}:
 0.0
 1.0
 1.4142135623730951
 1.7320508075688772
 2.0
 2.23606797749979

See also: transformLog, transformBoxCox

source

Numeric – Interaction Features

Calculate polynomial features to a specified degree before performing something like polynomial regression.

FeatureEng.polynomialMethod
polynomial(df::DataFrame, degree::T = 2) where T <: Integer

Calculate polynomial interaction terms between columns in a DataFrame.

If you have a DataFrame with 3 columns: x, y, and z, you can get degree-2 polynomial interaction terms: x*x, x*y, x*z, y*y, y*z, and z*z.

Examples

julia> using DataFrames

julia> df = DataFrame(a=1:10,b=repeat(0:1,5))
10×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      0
   2 │     2      1
   3 │     3      0
   4 │     4      1
   5 │     5      0
   6 │     6      1
   7 │     7      0
   8 │     8      1
   9 │     9      0
  10 │    10      1

julia> polynomial(df,2)
10×5 DataFrame
 Row │ a      a_a    a_b    b      b_b   
     │ Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────
   1 │     1      1      0      0      0
   2 │     2      4      2      1      1
   3 │     3      9      0      0      0
   4 │     4     16      4      1      1
   5 │     5     25      0      0      0
   6 │     6     36      6      1      1
   7 │     7     49      0      0      0
   8 │     8     64      8      1      1
   9 │     9     81      0      0      0
  10 │    10    100     10      1      1

julia> polynomial(df,3)
10×9 DataFrame
 Row │ a      a_a    a_a_a  a_a_b  a_b    a_b_b  b      b_b    b_b_b 
     │ Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────────────────────────────────
   1 │     1      1      1      0      0      0      0      0  0
   2 │     2      4      8      4      2      2      1      1  1
   3 │     3      9     27      0      0      0      0      0  0
   4 │     4     16     64     16      4      4      1      1  1
   5 │     5     25    125      0      0      0      0      0  0
   6 │     6     36    216     36      6      6      1      1  1
   7 │     7     49    343      0      0      0      0      0  0
   8 │     8     64    512     64      8      8      1      1  1
   9 │     9     81    729      0      0      0      0      0  0
  10 │    10    100   1000    100     10     10      1      1  1
source