Manual
Categorical Features
Functions for encoding a column of categorical data as multiple, binary columns.
FeatureEng.encode_dummy
— Methodencode_dummy(column::T[, categories::T[, prefix::String]]) where T <: AbstractArray
Same as encode_onehot
except that it drops the first column (to help prevent issues caused by multicollinearity).
See also: encode_onehot
, encode_hash
FeatureEng.encode_hash
— Methodencode_hash(column::T, n_cols::Int = 8, prefix::String = "c") where T <: AbstractArray
Deterministically encode categorical features with high cardinality as a DataFrame
with n_cols
columns.
julia> data = [1:100;1000;];
julia> encode_hash([1:1_000:10_000;])
10×8 DataFrame
Row │ c1 c2 c3 c4 c5 c6 c7 c8
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────────────────
1 │ 1 1 1 1 1 1 1 0
2 │ 1 1 0 1 1 0 1 0
3 │ 1 0 1 0 1 1 0 1
4 │ 1 1 1 1 0 1 1 1
5 │ 1 0 0 0 0 0 1 0
6 │ 1 0 1 0 0 1 0 1
7 │ 0 1 1 0 0 1 1 1
8 │ 1 1 0 1 0 1 0 0
9 │ 0 1 0 1 0 1 1 0
10 │ 0 1 0 0 0 1 1 0
See also: encode_onehot
, encode_dummy
FeatureEng.encode_onehot
— Methodencode_onehot(column::T[, categories::T[, prefix::String]]) where T <: AbstractArray
Converts a categorical column into a DataFrame
of one-hot-encoded columns – with one binary-encoded column per unique value in column
.
Examples
The basic version of this function makes a column for each unique value in column
.
julia> data = [3,1,2,4];
julia> encode_onehot(data)
4×4 DataFrame
Row │ 1 2 3 4
│ Bool Bool Bool Bool
─────┼────────────────────────────
1 │ false false true false
2 │ true false false false
3 │ false true false false
4 │ false false false true
You can also specify a prefix for each column.
julia> data = [3,1,2,4];
julia> encode_onehot(data,"col_")
4×4 DataFrame
Row │ col_1 col_2 col_3 col_4
│ Bool Bool Bool Bool
─────┼────────────────────────────
1 │ false false true false
2 │ true false false false
3 │ false true false false
4 │ false false false true
Additionally, you can specify the categories to convert to columns, regardless of whether it exists in column
.
julia> data = [3,1,2,4];
julia> encode_onehot(data,[1:6;],"c")
4×6 DataFrame
Row │ c1 c2 c3 c4 c5 c6
│ Bool Bool Bool Bool Bool Bool
─────┼──────────────────────────────────────────
1 │ false false true false false false
2 │ true false false false false false
3 │ false true false false false false
4 │ false false false true false false
See also: encode_dummy
, encode_hash
DateTime Features
Functions for extracting helpful information from a column DateTime
data.
FeatureEng.extract_date_features
— Methodextract_date_features(datetimes::T) where T <: AbstractArray{<:Union{Date,DateTime}}
Extract a DataFrame
of features from an array of DateTime
or Date
objects. Features extracted:
year
: Year fromdatetime
month
: Month fromdatetime
dayofmonth
: Day of the month (1-31)dayofweek
: Day of the week (ordered)isweekend
: Isdatetime
a weekend?quarter
: The quarter from datetimes
Examples
julia> data = strp_datetimes([
"2021-01-27 14:03:25",
"1999-10-05 01:13:43",
"2010-06-11 11:00:00"
]);
julia> extract_date_features(data)
3×6 DataFrame
Row │ year month dayofmonth dayofweek isweekend quarter
│ Int64 Cat… Int64 Cat… Bool Int64
─────┼───────────────────────────────────────────────────────────
1 │ 2021 January 27 Wednesday false 1
2 │ 1999 October 5 Tuesday false 4
3 │ 2010 June 11 Friday false 2
See also: extract_datetime_features
, extract_time_features
FeatureEng.extract_datetime_features
— Methodextract_datetime_features(datetimes::T) where T <: AbstractArray{<:DateTime}
Extract a DataFrame
of features from an array of DateTime
objects. Features extracted:
year
: Year fromdatetime
month
: Month fromdatetime
dayofmonth
: Day of the month (0-31)dayofweek
: Day of the week (ordered)isweekend
: Isdatetime
a weekend?quarter
: The quarter from datetimeshour
: Hour of the day fromdatetime
minute
: Minute fromdatetime
second
: Second fromdatetime
isAM
: Is time AM (vs PM)?
The same as the following:
julia> hcat(
extract_date_features(datetimes),
extract_time_features(datetimes)
)
Examples
julia> data = strp_datetimes([
"2021-01-27 14:03:25",
"1999-10-05 01:13:43",
"2010-06-11 11:00:00"
]);
julia> extract_datetime_features(data)
3×10 DataFrame
Row │ year month dayofmonth dayofweek isweekend quarter hour minut ⋯
│ Int64 Cat… Int64 Cat… Bool Int64 Int64 Int64 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2021 January 27 Wednesday false 1 14 ⋯
2 │ 1999 October 5 Tuesday false 4 1 1
3 │ 2010 June 11 Friday false 2 11
3 columns omitted
See also: extract_date_features
, extract_time_features
FeatureEng.extract_time_features
— Methodextract_time_features(datetimes::T) where T <: AbstractArray{<:Union{Time,DateTime}}
Extract a DataFrame
of features from an array of DateTime
or Time
objects. Features extracted:
hour
: Hour of the day fromdatetime
minute
: Minute fromdatetime
second
: Second fromdatetime
isAM
: Is time AM (vs PM)?
Examples
julia> data = strp_datetimes([
"2021-01-27 14:03:25",
"1999-10-05 01:13:43",
"2010-06-11 11:00:00"
]);
julia> extract_time_features(data)
3×4 DataFrame
Row │ hour minute second isAM
│ Int64 Int64 Float64 Bool
─────┼───────────────────────────────
1 │ 14 3 25.0 false
2 │ 1 13 43.0 true
3 │ 11 0 0.0 true
See also: extract_datetime_features
, extract_date_features
FeatureEng.get_month
— Methodget_month(datetimes::T) where T <: AbstractArray{<:Union{Date,DateTime}}
Return an ordered CategoricalArray
of month names extracted from datetimes
.
Examples:
julia> data = strp_datetimes([
"2021-01-27 14:03:25",
"1999-10-05 01:13:43",
"2010-06-11 11:00:00"
]);
julia> get_month(data)
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"January"
"October"
"June"
See also: extract_datetime_features
, extract_date_features
, get_weekday
FeatureEng.get_weekday
— Methodget_weekday(datetimes::T) where T <: AbstractArray{<:Union{Date,DateTime}}
Return an ordered CategoricalArray
of weekday names extracted from datetimes
.
Examples:
julia> data = strp_datetimes([
"2021-01-27 14:03:25",
"1999-10-05 01:13:43",
"2010-06-11 11:00:00"
]);
julia> get_weekday(data)
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"Wednesday"
"Tuesday"
"Friday"
See also: extract_datetime_features
, extract_date_features
, get_weekday
FeatureEng.strp_datetimes
— Methodstrp_datetimes(datetimes::T, format::Union{String,DateFormat} = "y-m-d H:M:S") where T <: AbstractArray{<:AbstractString}
Convert an array of timestamps and to an array of DateTime
objects.
Any of the strings it's unable to parse, will be replaced with missing
.
Examples
julia> date_strings = [
"2021-01-27 14:03:25",
"1999-10-05 01:13:43",
"abcdefg"
];
julia> strp_datetimes(date_strings)
3-element Array{Union{Missing, DateTime},1}:
2021-01-27T14:03:25
1999-10-05T01:13:43
missing
Numeric Features
Working with numeric features.
Numeric – Binning Features
Converting continuous data to categorical data.
FeatureEng.apply_transform
— MethodFeatureEng.apply_transform
— MethodFeatureEng.fit_transform!
— MethodFeatureEng.fit_transform!
— MethodNumeric – Scaling Features
Scaling or normalizing numeric columns.
A helpful pre-processing step for ML models that are sensitive to data scale (ex k-means clustering, regularized regression).
FeatureEng.apply_transform
— MethodFeatureEng.apply_transform
— MethodFeatureEng.apply_transform
— MethodFeatureEng.fit_transform!
— MethodFeatureEng.fit_transform!
— MethodFeatureEng.fit_transform!
— MethodNumeric – Transforming Features
Power transformations for numeric data.
Helpful for data with a distribution that doesn't work well with the model you're using (ex log-transforming data drawn from an exponential distribution before linear regression).
FeatureEng.transformBoxCox
— MethodtransformBoxCox(data::T, λ::Real = 0.0) where T <: AbstractArray{<: Number}
Box-Cox power transformation following the following function:
\[y_i^{(\lambda)} = \left\{\begin{matrix} \frac{y_i^\lambda - 1}{\lambda} & \mathrm{if} \lambda \neq 0, \\ \mathrm{ln} y_i & \mathrm{if} \lambda = 0, \end{matrix}\right.\]
Examples:
julia> data = [0:5;];
julia> transformBoxCox(data)
6-element Array{Float64,1}:
-Inf
0.0
0.6931471805599453
1.0986122886681098
1.3862943611198906
1.6094379124341003
julia> transformBoxCox(data,.1)
6-element Array{Float64,1}:
-10.0
0.0
0.7177346253629313
1.1612317403390437
1.486983549970351
1.7461894308801895
julia> transformBoxCox(data,1)
6-element Array{Float64,1}:
-1.0
0.0
1.0
2.0
3.0
4.0
See also: transformLog
, transformRoot
FeatureEng.transformLog
— MethodtransformLog(data::T, base::Real = ℯ) where T <: AbstractArray{<: Number}
Log transform data
using log-base, base
.
Examples:
julia> data = [0:5;];
julia> transformLog(data)
6-element Array{Float64,1}:
-Inf
0.0
0.6931471805599453
1.0986122886681098
1.3862943611198906
1.6094379124341003
julia> transformLog(data,2)
6-element Array{Float64,1}:
-Inf
0.0
1.0
1.5849625007211563
2.0
2.321928094887362
See also: transformRoot
, transformBoxCox
FeatureEng.transformRoot
— MethodtransformRoot(data::T, index::Real = 10) where T <: AbstractArray{<: Number}
Root transform data
using root index, index
.
Examples:
julia> data = [0:5;];
julia> transformRoot(data)
6-element Array{Float64,1}:
0.0
1.0
1.0717734625362931
1.1161231740339044
1.148698354997035
1.174618943088019
julia> transformRoot(data,2)
6-element Array{Float64,1}:
0.0
1.0
1.4142135623730951
1.7320508075688772
2.0
2.23606797749979
See also: transformLog
, transformBoxCox
Numeric – Interaction Features
Calculate polynomial features to a specified degree before performing something like polynomial regression.
FeatureEng.polynomial
— Methodpolynomial(df::DataFrame, degree::T = 2) where T <: Integer
Calculate polynomial interaction terms between columns in a DataFrame
.
If you have a DataFrame
with 3 columns: x
, y
, and z
, you can get degree-2 polynomial interaction terms: x*x
, x*y
, x*z
, y*y
, y*z
, and z*z
.
Examples
julia> using DataFrames
julia> df = DataFrame(a=1:10,b=repeat(0:1,5))
10×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 1
3 │ 3 0
4 │ 4 1
5 │ 5 0
6 │ 6 1
7 │ 7 0
8 │ 8 1
9 │ 9 0
10 │ 10 1
julia> polynomial(df,2)
10×5 DataFrame
Row │ a a_a a_b b b_b
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 1 0 0 0
2 │ 2 4 2 1 1
3 │ 3 9 0 0 0
4 │ 4 16 4 1 1
5 │ 5 25 0 0 0
6 │ 6 36 6 1 1
7 │ 7 49 0 0 0
8 │ 8 64 8 1 1
9 │ 9 81 0 0 0
10 │ 10 100 10 1 1
julia> polynomial(df,3)
10×9 DataFrame
Row │ a a_a a_a_a a_a_b a_b a_b_b b b_b b_b_b
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────────────────────────────────
1 │ 1 1 1 0 0 0 0 0 0
2 │ 2 4 8 4 2 2 1 1 1
3 │ 3 9 27 0 0 0 0 0 0
4 │ 4 16 64 16 4 4 1 1 1
5 │ 5 25 125 0 0 0 0 0 0
6 │ 6 36 216 36 6 6 1 1 1
7 │ 7 49 343 0 0 0 0 0 0
8 │ 8 64 512 64 8 8 1 1 1
9 │ 9 81 729 0 0 0 0 0 0
10 │ 10 100 1000 100 10 10 1 1 1