Manual
Categorical Features
Functions for encoding a column of categorical data as multiple, binary columns.
FeatureEng.encode_dummy — Methodencode_dummy(column::T[, categories::T[, prefix::String]]) where T <: AbstractArraySame as encode_onehot except that it drops the first column (to help prevent issues caused by multicollinearity).
See also: encode_onehot, encode_hash
FeatureEng.encode_hash — Methodencode_hash(column::T, n_cols::Int = 8, prefix::String = "c") where T <: AbstractArrayDeterministically encode categorical features with high cardinality as a DataFrame with n_cols columns.
julia> data = [1:100;1000;];
julia> encode_hash([1:1_000:10_000;])
10×8 DataFrame
Row │ c1 c2 c3 c4 c5 c6 c7 c8
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────────────────
1 │ 1 1 1 1 1 1 1 0
2 │ 1 1 0 1 1 0 1 0
3 │ 1 0 1 0 1 1 0 1
4 │ 1 1 1 1 0 1 1 1
5 │ 1 0 0 0 0 0 1 0
6 │ 1 0 1 0 0 1 0 1
7 │ 0 1 1 0 0 1 1 1
8 │ 1 1 0 1 0 1 0 0
9 │ 0 1 0 1 0 1 1 0
10 │ 0 1 0 0 0 1 1 0See also: encode_onehot, encode_dummy
FeatureEng.encode_onehot — Methodencode_onehot(column::T[, categories::T[, prefix::String]]) where T <: AbstractArrayConverts a categorical column into a DataFrame of one-hot-encoded columns – with one binary-encoded column per unique value in column.
Examples
The basic version of this function makes a column for each unique value in column.
julia> data = [3,1,2,4];
julia> encode_onehot(data)
4×4 DataFrame
Row │ 1 2 3 4
│ Bool Bool Bool Bool
─────┼────────────────────────────
1 │ false false true false
2 │ true false false false
3 │ false true false false
4 │ false false false trueYou can also specify a prefix for each column.
julia> data = [3,1,2,4];
julia> encode_onehot(data,"col_")
4×4 DataFrame
Row │ col_1 col_2 col_3 col_4
│ Bool Bool Bool Bool
─────┼────────────────────────────
1 │ false false true false
2 │ true false false false
3 │ false true false false
4 │ false false false trueAdditionally, you can specify the categories to convert to columns, regardless of whether it exists in column.
julia> data = [3,1,2,4];
julia> encode_onehot(data,[1:6;],"c")
4×6 DataFrame
Row │ c1 c2 c3 c4 c5 c6
│ Bool Bool Bool Bool Bool Bool
─────┼──────────────────────────────────────────
1 │ false false true false false false
2 │ true false false false false false
3 │ false true false false false false
4 │ false false false true false falseSee also: encode_dummy, encode_hash
DateTime Features
Functions for extracting helpful information from a column DateTime data.
FeatureEng.extract_date_features — Methodextract_date_features(datetimes::T) where T <: AbstractArray{<:Union{Date,DateTime}}Extract a DataFrame of features from an array of DateTime or Date objects. Features extracted:
year: Year fromdatetimemonth: Month fromdatetimedayofmonth: Day of the month (1-31)dayofweek: Day of the week (ordered)isweekend: Isdatetimea weekend?quarter: The quarter from datetimes
Examples
julia> data = strp_datetimes([
"2021-01-27 14:03:25",
"1999-10-05 01:13:43",
"2010-06-11 11:00:00"
]);
julia> extract_date_features(data)
3×6 DataFrame
Row │ year month dayofmonth dayofweek isweekend quarter
│ Int64 Cat… Int64 Cat… Bool Int64
─────┼───────────────────────────────────────────────────────────
1 │ 2021 January 27 Wednesday false 1
2 │ 1999 October 5 Tuesday false 4
3 │ 2010 June 11 Friday false 2See also: extract_datetime_features, extract_time_features
FeatureEng.extract_datetime_features — Methodextract_datetime_features(datetimes::T) where T <: AbstractArray{<:DateTime}Extract a DataFrame of features from an array of DateTime objects. Features extracted:
year: Year fromdatetimemonth: Month fromdatetimedayofmonth: Day of the month (0-31)dayofweek: Day of the week (ordered)isweekend: Isdatetimea weekend?quarter: The quarter from datetimeshour: Hour of the day fromdatetimeminute: Minute fromdatetimesecond: Second fromdatetimeisAM: Is time AM (vs PM)?
The same as the following:
julia> hcat(
extract_date_features(datetimes),
extract_time_features(datetimes)
)Examples
julia> data = strp_datetimes([
"2021-01-27 14:03:25",
"1999-10-05 01:13:43",
"2010-06-11 11:00:00"
]);
julia> extract_datetime_features(data)
3×10 DataFrame
Row │ year month dayofmonth dayofweek isweekend quarter hour minut ⋯
│ Int64 Cat… Int64 Cat… Bool Int64 Int64 Int64 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2021 January 27 Wednesday false 1 14 ⋯
2 │ 1999 October 5 Tuesday false 4 1 1
3 │ 2010 June 11 Friday false 2 11
3 columns omittedSee also: extract_date_features, extract_time_features
FeatureEng.extract_time_features — Methodextract_time_features(datetimes::T) where T <: AbstractArray{<:Union{Time,DateTime}}Extract a DataFrame of features from an array of DateTime or Time objects. Features extracted:
hour: Hour of the day fromdatetimeminute: Minute fromdatetimesecond: Second fromdatetimeisAM: Is time AM (vs PM)?
Examples
julia> data = strp_datetimes([
"2021-01-27 14:03:25",
"1999-10-05 01:13:43",
"2010-06-11 11:00:00"
]);
julia> extract_time_features(data)
3×4 DataFrame
Row │ hour minute second isAM
│ Int64 Int64 Float64 Bool
─────┼───────────────────────────────
1 │ 14 3 25.0 false
2 │ 1 13 43.0 true
3 │ 11 0 0.0 trueSee also: extract_datetime_features, extract_date_features
FeatureEng.get_month — Methodget_month(datetimes::T) where T <: AbstractArray{<:Union{Date,DateTime}}Return an ordered CategoricalArray of month names extracted from datetimes.
Examples:
julia> data = strp_datetimes([
"2021-01-27 14:03:25",
"1999-10-05 01:13:43",
"2010-06-11 11:00:00"
]);
julia> get_month(data)
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"January"
"October"
"June"See also: extract_datetime_features, extract_date_features, get_weekday
FeatureEng.get_weekday — Methodget_weekday(datetimes::T) where T <: AbstractArray{<:Union{Date,DateTime}}Return an ordered CategoricalArray of weekday names extracted from datetimes.
Examples:
julia> data = strp_datetimes([
"2021-01-27 14:03:25",
"1999-10-05 01:13:43",
"2010-06-11 11:00:00"
]);
julia> get_weekday(data)
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"Wednesday"
"Tuesday"
"Friday"See also: extract_datetime_features, extract_date_features, get_weekday
FeatureEng.strp_datetimes — Methodstrp_datetimes(datetimes::T, format::Union{String,DateFormat} = "y-m-d H:M:S") where T <: AbstractArray{<:AbstractString}Convert an array of timestamps and to an array of DateTime objects.
Any of the strings it's unable to parse, will be replaced with missing.
Examples
julia> date_strings = [
"2021-01-27 14:03:25",
"1999-10-05 01:13:43",
"abcdefg"
];
julia> strp_datetimes(date_strings)
3-element Array{Union{Missing, DateTime},1}:
2021-01-27T14:03:25
1999-10-05T01:13:43
missingNumeric Features
Working with numeric features.
Numeric – Binning Features
Converting continuous data to categorical data.
FeatureEng.apply_transform — MethodFeatureEng.apply_transform — MethodFeatureEng.fit_transform! — MethodFeatureEng.fit_transform! — MethodNumeric – Scaling Features
Scaling or normalizing numeric columns.
A helpful pre-processing step for ML models that are sensitive to data scale (ex k-means clustering, regularized regression).
FeatureEng.apply_transform — MethodFeatureEng.apply_transform — MethodFeatureEng.apply_transform — MethodFeatureEng.fit_transform! — MethodFeatureEng.fit_transform! — MethodFeatureEng.fit_transform! — MethodNumeric – Transforming Features
Power transformations for numeric data.
Helpful for data with a distribution that doesn't work well with the model you're using (ex log-transforming data drawn from an exponential distribution before linear regression).
FeatureEng.transformBoxCox — MethodtransformBoxCox(data::T, λ::Real = 0.0) where T <: AbstractArray{<: Number}Box-Cox power transformation following the following function:
\[y_i^{(\lambda)} = \left\{\begin{matrix} \frac{y_i^\lambda - 1}{\lambda} & \mathrm{if} \lambda \neq 0, \\ \mathrm{ln} y_i & \mathrm{if} \lambda = 0, \end{matrix}\right.\]
Examples:
julia> data = [0:5;];
julia> transformBoxCox(data)
6-element Array{Float64,1}:
-Inf
0.0
0.6931471805599453
1.0986122886681098
1.3862943611198906
1.6094379124341003
julia> transformBoxCox(data,.1)
6-element Array{Float64,1}:
-10.0
0.0
0.7177346253629313
1.1612317403390437
1.486983549970351
1.7461894308801895
julia> transformBoxCox(data,1)
6-element Array{Float64,1}:
-1.0
0.0
1.0
2.0
3.0
4.0See also: transformLog, transformRoot
FeatureEng.transformLog — MethodtransformLog(data::T, base::Real = ℯ) where T <: AbstractArray{<: Number}Log transform data using log-base, base.
Examples:
julia> data = [0:5;];
julia> transformLog(data)
6-element Array{Float64,1}:
-Inf
0.0
0.6931471805599453
1.0986122886681098
1.3862943611198906
1.6094379124341003
julia> transformLog(data,2)
6-element Array{Float64,1}:
-Inf
0.0
1.0
1.5849625007211563
2.0
2.321928094887362
See also: transformRoot, transformBoxCox
FeatureEng.transformRoot — MethodtransformRoot(data::T, index::Real = 10) where T <: AbstractArray{<: Number}Root transform data using root index, index.
Examples:
julia> data = [0:5;];
julia> transformRoot(data)
6-element Array{Float64,1}:
0.0
1.0
1.0717734625362931
1.1161231740339044
1.148698354997035
1.174618943088019
julia> transformRoot(data,2)
6-element Array{Float64,1}:
0.0
1.0
1.4142135623730951
1.7320508075688772
2.0
2.23606797749979See also: transformLog, transformBoxCox
Numeric – Interaction Features
Calculate polynomial features to a specified degree before performing something like polynomial regression.
FeatureEng.polynomial — Methodpolynomial(df::DataFrame, degree::T = 2) where T <: IntegerCalculate polynomial interaction terms between columns in a DataFrame.
If you have a DataFrame with 3 columns: x, y, and z, you can get degree-2 polynomial interaction terms: x*x, x*y, x*z, y*y, y*z, and z*z.
Examples
julia> using DataFrames
julia> df = DataFrame(a=1:10,b=repeat(0:1,5))
10×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 1
3 │ 3 0
4 │ 4 1
5 │ 5 0
6 │ 6 1
7 │ 7 0
8 │ 8 1
9 │ 9 0
10 │ 10 1
julia> polynomial(df,2)
10×5 DataFrame
Row │ a a_a a_b b b_b
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 1 0 0 0
2 │ 2 4 2 1 1
3 │ 3 9 0 0 0
4 │ 4 16 4 1 1
5 │ 5 25 0 0 0
6 │ 6 36 6 1 1
7 │ 7 49 0 0 0
8 │ 8 64 8 1 1
9 │ 9 81 0 0 0
10 │ 10 100 10 1 1
julia> polynomial(df,3)
10×9 DataFrame
Row │ a a_a a_a_a a_a_b a_b a_b_b b b_b b_b_b
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────────────────────────────────
1 │ 1 1 1 0 0 0 0 0 0
2 │ 2 4 8 4 2 2 1 1 1
3 │ 3 9 27 0 0 0 0 0 0
4 │ 4 16 64 16 4 4 1 1 1
5 │ 5 25 125 0 0 0 0 0 0
6 │ 6 36 216 36 6 6 1 1 1
7 │ 7 49 343 0 0 0 0 0 0
8 │ 8 64 512 64 8 8 1 1 1
9 │ 9 81 729 0 0 0 0 0 0
10 │ 10 100 1000 100 10 10 1 1 1