I'm using Julia's DataFrames.jl package. In it, I have a dataframe with a columns containing a list of strings (e.g. ["Type A", "Type B", "Type D"]). How does one then performs a one-hot encoding? I wasn't able to find a pre-built function in the DataFrames.jl package.
Here is an example of what I want to do:
Original Dataframe
col1 | col2 |
102 |[a] |
103 |[a,b] |
102 |[c,b] |
After One-hot encoding
col1 | a | b | c |
102 | 1 | 0 | 0 |
103 | 1 | 1 | 0 |
102 | 0 | 1 | 1 |
Bogumił Kamiński :
It is easy enough to do it with basic functions we provide though:\njulia> df = DataFrame(x=rand([1:3;missing], 20))\n20×1 DataFrame\n│ Row │ x │\n│ │ Int64? │\n├─────┼─────────┤\n│ 1 │ 1 │\n│ 2 │ 2 │\n│ 3 │ missing │\n│ 4 │ 1 │\n│ 5 │ 3 │\n│ 6 │ missing │\n│ 7 │ 3 │\n│ 8 │ 3 │\n│ 9 │ 3 │\n│ 10 │ 3 │\n│ 11 │ missing │\n│ 12 │ 1 │\n│ 13 │ 3 │\n│ 14 │ 3 │\n│ 15 │ 3 │\n│ 16 │ 1 │\n│ 17 │ missing │\n│ 18 │ 1 │\n│ 19 │ 1 │\n│ 20 │ missing │\n\njulia> ux = unique(df.x); transform(df, @. :x => ByRow(isequal(ux)) .=> Symbol(:x_, ux))\n20×5 DataFrame\n│ Row │ x │ x_1 │ x_2 │ x_missing │ x_3 │\n│ │ Int64? │ Bool │ Bool │ Bool │ Bool │\n├─────┼─────────┼──────┼──────┼───────────┼──────┤\n│ 1 │ 1 │ 1 │ 0 │ 0 │ 0 │\n│ 2 │ 2 │ 0 │ 1 │ 0 │ 0 │\n│ 3 │ missing │ 0 │ 0 │ 1 │ 0 │\n│ 4 │ 1 │ 1 │ 0 │ 0 │ 0 │\n│ 5 │ 3 │ 0 │ 0 │ 0 │ 1 │\n│ 6 │ missing │ 0 │ 0 │ 1 │ 0 │\n│ 7 │ 3 │ 0 │ 0 │ 0 │ 1 │\n│ 8 │ 3 │ 0 │ 0 │ 0 │ 1 │\n│ 9 │ 3 │ 0 │ 0 │ 0 │ 1 │\n│ 10 │ 3 │ 0 │ 0 │ 0 │ 1 │\n│ 11 │ missing │ 0 │ 0 │ 1 │ 0 │\n│ 12 │ 1 │ 1 │ 0 │ 0 │ 0 │\n│ 13 │ 3 │ 0 │ 0 │ 0 │ 1 │\n│ 14 │ 3 │ 0 │ 0 │ 0 │ 1 │\n│ 15 │ 3 │ 0 │ 0 │ 0 │ 1 │\n│ 16 │ 1 │ 1 │ 0 │ 0 │ 0 │\n│ 17 │ missing │ 0 │ 0 │ 1 │ 0 │\n│ 18 │ 1 │ 1 │ 0 │ 0 │ 0 │\n│ 19 │ 1 │ 1 │ 0 │ 0 │ 0 │\n│ 20 │ missing │ 0 │ 0 │ 1 │ 0 │\n\nEDIT:\nAnother example:\njulia> df = DataFrame(col1=102:104, col2=[["a"], ["a","b"], ["c","b"]])\n3×2 DataFrame\n│ Row │ col1 │ col2 │\n│ │ Int64 │ Array… │\n├─────┼───────┼────────────┤\n│ 1 │ 102 │ ["a"] │\n│ 2 │ 103 │ ["a", "b"] │\n│ 3 │ 104 │ ["c", "b"] │\n\njulia> ux = unique(reduce(vcat, df.col2))\n3-element Array{String,1}:\n "a"\n "b"\n "c"\n\njulia> transform(df, :col2 .=> [ByRow(v -> x in v) for x in ux] .=> Symbol.(:col2_, ux))\n3×5 DataFrame\n│ Row │ col1 │ col2 │ col2_a │ col2_b │ col2_c │\n│ │ Int64 │ Array… │ Bool │ Bool │ Bool │\n├─────┼───────┼────────────┼────────┼────────┼────────┤\n│ 1 │ 102 │ ["a"] │ 1 │ 0 │ 0 │\n│ 2 │ 103 │ ["a", "b"] │ 1 │ 1 │ 0 │\n│ 3 │ 104 │ ["c", "b"] │ 0 │ 1 │ 1 │\n",
2020-10-28T07:20:38
Nils Gudat :
There is indeed no one-hot encoding function in DataFrames.jl - I would argue that this is sensible, as this is a particular machine learning transformation that should live in a an ML package rather than in a basic DataFrames package.\nYou've got two options I think:\n\nUse an ML package that does this for you, e.g. MLJ.jl. In MLJ, the OneHotEncoder is a model that transforms any table with Finite features in it into a one-hot encoded version of itself, see the docs here\n\nUse a regression package that automatically generates dummy columns for categorical variables using the StatsModels @formula API - if you fit a regression with e.g. GLM.jl and your formula is @formula(y ~ x) where x is a a categorical variable, the model matrix will automatically be constructed by contrast coding x, i.e. having binary dummy columns for all but one level of x\n\n\nFor the second option, you ideally want your data to be categorical (although strings will work as well), and for this DataFrames.jl includes the categorical! function.\nEDIT 17/11/2021: There has since been a definitive thread on this on the Julia Discourse which contains an extensive list of suggestions for doing one-hot encoding: https://discourse.julialang.org/t/all-the-ways-to-do-one-hot-encoding/\nSharing my favourite from there:\njulia> x = [1, 2, 1, 3, 2];\n\njulia> unique(x) .== permutedims(x)\n3×5 BitMatrix:\n 1 0 1 0 0\n 0 1 0 0 1\n 0 0 0 1 0\n",
2020-10-28T06:06:14