When building a predictive model in R, many of the functions (such as lm()
, glm()
, randomForest
, xgboost
, or neural networks in keras
) require that all input variables are numeric. If your data has categorical variables, you may have to choose between ignoring some of your data and too many new columns.
Categorical embeddings are a relative new method, utilizing methods popularized in Natural Language Processing that help models solve this problem and can help you understand more about the categories themselves.
While there are a number of online tutorials on how to use Keras (usually in Python) to create these embeddings, this talk will use embed::step_embed()
, an extension of the recipes
package, to create the embeddings.
Alan Feder is a Principal Data Scientist at Invesco, where he uses as much R as possible to solve problems and build products throughout the company. Previously, he worked as a data scientist at AIG and an actuary at Swiss Re. He studied statistics and mathematics at Columbia University. He is unreasonably excited to spread the word about categorical embeddings. Alan lives in New York City with his wife, Ashira, and two children, Matan and Sarit.