Excel to data science to lead ML engineer
We were joined by Javier Orraca-Deatcu, Lead Machine Learning Engineer at Centene.
Among many topics covered, Javier shared how his background in finance and consulting led to his interest in data science to automate some of his work – and how he helped get other data scientists together in his organization.
(26:31) How did you organize and recruit people for the data science community group at Centene?
“So I sort of piggybacked from a general data science community chat that we had at the company. There were several hundred people on this of varying backgrounds and expertise levels so there was a lot of conversation happening. There was already a Python group that was meeting– I think every other month.
So three weeks after I started, I got really excited about the possibility of potentially creating something similar for R users.
- It started by just trying to figure out who owns that already existing data science chat and see if they could help support the idea of creating an R user group, something to meet once a month or once every two months. At larger companies especially, getting that type of top-level executive stamp of approval and support can go a long way, especially if that individual is part of the already existing IT or data science function.
- At the time, I created a Blogdown site. For those of you who are familiar with R Markdown, Blogdown is a package that allows you to create static websites and blogs with R Markdown. Now with Quarto you can do the same thing and create websites. I love the syntax of Quarto.
- We had partnerships with Posit, so we were able to get some people to come in and do workshops as well.
- We also had reticulate sessions, where it was a co-branded Python & R workshop where we were looking at ways in which we can actually communicate between teams of different languages a lot easier. I had a great experience with it. Everyone was so collaborative and it was such a great way to see the excitement around what you could do with both R and Python.
I think what started as 13 users the first month, jumped to about 100 – 125 monthly users on this monthly meetup.”
On the journey to machine learning engineer, what was the hardest part? (49:10):
“Because of SQL, I had a really good understanding of at least how tabular data could be joined and the different transformations that could be done to these data objects.
I think I would have really struggled without that basic understanding. But having said that, I think the part where I really struggled at first was function writing. Function writing was not intuitive to me.
Basic function writing was but in general, I found it to be very complicated and it took a solid three to six months of practice to feel actually comfortable with it.
Even when I started building Shiny apps– basic Shiny is quite easy but large functions underpin the entirety of a Shiny app. Everything you do within Shiny is effectively writing functions.
The process of learning Shiny and becoming more comfortable with Shiny was very difficult and something that just took a lot of repetitions but it all sort of played together.
While people may think of Shiny more as a frontend type of system, it did make me a much better programmer in the way I thought about actual functions and function writing.
Other things that I found hard, looking back, I’m sort of embarrassed to say this, was reproducibility of machine learning – being able to reproduce a code set and get the exact same predictions every time.
I wasn’t quite sure why this wasn’t working or how to create these fixed views, setting a seed or whatever you need to do to ensure that someone else downstream could replicate your study or analysis and get the exact same findings themselves.”
Featured in this episode
As an analytics and data science engineering leader, my teams develop and deploy to production decision intelligence data products, predictive forecasts, and automated reporting solutions across the Amazon, Google, and Microsoft tech stacks. Examples include developing, testing, deploying cron jobs, internal packages / libraries, APIs, and R Shiny web apps (modular, interactive web apps for reporting, forecasting, statistical inference, and model performance monitoring). My daily toolkit includes the following: R, GitHub / GitLab, Python, SQL, RStudio Workbench / Connect / Package Manager, GCP BigQuery, AWS S3 / Redshift, Docker, and Excel.