Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Marco Blume, Trading Director at Pinnacle Sports.
Here is the podcast link.
Introducing Marco Blume
Hugo: Hi there, Marco, and welcome to Dataframed.
Marco: Yeah, hi Hugo. Thanks for having me.
Hugo: Real pleasure to have you on the show and I’m actually really excited to have you here today to talk about sports betting, how data science plays a huge role in what you do as trading director at Pinnacle, with respect to sports betting. And also the fact that sports betting in your line of work doesn’t only allude to sports, but that at Pinnacle, you do lots of different types of bets. I’m really excited about getting into the weeds there. But before we get to all of that, I want to find out a bit about you and so I’m wondering first what your colleagues would say that you do.
Marco: My colleagues? Risk management. I think that’s probably the best assessment. I’m responsible for managing all the risk that is associated with wagers at Pinnacle, over all sports, live, pre-live. Any aspect of the betting, I manage the risk at Pinnacle.
Hugo: Fantastic. Do you think your colleagues, as you do so much quantitative stuff, they have an awareness of the ins and outs of your daily life? Or do they think it’s all, let’s say, texters and whiteboards, or pen and paper, or writing code and building models?
Marco: I mean, it’s a lot of black box for them. I mean, at the end of the day, most of them have different areas of expertise, and inner workings of the trade floor are just too complex and too specific these days, but I think that’s true for most areas if you deep down look and see how much do you actually know about other areas anymore. So, I would say the day-to-day is probably unknown for them, what my actual day-to-day job is.
Pinnacle
Hugo: Yeah, and I think you’re right that due to increasing specialization across so many disciplines, that it is…things do become more black box as we head down that path. So, maybe we can step back a bit, and you can just tell me a bit about what Pinnacle actually does.
Marco: 2018 is our 20th anniversary. We are one of the largest bookmakers in the world, and we are known for being a very efficient bookmaker in terms of pricing. We are considered … Some people compare us to the NASDAQ of prices, meaning that the traditional bookmakers that people know and heard of are usually more in the recreational field of work, and Pinnacle’s actually a true bookmaker. That means we have very low margins, very high limits. Our website is not so flashy, but we have an API that people can interact with. We are like a real, true bookmaker trying to do quantitative analysis of sports events and other events, and allow people to build models against us and place wagers with us.
Hugo: So then, as trading director, what does your day look like? What are the ins and out of your actual job?
Marco: I mean, it largely depends on season. So, sports is obviously very seasonal. You have big events, like this summer we had the World Cup, which changes my job dramatically. But overall, day-to-day would be sitting down with my managers, maybe going over the week, going over the month, discussing some plans about some products that we want to roll out, discussing some models that we need to test, discussing some of the new strategies we want to try. And overall, it’s like a constant strive to improve our product, and obviously do analysis about things that we tried that didn’t go so well. That’s the bread and butter of my day-to-day.
Hugo: So, how did you get into data science initially?
Marco: By sheer force. So, I was always a math guy, but I was never in data science, and once we started building our quant team out, our quant’s started to … Before, we used Excel for everything, and then the quant’s started using R, and they were coding in R, and I quickly picked up that the level of efficiency gain they had over me was order of magnitudes. They could analyze data so easily that was inaccessible to me just because of the natural restrictions about Excel. So, I started at the Coursera course, did my lectures there and then starting coding R. And then pretty soon it became a bread-and-butter tool for me. I couldn’t actually believe that I didn’t have the skillset before and did my job.
Hugo: And which Coursera course was it that you took?
Marco: It’s the very first one. It was the very first course, I think it’s actually the data science track, that’s what it’s called but-
Hugo: That’s Roger Peng and Jeff Leek.
Marco: Yeah, Roger Peng and Jeff Leek, exactly. That’s the original first course I took.
Hugo: I’m in exactly the same position. I spoke with Roger about this on this podcast, that I was actually in one of their first cohorts and maybe you were too, around 2012, 2013, or something like that.
Marco: Yeah, maybe. Around that time, for sure.
Hugo: Yeah.
Marco: Incredibly tough. Because I didn’t come from coding, so for me, this was brand new. I thought it was a really tough course for me, I actually struggled quite a lot. The thing is I knew I had enough expertise in my team that the answers were available to me if I ever had a question. And I knew exactly what I wanted to achieve, so I had a very clear goal in my mind. What did I want to achieve? I wanted to interact with our data directly, I wanted to access our database directly and do analysis over it without the need to ask somebody for a data pull. And then this data pull had some missing columns or missing attributes and need to ask them again, and need to give them to the analysis team. I just wanted to reduce the red tape and be able to be self-sufficient.
Role of Analytics and Data Science for Bookmakers
Hugo: Some people might have a question revolving around the Venn diagram of data science and sports betting. And I’m wondering historically up until now, what the role has been of analytics and data science for bookmakers?
Marco: In bookmaking, you have a few vectors of data analysis, which makes it really interesting. You have the classical sports analytics: how does the sport work? And sabermetrics are the people who know baseball, who’s leading in many aspects. But sabermetrics, ideas and concepts are now almost existent in every other sport, especially soccer, football for Europeans. There’s a high level of analysis done right now. But this is all the field that surrounds the sport data analysis. But since we are trading house, and we actually have a ticket flow coming in and out and so we also have the traditional financial analysis of our risk management assessment and basic game theory strategies and all of that stuff in addition. So we have a very nice overlap between those two worlds and have to manage both separately and then mash them together eventually which is often the hard part.
Hugo: I’m sure. So when I came into this, our first conversation earlier this year, I was under the misapprehension that sports betting was really only about sports. And you opened my eyes. And at Pinnacle, you do all types of bets. So I thought maybe you could run us through a few of the more interesting, to your mind types, of bets you can make inside and outside the sports space at Pinnacle.
Marco: You can bet on literally every single sport that you could possibly imagine and this includes darts and chess and anything you see, obviously E-sports, your video sports, very popular with us. But you also have politics, politics is a big betting field. You have a few of the more exotic and fun stuff. Since you are recording this from New York I believe, we do Nathan’s Hot Dog Eating contest. I know exactly when Kobayashi was not on top of his game anymore. I remember that. And we do the pope election was a fun one… that was very interesting for try to price up the pope. The pope election.
Marco: So it’s almost any event in the world. You could even go as far and do stuff about Game of Thrones. So we have a Game of Thrones prop-up, who will sit on the Iron Throne at the end of the season. Oscar betting, Golden Globe betting, you name it, literally any event you could possibly think of, you could place a wager on.
Hugo: That’s incredible. Of course, I don’t want you to give up any of your IP here and of course you won’t, but I’m wondering, let’s take hotdog eating contest or Game of Thrones or who’ll be the next pope. I’m wondering how you even, I mean you have the technical skills but in terms of domain expertise, I don’t suppose-
Marco: This here obviously is zero expertise. Let’s talk about the pope, so how does it work? We read columns about populist writers. What do people believe to be the truth? And then we price according to this. We don’t have any inside information, we don’t know anything about it. But we read up a little bit, we try to price as good as we can and then you let over market efficiency with all the crowd, your effects, shape the price. Game of Thrones obviously, we’re avid fans ourselves, so we speculate ourselves. But we don’t know. We don’t have an inside. We don’t know George RR Martin personally or anybody. It’s guessing.
Marco: But to be fair and frank, these are entertainment props to bet on. They are not…in comparison, like on a, let’s say on a workout game in soccer, you could bet up to $500,000 or a million dollars with us without even questioning. And on these kinds of props the limits are low, maybe a $1,000, maybe $5,000. So there is a difference between the level of scrutiny that goes into pricing one or the other.
Hugo: So in terms of pricing them, I suppose, can you talk us through the process from go to woe in the sense that I presume you have some model which ends up with a probability distribution or probability mass function or density function with respect to outcome and then you price according to those distributions. But maybe you can spell that out in a particular example.
Marco: It really depends. It really, really depends. So yes, we obviously have exactly what you just said. In some aspect it might just be market prices, so the market has a price already. But what I mean by that is that if you would like to open and exchange the trades Apple stocks, you wouldn’t need to do all the analysis yourself. I mean, Apple stocks is traded at many, many exchanges. So you have an idea of what the price should be. And that’s the same in sports betting. Many, many bookmakers exist and they work connected.
Marco: But especially when you talk about a live game, you know, that we have tons of models running, so you feed the model tons of input. And then it crunches the numbers and spits out something. And there’s several layers of models and all kinds of AI, machine learning elements. And it gets very sophisticated depending on the sport and depending on how much betting there is done in the sport. The more betting is done, the more sophisticated we have to be because the more sophisticated people on the other end as well.
Risk Management
Hugo: So I think once again this speaks to your job of essentially managing risk and could you just say a few more words about what risk management or what managing risk in general amounts to for you, or looks like, or how you think about it?
Marco: Yeah, I mean, probably, risk is always … even if … how do you maximize equity over probability space meaning even if you have a coin flip or if your 50-50 you don’t gain equity there, but I hope maybe somebody pays a little bit more and your odds would now lose a little bit of money. How do you hedge yourself against this risk? Can you take it? Are you willing to take it? What happens if you lose? Does it have an impact on the financial bottom line? Are you exposing yourself? And all of these kinds of questions, how do we think longterm about managing our book. We are a big company, everything has to stay afloat, and there is a lot of regulations so you have to be very careful in managing your risk probability and try to balance the book in some aspect. It is not always easy to balance the book and the way that betting works is often based on news, you know somebody is injured. And if you’re not on top of that and then you don’t pick up trading secrets very very quickly you just overrun by wagers and then you are exposed in a very unfavorable scenario.
Hugo: Right, that makes perfect sense. And how does this idea of risk relate to uncertainty in general?
Marco: Uncertainty is … there is a few levels of uncertainty. Obviously you have an inherent uncertainty because it is a sport event which is a non-perfect environment so you never know exactly what other parameters that matter, but you also have uncertainty variance meaning some events are naturally just more volatile and less known as other events. Quite really it has to do with the historic data available. If 2 people had done the same competition one hundred days in a row you have very very very strong data that eventually backs up one percentages or the other.
Marco: And then you have events like, for example, soccer world cup where Germany plays against Uruguay, which has not happened ever. In the sense that these exact teams have never played against each other. And in the end of the NBA season all of the teams have played against each other many many times over so you get a very good idea of the relative strengths of to the San Antonio Spurs, to the Golden State Warriors, to the Cleveland Cavaliers. Even though maybe the Cavs and the Spurs have only paired off 2 or 3 times, but because of cross relationships you have a very good idea of how strong the team is. But Germany playing Uruguay you actually have no idea. What does it matter about Germany / Uruguay was 20 years ago. None of the players are on the pitch anymore and the game has changed.
Marco: So you have a lot of different kind of uncertainty in the gambling world.
Hugo: That’s interesting so it sounds like there is a distinction between uncertainty that you can quantify, so that would be risk. And uncertainty that you just don’t know a lot about the situation so you can’t say so much.
Marco: Exactly, you have known known and the known unknown and the unknown unknown. It’s very tricky, especially these big events are very very tricky. And then you notice it, baseball is primary example with a 180 games per season, at the end of the season you have a very good idea of what the strength is. But all book makers in the beginning of the season you will notice that the lines are much more volatile. Bookmakers, including us, they are much more careful in taking on risk because we don’t believe that the underlying odds are as certain as to which they would be. But at the end of season, if you want to bet against our lines we are much less willing to adjust our probabilities based on betting behaviors and are willing to accept much more risk just because the certainty has grown so significantly.
Hugo: Right, so your saying that when there is more uncertainty, when you have less data, less knowledge about the space, your lines will be more responsive to betting behavior.
Marco: Absolutely, a wager that could move a line in the beginning of the season 3% might move the line 0% or 0.1% at the end of the season. Absolutely. Just because the certainty is there, eventually you know this is the price, everybody has spoken, the entire world has placed a wager, we know the price, we’re willing to take a gamble here.
Hugo: Yeah and I don’t know if you think about it in this framework. When I hear stuff like this, I think about it in almost a Bayesian sense that you have some sort of prior knowledge about the space and then once you have more and more data you can update whatever your interested in and get a more precise estimate as you keep updating essentially.
Marco: Bayesian thinking is predominant in our world. Almost everything we do from a Bayesian point of reference.
Hugo: Yeah, fantastic. And I think as I said this idea of updating. That’s what came to my mind when you were discussing –
Marco: Yeah, I sort of said like if you price anything up, any event, like you see 2 people in the street and you imagine let them do a 100 meter dash you initial price might be 50-50 and then you see one guy is actually on crutches, suddenly the line moves 30% right or 40%. Now it’s 90-10, but then you see he throws the crutches away and then you’re like … on and on, but eventually you have a pretty good idea about "oh okay now I have, this guy is overweight, this guy looks fit" and your pretty sure about your price, and if somebody tells you it’s the complete opposite you might not believe it anymore. So not believing means, in our language which were willing to take on a lot of risk before we actually get moved over to the new price.
Hugo: Right, that makes sense. So we’ve spoken around this, but as we’ve said it, as trading director you think about everything from R&D to odds making to everything related to markets and I’m just wondering if you could speak to how all these different aspects of your job are related. Perhaps speaking through the lens of a particular real or hypothetical project.
Marco: One of our cohorts is to a. Improve our models – higher accuracy in our models, but also open up new betting opportunities to our clients. And more interesting betting options allow people to then hone their own models and first give us liquidity and then the machine gets rolling. At the end of the day what we are, we are a very low margin high volume bookmaker. A Little bit like Walmart, we don’t want to make a lot of money selling orange juice, we just want to make a little bit, but we want to sell a lot of orange juices.
Marco: So the idea is that we want develop a new product, lipstick, a hypothetical product we want to, I don’t know, how many throw-ins at half time are there going to be in the soccer game? And so you start modeling this, you start putting it out. Some better picks up that your model has complete different or wrong assumptions and bets a lot of money on you then you refine your model and so on. Until you get a solid market, and once you have this you can rule it out over many leagues, you have to do more and more refining, more refining, but eventually you get to a stable product which now might be something like a lot of clients enjoy betting, and thus you’ve created a new market that clients like. A new product that clients are interested in on betting. This product would stay mainstream for the next 10-15 years.
Hugo: Interesting, when you speak about this relationship between domain expertise and data science skills it actually made me think of, have you read a book called Super Forecasters or Super Forecasting?
Marco: Yeah, yes of course.
Hugo: For the listener, this is a project by Phillip Tetlock and colleagues. And the basic idea is he found certain members of society who are better at forecasting than other people. One thing he does is kind of analyzes, looks at the characteristics of these people and sees what makes them better forecasters then others. So I’m just wondering Marco, do you try to hire people who are super forecasters or instill this super forecasting culture within your organization or how do you think about that?
Marco: I’m actually paying them already, I used to call these people an army of consultants because if you are a super forecaster in my world what is means that you are actually better than our models and you can actually predict the outcome better than we can and thus by betting making a profit. And so what I’m actually doing I’m consulting you, please here is my prediction for this event what is your prediction, and by placing a wager you’re telling me your opinion which I can then incorporate in my model and can change my prices, but I have to pay you the price. And so all these people who are great at forecasting anything are basically working with me on a consulting basis.
Hugo: That’s awesome, and I actually, I just had a thought that it kinds of brings this full circle in a sense that we’ve moved from data analysis and data science to bookmaking to the idea of super forecasting and something that becomes very apparent in this book is that, you know, you don’t necessarily need super special skills to be able to be a super forecaster, but there are several key aspects such as being less prone to confirmation bias then other people in the world. Which of course is the hallmark of a great data analyst as well.
Marco: Yeah so one of the classic training pitches that I used to give for the longest time when I get new training recruits, and these are all bright people, they are all successful and bright and eager. And obviously they are at the beginning of their career and they want to make a mark they are young and willing to gamble and I always tell them. "This is the strategy, this is how it works and obviously you have to bring in your own feel, but if you ever start gambling with our money" And then I show them a credit card, "Do it with your own money, and if you’re rich enough buy yourself an island, but you’re not gambling with our money." I try to really separate for them, you think you might know something because you’re setting on this side of the table, but if you could, you would sit on the other side of the table maybe.
Marco: On our end it is hard work, it’s hard analytical work. Everyday we have to grind, we have to refine our model, we have to get our data. It’s a craft. You have to hone it over many many years. You don’t just go in in the classic Vegas movies and know that the spread should be 8 and a half here and the total is going to be 165 points, that’s not how it works at all. Everyday we go back in, everyday we have people smarter than us, people better than us, outsmarting us in our own game. So we need to improve all the time. So I have seen, over many many years, customers who actually have a lot of talents who then become lazy and their analysis becomes sloppy and at that moment they are not winning anymore because somebody else is more hungry then them and knows the numbers better then them, and bets the other way.
Updating Predictions
Hugo: Yeah, absolutely. And the other quality of super forecasters that I just remembered, and speaks to this idea of updating and doing a Bayesian update essentially is that super forecasters are very good at updating their predictions and beliefs with respect to new data coming in as well.
Marco: So the way that training works is that … one of the key aspects in training is that the past is the past you cannot change past wagers, the only wager that you can change is the next wager so what you do is that you come up with a scenario, like almost a probability tree and you say okay. You put the line here so I expect 80% this to happen, 20% this to happen, and whatever percent, maybe 0.1% that should happen. And this basically your tree of probabilities and from your experience, but if something unexpected happens then you basically have to update your assumption very very quickly because something out of the ordinary has happened which means all of your assumptions might not be correct anymore.
Marco: And the politics, I hope this word is okay we always use it to call the ladies of service problem, every politician if he would ever be found with a lady of service all the prior work would be almost meaningless. The odds would be abysmal, I mean obviously Donald Trump might refute this now, but back in the day it was always this problem that politics are so dangerous because if there’s one character flaw being revealed of a politician all the analysis before becomes completely meaningless, the odds would shift from 60% to 5% in a matter of seconds and you have to account for this possibility that somebody gets this information ahead of you. And we’ve seen it many times over the years that somebody has good information about something that is not public yet.
What tools or techniques should aspiring data scientists learn?
Hugo: So we’ve discussed a couple of tools and techniques from Bayesian inference to mentioning that historically you started using R, moved to Excel to the R programming language. I’m wondering for people who want to enter this type of space, bookmaking, sports gambling, or these type of prediction challenges in general, what type of tools and techniques in data sciences that you suggest they learn and speak to this from just general suggestions to the type of people you want to hire as well. I don’t have a strong preference there.
Marco: So what we look for is the classic R, Python stack, you know like machine learning is highly sought after, doesn’t really matter which framework it is. Some machine learning framework, we can teach the other. It is a good thing if you actually have done some sports modeling already, it doesn’t matter which sport that you’re familiar with how sports modeling works conceptually. Some of our people do Kaggle competitions, they like that kind of stuff. There’s a lot of different ways, you can come from the sabermetrics which is hardcore baseball analytics, but you could also come from different way. We have people who in the past were big in poker AI and did a lot of work on game theoretically approaches there. Because our field is so diverse you can actually come from a game theoretics point of view and build game theory here, training models, you can come from the sports analytical model bases and build sport analytical models and there is many different ways you can bring in your creativity and your knowledge to get there. But the classic computer science background is very highly sought after. The alternative is a strong math background, you come from the other end, you’re very proficient at high level math and now you’ll learn some coding skills and are able to help and sit down with another guy to develop a proper AI model and you do the quant stuff and he does the coding stuff.
Hugo: So you mentioned R and Python, is there a culture of one of these more strongly in your organization than the other.
Marco: So Pinnacle is very heavily reliant on R we are much bigger in R than we are in Python. We do Python, we are not a cult of R in the sense that we feel like we need to use R, we feel that R gives us the best bang for our buck in many aspects. We have also been very active in the R community for a long time, speaking at conferences, we send people to almost every R conference. And it’s a community that embraces the idea about sports betting. We have released data sets into the community, we have worked with members of community to improve packages that we maintain which are free and available for everybody that helps with sports betting. So it’s a great community, so most people who like analytics like sports analytics.
Marco: Sports analytics are great, in the essence the difference between analyzing sports and many other things is that sports have a finite end, you could analyze a game as much as you want, and after 90 minutes or 1 hour or whatever you have the result on the table and now you have the next game to analyze. There is always something happening in sports which makes it very interesting and with betting you have a way of keeping track with your score. Betting is just a way of keeping track of score, how good is your model? The better it is, the more money you make.
Hugo: Absolutely, and as you say there is a final result in sports. Someone wins and someone loses, most of the time.
Marco: Exactly, you know like, this is the big difference between financial trading and sports betting trading. The key big difference is that financial trading is almost infinite, there is not an end to the price of oil. A commodity exists continuously while all these sporting events are discrete events.
Hugo: Yeah, I’m just wondering with all the new technology and deep learning and video analyses and that type of stuff. I know that a lot of basketball for example is captured on film and things people think about doing deep learning analysis on players movements and that type of stuff. Is this something you’ve thought about at all?
Marco: We think about it we haven’t done OCR analysis. One of the key features of us is that if there are every attributes that we want to use in our models needs to be available live and fast. It doesn’t help us to have a very rich data set of data points that we cannot get while the game is in play at a reasonable speed. And reasonable for us means maximum maybe a second or 2 seconds slow. If something is 10 seconds slow in our world it is basically like yesterday, it doesn’t matter to us. So our world is very fast-paced so we need to find data points that can be analyzed and can be transmitted to us on a fast-pace. This has increased, if I remember, if you can imagine like 10 years ago, 20 years ago in most games the only data points that you would get are very superficial high level ones something like in basketball you might get rebounds, steals, points, blocks, The classic 4. But now you get something like dangerous attacks, a human puts an element of a concept on it and gives you a judgment that helps you understand the game.
Marco: So it has gotten a lot better and now with eventually biometric wear we will eventually get data super fast, super accurate, that we can use. It sounds fantastic to use heart rates, and basically try to see if heart rates matter and all that kind of stuff.
Hugo: Body temperature, amount of sweat, perspiration on forehead, you know. All of these things I wonder how many dimensions you need to describe these things I think that’s a cool question, but in terms of processing all this data in realtime I suppose we have this misconception in the cultural consciousness in data science that to run fast code and machine learning models in production they better be in python, and that is something we’ve been discussing on the podcast recently actually, but I’m just wondering if you could tell me about your experience with productionizing R code and efficiency, which I know you’re very interested in these types.
Marco: We actually spent the better part of 2 years now thinking exactly about the question of how to productionalize R effectively. So we’ve done a lot of work in the traditional pipeline for us would be to transfer the R code into C/C# for productionalizing just because of the speed, but we found now new APIs, we are using Plumbe actually in some aspects, for the people who know R, which is an API interface to productionalize R in small scale test scenarios it has been working well for us. So we are actually running R code in a production environment and in trading algorithm environments.
Hugo: Fantastic, and actually I had someone on the podcast who talked about using Plumber and R keras together that it worked very well for them.
Marco: Yeah, I mean it’s still a little bit at the infancy and some of our team members are on the very cutting edge and working with the guys on the packages together to help and improve stuff, but it is promising and it allows us to rate even faster over our models because we don’t have to take this extra step about productionalizing. Traditionally, python was used in our world when the AI learning models didn’t have a good interface into R. That’s when, the distinction was if wanted data and data analysis you did it in R, but if you want to code AI you have to do in python. But nowadays with R being interfaced in all the other big machine learning frameworks you don’t necessarily need it anymore.
Hugo: And of course you were at the R studio conference in San Diego this year and JJ Allaire keynote about R keras and how we’re seeing more and more of R being an interface language to these pretty serious packaging infrastructures was really cool.
Marco: It was an amazing talk I mean I also couldn’t agree more with him that the way they are pushing it is towards an open framework for everything, lets make R interfaceable with everything and not try to close it up and try to do it on our own because then you get into the classic language wars problem about people being in a cult. People should use whatever language they want to use and R should be able to help them use all the tools that are available in all other languages.
Open Source Community
Hugo: Exactly. So you’ve mentioned several times in various guises how Pinnacle and yourself work with the R community at large, from giving talks at studio conferences and other conferences to working with developers on packages. And I’m wondering if you could speak to how important the sense of community is in an open source landscape is for you in your job.
Marco: Oh it’s everything to us and that’s part of why we liked R so much is that at the very beginning when we had problems you know with the classic OCDB packages, in R you have direct access to the guy who made it and you can ask them a question, and if he knows that you know what you are talking about is actually working with you on your environment to help troubleshoot and then improve the code. To have this concept that the developer of the package actually cares so much about making a bug fix that addresses a tiny problem that might only exist in your configuration which is a bug in his code ultimately, and to make it better is amazing. So we’ve done so much work on that. That’s why we decided eventually in order to release some packages into the world to tell people hey if you want to get into sports betting these are some ways, these are some tools that we used internally before that might make your life a little bit easier.
Hugo: Cool, and we will definitely link to some of those packages in the show notes as well so interested listeners can check them out. You mentioned earlier that what you referred to was an army of consultants and I love your military metaphors and analogies in general and one of the ones I really love is that you’ve stated that part of your mission is to train an army of new data scientists.
Marco: R is predominant in the R&D team, but Pinnacle is a data-driven organization and so we had this huge gap between the people who know R and the people who don’t know R and my belief was, especially with the Tidyverse coming along there was a path where people who are unqualified in the terms of the never done computer science before they never coded, they don’t have a math background, and they are not technical people either, they are in human resources, maybe they worked in business analysis, or maybe we met people who worked in customer service for years and if we can come up with a curriculum based on the Tidyverse, based also on the Master the Tidyverse lessons that I quite enjoyed myself and basically try to build a curriculum with the help of DataCamp and specifically tailored to the people and the success has been overwhelming.
Marco: We have now trained over 150 people now, maybe even more by now. We have R being used in every aspect of our company. It is a smile on my face when I go around and I see somebody who I know doesn’t come from this background showing me an Rmarkdown that he created in a repo that he sent to his colleague which they are going to discuss. It is just amazing to me. The sense of empowerment at every level of the organization is just fantastic. We have a data warehouse where people can access the data, we have made an interface which means it is very easy to get the data from the data warehouse directly into your R session. We are using all kinds of tools that RStudio provides. We are using R Studio Server, we are using all the tools that they have, and it is amazing to see. We now have people who… the biggest success story is a 45 year old women who worked for us for over 15 years as a customer service rep who now is full blown data scientist with us who is actually doing some phenomenal work and really, you wouldn’t know that she was in customer service for years. You would have no idea.
Learning Tools vs In-person Training
Hugo: That’s incredible. My next question, you could I suppose answer in the framework of her story or other success stories, I’m wondering about, how you think about the relationship between using platforms such as DataCamp which as you said has been very successful for you and in-person training and how these two can complement each other.
Marco: For us DataCamp was invaluable, we couldn’t have done this without DataCamp. Undoubtedly we needed Data Camp. What we did, we did complimentary training, definitely, so we took DataCamp courses, graded them in terms of difficulty, and then put them together in a logical order which now I believe DataCamp has themselves, they call it tracks I think. But back in the day they didn’t exist. So we did it ourselves, we did these tracks and at the end of each track we got together in a group, we did group sessions. We brought up very funny problems, we brought up interesting problems that we found. Often a few problems were actually real Pinnacle problems, real Pinnacle data, and we did some analysis of it, and we showed the people about how efficient this can be. What I was trying to sell people was that R is nothing else thqn power Excel. I tried to take away the fear of being in an interface where you would have to type in something I just tried to bring in down to them in terms of Excel. This is just like Excel, but instead of being able to work on 60,000 data points or 60,000 rows you could now work on a few million rows without sweating a beat.
Hugo: That’s awesome. I do wonder what, how things would have been different if it had been called Power Excel. That type of branding for an open source language such as R.
Marco: To be quite fair, prior to the Tidyverse I don’t think it would be a fair clarification, but if you break down the Tidyverse and you actually break it down to dplyr and ggplot and just those 2 cover 95% of all data world needs. Everything else is specialists in many aspects, but dplyr and ggplot that’s all you need to do the vast majority of work so we only teach basically dplyr, ggplot and markdown because the company is markdown based so we’re sending out markdowns, we’re sending to each other. And so we’re now in this nightmare scenario where somebody sends you data and you don’t know where they pulled it from, you don’t know what filters they put on, you have no idea. Now we are in the markdown you could get look through the code and you can see exactly what they did and you could point it out, "you cannot do this, you forgot this in this scenario" and can help them right away to do a better job.
Hugo: Absolutely, so this particular case, your colleague who moved from customer support to learning a lot of R using DataCamp and in-person training at Pinnacle, then moving towards a data science role. How much math did she need to pickup or statistics or machine learning or this type of stuff. So I understand using dplyr and ggplot2, the classical data analyst in data science world, but then there is another step above that right?
Marco: For sure, so we now I believe we now given them training on forecasting models and more simple techniques just to get her into a different mindset. So obviously she doesn’t have the traditional training, so yeah, but data analysis is such an important step for many companies, for many of our areas, where they just lack basic data analysis. So she was productive from I think day one. We actually put her in her old environment in the customer service field and she was writing the framework and reporting framework for the customer team because she was obviously a subject matter expert because she has been on the frontline for years and years and years and now we … a classic example is we’ve redone our staffing, we’ve realized that we had our – some of our local speaking customer service agents working at the wrong hours. Maybe not when the local speaking languages, lets say like Swedish speaking customers were having questions they were working at the wrong hours so we now are able to match these 2 datasets with each other and actually optimize our scheduling to service our clients better. So a direct win for our clients and a direct impact from her.
Hugo: That’s incredible. So you’ve spoken to a lot of different modeling techniques, you’re using forecasting, including a lot of machine learning. I’m wondering, we all still write a lot of code to build machine learning models and that type of stuff, but speaking to your colleague, you know the work she’s been doing to impact the business, there is a lot of people worldwide who can impact their businesses developing machine learning models, but may not be able to code. This is kind of a roundabout way of asking you about how you think about machine learning as a service and as a platform going forward. So people in businesses, able to build machine learning models, without necessarily writing code.
Marco: It’s starting for us like this for sure. Because some of us build frameworks for others where they actually don’t really know what’s happening under the hood anymore. But underneath the hood there is machine learning. Were going to see the shift over the next years for sure, and its also a good thing. Just because you’re driving a car does not mean you need to know how the combustion engine works. That is not requirement, and you can a drive a car perfectly fine, you can do your job with it, without knowing the inner workings of the combustion engine. And this is what we’re going to see with machine learning as well. Other people are going to do it for you.
Hugo: Absolutely, so what we need in place are checks and balances and processes so that when your car busts up it doesn’t explode and kill you right?
Marco: I’ve seen data analysis gone wrong many times over in our company, many times over, quite obviously. The very famous one is obviously, the subject matter expert forgets to tell the model that there are certain parameters which cannot exist, a classical thing is a volleyball game is best of 3 so there cannot be a 4th set … 6th set, I’m sorry, because its impossible. But mathematically you could easily forecast this, some density, some Poisson distribution or whatever it is, and you get a distribution for it. That’s the classic thing where you have to train people to tell the model a little bit better what are the frames, what’s the framework that you are operating in.
Favorite Data Science Technique
Hugo: And I love that you use the term data analysis gone wrong. I think we should have a segment on the podcast at some point in the future called data analysis gone wrong, nightmare stories. So speaking of data analysis and data science I’m wondering what one of your favorite data analysis techniques or methodologies is?
Marco: My favorite one myself? You know in my day, today, because I’m too far detached from this stuff, I mainly stick to the Tidyverse myself. Pull some data, graph a little bit around, and then dig deep. That’s all I do. When I speak to the actual quants I think the one I always like the best are some genetic algorithms, I just find them so cute, how they develop and how they stumble around. I just love watching models grow like this and eventually … for a long time has such poor results eventually just exploding and producing results which are far greater than you would have ever expected.
Hugo: That’s really cool that we’ve got, you’ve mentioned genetic algorithms, Bayesian inference, and Bayesian updating, machine learning models. I presume when thinking about time series forecasting you think about ARIMA models as well. Is there anything else that is kind of the bread and butter of building these kinds of models in your line of work?
Marco: Bayesian inference is huge for us. That’s probably our bread and butter in many aspects just because we have a classical lack of data, very classical, like every game is so short so you have classical, you never ever have sufficient data so Bayesian inference is very important for us. That might be the biggest one that we use.
Hugo: Just for fun, what are some current bets that you have that you are really excited about or that you find cute or interesting?
Marco: Oh I actually don’t know, I always like the Game of Thrones one. That’s something that I always like, but we always try to put some fun bets. Fun bets just come around and some of us talk about an event and then we post some odds, mainly sometimes to see who is right between us as well you know. We are just guessing ourselves, it is not unheard of that people are betting on Game of Thrones events in the company just because it is fun.
Hugo: And you may not be able to answer this, but who’s your pick to be on the Iron Throne at the end of Game of Thrones?
Marco: I’m going to go with Cersei, maybe.
Hugo: Yeah, right, cool. Has that changed over the past several seasons, or you’ve been a Cersi stronghold for sometime?
Marco: I think Cersei is so evil, lets see, we are going to know the truth next year you know.
Hugo: We definitely will.
Marco: Let me actually check right now, Game of Thrones, I can actually tell you who is the favorite.
Hugo: Please do.
Marco: So the favorite is: we have Jon Snow, Daenerys, oh wow Bran!
Hugo: Oh wow interesting!
Marco: Those are the three favorites.
Call to Action
Hugo: Interesting, I wonder why Tyrion isn’t a favorite, but I think we are digressing now. We can have a Game of Thrones episode when it comes up. So my final question Marco, is: do you have final call to action for our listeners out there something you’d like to see them do or implement moving forward in their data science careers?
Marco: To me it’s just teach people. If you are in data science, help your colleagues become data scientists, empower everybody. It will make your life easier, it’s going to make their life better, it’s going to change their life. Just try to teach teach teach as much as you can.
Hugo: I couldn’t agree more. Marco it has been an absolute pleasure having you on the show.
Marco: Yeah it was a pleasure, any time Hugo. If you ever want me again I will be available.
Hugo: Fantastic.
Marco: Alright.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.