Full Stack Data Science (Transcript)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here is a link to the podcast.
Introducing Vicki Boykis
Hugo: Hi there, Vicki, and welcome to DataFramed.
Vicki: Thank you so much for having me.
What are you known for in the data science community?
Hugo: It’s an absolute pleasure to have you on the show. I’m really excited to talk about your work in Python education, full stack data science, end-to-end data science, what these things actually mean, and your work in consulting. Before we get into all of that, I’d love to know a bit about you. I’m wondering what you’re known for in the data community.
Vicki: Probably first and foremost, terrible puns and memes about all sorts of data and programming related things. Secondary is the content. My strategy is a little bit like BuzzFeed, right? Hit them with the memes and then sneak in serious content in between.
Vicki: I’ve written a lot of blog posts about how to do specific things in Python, how to do specific things in data, and then just talking about like where we are in the data community in general, so very high level articles, and talking about things that break down complicated concepts into easy to understand analogies.
Hugo: Fantastic. I love that secondary is the content and that primary are terrible puns and memes. I don’t mean to put you on the spot, but what’s one of the worst puns you’ve said or come up with or heard?
Vicki: They’re all so terrible. I have this series of puns where it’s basically me pretending to talk to a TV producer to pitch them on possible shows or movies, and so that series is s pretty terrible series of tweets.
Hugo: We’ll definitely link to that in the show notes. That’s primary. Secondary is the content. I thought I would just mention that you’re also, in terms of content, in the process of creating a DataCamp course.
Vicki: Yeah, that’s right. I’m working on a course that teaches object-oriented programming for Python, specifically in the context of a data setting. I’ll be going throughout how to create objects and do manipulations with CSV files and digging into NumPy and pandas internals, so I’m pretty excited about that.
Hugo: Fantastic. Something you’ve also mentioned previously is that the educational stuff you do now is that you’re essentially being the person you needed when you started out.
Vicki: Yep. Yeah, so the internet is a pretty big place and there’s a lot of resources, but if you’re just learning to program or you’re just getting into data science, the best thing you can do is have an in-person mentor or someone who’s ahead of you who you can ask questions. I didn’t really have that person when I just started out, so my goal is to be that person for people just getting into the field.
Hugo: Fantastic. Actually, DataCamp itself has a similar origin story in that the CEO, our CEO Jonathan Cornelissen, when he was at grad school he was looking for something like DataCamp and just couldn’t find it. He was like, “Okay, when I finish grad school I’m going to make this thing,” essentially.
Vicki: Yes.
What do you do professionally?
Hugo: That’s one of our several origin stories. That having been said, can you tell us what you do professionally at the moment?
Vicki: Yep, so I am a consultant. I work for CapTech Consulting. We do a bunch of different stuff. Part of our company deals with management consulting, and part of it is deeply technical consulting practice. Right now I do both data science and data engineering consulting depending on the project scope.
Hugo: That sounds very much like this idea of full stack data science, right?
Vicki: Yeah, so the idea is that a lot of companies will start out by not having the infrastructure set up to do data science, because really data science is kind of a mature product offering. We’ll come in, we’ll build out those pipelines, and then we’ll get to the data science aspect, which is creating the models and presenting those results.
Hugo: Great, and we’ll get to more of that later. In particular, I’m really interested in thinking about this job of building out the pipelines, doing that, but at the same time needing to demonstrate value as quickly as possible within an organization. This is something … That’s a little teaser for some things we’ll chat about later.
How did you get into data science?
Hugo: Before we get to that, though, data science is interesting because so many people have different avenues, all roads lead to data science in some sense. I’m wondering what your journey was. How did you get into data and data science, originally?
Vicki: I think I come from kind of a nontraditional, kind of a traditional background. It’s kind of in the middle. I started out as an undergrad who majored in economics, and the reason I picked that was because I didn’t want to be an English major and I didn’t want to be a math major, and I like that econ kind of combined the two. I like using both sides of my brain a lot. That was my undergrad degree.
Vicki: Then after that, I actually got into economic consulting, which was pretty rare because I don’t know a lot of people who focus on their major out of undergrad, so I guess I was lucky, or maybe unlucky in that sense. That’s where I got tuned into doing stuff with data. Usually when you start right out of college you start doing stuff with spreadsheets, so I started doing stuff with spreadsheets. Then I heard about this new cool programming language that was free that was called R. I got exposed to that a little bit. I had a couple of roles that were analytics-based. Then my last role was as a data analyst where I learned SQL.
Vicki: Then I got tired of waiting for data to come into the SQL database for me, which is when I started really focusing on learning programing with Python and statistical methods, and then I became a data scientist as my next position. At the same time, I decided that I also wanted to get an MBA because I was interested in technical leadership. I actually don’t have a statistics or development background in terms of a Master’s program, but I kind of came to it through the job field.
Hugo: That’s really interesting. Because a lot of people I speak to when thinking about advice to give to aspiring data scientists is one of the most important skills isn’t to be able to build a thousand layer recurrent neural network, but to be able to learn on the job and pick up new skills as you go along, and it sounds like that was an integral part of your journey.
Vicki: Yeah, I think that’s been really important for me the entire time in figuring out what to learn, because there’s just so much to learn in data science. In consulting, that’s one of the primary skills as well because you never know what kind of environment you’re going to come into or what the client needs are going to be. Learning and a broad set of skills.
Hugo: Great. I’m just wondering, with your background in economics and your MBA, how do these play into your work as a data scientist in general? Do you find skills and tools you’ve developed and ways of thinking in economics and your MBA useful in your work in data science?
Vicki: Yeah, so economics and econometrics is actually pretty close to data science, and I think that’s probably partly where data science came from. There’s a lot of hypothesis testing, for example. There’s a lot of statistic and econometrics that goes on. There’s a lot of like the social science aspect where you have a hypothesis about how especially large scale systems would work, and that’s what a lot of data scientists these days do, right? They test large scale social systems like social networks or platforms to see how things will perform, so that’s part of it.
Where do you see data science having the biggest impact?
Hugo: Let’s talk about your work in consulting. I presume you work across a variety of different industries, but which verticals do you see data science having the most impact on, in your experience?
Vicki: This is going to be a really consulting-y answer, but it really depends, and it’s really a broad, broad variety of verticals. The ones that I focus on in my consulting career so far have been telecommunications, banking, and healthcare. Data science has an impact or a place in all of them as long as it’s implemented correctly and as long as the business believes in data and sees it as a priority.
Hugo: What type of challenges have you found in demonstrating the value of data science across these industries?
Vicki: A lot of the times … so we’ll probably get to this later, but a lot of the times it’s even building out that pipeline to get to the point where you can do data science, but a lot of the times, especially in larger companies, so my company deals a lot primarily with Fortune 500 companies, is getting to the point where you can demonstrate that your hypothesis or whatever it is that you said to do, your call to action, actually results in a change in the business.
Hugo: Great. Are you able to give any specific examples? I don’t mean the names of companies or anything like that, but specific examples in telcos, bank, or healthcare, of actual data science projects?
Vicki: A lot of projects … so this has been true for every industry I’ve been in. Every company wants to be able to measure churn or why customers are leaving or joining their platform, and especially tracing the fact of why companies are unhappy. For larger companies, this might result in an enormous amount of features, not all of which you can control. For example, the signup process, the billing process, issues they’ve had with your service or with their service, outside people that have approached them. You can create a model of what potentially causes customers to churn, but that might not necessarily be reflective of the real world. I think that ties back to econometrics, too, because in econometrics you’re trying to create a model of the entire economy, but what you really have is a representation because you can’t trace all of it.
Hugo: Yeah, great. This is a great example and something I’ve actually been thinking about a lot recently and talking about this morning, in fact, the churn example in particular, the potential for customers to take their business elsewhere, is the intersection between data science and decision science. Because you can build a model that may tell you or approximate what’s happening in the world as to why customers are churning, but that doesn’t tell you what to do, right?
Vicki: Right, so ultimately it’s for the data scientist, in my opinion, to present a number of options, to present clearly what they think their view of the company is, and then a way for the company to move forward. That’s kind of the point where we hand that off to a client. We’ll recommend a couple of options, but we obviously won’t say, “Here’s what you have to do.”
Hugo: Great. In the churn case, I can imagine several courses of action. The first would be to, if you think a customer’s going to churn, reach out to them and make them some sort of offer dependent on how valuable a customer they are to your company. Another would be to try to nip it in the bud well before they’re going to to churn. Are these the types of suggestions that you make or are there others as well?
Vicki: Yeah. Usually it’s preventive, or you can change it when they’re about to churn, or you can create preventive measures so that they can channel their frustration somewhere, for example, new support channels.
What are the most common patterns in data science?
Hugo: Great. In your work across all these industries, what are common patterns you’ve seen in data science across them?
Vicki: One is that, I think we’ve heard this a lot, but getting the data to a point where you can actually do data science is always 80% of the work. Usually when we come into a company, a lot of the work will be getting the data to a point where we can do data science. The selection of tools and understanding what everybody else in the industry is doing. Kind of this need for understanding best practices. Are we picking the right tool? Is this what other people in the industry are doing? Is this what people in our industry are doing? Or people who are interested in having data science making the case that we need someone to come in and help us do this data science practice, that we actually need data science, that we actually need help making these decisions. Those are probably the big ones.
Hugo: Interesting. There are actually a lot of things that spring to mind there. The first I want to zoom in on is a lot of it’s data preparation, getting into a form where you can do analytic or data science work with it. This amount of preparation you have to do, do you see this changing in the next 2, 5, 10 years? Will these types of things become more and more automated and hopefully productized?
Vicki: Some of that, but ultimately I think it’s just the feature of data. Because usually unless you’re working in manufacturing or some related field, what you have is you have humans generating the data, making sense of the data, defining how it’s going to be in a business sense, and that kind of data is always going to be messy. Especially across larger organizations where you might have 5 or 10 or even 20 different data flows. Sometimes you have 2 data flows. They’re exactly the same, but just with a little bit of difference. That kind of reconciliation is always going to be existing.
Vicki: What I do see happening more and more lately is a lot of organizations are calling for more data governance. More metadata management is becoming increasingly important in larger organizations. I think over the last 4 years or so, the push was to get stuff into a data lake. It doesn’t matter how. It just needs to all be in one place so we can do something with it. Now the idea is we want to be able to manage our assets in a data lake. We need to be able to see them, represent them, and have the business be able to inventory like an S3 bucket or Hadoop cluster or something like that.
Hugo: Great. The other thing you mentioned that I’d like to discuss is you mentioned kind of the movement towards figuring out best practices for the industry, what other people are doing. I wanted to discuss this kind of in the sense that it appears to me that a lot of people … a lot of data science work is occurring in silos across many different consulting groups, many different organizations, and that a lot of people seem to be reinventing the wheel in parallel in a lot of ways. Is that something you’ve seen as well?
Vicki: Yeah, I think that can definitely be true. What I’ve seen in a couple of my projects that were really successful is the organization or the client was dedicated to centralizing all of this stuff. What I’ve seen come up in larger organizations is something called a Center of Excellence where you have cross functional teams. You have engineers, you have data analysts, you have data scientists, and they all meet together to talk about what they’re doing as a team. I’ve seen that kind of structure come up more and more recently.
What is the most effective data science team structure?
Hugo: Is this the type of structure within an organization for data science teams that you think is the most effective?
Vicki: I think so. I’m a big proponent of always having all the stakeholders of any given data science project in the room, if it’s practical. For example, if you have maybe 200 people that are going to impact, probably not, but I really always push for developers to sit with data analysts, and more importantly, with business users. Because usually the developers are the first part of the process, and the business users are all the way down there. It can be like a game of telephone where the developers build something, that gets put into some warehouse, that gets put into a dashboard. By the time it’s built, the business users don’t necessarily always want it and can’t act on it. I always like to have all those people all in the same room.
Hugo: What do you think about the future of, I suppose, data literacy for business users? Will we increasingly see people in management, C-level, people using dashboards become more and more knowledgeable about what data is and how it works?
Vicki: I think so. I’m really optimistic about that, and not just because it’s job security for me, as people want more and more data. I do believe that the popular press, or at least the tech press, has gotten to a point where … and I’ve seen this in business literature like The Harvard Business Review or what have you, it’s gotten to the point where a lot of executives now understand the need to be data-driven. Usually when meeting the clients, they say, “We want to be data-driven.” I think the next two to three years will be ironing out what that means for them specifically.
Hugo: I presume it will mean some sort of computational literacy. I think it will probably mean a bit of statistics as well. Do you think people will need to learn like the basics of math even and linear algebra and logistic regression and these types of things, or is that expecting too much?
Vicki: No. I think the onus there is on the data scientist to present things for different audiences. If you’re a data scientist and you’re presenting to other data scientists, you can obviously talk about the specifics, the parameters that you have in your logistic regression or what have you. If you are talking to project managers, and especially executives, you should be speaking in a very different way, and you should be talking in a way that makes sense with what they’re interested in. An executive is probably not going to be interested in the algorithm you used, but they’re going to be interested in what you found and what kinds of actions you think that they should take. I am a firm believer in speaking to people in the language that they understand.
Full Stack end-to-end Data Science Solutions
Hugo: I want to shift gears slightly and talk about your approach to building full stack end-to-end data science solutions. Before we do that though, I’m wondering if you could give us the elevator pitch or something analogous on what even full stack end-to-end data science is or means.
Vicki: Full stack to me means basically building out a data science product. You start with some kind of data flow, you transform that data in some environment, and then you output a model and you display that model. That, to me, is end-to-end data science and that’s more of a product rather than a project, which I see as iterating on a specific model, for example.
Hugo: Great, so what’s your approach to building these solutions then?
Vicki: I don’t have a standard approach. It really depends. I usually come into the client’s site and just kind of observe for the first week or so. I see what the team norms are, what kinds of tools they’re using, where their pain points are. I get really annoying and I ask a ton of questions, and I do a lot of documentation. Then we usually start with looking at where the data flows into that team or that organization and seeing what we can leave behind that will be easy to maintain, reproducible, where you can understand the model that’s going into it and where you can easily visualize the output. This is the golden ideal of an end-to-end project.
Hugo: Great. Can you give me an example of one you’ve worked on recently that you think was particularly valuable?
Vicki: Yep, so I did a project a couple of projects ago that was building predictive modeling capabilities into a Software as a Service platform. This client had a number of, let’s say, a number of things that they wanted to predict about their clients. They had the descriptive capability, but they didn’t have the predictive capability. My part was taking the data that they were already getting from their clients, putting that data through a model, so I used a Markov chain model that was kind of similar to modeling page views for this particular industry. Then I integrated that back into their existing software platform.
Vicki: Really my role there was, one, ingesting the data that the company was currently collecting in its task platform, analyzing that data, making sense of it because there had been no data analysis done before, figuring out what kind of model best to use to predict, and it turned out to be a Markov because, again, the product was similar to page views where you want to predict kind of the next move of the person or the client. Then wrapping that model around something that you could integrate back into their Software as a Service platform.
Hugo: Once this model is in production, who then is responsible for maintenance of it, and essentially also responsible for checking on model drift? Which, for our listeners out there, model drift is a phenomenon where when you have a productionized machine learning model, for example, it may not work, it may not give the results you’re expecting after three to six months, for example. Who’s responsible for this type of maintenance then?
Vicki: That depends on the type of project. Usually what we’ll do with our company is we’ll work with clients to stay on a month or so after and monitor the model, but usually we’ll make it so that it can be easy to change on the client side, because ultimately it’s theirs. We have to then make sure that it’s easy to document and easy to change, which is why it’s important to come in and observe it first, like I talked about, to see what toolsets they’re comfortable with, what programming languages they use, what the statistical skillset of the people on the team are, so we can pass it back to them and not have it be a black box.
Hugo: Fantastic. That really is setting the expectation to make sure there’s someone in house there who even has the capabilities to do this type of maintenance.
Data Science Generalist
Hugo: Another thing that sprung to mind when you elucidated the process of building full stack end-to-end data science solutions was there are so many steps along the way. To be able to do this as one person as opposed to a team of people with different specialties, it seems like you … one needs to be, and you are, a data science generalist in order to do this.
Vicki: Yeah, I think that’s true. In general, I hate to propel the myth of the data science unicorn. I am certainly not a unicorn, but I do think there are generalists and specialists. For consulting particularly, it makes sense if you are a generalist and if you like to be a generalist, because you could be doing a bunch of different things.
Vicki: Recently I’ve done some prototyping in R. Right now I’m working on a data ingest into AWS. I’ve done, like I said, the Markov chain modeling before. All of that really is the skillset of understanding what the client needs and being able to figure out how to research and to get to the point where you can offer a solution versus a specialist who might be very, very knowledgeable in, for example, deep learning, for a specific industry.
Hugo: Yeah, and you mentioned R there, implicit in your work, of course, is that you work with SQL. In order to do what you need to do, I’m sure you need to do a bunch of command line stuff and you work in Python as well, so there’s this kind of whole array of tools that you use to get the job done, right?
Vicki: Yep. Yeah, I would say my primary tool, when I can use it, is Python just because it’s also kind of like the Swiss Army knife of languages. I actually read somewhere recently that Python is the second best language for almost anything, which I agree with. It’s my personal favorite language. If you want to do almost anything, you can do it with Python. For my position particularly it works really well.
Vicki: Like I said, I’ve worked with R, I’ve worked with Scala, I do a bunch of command line stuff. Recently more and more I’ve been working with cloud platforms, AWS in particular, which is a whole new skillset, and more and more with engineering things like continuous integration, which is putting your model and making sure that you can keep building it and integrating it into the software.
Hugo: Actually, so I’ve referred to Python as the Swiss Army knife and I’ve heard it referred to as a Swiss Army knife for years now. I just had a brain flash, if that’s even a term, that maybe we could call it the Dutch Army knife because of Guido.
Vicki: In honor, yes, in honor.
Hugo: Okay, great. I just want to also make clear to all our listeners that although Vicki … Many guests I have on are data scientist generalists. Definitively not everyone is, and there is not a need to be a generalist, either. Something we may discuss later is that we actually are seeing a lot of specialization emerge within the discipline, right, Vicki?
Vicki: Yep, I totally agree with that. I think there’s a place for both. I’m also a big proponent of data science teams as opposed to just one person doing it alone. I always work in teams. Usually it’ll be someone who knows a little more statistics, someone who knows a little more engineering, and someone who’s more business or business analyst oriented, and someone who’s completely business facing. You have a combination of three or four of those kind of people. The best teams that I’ve been on complement each other in those ways.
Advice on Learning Paths
Hugo: For people who want to get into this type of work building full stack end-to-end data science products and solutions, what advice with the respect to learning paths would you give them?
Vicki: I would say to just learn one thing that you’re interested in. The bast advice I ever got was to just learn one language really well. It doesn’t matter what language you’re learning, although probably for the generalist Python would make more sense. Learn one language really well, and learn the internals of that language so then you can apply it to other things.
Vicki: Because what generalists do really well is to understand how different things apply to other things. For example, this is how objects work in R, this is how objects work in Python, this is how data flows into AWS, this is how data flows into Hadoop, this is how we would do something in Tableau versus D3. Generalists generally work well with patterns and are able to research different things.
Vicki: What I would suggest is, one, learning one language and then being able to extrapolate from that, and trying building a product or a project end-to-end. I had a tweet about this, which I can link to. Because it can sometimes be really hard to come up with project ideas and daunting, too. The way that I kind of scratch that itch for myself was I built a project called Soviet Art Bot, which tweets out socialist realism art. For that, I had to get that art from a website. I had to put it in AWS, and I had to have an AWS Lambda to create the bot to tweet. That kind of scratched my itch to figure out how all those different parts came together. Like I said, I have a tweet that I can link to that has a couple of different project ideas that you can …
Hugo: I love that, and we’ll definitely link to that in the show notes.
Hugo: Something that’s in the cultural consciousness at the moment has been emerging for some time is this trade off in predictive analytics, machine learning and deep learning, between multiple forms, so how well a model is at predicting what it wants to predict, and being interpretable, so trying to figure out why it’s making the predictions it does. I’m wondering in your work and your client work, what is the approach to this trade off, generally?
Vicki: My personal approach is to always create models that are a little bit simpler, but always easier to look in under the covers. The reason for that … and I probably would have a different answer if I were full time at a company, but as a consultant you always need to be able to leave behind work that other people can look at, they can take apart, they can rely on, is easily documented. Especially for dealing with people that are not as technical, it’s important to be able to explain those things really well. For me, I always err on the side of simpler is better.
Cloud
Hugo: Something you spoke to earlier was the fact that more and more data science work is moving to the cloud, and I’d just love to pick your brain about that. This is a relatively large challenge for us as a community to do, and I was just wondering how you approach this in your work.
Vicki: Yeah, so what we’ve seen recently, while it’s been a trend over the past couple years, but I’ve seen it come up in more and more projects is a lot of clients are starting to realize that they don’t want to maintain infrastructure, and they want to take everything to the cloud. Of course, when they’re doing this they want consider the fact that there’s now things that you have to manage. For example, you have to manage the security of the cloud.
Vicki: Like there’s been a lot of stories in the news lately with, for example, S3 buckets just kind of left wide open and all the data leaking out, so that’s important to handle off. You need to handle some of the cloud management, and most importantly, you need to understand how all of these parts work together, because it can be harder than just, for example, creating a model in scikit, pickling it, and then putting it on some server. You have to understand how all the parts of the ecosystem work together, so that’s becoming more important, too, in data science. I think specifically for data science in the cloud, the toolset is really just emerging at this point. For example, I know there’s SageMaker and Google Cloud has some stuff and there’s Azure Machine Learning, but all of these, I feel like, are just starting to come into their own, but they’ll become more important components as people move in that direction.
Hugo: Also, I think the fact that these are emerging and rapidly developing technologies means that the barrier to entry might be slightly higher, right?
Vicki: It could be. Yeah, it could be in some ways, it’s less in others. If you already know how to move in cloud environments, the barrier to entry to the cloud is low, and then the barrier to entry for machine learning is lower, too, because there’s already some prototyped components that you can put together. If you don’t know how to operate in those environments, in that sense the barrier to entry can be higher. What I’ve seen recently is a lot of people doing data science are kind of moving a little more towards the engineering path, even.
Hugo: Right. Yeah, I suppose I’m really thinking of the people who are working data scientists or are proficient in machine learning trying to go to the cloud, and it may not even be obvious even documentation wise what to do and how to do it.
Vicki: Right. Yeah, the documentation for a lot of these cloud services leaves a lot to be desired.
Hugo: We’ll see that improving, surely.
Vicki: Yeah. In fact, I know AWS and I think also Microsoft have open sourced their documentation on GitHub, which a really positive.
Hugo: That’s right, and I actually recently had Paige Bailey on the podcast who’s a software developer advocate at Microsoft Azure, and she’s instrumental in a lot of this work as well.
What does the future of data science look like to you?
Hugo: Great, so we’ve talked a lot about kind of the data science landscape and your work currently. I’m wondering what the future of data science looks like to you.
Vicki: I think what we’ll see is a lot of standardization and kind of like a narrowing out of the industry. The last five years have been about this explosive growth in this new field called data science, which nobody really knew what it was at first, and so we started to define that. There’s a lot of now kind of shifting to data science. Everybody almost knows that data scientists are statisticians.
Vicki: What we’re seeing now I think is a lot more, to your point, specialization. There’s a lot of people specifically deep learning or specifically AI. A lot more movement to software development, like I mentioned. Especially as more stuff goes into the cloud, data scientists will need to know how to work in those environments. As always, I think the future belongs to people who can be flexible, who can write and read good code in whatever language, and who can teach themselves as the environment shifts.
Hugo: Great. Something you spoke to previously is trying to understand what best practices in data science look like. There isn’t as of yet … I mean people talk about certain things, but there isn’t kind of solidified system of best practices like there is in front end software engineering, for example, right?
Vicki: Yeah, and I think that’s just starting out. Like I’ve seen both Facebook and Google release guides on machine learning and things to look at. Google’s is particularly good because it has things you should look at, and Facebook just released a bunch of videos. I think that will start to become more solidified. The other side of that is you also hear a lot of people talking about ethics in machine learning and data science, and I think there might be some pressure from that perspective as well to define just what data science means. Of course, there’s GDPR regulations which will have us define what data we can collect. I think all those three things together will give us a little more fleshed out view of what that is.
Hugo: Yeah, great. I think the GDPR’s an interesting example. We’ll be seeing more and more of this. That’s EU specific in a number of ways, if you have any data going through the EU potentially as well. As we see more and more countries adopting these types of things, I’m wondering if that will impact how we use cloud technologies as well.
Vicki: I’m sure it will to some extent. I think the big thing in cloud will be figuring out security… security and data flows first.
Ethics
Hugo: Yeah. You mentioned ethics in data science. I’m wondering what you think the biggest concerns are in the ethical landscape.
Vicki: Personally I would say right now probably the biggest issue is data leaks. There’s a number of different things, but I want to focus on the practical issue, which is a lot of people are not securing their data. The issue there is potentially collecting too much and then not monitoring it carefully enough.
Favorite Data Science Technique
Hugo: Okay. Yeah, I agree with that. We’ve talked a lot about different aspects of data science and the data science flow. I’m wondering in particular what’s one of your favorite data science-y things to do, I mean techniques or methodologies?
Vicki: Yeah, so the ones that I enjoy doing the most because I get the most return out of them are probably decision trees. The reason I like them so much is because they’re very easy to discuss with people who aren’t necessarily data scientists. They’re very easy to visualize and they give you a clear path to a call of action. If I can utilize them, I do.
Hugo: This really speaks again to something we’re discussing earlier of one interpretability, you can actually show someone going down the tree and what decisions it makes at each branching point, but also ease of explicability or just being able to explain something to someone else.
Vicki: Yep, and the ease of porting between multiple platforms as well.
Hugo: In what sense?
Vicki: Implementation details so you can create a decision tree locally in scikit-learn. You can create one in R. You can create one on almost any platform that exists, so I like it with that.
Hugo: That’s great. Of course in scikit-learn you can … it’s nice it’s compatible with Graphviz, right, so you can visualize it immediately.
Vicki: Yep.
Hugo: What about with respect to data engineering? What really gets … do you love doing there?
Vicki: I’m really into AWS Lambdas, which are basically… think of them as like virtual environments that exist ephemerally. They spin up, they do something, and then they go away. There’s a lot of potential for use with them, and I’m really interested in exploring them a lot more. I’ve used them in my past two projects and I see them only growing.
Hugo: What’s the gain? What’s the big win to be made with AWS Lambda environments, do you think?
Vicki: They’re kind of like functions that do things very quickly. They can move data. They can tweet. I use Lambda functions in my bot to tweet every certain amount of time. They’re very easy to maintain. Once you set them up and have them going, they just kind of keep going.
Call to Action
Hugo: Fantastic. All right, so my last question is do you have a final call to action for our listeners out there?
Vicki: Yeah, so I’m on Twitter. I’m @vboykis. You can find my site there, my tech blog. If you’re interested in more about what my company CapTech does, you can go to captechconsulting.com. We’re always hiring and we’re always taking on new clients.
Hugo: Fantastic. I suppose I do have a follow-up question there. In terms of the hiring process, this is a question I get a lot, do you have any advice or general rules of thumb for people entering an interview process, I mean with you or elsewhere?
Vicki: One, prepare well to understand the company that you’re interviewing for. Especially in consulting it’s a little bit different because we’re looking for people who are good technically, but we’re also looking for people who are interested in doing a lot of different things and are good at doing a lot of different things and can be self learners and do a lot of research.
Vicki: The second thing is to be enthusiastic about what you talk about. Tell me about what you’re passionate about. Tell me about what kinds of projects you’ve done, if you’ve done projects outside of work. Tell me as much as you can about your work projects.
Vicki: Basically when I come into an interview with someone, I’m looking to have … I’m not looking to trick you. I’m looking to have a conversation with you and to see if I can work with you, and that’s it.
Hugo: Vicki, it’s been an absolute pleasure having you on the show.
Vicki: Thank you for having me.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.