Becoming a Data Scientist (Transcript)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here is a link to the podcast.
Introducing Renée Teate
Hugo: Hi there, Renée, and welcome to DataFramed.
Renée: Hi Hugo. Great to be here.
Hugo: It’s great to have you on the show and I’m really excited to talk about all the things we’re gonna talk about today, the podcast that you worked on for so long, the idea of becoming a data scientist and your journey and process there, but before that I’d like to find out a bit about you. Maybe you can tell us a bit about what you’re known for in the data community.
Renée: Sure. Well I think I’m known for the podcast that you mentioned. It’s called Becoming a Data Scientist. I interview people about how they got to where they are in their data science journeys and whether they consider themselves to be a data scientist. I plan to start that back up soon. I think that’s what I originally kind of got known for but a lot of people also follow me on Twitter that may or may not have been an original podcast listener. I have a Twitter account called BecomingDataSci and my name on there is Data Science Renée. I try to help people that are transitioning into a data science career to find learning resources and inspiration. I’ve built a site called DataSciGuide.com, which collects learning resources and people can go on there and rate them. I hope to eventually make that into learning paths and things like that. I have a Twitter account called NewDataSciJobs where I share jobs that require less than three years of experience and I try to share articles about learning data science and getting into this field to help people transition in.
Renée: On top of that, I share my own data science challenges and achievements and try to encourage and inspire others so they can kind of watch what I do. I’m really happy, especially in the last year I feel, to see a wide variety of people with different educational backgrounds that want to enter this field, so I intend to help them become data scientists too because I think the broader the background of people in this field, the better it’s gonna get. I guess that’s what I’m known for, the podcast and Twitter account for the most part.
Hugo: Sure. I think a wonderful through line there that of course we’re very aligned with at Data Camp is lowering the barrier to entry for people who want to engage with analytics and data science. One of your wonderful approaches I think, you know you stated that on the podcast you’ll even ask people you have on about their journey but whether they consider themselves to be data scientists, kind of what this term means, and how their practices apply to it. It kind of demystifies data science as a whole, which can be a very I think unapproachable term with a lot of gatekeepers around as well. I think the work you do is very similar to how we think about our approach at Data Camp so that’s really cool.
Renée: Great. I definitely aim for that.
How did you get into data science?
Hugo: How did you get into data science initially?
Renée: This is my favorite question because this is what we talk about the whole time on my podcast, so hopefully I don’t run too long but I will give a detailed answer. I’ve worked with data my whole career. You might call me a data generalist. Right out of college, I went to James Madison University in Harrisonburg, Virginia, where I still live, and I majored in something called integrated science and technology. It was a very broad major. It gave more breadth than depth in a lot of topics. We covered everything from biotech to manufacturing and engineering to programming, but you kind of get a taste of everything and find out what you like and don’t like. It had a lot of hands-on real-world projects and one thing we learned in the programming courses in the ISAT program was relational database design. This is something I had never done before then but when I was in the class I realized hey I’m pretty good at this. I get this. It makes sense to me. Right out of college, I started doing that type of work. I was designing databases, building data-driven websites, and designing forms and reports to interact with the data. I did a lot of SQL and helped design a reporting data warehouse and building interactive reports where people can interact with the data and I did some analysis on that.
Renée: I wanted to take my career to the next level beyond that. At the time, I thought that a masters in systems engineering would fill in a lot of the gaps in my knowledge so in my undergrad program I didn’t have a lot of depth in math, for instance, or coding. I just had some introductory classes. This program had, it was at the University of Virginia, and it had simulation and modeling courses, optimization, statistics, and at the time I was kind of afraid of the math. I had to take linear algebra at the community college in a summer course to even qualify to apply for this masters program. This is eight years after undergrad. I should have known that it was gonna be more math intensive than I originally thought but I found out that the title of each of these courses in the systems engineering program is kind of like a code for another type of math. It was very math intensive but I needed that. That’s something that I wouldn’t have learned as much on my own if I did all self directed learning.
Hugo: I have a question around that which of course I get a lot as an educator, which is to be an effective data analyst or data scientist, how much linear algebra do people need to know?
Renée: I think it’s good to understand the basics. It gives you a sense of what’s going on behind the scenes of those algorithms, to understand how data is being transformed and processed, however if you’re really going to be an applied data scientist and not so much like a machine learning researcher, you don’t have to really know all those intricacies. I’m glad I got a background in it so I understand how these things work, but I don’t use those skills on my day-to-day work. They’re like packages that abstract all that away so I don’t have to be doing those type of calculations on a daily basis as a data scientist. I would say it’s good to get a grasp of it and feel like you understand the concepts but you don’t need to like have a mastery of the actual computations yourself. I mean that’s what computers are for. They can do a lot of that for you.
Hugo: Yeah. I agree completely and I do think there is a lot of anxiety around learning these types of things, linear algebra and I suppose multivariate calculus in particular. I do also encourage people to push through a bit and persevere a bit because a big part of the challenge is the language and notation. A lot of the concepts aren’t necessarily that tough but when you’re writing a whole bunch of matrices and that type of stuff, you get pretty gnarly pretty quickly.
Renée: Yeah. I still like shudder when I see certain depictions of the … Like you said with multi variable calculus and calculus that’s done in a matrix. It just looks so overwhelming and the notation still gets me so I feel that.
Hugo: Yeah.
Renée: But I’m glad I understand the concepts behind it, even if I still shudder every time I see those.
Hugo: Yeah and you can have some crazy notation that really what it is referring to is the directional flow along a surface or something like that, like something that intuitively is quite easy to grasp but we’ve got this heavy archaic notation around it.
Renée: Yeah and it’s not even consistent. I was in a program that had like professors from different departments at different universities and my husband is a physicist and there was a course where I was just really struggling with this particular type of computation and the notation and he looked at it and he was like you just learned this last semester. I was like I’ve never seen this before. He said no it’s the same concept, it’s just different notation. That’s when I really started to understand like mathematicians and engineers for instance might use different notation for the same thing. It gets complicated. I do think if you’re gonna become like a machine learning researcher or go into like a PhD program or you’re developing things around the cutting edge of data science and really pushing forward the field and building algorithms that other people will use, then you need to really understand that stuff but if you’re mostly applying algorithms that are already built, you don’t have to get as in depth. For statistics I do think you really need a solid statistical foundation. I would kind of say the opposite. Everybody that does data science really needs to understand basic statistics well.
Hugo: Great. So what then happened in your journey while or after you did this program?
Renée: Yeah. While I was in the program, the data science institute got started at UVA. I had been hearing about data science everywhere and I kind of wanted to switch into that program but I couldn’t without completely starting over. They kind of moved as a cohort through their program so I found out that I could take a machine learning course as an elective and so I started taking that just because I wanted to know what it’s about and how close is it to what I’m already doing. It felt like my whole career up to that point was kind of leading towards data science and I had never heard of it. In this machine learning class, it started with a lot of the math and it moved really fast and I’ll be honest I bombed that mid term. I really thought I was gonna fail out of the course but I decided to keep going because the first half of the course was the math and the second half of the course was the coding and applied part of it which was what I was looking forward to, so I thought well even if I get a bad grade I want to learn what I’m supposed to learn in this course so let me stick with it.
Renée: Like you said, with the abstract symbols and things I was having a hard time even understanding the textbook but then the last part of the course we had been building these machine learning algorithms from scratch. Oh and by the way all the examples were in C++ but the professor let us use whatever coding language we wanted to, so I started picking up Python at that point. I didn’t have a very good grasp of C++. I had mostly done visual basic .NET up until that point and SQL and I didn’t know Python at all but I figured that was my chance to learn it so I kind of learned Python as I went as well, which is probably part of the reason I struggled in the class. By the end we had this project. By then I kind of got Python and I kind of got what was going on with machine learning. I was going to school part time while I worked, so I asked my manager can I use this data that we use at work to apply it to this project that I’m doing in school. He said yes that was fine.
Renée: So what I did, I was working in the advancement division at JMU which is basically the fundraising arm of the university. For my project, I predicted which alumni were most likely to become donors in the next fiscal year. The professor loved it and maybe even mentioned this is something I could publish in the future. I guess that project outweighed my performance in the math portion of the course because I ended up getting an A in that class, which just blew my mind.
Hugo: That’s incredible.
Renée: I was like okay now that’s kind of confirmation that this is something I should be doing.
Hugo: Absolutely. I just wanna flag that before you go on, that you’ve actually made an incredible point there which is that you didn’t do a project kind of in a vacuum essentially. You were working on data that was meaningful for you, meaningful to your employer, and actually gave some insight into something important to a bunch of stakeholders.
Renée: Yeah and it took what like in class we have pre-prepared data sets and they were all just lists of numbers. They weren’t even like kind of related to the real world at all. The professor chose those data sets because the answer would come out a certain way and so diving into something that was unknown that no one had really looked at before at least in our university and finding some insights that I could share and actually make a real-world difference, that tied it all together for me.
Hugo: In a learning experience as well, working on something that means something to you and interests you is so important.
Renée: Oh absolutely. I always encourage people to find datasets that are interesting to them and use them throughout their learning journey because it keeps you interested when things get tough and also you’ll understand the output better if it’s something that you’ve had a background in or even interested in. If you’re into sports, use a sports data set because you’ll have a better sense of whether the output of your model even makes sense in context of sports.
Hugo: I always say if you, a lot of people wear fitness trackers these days and they can get their own data with respect to exercise and sleeping patterns and that type of stuff. They can quickly do a brief analysis or visualization of stuff that’s happening physiologically with them.
Renée: Yeah. That’s an awesome idea and definitely something I would encourage.
Hugo: Awesome. So what happened next in your journey?
Renée: For my last class, so most of my program that I did in grad school was online. It was synchronous so I was actually watching lectures over the internet that were live and there was a class there but for the last semester I commuted to campus which was an hour for me. I started listening to a lot of data science podcasts because I knew at that point I’m interested in this thing. Back then I was listening to Partially Derivative and Talking Machines and the O’Reilly Data Show, Linear Digressions, Data Skeptic, so I was just absorbing all of this data science information and I knew that this was what I wanted to do. As soon as I graduated, I started diving into books about data science and teaching myself what I needed to know to get a job in this field and move on from, at the time I was a data analyst and I wanted to move into being a data scientist. That’s what I did next.
Renée: Then I applied to a bunch of different jobs that like at the time I was just getting comfortable with data science so I didn’t want necessarily a data scientist job but I wanted to make sure it was a job that was moving in that direction because the job I was in wasn’t giving me a lot of opportunities to really exercise these new skills and do machine learning on the job. I knew I was good with designing analytical reports. I knew I was good with SQL. I had this new masters degree in systems engineering but I wanted to grow into a data science role. I started applying to a bunch of different jobs that partially involved data science but they had components that I knew I already had the skills to provided value in. I didn’t get any of the first several I applied to, but I was starting to learn by doing those interviews what they were gonna ask and what gaps I had in my knowledge so I can go back and learn more.
Renée: At the time, there were two different startups, one on each side of the country, that apparently needed that type of generalist that could do both the backend data engineering and SQL stuff and move into the predictive modeling side. I got two offers at the same time. They were both for remote roles that were like a combination of data analytics and entry level data science. I didn’t have to do whiteboard interviews or coding interviews for either of them which was nice because that part, I don’t think I was as good at the time, but they needed somebody with my background and my experience with databases and someone that was good at communicating with the stakeholders. I think that helped me stand out and I think we’re gonna talk a little bit more about that later.
Hugo: Absolutely.
Renée: But one of those two job offers was with people I had worked with before. I worked at Rosetta Stone as a data analyst and a lot of the people at this startup had come from Rosetta Stone. I was more comfortable with that one and took that one and have been able to build my data science and machine learning skills on the job. That company is called HelioCampus. We work with university data and I can tell you more about that if we’re interested, but I’ve been in that role for about two years now as a data scientist.
Hugo: Fantastic. That’s telling that the project you did did involve alumni data initially, when you were first learning.
Renée: Yeah. At HelioCampus we’ve kind of … It’s extended me into a new domain. It’s still university data but we work a lot with the student success data and admissions and things like that. I guess I’ll give a little brief overview of the company. At universities they have databases that are like all kinds of data that you might not even think of when you’re applying and enrolling at this university. There would be a system for admissions and applications. There’s usually a separate system for enrollment and courses and faculty and then there’s another system that they have for payroll and financials and then they’ll have another system for the fundraising and alumni information. They have all these databases across campus and the leaders want kind of a big picture look at the students’ trajectory through this whole experience of applying and then going to college and becoming alumni.
Renée: To get metrics on that whole system, you have to combine that data. We combine it into a data warehouse and we have reports in Tableau that point at that data. We have some canned reports and then my job is to then work with the end users to do analysis that’s not already built to answer questions they have about the students and to do some predictive modeling. One example is for the admissions team, we have … We’ll take a look at all the students that have been admitted to a university and try to predict how many of them will enroll or which ones might be on the borderline of the type of students that sometimes enroll and sometimes don’t. They might need some extra outreach in order for the school to get their attention or students that need additional financial aid for instance. We’ve helped them get some insight by doing predictive modeling into what their student body looks like and what type of students they can except to come to their university and what trend we expect in the future for their enrollment. That’s just one example of many different aspects of what we do with the universities at HelioCampus but that’s the kind of work I’m doing now.
Hugo: That sounds like very interesting and fulfilling work, particularly with your kind of deep interest and mission as an educator and investing in learners.
Renée: Yeah definitely.
What questions do aspiring data scientists need to think about?
Hugo: It was fantastic to find out once again about your journey to becoming a data scientist and something that of course you do is insist through your podcast, through a lot of different media that this is only one journey, that everyone’s journey particularly to becoming a data scientist, there are a lot of different paths and there isn’t a one-size-fits-all approach to becoming a data scientist, and that before actually deciding on a path, people need to figure out both where they are and where they need to go and connect those points somehow. So: what I’d like to know is what questions do aspiring data scientists need to think about when figuring out where they’re starting from on their journey?
Renée: Yeah definitely. That’s actually why I started my podcast because I was listening to all these other podcasts showing what cool stuff data scientists were doing, but none of them had focused on how did they get there? What did they do? I started asking questions and one of the things I realized that you have to asses no matter which different educational background or career background you have is your starting point. The kind of questions you need to ask to map out your data science learning path is like have you coded before? What language have you coded in before? Data scientists typically learn R or Python, often need to know SQL. How comfortable are you with the mathematics and statistics and do you need to brush up on those things and get some refreshers? Maybe you need to take it to the next level from where you’re at? Have you ever presented a report based on data? Have you done an analysis in a professional setting before? Have you ever answered questions with data? These are like the basics that you need.
Renée: Then, you’re gonna probably be working in a particular domain so within that field do you know the lingo? Do you know what kind of data related career paths there are in that domain? How you might focus in your data science learning to target one of those career paths. You might want to talk to a data scientist in that domain or analyst in that field and get a sense of the common questions and state of the art of what problems are they working on and what are they asking so you get that language. It’s kind of this baseline of all the different parts of those common data science Venn diagrams that you see of how many of those pieces do you still need to work on to fill in. You’re just assessing your starting point and then next you’ll look at where you wanna go so that you know how to map out that learning path.
Data Science Profiles
Hugo: Yeah. So to recap, essentially we have coding chops, whether you can program, what languages, comfort with maths and stats, then communication skills and actually presenting I was gonna say data-based reports but I really mean reports based on data and then domain knowledge. I think these are definitely very important aspects of your own practice to analyze when figuring out where you’re starting from and then of course, as we both said, you need to have an idea of where you wanna end up. This may be a relatively amorphous, changing, vague notion but what are the typical data science profiles that we’ve seen emerge that people can end up as?
Renée: Yeah. As you mentioned, data science can mean a whole lot of things. I’ve noticed that there seems to be these groupings of specialties within data science. There’s like an analyst type of data science: these are people that are usually working with end users or leaders or other people in the business. They’re understanding the kind of questions that can be asked and figuring out how to convert those questions into data questions and determine “do you have the data available to answer those questions?” and doing the analysis and then presenting the results and proudly developing data visualizations for those kind of things. There are the engineer types of data scientists that are doing a lot of the backend work, the coding, working with databases and data warehouses, probably doing some of the feature engineering, working with big data systems and technologies that can handle massive data sets, building those data pipelines that support the analysis.
Renée: Then there’s what I mentioned earlier, the researcher type of data scientist: they’re improving those cutting edge algorithms and developing new tools and techniques, so that’s a different focus of data science. I’ll say that most people end up doing some combination of these things but you end up specializing either in like the analysis part or the engineering part or the research part. In my current role, I do a lot of the back-end engineering stuff because I have that background but also mostly focusing on the analysis tasks and communicating with people at the universities, the institutional researchers and decision makers that are gonna use the results of what it is that I’m doing.
What paths should individuals take?
Hugo: Yeah great. We’ve identified the three archetypes, the analyst, engineer, and researcher as end points or at least career paths. Knowing kind of the ways we need to think about where we are and knowing where we can end up, what are paths that you would recommend? What do recommended paths look like essentially?
Renée: Yeah I’m hoping to formalize this more in the future with the information I’m gathering at Data Sci Guide but it really depends on the individual. That starting point that you assessed, the ending point of where you want to end up at, and what are you comfortable teaching yourself or taking courses in, learning online, deciding if you need to go back to school. I do think it’s a myth that you need a PhD to be a data scientist. I don’t have one. A lot of data scientists I know don’t have one. I would say go back to school if there’s something like there was for math for me that you would be uncomfortable teaching yourself and you really need someone else to help you understand like the fundamental concepts there. Talk to someone that has a similar background as you and has become a data scientist or find people on Twitter that seem to be following paths that you like and you want to follow that.
Renée: Then do that project based learning like you talked about. Finding the data set that has the information that you’re interested in, whether that’s sports, statistics, or political data, or geospatial imagery or medical data or entertainment data. There’s so many different types of data out there that you can find something that’s really interesting to you. Ask a question that you can answer with the data and then learn whatever techniques you need to learn in order to answer that question. I think project directed learning is really valuable but that exact path and what resources you use, I have a really hard time recommending any one thing because different things work for different people, though I would recommend keep trying different things until you find out what works for you. Don’t get discouraged if you pick up a book that a lot of people say is popular and great and you don’t really get it and it’s not sinking in for you. Just try something else. Don’t give up and say oh I’m not cut out for this because this popular book doesn’t make sense to me.
Hugo: Yeah. There’s a lot of great advice in there. Something I haven’t thought about a lot beforehand is talking to someone that has a similar background, essentially finding people like you. I think this is really cool because after you’ve done the work of identifying where you are and where you wanna go or where you’d like to be in whatever time frame you’re thinking, I think it’s easy to forget or to think that there aren’t people like you out there and that you’re alone in this journey, particularly in a field that’s moving so quickly so to find people at different points in their career who are like you, that type of community to advise or be a mentor or a mentee later on, these types of things, is an incredible idea.
Renée: Yeah. I think another thing that I just thought of that ends up being difficult is just even orienting to the terminology. Even when you’re out there looking for someone like you, like there’s a lot of weird words that are used in data science that can be confusing at first and you don’t really know is that person doing what I think I wanna do. I have an article on my blog about how I used Twitter to do this. Podcasts like yours are great for that, just hearing people talk about data science and learning like what kind of things data scientists have to think about. When I was ready to move into this career path I got this book. It was called Doing Data Science by Kathy O’Neill and Rachel Shut. That was great for me in terms of getting an overview of the big picture of what is this stuff and what do I need to learn and what are some of the basic terms and it pointed you at other resources to learn.
Renée: Yeah just orienting to like how people even talk and what … What matters in data science and maybe there are things that you actually know already but it’s called something else by data scientists. Data science is kind of a combination of fields that have already existed for a while. Yeah just learning that terminology and listening to data scientists and watching them on Twitter and reading articles to figure out what you don’t know yet is important first step.
Specific Learning Tasks for Beginners
Hugo: In terms of this journey of becoming a data scientist, can you suggest any learning tasks for beginners?
Renée: Yeah. I would say build a report. Like you were saying, maybe use your own data from a Fit Bit or something like that. Just explore a dataset and do some basic statistical summaries and then practice communicating those results. As you learn, you’re gonna be using different tools and techniques but you wanna make sure that the outcome is always understandable and so see if you can bridge that gap as you go. Actually I think when you’re learning it’s a great time to do this because that’s when it’s fresh and new to you as well so you can bridge that gap between the technical analysis and using that information to make decisions and talk to people that are less technical to get the point across. Constantly blogging is a great way to do this. Talking to friends or people in your field is a good way to do this and just explaining the analysis you did but in a way that just makes people comfortable that you know what you’re talking about and then makes that information usable without getting into too much of the nitty gritty of statistics behind it.
Hugo: For sure. I do think working on datasets that are relevant to you is so important. The titanic and iris data sets don’t count even if you think they’re relevant to you.
Hugo: We need to move away. I think you dispelled very importantly the myth that you need a PhD to do this type of stuff. I’m wondering what other potential pitfalls or warnings you have for people along the way on their journey.
Renée: I think there’s some misconceptions about how much you need to learn. A pitfall is that it’s really easy to get discouraged when you’re learning. There’s so many topics under this umbrella of data science that you can easily get really overwhelmed and not know where to go, especially with self-directed learning. You have to kind of balance learning enough to qualify for the type of job you want but then not over planning it or overdoing it to the point where you’re starting to feel totally off track and psyching yourself out and feeling like you’re never gonna make it.
Renée: In a talk I gave, I talked about it like you’re planning a trip. You could plan it out turn by turn and print out the directions and know exactly where you’re gonna turn and what it’s gonna look like at each of those turns, but you still wanna have your GPS handy because if you run into unexpected traffic or road closings you gotta route around that. At some point you’re gonna feel lost in your learning or like you’ve totally hit a roadblock but instead of giving up you might just need to go back and find other resources to get you more comfortable with the topic before you move forward again or decide do I really even need to learn this? Maybe you can skip that part and come back later when you have a better understanding. Instead of just getting stuck and waiting for things to kind of clear up in front of you just be prepared to reroute. There’s a whole lot of different paths to a data science career and just be prepared to change course.
Renée: Also I think a lot of people look at those terrible job postings that are like a wish list of everything that company could ever want a data scientist to be able to do and they’re basically describing a whole data science team in one job posting. People think that they need to learn all of those things in order to get that job so I would say no. Learn a few key things really well. Practice applying that knowledge you have to real world problems so you have experience like overcoming challenges that you’re gonna encounter in a real job and that will also help you have a story to tell in your interviews of how you overcame trouble and ended up having usable results in the end. I guess what I’m trying to say is don’t derail yourself and don’t feel like you have to learn everything you’ve ever heard of in data science in order to be a data scientist. None of us know how to do everything. You just have to know enough of the basics that you feel solid in that understanding and confident that you could pick up other tools and techniques as you need them. I would say learn the basics and then learn a couple specialty items that might set you apart or are particular to the field that you’re trying to get into. Also those communication skills are really important too, not just the tools and techniques.
Hugo: Absolutely. To build on that, something that you hinted at earlier is get out there and do some job interviews as well to find out what the market is like and what interviewers want and ask them questions to figure out what gaps you may have as opposed to learning in the abstract what you think may be needed out in the job market.
Renée: Yeah. It can be discouraging not to get a job but I remember once I did get a data science job looking back and saying all those ones that I didn’t get, they weren’t right for me any way so why should I feel bad about not getting them? I wasn’t right for the job or the company wasn’t right for me and so once I found one and it was the right fit and I feel good about it and I like my job so looking back I realize there’s just times when it really gets frustrating or depressing if you keep getting turned down, but there’s just so many different kinds of data science jobs out there. I think everybody can find one that matches their skills even though it might take a while.
Hugo: Yeah and I do think it’s incredible discouraging and horrifying to not get a bunch of jobs in a row. Advice I give which I definitely don’t necessarily I find it difficult to take myself though is that you only need one hit. You’re looking for one hit out of a bunch of opportunities and the ones that don’t work out can be really incredible learning experiences as well. That doesn’t make it any less brutal to be turned down.
Renée: Yeah. It’s not until after the fact that you look back and realize like how much you learned and how valuable those rejections were.
Hugo: Yeah. Exactly. Talking about what employers are looking for, I think one thing that we can forget about when thinking about data science in the abstract is that a lot of the time it’s used to solve business questions. You have a great slide that demonstrates how data analysis and science can be used essentially as an intermediary step to get from a business question to a business answer, so this movement from a business question to a business answer is factored through data science. I’m wondering how keen this concept is to your understanding of data science as a whole.
Renée: Yeah I created that for one of my first data science talks in order to illustrate what I think the data analysis process is. I got such good feedback on it and people really like it so I go back to it a lot now. If for anyone that hasn’t seen it, it has four little phrases with arrows between them. It starts with business question and goes to data question and then to data answer and then to business answer. I’ll go through each of those.
Renée: For the business question, I don’t necessarily mean like a sales and marketing kind of business but like a domain question, something that a decision maker in your particular field or business might ask. You’re job as an analyst is to convert that into a data question. What data is required in order to answer it? Do we have it available? What related questions might we have to answer first to get to that one? What type of analysis needs to be done to get us to a usable answer? Then you have to do the analysis so that’s the data answer piece. This type of analysis will depend on like what kind of field you’re in, what’s your role and your skills, what data is available so the type of analysis differs but basically to turn that data question into a data answer you’re doing analysis.
Renée: Then you have to take the results of that and turn that into a business answer. There’s very few people out there that will want to hear your data answer. You have to be able to communicate that in terms that a non data scientist can understand so that they know what the data is telling them and can use that information to make a business decision. You have to be able to convey statistical results and uncertainty in business terms and explain what your analysis means and does not mean so it’s not misused. A report when we talk about building a report, in the real world the end result is usually not some sort of statistical readout with model evaluation metrics. It’s like a presentation of the results that are clear and usable by people that are not data scientists.
Hugo: Absolutely and I do think to keep in mind that we’re always attempting to answer business questions or develop business insights in this context is incredibly important. I wanna shift slightly. We have a lot of aspiring data scientists and learners out there. I’m wondering what’s your take on where people can learn, particular places people can learn the skills and knowledge necessary to become a data scientist.
Renée: Well like I said I have a hard time giving specific recommendations because it’s so personal but I’ve heard great things about DataCamp of course. It’s actually the highest rated course system on DataSciGuide, so people that use DataCamp seem to really love it.
Hugo: That’s great. I’m personally I’m a huge fan of Data Camp as well. I don’t know whether there’s any bias involved here.
Renée: I’m not saying that just to suck up. It’s really… people love it. Also there’s Data Quest. There’s Khan Academy for some of those basic skills. There are lots of books out there. People tend to really like the O’Reilly books and there’s some other favorites. Again, I hesitate to give specific recommendation just because they vary so much. People can tweet me if you’re looking for a certain resource that will get you started from where you’re at and usually I retweet that and lots of people that follow me will help answer. It’s really kind of a personalized answer but I’ll just say there are a ton of resources and it’s easy to get overwhelmed by the resources so don’t be afraid to ask to find what might be best for you and then if someone recommends something and you really don’t like it don’t feel bad about that either. Just move on to the next thing.
Renée: So yeah I mean my site, Data Sci Guide, I’m trying to collect those reviews from data science learners so we can get a sense of what did you need to know before you used this resources because that tripped me up a lot when I was learning is there weren’t clear pre-requisites for certain resources and I would start out real excited like yeah I’m getting it and then five lessons in be totally overwhelmed and wanting to give up. I think that’s dangerous. Yeah talk to people that are just ahead of you on the learning path maybe and find out what helped them get to over that first step from where you are to where they are and maybe not reach out to people that are already working as data scientists but other data science learners.
Hugo: So something we’ve been talking around, Renée, is Twitter which can be an incredible resource for aspiring data scientists so maybe you can tell me a bit more about that.
Renée: Yeah so in addition to all like the books and courses and tutorials, I really use Twitter a lot to get the lingo of data science. There are these great communities on Twitter and you can usually use them by searching for certain hashtags. I’ll give you a few of them. For Python people, there’s pydata, pyladies, p4ds. For people learning R, there’s Rstats and Rladies, R4ds. These are all hashtags you can search. A lot of those have slack channels too. There’s a data science learning club slack channel that some followers of mine started a while back based on my podcast learning activities. There’s a slack called data for democracy for people who want to get into political data. There’s a hashtag for data ethics, so I’m sure there’s similar groups like these on other social media like Facebook and LinkedIn but I’m mostly on Twitter so I have a whole blog post about using Twitter to learn data science and if you start searching for hashtags related to what you’re learning, you’ll usually start finding the leaders or the hubs in these communities and you can learn a whole lot just by following them. Then if you ask a question and use that hashtag you’ll usually get an answer. It’s pretty cool.
Hugo: That’s awesome. We’ll link to your article on how to use Twitter to learn data science in the show notes as well. So for learners, how will they know when they’re ready to actually be a data science or start interviewing?
Renée: Yeah. I think people are ready to start applying for jobs before they feel fully ready to make that jump. Don’t wait too long to start looking. Like we talked about, like doing those interviews is really instructional as well but I’d say that you’re ready when you’re confident enough with those basics so you know how to do exploratory data analysis and do some statistical summaries. You know that basic feature engineering, how to get a dataset into shape that you can use for machine learning. You know how to do some of that pre-processing and clean up. You can build a good report and a data visualization and communicate the results. Maybe you’ve used a few basic commonly used machine learning algorithms like logistic regression and random forest, so you’re confident enough with these basics that you know that you’re not gonna be totally struggling on the job.
Renée: Once you feel that you have that solid understanding of like how machine learning works and you can apply it, you probably want to also add in a few specific techniques that will make you stand out, either something you feel like you’re good at. Maybe you’re really awesome at building pretty visualizations that are easy to read. Maybe you’re really good at that back-end data engineering stuff. Something that you can say is your specialty when you’re applying for the jobs but you don’t need to check off the entire list of every algorithm and every tool and technique out there.
Renée: I’ve interviewed for jobs that included skills that I already had throughout my career and I was confident with, plus some skills that I was still picking up. If I knew that I could understand what people wanted and I was confident enough that I could pick up those new tools and techniques along the way, then I realized like I got a job before I thought I was ready and at least I hope and I’ve been told that I’ve done really well there. A lot of stuff you can pick up as you go if you have the basics down. Don’t feel like you have to be an expert in every area. Nobody is. Start applying and you’ll get a sense for what it is that you still need to learn in order to get a certain type of job but yeah don’t wait too long.
Hugo: I think the field is so vast and there are so many techniques and new techniques emerging all the time that if you try to be as comprehensive as possible you’ll always feel there’s more stuff to learn and you’ll never get out there.
Renée: Yeah you’re going to be learning on the job no matter how advanced you are when you apply. There’s a huge demand out there right now for people with data skills, so even if you get kind of a transitional data analyst type of role you might not have the title of data scientist right away, but if it’s a role that offers you the possibility of doing some machine learning yeah you can grow into that as you work.
Biggest Ethical Challenges
Hugo: I wanna shift slightly. Recently you gave a talk called Can a Machine be Racist or Sexist? Using this question you posed as a jumping board, can you speak to what you consider the biggest ethical challenges facing data science and data scientists as a community?
Renée: Yeah so we could do a whole episode just about this. I’ll connect you with some people that I think would be great interviews that could talk extensively on this topic but the main purpose of me doing that talk was to get people to understand that even though you’re using these mathematical algorithms and computers to get a result, that doesn’t mean that things produced by data science are unbiased. There’s so many ways that bias, maybe you’d say racism or sexism and I’m talking about a system at kind, so not somebody yelling a word at somebody on the street, but historical racism that’s baked into systems. I have that masters in systems engineering and I think I’ve always been kind of a systems thinker so I picked up on this quickly and I was trying to share it with other people. You can link to my whole talk for all the slides. I really struggle to cram in all the examples I wanted to give because there’s really so much to learn here. With machine learning, you’re really doing pattern matching. That’s what those algorithms are doing, finding patterns in the data which is a lot like stereotyping. You have to be aware of what data is going into making those decisions and make sure you understand the model outputs and it’s not completely a black box where you don’t understand why a particular decision was made by the model when people’s lives are being affected. Biases can be introduced at every step along the way in this development process. The data could have been incorrectly recorded in the first place. It might not be representative of the full population. It might be a limited sample and you’re training your model assuming it’s gonna generalize and it might not.
Renée: Your data could contain historic biases. For instance, crime databases are only gonna contain records for crimes in areas that are policed. If a crime at a certain location isn’t observed or isn’t recorded into the system by the police, an algorithm you train on that is gonna think there was no crime there and make predictions accordingly so it’s just you’re encoding not what’s happening in the real world necessarily but you’re capturing what people are capturing about the system that you’re looking at. There’s certain techniques that can amplify bias when you’re doing your pre-processing and model training.
Renée: There’s the question of what are you even optimizing for? For instance, YouTube has this problem where they’re optimizing for viewing time. They want your eyeballs on their ads. If something is like particularly crazy or creepy or exciting, people are gonna watch it a little longer and so those videos that are really extreme will bubble up to the top and be recommended to more people because when you watch them you might be fascinated by them and watch longer. It can kind of radicalize people. People might get to the point, especially kids I think, where you can’t necessarily separate the truth from this fiction that’s constantly in front of you because that fiction is exciting and interesting and makes you watch longer.
Renée: What you’re optimizing for and what kind of effects that could have is important. How do you even decide when to stop optimizing or if the results of your model are good? That’s a decision that requires a human input. How do you know if the results of your model are being used properly and it’s not being misused or misinterpreted? There’s people and people making decisions at every step along the molded development process so you can’t say that oh it’s automated and computerized. There’s no bias involved. There can be bias introduced at every single step.
Hugo: A lot of these issues are cultural as well, that as a community of data scientists we’re only now really starting, well there’s been work done on it previously, I don’t wanna dismiss that but we’re really only starting to think collectively about how to approach these problems now.
Renée: Yeah definitely. Yeah and it’s a culture of how the company is run and it really takes us data scientists making decisions about what we’re willing to do as well. So much of this like the models are being built under pressure for deadlines and being rolled out and you might not even know how it’s being used in the end, but just being aware of the impact of these things that we’re building is important. I love this quote from Susan Etlinger in a TED Talk that she gave. She said we have the potential to make bad decisions far more quickly, efficiently, and with far greater impact than we did in the past. We’re really just speeding up these decisions. We’re not necessarily making them better unless we make an effort to do that, so we have to make sure that as data scientists that we’re not causing harm and we’re in high demand right now so we’re lucky we have some choice in what kind of businesses we’re willing to work for and what kind of products we’re willing to contribute to. We can make a difference in our future and hopefully make it a little less dystopian than the entertainment world imagines or that we can imagine just by being aware of this and making conscious decisions of what we’re willing to build.
Call to Action
Hugo: I couldn’t agree more. Renée, do you have a final call to action for our listeners out there?
Renée: Yeah so I know there’s a lot of people that listen to these podcasts that are just getting into data science but some people have been lurking on Twitter for a long time, listening to podcasts for a long time, reading books, and so my call to action for them is like dig in. Find a data set. Start working with it. Tweet me at becomingdatasci if you need help. I’ll connect you with an online community that can help get you started. Don’t delay actually working with real data.
Renée: My call to action for people that aren’t new to data science is I would encourage you to read up on the data ethics so that you understand how the work that you do in this field can affect real people’s lives. There’s lots of great books out there now so someone remind me when this episode comes out and I will tweet a list and share a bunch of books that I’ve collected that I’ve either read already or they’re in my kindle waiting to be read because I’m really interested in this topic and it’s important to me and I think it’s vital for people in our industry to be well aware of, so that would be my call to action for people that are already data scientists.
Hugo: Fantastic. Renée, it’s been such a pleasure having you on the show.
Renée: Great thanks for having me, Hugo. I’ve been listening for a long time and it’s exciting to actually be on here.
Hugo: It’s great to have you on particularly because I was listening to your podcast for so long, so it was a really fun experience.
Renée: Great.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.