Site icon R-bloggers

Data Science R&D at TD Ameritrade

[This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Sean Law, a Senior Applied Researcher and Data Scientist at TD Ameritrade.

Introducing Sean Law

Hugo: Hi there, Sean, and welcome to DataFramed.

Sean: Thanks for having me, Hugo.

Hugo: It’s a real pleasure to have you on the show, and I’m really excited to have you here today to talk about the research and development sides of data science, particularly as it pertains to your work at TD Ameritrade financial services and brokerage. We’re going to get into all of that, but before that I’d like to find out a bit about what you do. But in an org such as TD, I’m sure there are a lot of different impressions of what data science and R&D do and are capable of. I’d like to start off by having you tell us a bit about what your colleagues think that you do.

Sean: Yeah, so I think that a lot of times people know that or are aware within the organization that I have some sort of scientific background, research background and so they think that I spend the majority of my time, sort of, dreaming up crazy new ideas or exploring the deepest, darkest areas of research. In reality, while I might do some of that, I kind of … boiling it down to, sort of, one sentence, is that I tend to ask a lot of hard questions or interesting questions. That’s usually where I start.

Hugo: So you ask a lot of hard questions. Do you get to answer some of them along the way as well?

Sean: Hopefully so. Otherwise, it wouldn’t be that fun of a job. But I think what is important is to ask a lot of hard questions, have a good amount of skepticism, right? Or at least, a healthy dose of that to understand where people are coming from, what problems that we’re actually trying to really address. Because sometimes when you peel back certain layers, you start to realize that maybe, perhaps the … either the underlying question is not well posed or the underlying problem is either extremely hard or maybe, perhaps even impossible to solve. That doesn’t mean that we need to give up on it immediately, but I think sometimes having a good amount of empathy as well as listening skills goes a long way, especially when it comes to data science.

Hugo: For sure. I like this trope of, kind of, the crazy research scientist still exists in a lot of ways. But one thing we’re here to do today is to demystify that. You’ve actually told us a bit about what you actually do, but maybe you can go into more detail about what your job entails.

Sean: I work in essentially a research and development group. The name of my team is Exploration Lab, here at TD Ameritrade. We used to be called Advanced Technology, but I think more and more we’ve been focused on thinking about how different areas of research that sometimes are unrelated to finance could be applied in the finance sector, and then thinking through maybe what new experiences customers might have. I think one thing to clarify is a lot of times when people hear that I’m in finance, they will immediately ask me to provide advice on financial markets or investment advice. I’m not a financial investor.

Hugo: Well, it’s funny that you mention that, because ETFs went down in December. I was wondering if you could tell me a bit about … I’m just kidding. So, yeah, go on.

Sean: No, I’m probably the last person that you want to be asking for financial advice, probably because I’m not as qualified as many others are. But at the same time, I think that also provides a unique perspective where I’m sort of unencumbered by some of that history and I might bring a fresh perspective to things. So being able to look out for, again, those new experiences for a customer is very important. While that might come in the form of providing some advanced analytic methods to our active traders or maybe to provide a more personalized experience for our retail customers or even just to simply improve the process efficiencies for our institutional side of the business, which is our advisors. There’s really a lot of great opportunity, especially in terms of how we can leverage technology. So the focus for us is leveraging technology, but nowadays, leveraging technology sometimes isn’t efficient and that’s where, sort of, the data side of things come into play.

Hugo: And so what’s the trade off there, or let’s say the balance between doing research, talking with relevant stakeholders about the business problems, sitting down and just thinking, the act of discovery, and all of these things?

Sean: That’s a great question. I think initially as I transitioned away from academia into industry, I thought to myself that, "I’m a hard worker, right?" If I put my head down and, sort of, churn out work 90% of time and spend 10% of my time building relationships and attending meetings then I’ll do great. But very, very quickly, having been in industry for maybe three months, six months time, I quickly realized that that’s not sufficient. Really it’s a juggling act where you have to spend the time effectively not only to your point of doing some of the research, doing some of the experimentation, managing the relationships, and spending a lot of the time actually understanding what is being asked from our business stakeholders, and being able to even read between the lines, do your due diligence before even writing a single line of code or even thinking about how to solve a problem, you have to really, truly identify what problem is it that we’re even trying to solve and is there a problem at all.

Sean: Maybe some of it is … sometimes when you ask somebody, "What are your pain points?" People will tell you what’s top of mind. In reality, for our team, what we really wanted to do is think about problems or new opportunities that sort of span across the entire industry. If TD Ameritrade is the first one to solve that problem or begin shaving off some parts of it then that’s, sort of, a huge win for us. And again, we’re doing exploration, my team. We’re not doing production, right? We’re really on the research side of things, and we might build out some initial proofs of concepts to prove out a potential of either a technology or, in the data science case, a certain methodology that might be applicable. But we’re far away from production, typically.

Hugo: Interesting. And something you spoke to in there, I think is when people have a problem or think they have a problem, it’s actually a question whether data science and/or analytics is the correct way to answer it, as well, right?

Sean: Absolutely. That’s where I touched upon at the beginning of having a healthy dose of skepticism, right? And not necessarily in a negative way, but to make sure that people are bringing the right problems to the table. And again, for our team, the team that I work on, right, is actually composed of a very diverse set of people, so I am one of maybe two data scientists on the team. We’re a team of about maybe 12 or so. That sort of fluctuates over time, but the rest of my team have skills such as UX, user experience design. We have somebody with cybersecurity in the past, front end, back end, mobile, ARVR, AI. You kind of name it … architecture. We’ve got somebody who has some level of expertise or considerable amount of experience in all of those fields, and that makes it a very collaborative environment for our team to explore in.

Hugo: That’s really cool. How big is the team?

Sean: So, about 12 or so people.

Hugo: That’s a really nice size with those types of skillsets. It does … I want to ask you about a challenge such teams can present. Part of my background is working in a cell biology lab in biophysics and thinking about cell growth, cell division, these types of things. We had physicists, chemists, biologists, mathematicians such as myself, all working together, collaborating on cell biological questions. Which means that you get a lot of different points of view. In essence, a lot of different types of creativity and knowledge. Everyone brings a lot to the table. But a problem with that is that sometimes you’re not having the same conversation.

Sean: Absolutely. Maybe that’s the benefit of having myself come from an academic setting where we had a … I worked in a large lab of 25 plus postdocs and grad students and experimental researchers at the University of Michigan here. Many of them, to your point, were either coming from physics, engineering, computer science, mathematics, biochemistry … that’s actually my background, doing this, sort of, computational chemistry type work. Being able to understand the different viewpoints and understand what people bring to the table in itself is a very important thing. Having spent a good amount of time there, and realizing that there’s no way even as a scientist, or let alone, a data scientist, that you can know it all.

Sean: I think once you can set aside your ego and realize that the synergy that can be had with your collaborators can do amazing things and I think that that’s what … having experienced that in the past, as we’re building out our team here at TD Ameritrade, I think that’s something that we very consciously keep at the top of mind and that we’re not hiring the same people over and over again. In fact, my biggest thing is I’m not interested in hiring another Sean, right? What’s important is that we have that diversity of experiences and thinking that really helps push the boundaries of what our team can look at.

Hugo: For sure. I’m not sure I could handle another Hugo at DataCamp, in all honesty. So are you actually hiring at the moment?

Sean: Right now, we’re not, but we’re always have openings up and coming and in the future. But we also do have a data science team within the company here that also sits here in the Ann Arbor office with us and we’re actively looking. I’m also a part of, sort of, an enterprise-wide initiative, which is an AI council. I serve as an advisory member of the internal AI council, where we’re looking at how might we apply data science and aspects of artificial intelligence, namely deep learning methods, to help solve some real problems within the company. And for those types of initiatives, I think that we’re obviously going to be expanding into that realm even more over the next months and so I’m sure that adding new team members will be a priority.

Hugo: Okay, great. So we’ll put a link to your careers page in the show notes and any other resources that may help people who are interested in opening a conversation like this.

Sean: Great.

How did you get into data science?

Hugo: You spoke a bit to your background in scientific research in a lab. I’d love to know just a bit more about your background and how you got into data science originally.

Sean: Yeah, sort of, maybe going a little bit further back, when I was growing up I was actually coming from an Asian family, I thought that it would do me well to venture down a path of becoming a physician. And that never really panned out. I went to school for … actually was a biology major with an emphasis on biochemistry. But I was actually doing that, being very data driven, if you will, in that I chose that major because that was the major that had the highest probability of being accepted into a medical school in Canada. But that didn’t pan out, but in school growing up, I was always very, very good at math, so while doing biology, I actually did a minor in applied mathematics. I spent a good amount of time doing research, summer research, in a computational field or in a, sort of, geometry field where I was exploring protein flexibility and how that has effects on how proteins bind to certain or different types of ligands.

Sean: Once my undergraduate career was coming to a close, I was thinking about what was the next step, and it was recommended to me that, "Hey, maybe you should consider graduate school." And that was up until that point something that I had never considered, but looking into it, it was really a fantastic opportunity. So I know that you had several, I guess, past DataFramed podcast guests such as Sebastian Raschka, Randy Olson, who all went to school at Michigan State and there-

Hugo: That’s right.

Sean: … that’s actually where I went. Maybe a side note is that Sebastian actually is, sort of, came after my time there. I overlapped at the time that Randy was at Michigan State too, but Sebastian actually worked in a lab that I also rotated in as well. So we have a lot in common.

Hugo: Cool, and of course, yourself and Sebastian and Randy are all strong members of the PyData community as well, which is really cool.

Sean: Yeah, yeah. Maybe that’s something that’s important to us as well. I’m one of the co-organizers, along with Ben Zaitlen and Patricia Schuster here in the Ann Arbor community. We run the PyData Ann Arbor monthly meetup. It’s hosted here in the TD Ameritrade office. We really spend a lot of time thinking about what data science means and what value we can bring to the local data science scene as well as, to some extent, the startup scene. In fact, last month our speaker is a senior legal counsel here at TD Ameritrade who was invited to give a talk, and she talked about a topic that was titled "Privacy Isn’t Dead", which I thought was fascinating and something that as a data scientist, is important for all of us to think about. It’s not so much of necessarily putting your head down and crunching the numbers, right? There are people behind the data, right, and it’s important for us to all always consider the privacy aspect. But most all of the talks are recorded and posted on YouTube, so I invite everybody to go check them out.

Hugo: Right, and we’ll link to that in the show notes as well. So grad school, what happened after grad school?

Sean: During grad school, I did a lot of computational work, so I worked strictly in a dry lab. I did computer simulations of protein DNA interactions, some of the largest simulations of its time.

Hugo: And that was molecular dynamic stuff, right?

Sean: Right, right. I think there’s some overlap with some of your … Michelle Lynn Gill as well who did some-

Hugo: That’s right.

Sean: … MD simulations.

Hugo: And for our listeners … correct me if I’m wrong, but molecular dynamics is simulating stuff on a really short timescale, but all the interactions, and you need a lot of computing power to do this, right?

Sean: Absolutely. I feel like a curmudgeon these days, right, because when I was doing computer simulations, which was realistically not that long ago, we were using CPUs, right, and parallel computing on some cluster of computers. And it took me probably six months to a year to produce several hundreds of nanoseconds of … or even I guess sub-millisecond type simulations. And we’re talking about simulation time steps of picoseconds, here. Then now with the growth of and usage of GPUs, people are basically reproducing simulations that I ran, right, but within weeks if not days. People are … maybe spoiled is an exaggeration.

Hugo: Definitely not. And I do have this vision of you being like, "Back in my day, we never had GPUs. We got by with …" you know?

Sean: Right.

Hugo: I love it. So, yeah, grad school, molecular dynamics, then what happened?

Sean: So that’s when I moved from Michigan State to University of Michigan to become a postdoc, so it was also in a very similar computational lab. In this case, I was doing simulations of protein-protein interactions, protein RNA, and what’s sort of called coarse grained simulations, where you can think about different scales of dynamics. Usually at the atomistic level, you’re looking at atom to atom interactions, but what you are able to scale that out using a more coarser grain type of model, then you’re looking at larger and larger dynamics and being able to study that. So I was looking at sort of binding of different proteins that might affect transcription. During that journey … during, essentially, my entire PhD and postdoc career, I think what people refer to as data science today, I was just referring to doing science out of necessity, right? Things like applying PCA analysis or k-means clustering to look at what protein structures look very similar to each other, what are some of those dynamics. People call it machine learning these days. Around my colleagues it was just a necessity, again.

Hugo: Yeah, sure. So then you transitioned from academia to industry, right?

Sean: Yeah, I think a large part of that … I think maybe, perhaps even something that is not often talked about, especially when moving from academia into industry, right, or even staying in academia is what happens when you start growing up and becoming an adult, right, and need to raise children. I think it was precisely at the moment that I had a child that I started thinking, "What does life mean?" I had maybe, perhaps an existential crisis. But, and sort of pondering about whether or not I wanted to stay in academia and also the what was a very competitive environment in the academic setting that I was in … started thinking about well, what it would mean to move into industry, right, and to also open up the options, especially with funding … scientific funding being very, very challenging. Then also realizing too that maybe for the rest of my life, I’m not actually doing science and that I probably end up either teaching, which I do enjoy, but also spending the majority of my time reading papers and just writing grants and not actually doing fundamental research. That was a little bit depressing for me.

Sean: So I was very lucky. A postdoc that I had worked with who is now a faculty at the University of Michigan, Dr. Aaron Frank, who was a close collaborator. He worked with me, and he kind of asked me the question of, "Did you ever consider data science?" This was back in 2014. Up until that stage, I actually had never even heard of the phrase before, right? Even with DJ Patil sort of popularizing that term, again, data science. As I looked into some of the job postings that were out there, and I just kind of thought to myself, "I feel like I have 90% of these skills that people are looking for, and the other 10% I would be absolutely interested in learning more about." Having, sort of, that curiosity. And so that’s when I made the change and transitioned over to industry and started looking for jobs out there.

Data Science in Finance

Hugo: Great. So, I want to pivot now to talk about your work at TD Ameritrade, but before we do that, I’d like to speak more generally about data science in these types of organizations. I’m wondering what aspects of financial services and brokerage at TD you think data science can have the largest impact on.

Sean: I think the biggest area of opportunity for us, right, is personalization, and also to … in that same regard, is looking at how do we take the large amounts of data that is out there and to help boil it down to essential information that is pertinent to individual investors, whether it be for our long-term investors or even for our active traders. Now obviously, these problems or these types of problem, are on different time scales, right? Where our long-term investors might be looking at things that are over months, maybe even years. Our active traders could be looking at things at the, sort of, minutes to hours to days timescale.

Sean: When it comes to that personalization, we get into the realm of things like recommendation systems, sort of a la Netflix and other folks. So we’ve been definitely exploring some of that, but even earlier, when I started my tenure here, we were thinking about things along the lines of even looking at natural language generation and even natural language understanding. A lot of NLP type work, understanding when customers call us. What are they calling us about? And being able to take even phone calls and trying to predict what people might be calling us about.

Hugo: Right, so in that case, we’re talking about conversational software of some sort, right?

Sean: Yeah. That’s also interesting too, right, so conversation can happen whether it be on a phone call, but it can also occur on some sort of chat platform, whether … so for our customers here at TD Ameritrade, we have something called Ted, which is basically an online chat agent. We’ve also built on here at TD Ameritrade a Facebook chatbot, as well as a Twitter bot as well that our customers can interact with. A lot of that initial work started off on the Exploration Lab team.

Hugo: Right. That’s really interesting. It reminded me. Two of my collaborators, Jacqueline and Heather Nolis, who are a data scientist and machine learning engineer at T-Mobile, respectively. I might get this slightly wrong, but the message is the same. They’re working on conversational software and machine learning at T-Mobile, among other things. There was a simple problem that people would come and try to log in and have a conversation with an agent, but answer a few questions beforehand. The person, the customer, would initially say, "Hey, there’s a problem with my bill." And the follow up question would be, "Are you a customer?" Because that’s the first question that’s asked. And of course, from the initial statement, it’s clear that they are a customer. So even solving small challenges like this are incredibly helpful, right?

Sean: Yeah, NLP is an extremely, extremely hard problem, right? And that’s why we have tons and tons of academic researchers that have been looking at this for decades. It definitely poses a problem for us, too, being able to understand the question that’s being asked, but also being able to manage the context of that question as it relates to not only the current conversation, but also as it relates back to the customer’s journey at TD Ameritrade. I think the holy grail for any sort of conversational agent is to be able to tie together all of the touch points that a customer has had with us, and be able to know exactly what you want without you even having to tell us, right? That would be a bright day for humanity in general.

Hugo: Yeah, I always joke … and it’s not a joke at all actually, that NLP must be hard if we still use regular expressions.

Sean: Yeah, now you have two problems, right?

Hugo: Exactly. So tell me a bit more about, I suppose, the brokerage aspects or the financial aspects. I presume you’re probably deeply interested in thinking about time series analysis and prediction and this type of stuff.

Sean: It depends. I think on our team, I personally have explored a proof of concept looking at trying to spot patterns within a time series. One might think that it might be for predicting the financial market, but in fact, it’s internally for us. It has varying and general applications to look at, finding patterns of, let’s say, our server outages or our server resilience, right? And being able to spot that hey, prior to some strange event, we’d notice that the memory usage was very, very high. Being able to see that certain number of cores maybe were down, right, and spotting those patterns ahead of time and finding early indicators that will allow us to become even more resilient. It’s just so that at the end of the day, we’re better able to serve the customers.

Sean: Other alternative areas of application would be to look at even for our customers themselves, right? What is their pattern of usage of our platform, and being able to identify more clearly that are there certain steps that people take before they reach the next milestone of their investment journey? But, again, what we’ve built here … and hopefully, we’ll be able to talk more about this in the future, but we looked at some research that was actually conducted at UC Riverside by a professor out there that looked at taking a time series … and being a fairly long time series, right, and being able to spot from it patterns. And that itself is a very, very tough problem, because if you can imagine a time series with 10,000 data points, 100,000 data points and without you telling it what pattern to look for, can you, kind of, slide a window across that time series and at each window, can you tell me what is the top k or top N closest matches to that? And if you think about it, that’s sort of like in itself an n squared calculation, depending on how many windows you have. So it becomes computationally intractable very, very quickly. But this research that was conducted there was able to show that using sort of a smarter method, a smarter algorithm, that you can actually get these exact matches and do something with it. So we leveraged some of that and we currently have a patent that’s filed, to apply some of that technology. By the same time, at TD Ameritrade we’re working to open source some of the underlying code to allow the data science community to start applying some of this to their work … their time series work as well.

Hugo: Fantastic. And this is a great example, because it speaks to, I think, a huge part of your work, which has been a through-line here. But we haven’t, I suppose, necessarily explicitly stated in that you think about particular questions, problems, and challenges, but you want to draw results from research and work from all types of different industries and different disciplines, right?

Sean: Yeah, and I think that’s where, again, my background comes into play, and also having pretty stereotypical data science characteristic, which is curiosity that everybody talks about. But in addition to that, what I tend to do is I try to find very niche methodologies that people are looking at. But also, going back to what I was saying earlier about listening … I think time and time again when I listen to people’s problems and the problems that they’re trying to solve, I start realizing that there’s an underlying theme across all of them. In one case it is … or in this particular case it is trying to find patterns in the time series, right, without expert human knowledge, or domain expertise, necessarily, right? It isn’t necessarily to do all the work for you, but it at least makes it easier for the human to direct their attention towards certain patterns.

Sean: I think that’s one thing that on this team we try to do a lot of, which is to hear what people are saying, and then trying to connect the dots, and then also keeping an eye out for new and emerging methodologies or technologies that are coming out. But at the same breath, I would say that sometimes there’s not even a lot of new things. You think of something like deep learning or neural networks. Those things have been around for decades. And so they are now … somebody has put … it’s newly interesting. I think that’s something that Hilary Mason has said in the past too.

Hugo: I also recall, when I think of your work, is thinking about providing customers with alternative data sets.

Sean: Yeah, that’s something that’s very, very interesting, right? I think that for investors, they’re always looking for some sort of edge or maybe some inside scoop, right? But at the end of the day, it’s data that will help drive your decision making, or information, right … information by proxy to really … what you’re getting at is data, right? So thinking about not only are we reading news, right … are we reading opinion pieces, are we looking at how the prices are doing for certain securities … but also thinking about are there alternative data sets that might be indicative, right? And again, this is not to say that past performance can help you predict the future, but at the end of the day it is how do we as a company help provide our customer with the best experience, and that involves providing, potentially, alternative data sets. And as a research team, some of the things that we do is think about, or consider, what some of those alternative data sets might look like.

Hugo: So maybe you can’t speak to real examples just for IP reasons, but can you give us a hypothetical example?

Sean: Yeah, definitely. And maybe this gives you a glimpse into how, sort of, my brain and my thought process works.

Hugo: Please, let’s enter the brain of Sean.

Sean: Maybe a couple years ago, I was literally driving to work, and I was listening to NPR out of most things. I usually don’t listen to music, but NPR was talking about how, I think, it was at NASA … they were releasing a brand new data set to the community. And this is satellite imagery that allowed you to look at pollution levels across the globe. That got me thinking, "Hey, I wonder if we can leverage this open data set." So this was called NASA’s Aura Satellite data set, and so it’s mapping high-res images of air quality.

Sean: Because if you can think about it as an investor, I might be invested in some sort of commodity. So whether it’s either maybe soybeans or orange crop yield. And so if you’re in a polluted area, or in an area of high pollution, that could affect your potential yield in the future. So we can marry data such as this type of image data, and we can obviously leverage things like convolutional neural networks, which is in vogue these days, along with some historical data about these commodities and perhaps be able to provide a more holistic view of how the commodity might perform in the future.

Hugo: That’s cool. I really like that explanation and appreciate it, because it goes all the way from stating the question or problem or challenge, thinking about what type of technology and data that’s been collected is appropriate there. And also the kind of the state of the art techniques. That’s really cool, Sean.

Research and Development

Hugo: But what I’d like to do now is, we’ve talked about the types of questions and impact that data science can have in TD. I want to know more about R&D. The fact that you don’t necessarily build scaleable infrastructures or end-of-the-line products, but you build proofs of concept rather than actual products. I’m wondering what a proof of concept looks like and what’s the distinction between them and prototypes and products.

Sean: Yeah, that’s a great question. So the way that we view, sort of, this pipeline, right, is that product is something that tends to be put into the hands of some end customer. Now that customer could be anybody, right? They could be an external customer, but it can also be some internal associate here at TD Ameritrade. Maybe somebody that will help improve our general process, somebody that will help customers make better investment decisions. But for us, we’re at sort of the opposite end of that spectrum, right, at a proof of concept. And at that level, we are literally trying to prove out whether or not there is potential. That tends to mean that there’s a lot of risk involved, or at least, perceived risk, because we’re literally standing in front of a dark cave.

Sean: That’s sort of the analogy that I have, where we’re standing in front of a dark cave. We have this chintzy flashlight that we can only see 10 feet in front of us. We know that there might be light at the opposite end of that tunnel, right … or that cave, but we really don’t know until we start venturing in. Versus if you’re in production, you probably are pretty sure that not only is this what the customer wants, but that you have a scaleable solution or in this … maybe in the case of machine learning models, you have something that has relatively good precision and recall.

Hugo: Let’s just zoom in on it, and maybe this is what you’re getting to. Let’s zoom in on what it means to have potential. I mean, what’s the sniff test or the rule of thumb or how does something have potential enough to go the next phase?

Sean: For us, it is being able to have a reasonable internal use case, that we’re exploring along with the underlying technology or methodology, right, and realizing that this is hopefully truly a hard problem that we’re trying to solve. And that if we’re able to provide a minimum viable proof of concept … so MVP in this case is for a minimum viable proof of concept, then that might have enough legs to transition into what we’re calling a prototype. And the way that we view this is that a proof of concept, again, is showing the potential that okay, there’s something there.

Sean: So maybe to give your listeners an example is, if I’m building a machine learning model, that’s maybe a binary classifier of some sort, and what I don’t know is even where the data is to answer this problem. That’s part of that risk that we’re taking on. So we have to explore whether the data is internal, whether we even have to collect the data, whether the data is in some third-party vendor API that we need to connect with. And then once you bring in that data, how much effort is it to actually get it whipped into shape to be used in some data processing pipeline? Then maybe afterwards, are we doing some analytics on it? Are we building a machine learning model in the case of, let’s say, a binary classifier? If we’re building something out of the box, are we getting better than a coin flip?

Sean: Now, we’re not looking to target 99% accuracy, because this isn’t a Kaggle competition, right? But what we’re looking for is at least something better than 50/50. That will give us some level of confidence that there is perhaps some signal amongst that noise, and then that’s when we start thinking about moving it towards maybe a prototype, where if truly the goal here is to optimize the model performance. And that’s one thing, but perhaps 50/50, or at least 60% … that’s the accuracy, is enough, where previously we had no process. And with no process in place, your accuracy essentially is zero, right? That’s sort of one way that we look at things.

Hugo: Yeah, but of course the joke is that if you have a binary classifier, and your accuracy is less than 50, then you just flip your predictions.

Sean: Multiply by negative one?

Hugo: Yeah, exactly. So, you also have a nice example involving the Amazon Echo, Alexa, that maybe you could tell us about.

Sean: On our team, actually a few years ago, we were exploring the idea of … even before things like chatbots became very, very popular, even here at TD Ameritrade … we started thinking about communication. How might a customer want to interact with us? And that might be through, let’s say, an Amazon Echo. So one of our teammates here, Brett Fattori, he built out one of the earlier proofs of concepts where not only can you interact with the Alexa, which is the assistant on the Amazon Echo, but you can log into your account and be able to query what is my account balance? What is this particular stock trading for? Or what is the price of this company? Or even pull some relevant news articles that might be relevant to you.

Sean: When we started doing that, we started realizing that maybe this is one form factor, but is there opportunities for us to think about maybe a headless system that is device agnostic? Everybody carries around their personal mobile devices, but not everybody necessarily owns an Amazon Echo. Or maybe people own multiple different devices, such as Google’s Home Assistant. So one of other team members started to think about well, how might we make something that is headless?

Sean: So that opens up a whole realm of research discussions, right, because if you think about it, the first thing you do if you’re going to roll your own, is you have to think about, "Well, how do I capture the audio coming from the customer?" Assuming that they want us to. And then from that audio, how to we transcribe it from audio into text, and then how do we start parsing out that audio and think about what is the actual intent of the customer? And that gets at the realm of natural language understanding. Once you understand what the customer wants, you have to then go and retrieve the relevant answers. And that answer can be pretty complicated. You touched on it earlier, about the context of the conversation, right? If somebody is already a customer and they’re already authenticated into our systems, then that experience is completely different from somebody who is unauthenticated.

Hugo: Does the minimal viable proof of concept need to, kind of, demonstrate a proof of concept for each of these components of the pipeline?

Sean: You got it. That’s where it’s important to have good scope, and that it’s actually okay for you to fall out of each of these sections if things don’t work. Or we find that it’s actually extremely, extremely challenging. Now that doesn’t mean that we completely abandon it. It might be open to further discussion, or it’s something that we might even consider rolling over to a more dedicated team. So we’ve definitely had proofs of concept on our team that where members have followed that proof of concept into prototyping and then also out into production, where they became the tech lead on those individual teams.

Hugo: What types of things can the proof of concept stage miss that’s then caught in the prototype stage or even later? Top three things.

Sean: Yeah, I’d say the top three things would be … first thing is scalability. Unless the proof of concept was looking at a technology that claims to be able to provide scalability, we try not to be bogged down by scalability. Because oftentimes that can squash projects from the get-go. The second thing is in the data science realm is model accuracy. We’re not out of the box necessarily interested in getting the best model, but at least from a data standpoint, is there enough to provide a reasonable model? Or maybe from a methodology standpoint, if there’s new methods that come out, such as some new deep learning architecture or maybe even CatBoost or something like that, right? We’ll test it out, and really the goal there is to show that the method itself works, right, beyond what is being published.

Sean: And the last one is … the last thing that we try not to focus too much on initially is real world application. Of course, it’s an ebb and flow. There are times when we definitely want to have a real use case in mind, where we go out and build something. But at the same time, we also … to the point of finding a time series pattern matching methodology, right? We might not necessarily have a use case in mind at the get-go.

What is your favorite data science technique?

Hugo: Okay, great. So, scalability, model accuracy, and real world applications. Fantastic. So, we’re going to wrap up in a minute, but we’ve kind of dived into a lot of different, interesting questions that you think about, how you approach answers to them, process, and a variety of techniques you’re interested in. But I’m just wondering what one of your favorite data science-y techniques or methodologies is.

Sean: Maybe this is the longer answer, but maybe from the interpretability standpoint, right? I’ve always been a big fan of linear regression, with like a L1 norm regularization. I’ve studied that and used it on occasion, and I’ve found that that not only provides good interpretability but it’s sort of constraining your … the coefficient that come out of it can be very, very powerful. And I’ve seen some fantastic applications in the realm of predicting the potential energy of some protein states, as well as looking at extracting background noise from videos, all using this type of approach. But I think I mentioned it earlier, some of the work that I’ve done on the time series pattern matching has really opened up my eyes a lot. Because it’s non-traditional, and I recommend that people look out for that Git Repo or reach out to us if your interested in the future.

Call to Action

Hugo: Great, and we’ll link to it in the show notes as well. So do you have a final call to action for our listeners out there, Sean?

Sean: Yeah, what I would say is in working in, sort of, data science and science for a good number of years, I think people tend to … especially in the marketing world, they tend to over-glamorize, especially nowadays, AI and artificial intelligence and machine learning, but I think it’s actually very important for many of us practitioners to share data failures. So even at our PyData meetup, we always encourage people to not only point out where you succeeded, but also where things didn’t work, because that’s equally as valuable. I just wish that there was … when I was in grad school, that there was a journal of things that didn’t work. That would’ve been helpful.

Hugo: I couldn’t agree more, and I’d love a journal of negative result … I think there is actually a journal of negative results. But we definitely need a lot more negative results published, particularly in biology, where I came from, there’s such a reproducibility crisis.

Sean: Absolutely.

Hugo: I love that. So remember everyone, data science and AI are both hard, but you can learn techniques by working on problems step by step and problems that interest you. Sean, it’s been an absolute pleasure having you on the show.

Sean: Thank you so much for having me, Hugo.

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.