Data Science at AT&T Labs Research

Hugo Bowne-Anderson

3 years ago

[This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Noemi Derzsy, a Senior Inventive Scientist at AT&T Labs Research within the Data Science and AI Research organization.

Introduction Noemi Derzsy

Hugo: Hi there, Noemi, and welcome to DataFramed.

Noemi: Hi. Thank you for having me.

Hugo: It’s a real pleasure to have you on the show, and I’m really excited to be talking about your work at AT&T Labs in research at the moment, but before that, I’d like to find out a bit about you. So, on your website, and you sent me a bio, I’m gonna read it out because I love it. You’re a senior inventive scientist at AT&T Labs within the data science and AI research organization, and I love what you say next, that you’re doing lots of science with lots of data.

Noemi: Yes. Well, that is what I do.

What have you been involved in?

Hugo: Exactly. So, you’re working at AT&T now, but you actually have been involved in a lot of other initiatives in the data community. So, I thought maybe you could give me a bit of background, tell me about other things you’ve been involved in.

Noemi: Yeah, for sure. I spent a lot of time before becoming involved in the open source space in academia where I actually didn’t have the opportunity or the bandwidth necessarily to work on open source projects that were actually putting me out there in the open source community, but I started becoming more active in the open source space once I became a NASA Datanaut, and here I started working with NASA’s open source metadata, which is basically information about their over 30,000 datasets that they make publicly available, and that’s how I started getting involved in the data science community and in open source, and recently, I also became the co-organizer of Women In Machine Learning & Data Science meetup in New York, and here we organize meetup events focused on machine learning and data science topics, and our mission is to provide a supportive community that encourages and promotes women and non-binary people in tech.

Hugo: Fantastic, and actually, I’ve recently had Reshama Shaikh on the podcast to talk about a lot of the initiatives at WiMLDS in New York City.

Noemi: Right. She’s my colleague at Women in Machine Learning & Data Science.

Hugo: She’s fantastic.

Noemi: She is.

NASA Datanaut

Hugo: So, I’m also interested in this idea of being a NASA Datanaut. Can you just tell me a bit about this? It sounds really cool to start with. I want to be a Datanaut. So, could you just give me a bit of context around it, on what the program is?

Noemi: Yeah. I think every data scientist whose dream was to become an astronaut but didn’t make it now can become a Datanaut.

Hugo: That’s awesome.

Noemi: Yeah. When the government forced these government agencies to open source some of their datasets, then NASA open sourced over 30,000 datasets, and people don’t know about it, so one way they tried to promote this is with the NASA Datanauts program. So, this is an initiative in which they tried to create this collaborative group every year of individuals who are interested or excited about working with their open source datasets, and we have meetings regularly. We have webinars. People can present what they are working on. They can start collaborations based on NASA’s open source datasets to come up with ideas how they can use them, and then there are data scientists from NASA who are actually presenting what they are doing and telling us how we can get involved in certain projects that they would be interested in seeing results in, but they don’t have the time to work with.

Noemi: Also, the chief knowledge architect from NASA, David Meza, he’s also very involved, and he’s very supportive of this community, so you can always just reach out to him, and he’s going to be very supportive no matter what your question is or what projects you want to work on. So, it’s an amazing community to be a part of. It’s application-based, so every year they launch their application opportunity, and people can apply, and if they get selected, then they can just … Once they become a NASA Datanaut, they will be Datanauts forever.

Hugo: That’s really cool. If our listeners who I’m sure are really excited by the idea of being a NASA Datanaut, at least as much as I am, are interested, we’ll include a link to a few things in the show notes as well.

Noemi: Oh, yeah. Sure. Well, open.nasa.gov is the first place to go to, and then there you can find information.

Hugo: Perfect. So, the other thing that I know that you’re excited about is teaching and pedagogy and data science instruction. Right? And you’ve also run workshops in the wild at conferences and this type of stuff. I think, if I remember correctly, was your interest in network analysis and this type of stuff?

Noemi: Yeah. So, actually, my bachelor’s degree, master’s, and PhD, and then a five-year postdoc all involved network science, so many of my research projects were on understanding complex systems through their network structure. So, I was thinking that this was a good opportunity to have these workshops to show how you can do network analysis, especially because there is this very good NetworkX Python package out there that can enable data scientists to just analyze the data from network point of view very easily.

How did you get into data science?

Hugo: I love NetworkX so much, and we’ve actually got two courses introducing NetworkX on DataCamp taught by Eric Ma, who’s an old friend and collaborator. He’s now a research data scientist at Novartis, but NetworkX is a really great package, and the API is really nice as well, I’ve found. So, we’ve got a few ideas about your background, but I’m just wondering how you got into data science and analytics originally.

Noemi: Yeah. Well, I got into data science long before it started to be called data science, I think this was back in 2006. So, to give a bit of context, I did a bachelor’s degree in physics and computer science, so I had to write a bachelor’s degree thesis on something novel and related to the field, and I didn’t know what that would be. I could either decide to do some physics project or some computer science, but I really wanted to find a method that I can combine these two.

Noemi: So, I actually was very lucky because I had this great quantum physics professor who is the leader in the research space, and he is always interested in physics applications outside the traditional boundaries, and at that point he was working on projects that were focused on understanding complex systems, and even more complex systems with underlying network structure. So, what he was always working on within his projects at the time was to analyze and model these complex systems through some data that he obtained from different sources. So, this was using a lot of computational physics, which I really liked, and also leveraged data analysis. So, I found this topic very exciting.

Hugo: That really explains your interest in networks to this day, why you educate around them, your love of NetworkX, and I supposed also, as we’ll get to, some of your work at AT&T, thinking about networks of individuals in a society and communication between them and that type of stuff.

Noemi: Exactly, and that’s how I actually got to do my first project using social type of data, which was from this Erasmus European scholarship framework, and here I actually built my first network, which I found really exciting. I built a network of European universities where the connections were built by the students who went from one university to the other.

Hugo: Interesting. So, is that a directed graph?

Noemi: Yeah. You can look at it either at the direction, or you can look at it from the undirected way. We actually built both a directed one and the undirected one because if you take into account from which university the student goes to, which university to visit, then it would be a directed one, but if you just want to look at professional connections, for example, I just want to see how universities are interconnected among each other, I don’t really care about the direction. So, then you just look at the undirected version of the network.

Hugo: That’s really cool. I’m wondering about the data collection and data generating process in this case. Did you hand write all these universities down and then figure out using another data set, or did you figure out a way to automate it, or how did that work?

Noemi: So, actually, this was a fairly small dataset. It was only a snapshot from 2003, so they gave us just a small data that we could play with. It contained information… it was basically a matrix version that I received, so each row and column was a university, and then the value was the number of students that went from one to the other. So, that was the data that I got back then.

Hugo: Fantastic. What did you do with it then? What were the takeaways?

Noemi: So, we actually revealed the most interconnected cluster, the subgroup, and this was very interesting because it wasn’t the top universities, but it was actually someone at the conference mentioned when I presented this that it looks like the universities belonging to the cities where the students can have the most best parties. Yeah. What we basically found was that the connections are very much influenced by the professors’ connectivities. So, the professional network of the professors is the one that basically drives these connectivities within the students, despite the fact that they have the opportunity to choose themselves where they go.

Hugo: Right, and actually, that’s … So, I actually did a postdoc, or the first half of my postdoc, in Germany in cell biology, and I do remember a lot of people came through the lab and the institute as a function of professors and researchers and social connections between professors where I worked and professors at other institutes and campuses.

Noemi: Right.

Hugo: So then, if I recall correctly, you worked on something else for your master’s thesis, right?

Noemi: I got so fascinated by this topic that I wanted to continue to do the same thing throughout my master’s degree. So, for that thesis, my advisor, the same advisor who I eventually got the PhD with as well, because I was such a big fan of his type of research and work, so he obtained this Enron open source email communication data. I don’t know if you know about it.

Hugo: I know it well. Yeah.

Noemi: Yeah. Okay. So, I used that email communication data to start analyzing how people communicate with each other, and if we can detect some pattern and build a model, and yeah, the most interesting part was from a physicist point of view, was that it was basically the communication was like an exponential decay, like a particle decay.

Hugo: Oh, interesting.

Noemi: So, yeah. We showed that the later you reply to an email, the less likely it is that you’re going to actually reply to it, and the probability is going to drop exponentially in time.

Hugo: Right. Well, anecdotally, that seems right for me anyway, because there is this … I don’t want to go into this too much, but there is this barrier, right? If I’ve left an email a week to reply, then I’m like, “Oh, no. I’ve gotta actually give a proper reply now,” as opposed to just a few words like, “Hey, received whatever". There is a barrier there. The other thing that I just want to say about the … I’m actually very familiar with the Enron dataset in a slightly strange way. A good friend of mine, and I’ve mentioned this on the podcast once before, a good friend of mine who’s a digital artist and has done a project where you can register to receive the Enron emails daily-

Noemi: Oh, cool.

Hugo: Yeah. I can put a link in the show notes as well. So, I actually, in my inbox every day, I receive one of the Enron mails. I think today’s was Mark from legal thanking someone else for dinner last week or something like that, but it’s actually really odd, and a very intimate dataset as well.

Noemi: Yeah. I actually haven’t followed that to see how the data evolves compared to the fraction of the data that I had back in the day, but actually, that would be a cool thing to followup and see.

Hugo: Absolutely. Let me ask, is the work that you did for your master’s on the Enron dataset, is that on GitHub or out in the public domain at all?

Noemi: No. I didn’t post it back then. Back then, I was doing it in C++, and it was a very long time ago when … In academia, it’s also not very popular-

Hugo: Yeah, okay. No, that makes sense.

Noemi: … to open source things. That’s something that I think academia can work on.

Hugo: It can, and it wasn’t necessarily incentivized back then, and it’s generational in a lot of ways. We’ve discussed that on the show before, but I do think it’s becoming more and more commonplace, particularly, more and more people are learning R and Python as opposed to MATLAB and whatever else there is. Not that MATLAB isn’t great for certain things. I don’t want to say that.

Noemi: Yeah. Well, I always used C++, so from there, for me a natural transition was Python, and now since most of my colleagues here use R, it’s something that I’m dipping my toes into.

Hugo: For sure. When people ask me why I write Python, my first response always is because I love writing Python code. It’s so much fun to write.

Noemi: Yeah.

AT&T Labs Research

Hugo: But today, we’re here to talk about your work at AT&T Labs research, so I thought maybe you could just break down for me what the mission and history of AT&T Labs in general is.

Noemi: I actually represent AT&T Labs Research, so I can talk about AT&T Labs Research mission, because AT&T Labs is a very broad research and development division of AT&T. So, the mission of AT&T Labs Research is to look beyond today’s technology solutions to invent disruptive technologies that meet future needs. This comprises very diverse and fascinating research areas that range from AI, 5G technology to video and media analytics.

Hugo: Great. So, this is stuff that maybe won’t be implemented right now, but thinking very much, as my Belgian colleagues would call, future music.

Noemi: Right. Yeah. So, this is the big view and the future goal that we’re working toward.

Hugo: Yep. Okay, great. So, maybe you can set the scene historically as well for us, briefly.

Noemi: Oh, right. Yeah. So, the history of AT&T Labs, if you think about it, AT&T Labs traces its history from AT&T Bell Labs, which is famous for its very rich history in innovation, and as a physicist, I feel particularly honored to be part of the research lab, especially this research lab where several physicists and scientists from other fields as well have been awarded with a total of nine Nobel Prizes for their work done at Bell Labs, which I’m still very amazed by, and just to name a few, the Bell Labs hosted extraordinary scientists like Walter Shewhart and John Tukey who contributed to the fundamentals of statistics, and Claude Shannon, who is the father of information theory, and many, many others. So, for me, it’s an amazing opportunity to be here, and I’m proud of it every day.

Hugo: Oh, that’s really exciting, and I’m a huge fan of Claude Shannon’s work, of course, and Tukey’s really interesting, particularly in … Something we are only getting back to now in kind of the cultural discourse is really thinking about the importance of exploratory data analysis, and the focus in academia in industry for a long time has been on positive results. His focus on actually getting to know your data and all the techniques he developed to do that are incredibly beautiful.

Noemi: Right. One of the updates that AT&T Lab has is we recently opened an AT&T Science & Technology Innovation Center in Middletown, New Jersey, which is a museum that comprises this 142 years of inventions that AT&T pioneered in. So, you can actually go and check it out.

Hugo: Oh, great. That’s open now, or about to open?

Noemi: So, it opened at the end of last year, so it should be open. Yes.

Hugo: Okay, perfect. So, we’ve discussed briefly how the work at AT&T Labs research, how it thinks about the future of what can happen. I’m just wondering how it relates to the business side of AT&T currently.

Noemi: So, AT&T Labs was founded so it can focus on solving the hardest tech problems that AT&T’s dealing with, and the solutions of these problems translate for AT&T to improvements in customer service or customer care, and many of the projects also result in cost reductions for the company, like network optimizations and improving advertising and so on.

Hugo: Before we get into the work you do, AT&T Labs research, I’d just like a general high-level overview of kind of … Maybe you can tell me a bit about some of the current projects at AT&T Labs in general that you find most interesting.

Noemi: Yeah. Actually, there are so many, and as a new employee, I’m just still observing all the information and all the new projects that I get to learn about from my colleagues-

Hugo: I’m sure.

Noemi: … but to mention a few that I’m actually not involved in, but I find very exciting or important, are … So, one of them is creating new products to make a difference in the media and entertainment space, which also helps us build this partnership with Turner that has recently become a division of AT&T’s WarnerMedia. So, AT&T has a lot of TV data, and most of the time AT&T’s not associated with TV data, but since AT&T owns DirecTV, and now Turner, it’s a lot of TV data that can be used to do critical and fundamental research in the media and entertainment space.

Hugo: Great. That sounds very interesting. Do you know much about what type of tools and techniques are used in doing that, or what type of outcomes they’re trying to achieve?

Noemi: Yeah. So, I think what I can say is that basically, that’s why AT&T launched Zender, which is their new company focused on advertising. I don’t know if you know about that, after the acquisition of AppNexus.

Hugo: Tell me a bit about it.

Noemi: Last year, AT&T acquired AppNexus, and it became now Zender, which is focused on improving advertising in the media and entertainment space.

Hugo: Interesting. I suppose a big part of that now is thinking about targeted advertising, and I think that’s something your colleagues are thinking about as well.

Noemi: Right. So, the goal is … We have a lot of advertising that is just distracting, and the goal is to provide less advertising, but more relevant ones. So, there is a lot of research that is going on around this problem.

Hugo: Great. So, are there any other current projects that you find very exciting?

Noemi: Coming out from this advertising, there are projects. There are also very important project that is going on that many of my colleagues are working currently on, which is focused on how to combat bias and fairness issues in these targeted advertisings. So, this is somehow related to the first one, but it’s also very important. Then I’m going to mention one that has really nothing to do with this. It’s completely different, but when I first got to AT&T and I found out about it, I was like, “Wow. I didn’t even think that that would be a thing in AT&T.”

Noemi: So, one of the projects is working on creating drones for cell tower inspection. So, this research is basically leveraging AI, machine learning, and video analytics, and their goal is to create this deep learning-based algorithm that is just going to, once you send out the drones, to create these video footages. We are going to analyze this footage to detect tower defects or anomalies, and this will enable automating the tower inspections, and it will make it work faster and more efficient. So, this is one of the things that, oh, I didn’t even think about that, and I found it really cool. I just wanted to mentioned it.

Hugo: That’s amazing. The use of drones and the idea of using essentially deep learning and AI technologies and video analytics, as you say, in drones has so many applications. One I’ve been reading quite a bit about recently is in ag-techs, or agricultural technology, and drones analyzing yields of field crops and that type of stuff, and one of the really cool things is … I like this example that if you’ve got a camera on a drone, and you’re trying to build algorithms or get them trained and tested in realtime, it isn’t as though you can throw the deepest neural net at it, for example, if you’re trying to run it onboard the drone, for example. So, you’re pushing up against a lot of technological constraints there as well.

Noemi: Yeah, and this is so fascinating because for me, drones was never … I would have never connected it to such an important work at AT&T, which is you have to make sure that those towers are working properly, and if there’s any malfunctioning, you need to detect it in time, and this would be really helping that cause.

What are you working on?

Hugo: Exactly. So, this is a nice cross-section of different research projects at AT&T Labs that interest you. So, I’d love to jump in and find out just about some of the projects you’re currently working on.

Noemi: So, my projects are also just as diverse as the ones I mentioned above, which I find really exciting because I have the opportunity to study and to work on very different data types and types of data, and to try to answer very different problems and to tackle with very interesting research projects. So, to mention the first one, which is my favorite, is human mobility characterization from mobile network data. So, human mobility patterns revealed from cellular telephone networks can offer a large-scale glimpse of how humans move in space and time, and how they interact, and I find this project very exciting because I can study the human behavior, which is a phenomenon that as a physicist I’ve become very interested in since undergrad. This project also offers the opportunity to study human behavior through large-scale anonymized customer data and leverage the discoveries to also improve our services.

Hugo: That’s really cool, and as we said before, these types of projects really speak to a lot of your interests that you’ve developed over the past couple of decades as well, and what you’ve worked on.

Noemi: Last year, I was interviewing to find a new job, and I was getting after a point very frustrated that there are so many data science positions out there that I would have to literally throw out all the things I’ve learned in the past and all the research that I’ve done because I wouldn’t be able to leverage it, and I am so happy that I found this job because I can just use everything I studied so far, and it doesn’t go to waste.

Hugo: That’s really cool, and thinking about … I mean, the great thing about a project like this from my outside and naïve perspective is you can view it on so many scales. So, as you said, you can view it as a network of individuals, but you can also view it as including a lot of geospatial data, which is incredibly interesting, or you can view on the individual level as well. So, there are kind of a separation of scales there where you can answer different questions at different points.

Noemi: Yes. I also find it exciting because with this, I’m also learning new tools, and for example, I got to learn about Nanocubes, which is this open source visualization tool for large spatiotemporal datasets, and this was actually created at AT&T Labs, and it’s open source, and it’s amazing because you can use billions of data points and visualize them realtime, and you can also query it, slice and dice your data as you please, and then visualize subsets of it. So, it’s a lot of fun.

Hugo: Yeah. I’ve actually seen several demos of Nanocubes. I’ve never used it myself, but I think it was maybe Simon Urbanek and Chris Volinsky who showed it to me originally, but I can’t be sure. And actually, as you know, I had Chris, who for our listeners is … I think he’s now assistant VP of data science and AI research where you are.

Noemi: Yes.

Hugo: So, I had him on the podcast last year, and we didn’t discuss this in detail, but the first time I ever encountered Chris must have been, I don’t know, five or six years ago at a conference, and he actually spoke about characterizing human mobility, which of course, is the project you just spoke to, and he gave this great talk, which involved seeing when text messages stopped in a downtown neighborhood in … I can’t remember which city it was, but text messages stopped at a certain point, and phone calls started being made, and he realized, or his team realized, that at this point this was when all the nightclubs and bars shut, and people were calling for taxis, this was before Uber and that type of stuff, calling for taxis. So, from the data, you can actually see the emergent behavior of populations. Right?

Noemi: Right, and actually, they even published a paper. This is why I love my work so much, because we also have the opportunity to publish, and he published these findings. Yeah. It’s called Human Mobility Characterization from Cellular Network Data, and it’s a publication, I think, from 2013.

Hugo: Great. That was around the time I saw this talk, actually, so that would make perfect sense. So, we’ll definitely link to that in the show notes as well. But it’s really cool to be able to publish this stuff, to make discoveries about human mobility and characterizing that, as you say, but also, as you said, to leverage these discoveries to improve the services that AT&T provides people.

Noemi: Right, and that links to another project of mine, which I’m working on, and I also find it exciting because I can also use my network science background, and that project is about characterizing our mobile network and analyze how its topology compares to other reported real social networks out there. This initially just sounds like fun, but it’s also very crucial for us to know how our network topology looks like because it helps us understand how certain dynamical processes progress throughout the network, and this implicitly also helps us improve our services.

Hugo: Great. So, when we’re thinking of this mobile network, generally, for our listeners who don’t necessarily know a lot about network theory, a network or a graph, you’ll have nodes and edges that connect them. So, you can imagine on Twitter that all the nodes are people with Twitter handles, and connections are formed when people follow each other. Those are the edges. Now, I’m wondering what this mobile network looks like. Is it people with cellphones are the nodes? And there’s an edge between them if they call each other or message each other?

Noemi: Right. Yes. This is all anonymized, so the only thing that we’re doing is using at aggregate level to see connectivities like number of connections and so on, so this helps us understand the topological features. Yeah. So, basically, a network, it has elements, and then the elements that are connected by a certain relationship, you can build this edge among them, and then this is how you construct your network. The reason why I find this so fascinating is because networks are everywhere around us, and network scientists are many scientists from different fields because it’s a very interdisciplinary topic. You have protein interaction networks. You have neural networks. You have social networks. You have street networks, so transportation networks, power grids. So, we are living in a very interconnected world, and everything is networks, so for me, that’s why it’s so fascinating to work in network science.

Hugo: Very much so. So, you said two things I just want to kind of tease apart briefly. You talked about how the topology of a network can be crucial to understanding how certain dynamical processes progress throughout the network. I’m just wondering if you could give us insight into what topology in a network actually means, and what type of dynamical processes, for example, you think about.

Noemi: So, topology-wise, you want to see some basic features of the network. You want to see what is the degree distribution of a network. So, that means that I’m trying to see how many nodes I have with this number of connections, how many nodes with that number, and then I’m just building up the distribution. Recent studies have shown that-

Hugo: That think about how connected it is. Yeah.

Noemi: Right, and studies in network science have shown that real networks mostly follow this scale-free pattern when it comes to their degree distribution, which means that most nodes have very few connections, whereas you have this small number of hubs, which have an extremely large number of connections, and this is something you can see in Twitter, too. Right? So, you have these very popular people who have hundreds of thousands of followers, but most people will have a very low number of connections. Then the other thing that you want to look at when you’re looking at topology is how clustered the nodes are within the network. So, is the network more homogeneous, or do you see some more densely connected subgroups, like for example, in social networks you will see many densely connected subgroups.

Hugo: I was just gonna say, that’s really important because this can give rise to the emergence of filter bubbles and echo chambers. You can imagine politically distinct groups that really communicate within themselves and read particular types of media, but not from the other side, for example.

Noemi: Right, and that’s why it’s very important to understand the structure of your network, because before you start looking at how you can influence, for example, in politics, people, you have to first see what is the type. Is the distribution scale-free? Do you have these hubs where everyone has approximately the same number of connections, and because these dynamical processes are going to evolve in the network in a completely different manner based on what’s the structure of the network. To give an example of these dynamical processes, for example, something that I’ve worked on for a very long time is cascading failures, which you can see in power grids. So, in any type of network where you have information flow, information flow, you can think of anything, like me trying to convince my friend to buy a product, or power grids transmitting from one generator to the other current, you want to see in case one node fails in the system, its failure, how it’s going to get transmitted further throughout the network.

Noemi: So, one thing that we need to take into account when you’re looking at these networks that are transferring information is that nodes have assigned a capacity, like how much information can they handle, and if one of the nodes fails, it’s going reallocate its load, the information that it took over, to its neighboring nodes, and now those, if we’re going to have higher load than their capacity, they’re also going to fail. This is what they call cascading failure, or an avalanche of failures, and this is a big problem because in power grids you have very millimilliseconds to actually try to mitigate that failure. So, what you’re trying to do is to build a more robust system against that, but these failures are also very dependent on the structure of the network. So, at the basis of every network analysis, it comes like what is the structure of that network.

Hugo: As you were talking about dynamic processes propagating through networks, it just sprung to mind, I know that Twitter is used by data scientists so much for thinking about tools and techniques and problem solving and debugging, and I was just wondering, thinking about if we could see how data science tools actually propagated through social networks on the internet, which could be a cool project for a listener to do at some point.

Noemi: There was a research project focused on how tools spread out and how popular they become on GitHub.

Hugo: Yeah, that’s very interesting.

Noemi: Yeah. So, that’s related.

Hugo: You’ve told us about two of the projects that you love that you’re working on, and you told us at the start that you’ve got three main projects. So, what’s the third one?

Noemi: Oh, yeah. The third one I loved working on, too. That’s also something that is brand new and very fascinating topic for me. So, many people don’t know that AT&T owns DirecTV, and now also, with the acquisition of Turner, we have even more TV data that creates for us tremendous research opportunities in the TV advertising space that I’m still learning a lot about, and I find this very exciting because especially when I joined AT&T, even for me, it didn’t occur at the interview process that I might end up working with TV data. So, it’s really awesome.

Hugo: That’s really cool. This is a relatively new position for you, as you said. So, I’m wondering … We’ve got a lot of listeners out there who are aspiring data scientists and working data scientists, and I’m just wondering what advice you’d give someone who’d be interested in this type of job. What types of skills and tools would they need to have?

Noemi: Sure. We are constantly looking to hire, so I’m very happy to share that information for people who are interested. So, since we are a research lab, we are looking for people with PhDs, because we seek candidates who possess the main expertise and have research experience, and of course, we love seeing people with genuine enthusiasm who are excited about new data and know how to get the best out of it, and of course, this requires great technical skills, high integrity using the data, and of course, to be innovative, as implied by our inventive scientist job titles, and last but not least, our research lab is a very collaborative environment.

Noemi: So, you can come up with project ideas or get involved in projects with other team members. So, a critical soft skill that we are looking for is the ability to successfully collaborate with others. AT&T Labs research also promotes academic collaborations. We can publish, as I mentioned, and also, many of my colleagues have ongoing academic collaborations, so being collaborative in our field is a critical skill that we are looking for.

What does the future of data science look like to you?

Hugo: For sure. So, we’ve bounced back and forth between current data science work that you’re involved in, its impact on the future of AT&T. I’m wondering, generally, what the future of data science looks like to you?

Noemi: It’s funny that you ask me because I have a … So, yesterday I read a Forbes article saying that data science won’t be around in 2029-

Hugo: Great.

Noemi: … and it’s very funny because I have the opposite opinion. So, in my opinion, data science has been around for a while, and since even before being called data science, and I think it will be around for even longer.

Hugo: Of course, though, in 2029, it will be around, but it may not be called data science as well. Right?

Noemi: Right. So, even before, it was data analyst, or it had different-

Hugo: Data mining, a lot of-

Noemi: Right, and then there are other people who are working with data that … For example, when I was doing my research as a PhD, what was I doing? I didn’t even know what term to give to it. I was just hearing from traditional physics professors that this is not physics, and I never knew what to call it. So, now we have a term for it, and maybe in the future it’s going to change, but the job itself, the role, I think it’s going to be around for a long time because this field requires a sciencey, innovative mindset, and I think there will be plenty of opportunities in this field in the future. I think the part of data science that changes rapidly is the tools that we make use of, from how to ingest large-scale data to how to evaluate things, interpret predictive models, so this is changing very rapidly, and that’s why data scientists have to constantly keep learning to be able to keep up with the rapid technological advances, but the data scientist role in itself is gonna be around.

Hugo: I think so, and I think we’re also gonna see data skills, data literacy, and data fluency spread across organizations in really interesting ways as well. I mean, something we’re thinking about a lot is what do product managers need to know about data science and statistics? What do VPs of Marketing know? What does C-suite need to know? Do they need to know the basic definitions of metrics for machine learning models, and maybe a bit about class imbalances, for example, right?

Noemi: It’s very cool because the data science fellowship that I participated in last year, Insight Data Science fellowship, they also launched a product management fellowship, which is really awesome.

Hugo: That’s really cool. I’d actually love to know more of the … I’m gonna look into that, because I do actually think that the relationship between product management and data science is not so ill-defined, the way we talk about it is, but it’s becoming more and more important. But that’s for another conversation. I’d like to wrap up with a couple of questions. I’d just love to know what one of your favorite data sciencey things to do is, as a technique or methodology, or anything.

Noemi: For me, the data science process as a whole is my favorite because I liked these crime novels or movies-

Hugo: Awesome.

Noemi: … and I always feel like what I’m doing, I’m getting the data. It’s like throughout the EDA, the exploratory data analysis, I feel like a detective finding these puzzle pieces in the data, and then in the modeling part, I’m just putting the pieces together to reveal the story. So, for me, it’s like, oh, I feel like a detective in a safe space. I don’t deal with criminals. But to answer your question, one methodology that’s one of the several of my favorites is probably text data vectorization, because it’s so simple, yet I find it so fascinating how you can so easily extract features from unstructured text data with this very simple technique, and you can use this for feature extraction, natural language processing, and to build models from it. So, I find it really cool.

Hugo: That’s awesome, and although I agree that when performing data analysis and doing data science you’re not uncovering criminals, a lot of code people write and a lot of process is almost criminal as well. I mean, we’re still establishing best practices also, and also, I find it really interesting that unstructured data and text data natural language processing is part of your answer to this because a lot of the techniques and work and research you’ve done we’ve discussed today isn’t necessarily involved with text data. So, that’s kind of cool, to know that this is another interest of yours.

Noemi: Actually, throughout my open NASA dataset collection analysis, I did natural language processing, and I’m also developing a course for Pearson, which is called Natural Language Processing for Hackers, which is gonna be out, hopefully, soon.

Call to Action

Hugo: Okay, great. So, my final question is, do you have a call to action for our listeners out there?

Noemi: Yeah. Check out our website at AT&T Labs. It’s about.att.com/sites/labs, and there’s cool research to learn more about what we do and how you can get involved, because we’re always looking for young talent to join our growing team, and now I actually shared with you what type of skills we’re looking for, so yeah, we’re very interested in new talent.

Hugo: Fantastic. We’ll put that link the show notes as well, and for all of you who do reach out, mention that you heard our conversation on DataFramed as well. But Noemi, I’d just like to thank you so much for coming on the show. It’s been so much fun.

Noemi: Thank you so much for inviting me. It was great being here.

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.