How to rapidly master data science

Sharp Sight

4 years ago

[This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For some people, learning the basics of data science takes years.

Now, I’ll admit that if you eventually want to be one of the ‘best of the best,’ you will have to put in years of effort.

But to master the basics? Years? Just to get the foundations?

It should not take that long.

Any smart, dedicated person should be able to master the basics of data science within only a few months.

… And I mean master. I firmly think that it’s possible for a motivated data science student to learn the basics so well that he or she can write the basic syntax ‘in their sleep.’ A good student should be ‘fluent’ in the basics within months not years.

There’s a good analogy with spoken languages. A smart, dedicated student with a good learning plan should be conversational in a language like Spanish within 8 to 12 weeks.

Similarly, a good data science student studying R should be ‘conversational’ (i.e., fluent in the basics) within 8 to 12 weeks.

Yet, as I’ve already said, it takes most people years to learn the basics.

What’s going on here?

To put it bluntly, most people are terrible at learning. They don’t know how to learn new skills quickly, efficiently, and effectively, so they take much longer than they need to.

On the other hand, if you really know “how to learn,” it’s possible to learn the foundations of data science, very, very quickly.

Here’s how.

To rapidly master data science, you need to …

To rapidly master data science, you need to do several things:

Break it down
Figure out what to do, and what not to do
Design a plan
Learn
Practice

Let’s dive into each of these.

Break it down

To learn anything very quickly, you need to break it down into small components.

For example, to rapidly learn a spoken language (like Spanish), you’ll want to break it down into very small units first: words. To be clear, which words you learn is important (we’ll talk about that in a moment), but this is still a critical step. You might even go a step further and break words down into sounds (i.e., phonemes) so that you first learn the unique sounds of the particular language.

A very similar process applies to data science. To rapidly master data science, you need to break data science down into smaller sub-disciplines. Furthermore, those sub-disciplines can be broken down into a set of skills and techniques. Going a step further, all of those techniques can be broken down into small learnable units that you can practice (we’ll talk about practice later).

At a very high level, you can break down basic data science into the following sub-disciplines:

Data visualization
Data manipulation
Data analysis

I’ll add that there are many ‘special topics’ in data science that aren’t exactly covered by these categories. However, speaking at a very high level, these categories account for ‘the basics’ that you need to know.

Find out what to do, and what not to do

After you break things down, you need to figure out what to do, but also what not to do. To rapidly learn data science, it’s critical to select the right material and distinguish between what to do and what not to do; you need to distinguish between what’s really important and what is unimportant.

Figuring out what not to do is perhaps the more important of the two. When most people begin learning, they try to learn way too much. More often than not, this leaves people feeling overwhelmed, and it often causes them to spend time on topics that are not necessary.

Let me give you an example. As of 2017, there are over 10,000 R packages. You read that right. 10,000. Realistically, you will never learn all 10,000 packages, let alone master all 10,000 packages.

I should also point out that there is a lot of redundancy among these packages. In R, there’s often more than one way to do things. For example, to perform data visualization you can use the < inline_code>plot() function from base R, but there’s also several other packages and tools for visualizing and plotting data. Do you know which tools are the ‘best’? Do you know which packages you need to learn, and which ones you should skip?

To master R quickly and efficiently, you need to be able to select a very small number of packages among these 10,000. You need to choose what to learn and what to ignore.

Additionally, once you select the best packages to learn, you need to choose what to learn within those packages. Even if you select the best R packages, within those packages there are tools that you absolutely need to know, but other things that you probably don’t need to learn right now. Some tools and techniques are things that you don’t really need, and it would be better to wait a few months or years to learn them. Again, you need to know what to learn and what not to learn.

Focus on foundations

In the context of ‘selecting’ the right topics to learn, I should mention that it is very useful to focus on foundational skills.

If you study top performers of all kinds, you’ll find that essentially all of them place a strong emphasis on ‘the basics.’ Top performers focus on foundations.

To master data science, you need to do this. You need to master foundational techniques before you move on to advanced topics.

So as you’re learning data science, this means that you need to master 3 critical areas:

Data visualization
Data manipulation
Data analysis (AKA, exploratory data analysis)

A big mistake that beginners make is that they jump into advanced topics too soon, before they’ve mastered these foundations. For example, new data science students get excited about machine learning and want to start by learning machine learning. These students would be much better served by mastering the critical, foundational tools first like data visualization and data manipulation.

To put this another way, by focusing on the foundations, you put yourself in a position to rapidly master more advanced topics later. If you master the critical foundations first, you will be better equipped to learn more advanced topics later, and you will do so at a much faster rate.

Do you want to be a top-performing data scientist? Master the foundations first.

That invites the question: What are the foundational data science skills in R?

The following is a quick list:

Basic visualizations: bar charts, line charts, histograms
Manipulating colors in charts
Visualization formatting (I.e., how to format your charts to make them look good)
String manipulation
Date manipulation
Data reshaping (I.e., transforming from ‘wide’ format to ‘long’ format and visa-versa)
Adding variables
Removing variables
Aggregating data
Reading in data (from external sources)
Working with factor variables (e.g., ordering factors, re-naming factor levels, re-categorizing factor variables, etc)

This is a fairly high-level, list but it is a pretty good list of the things that you absolutely need to know. If you can’t do these, do not move on to more advanced topics. Don’t get shiny object syndrome. If you can’t do all of these skills, you need to go back and learn them right now.

Moreover, if you want to be a top performer, you should be able to do these things without even thinking about them. Top-level data scientists can do these things ‘in their sleep.’

Design a learning plan

Once you select the right topics, you need a learning plan. Specifically, you need to sequence the topics in the best order.

One reason for this is that some topics are dependent on others. For example, in math, you need to know arithmetic before you learn algebra, and you should know algebra before you learn calculus.

Data science is similar. There are some topics that are dependent on other topics. For example, I have routinely said that the prerequisites for machine learning are data visualization, data manipulation, and data analysis. To effectively learn ML, you need to be able to wrangle a data set, clean it, and visualize it. So, if you want to eventually learn ML, you need to start with visualization and manipulation first.

There are also more subtle sequencing considerations. I personally believe that it’s best not to start with data manipulation, because by definition, data manipulation is required for more complex and messy data. There are much better data science topics to start with.

So in order to rapidly master data science, you need to be able to sequence the material in the optimal order, so you learn the right things at the right time.

Learn

Once you have a plan, you can start learning.

However, learning data science topics can be challenging. Many topics can be very confusing. For example, when I try to learn something new, I typically buy 5 or 10 books on the topic, but I often find that many of the books don’t explain the topic in a clear way. Frequently, out of 5 or 10 books on a subject, 1 or 2 books are dramatically better at explaining that subject.

How quickly you learn can depend critically on the quality of your learning materials.

Practice

Having said that, learning is not the final step.

Once you learn the basic concepts and techniques, you need to practice. You need to practice techniques and review concepts until they are ‘second nature.’

This is an extremely important point. There is absolutely a difference between learning something once, and remembering it in the long run.

Let me give you an example. If I show you a video right now that explains the < inline_code>ggplot() function, you’ll probably understand how it works. The syntax is fairly easy to understand once someone breaks it down and explains each piece of the syntax.

Next, if I ask you to write some simple < inline_code>ggplot() code, you’ll probably be able to do that too. For example, let’s say I ask you to create a simple scatter plot:

ggplot(data = diamonds, aes(x = carat, y = price) +
  geom_point()

If I ask you to do something simple, like typing the code into R studio, you’ll probably be able to do it.

But what if I ask you to do it again 3 hours later? If I ask you 3 hours later to write that code from memory, there’s a good chance that you won’t be able to do it.

Why?

Because we forget. The human brain naturally forgets.

However, there’s a way to fight this. You can halt this forgetting process by practicing. Specifically, you need to repeat and review what you’ve learned.

Practicing techniques and repeating what you learn will enable you to remember those things in the long run. Moreover, as you practice, you will become more ‘fluent’ in those techniques. You will excuse those techniques more quickly and with less hesitation the more you practice.

An added benefit of practice, is that effective practice methods help you become a “top performer.” In fact, research has shown that elite levels of performance are strongly tied to deliberate practice. If you want to be a top performer, practice is critical.

I’ve said several times that to be a top-performing data scientist, you need to be able to execute the basic techniques ‘in your sleep.’ You should be able to do essential data visualization and data manipulation ‘with your eyes closed.’

You can achieve this level of mastery by practicing the right way, using good practice systems.

Effective learning has large benefits

Learning data science quickly can be a massive benefit.

Let’s put some numbers around it.

Let’s say that two people are learning data science: you and someone else. The other person learns extremely inefficiently, and takes 1000 hours to master the basics. But you learn the basics much faster, in about 200 hours.

The difference, 800 hours, is a really big difference. Again, we can put some numbers around this.

If your free time is worth only $20 an hour, that time savings of 800 hours translates to $16,000.

But let’s say that you really value your time. (You should value your time. Time is the only resource you can’t get back.) If you value your time at $50 an hour, the time you save by learning more efficiently amounts to a staggering $40,000.

Now these are just example numbers for illustration, but you get the idea.

Being highly effective and efficient in learning data science has massive benefits.

There’s actually another benefit of being an effective learner. If you really know how to learn, you’ll not only learn more quickly, but you’ll attain higher levels of proficiency and mastery.

If you are highly effective at learning data science, it becomes much easier to become a ‘top performer.’

It really pays to be a top performer. The reason for this is that the best people in tech often receive outsized gains. The best people disproportionately get the best jobs, highest salaries, and best perks. You’ve probably heard about the mythical ’10X developer’ … people who are 10 times more productive. These top performers often get the lion’s share of rewards in the tech industry.

In this regard, the tech world is sort of like sports. Think about basketball: you have a few guys like Kobe, and Michael, and LeBron who made millions of dollars per year. Then you have a larger set of guys in the NBA who have less skill and make dramatically less money. Even worse: for every guy who made it into the NBA, there are hundreds who didn’t.

The tech world is similar. Top performers get the lion’s share of rewards, while less skilled performers make dramatically less (and many people struggle to break into the industry at all).

Get expert help

It definitely pays to learn data science as quickly as possible and to master the techniques.

Let’s review how you can do that:

Break it down
Sequence the material
Learn
Practice

At a high level, that’s really it (although, the key to getting it right is in the details).

If you can apply this learning process to data science, you’ll accelerate your learning and increase your chances of success.

But if you really want to accelerate your progress and learn as quickly as possible, there’s one more thing you can do.

You can get guidance from an expert.

Top performers understand that they can save massive amounts of time by getting advice from people who have already mastered the topic.

Learning a new subject is time consuming, because you need to figure out what you need to learn, design a learning plan, sequence the material, and all of the things I’ve already talked about. But you need to do these things without a clear understanding of the subject. It’s like trying to find your way through a jungle, alone, without knowledge of the terrain. You’d be well served by getting a guide … someone to safely and quickly get you to your destination.

A data science mentor can tell you exactly what to do: “learn this first, learn this second, focus on x-y-z, don’t bother learning that topic, etc.” A good teacher can dramatically accelerate your learning, because they remove the burden of having to find the path on your own.

For example, when ‘superlearner’ Tim Ferriss wants to learn something, he finds a world class expert on the subject and gets help. Ferriss knows that he can dramatically accelerate his learning by getting expert advice.

If you want to rapidly master data science, you need to do the same. While it is possible to learn data science on your own, you can learn much, much faster with expert guidance. That might include finding a data science mentor, but it could also mean a good data science course.

The post How to rapidly master data science appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.