10 Tips and Tricks for Data Scientists Vol.3

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have started a series of articles on tips and tricks for data scientists (mainly in Python and R). In case you missed vol 1 and vol 2.

Python

1. How to work with JSON cells in pandas

Assume that you are dealing with a pandas DataFrame where one of your columns is in JSON format and you want to extract specific information. For this example, we will work with the doc_report.csv dataset from Kaggle.

import pandas as pd
import ast
pd.set_option("max_colwidth", 180)
doc = pd.read_csv("doc_reports.csv", index_col=0)
# print the properties column
doc['properties']
Output

If we look at the data, the properties field is in JSON format. This means that we need to convert it to a dictionary and then extract the required information. We will work with the ast library to convert it to a dictionary and then we will create separate columns for each key as follows:

dummy = doc['properties'].apply(lambda x: ast.literal_eval(x))
doc['gender'] = dummy.apply(lambda x:x.get('gender'))
doc['nationality'] = dummy.apply(lambda x:x.get('nationality'))
doc['document_type'] = dummy.apply(lambda x:x.get('document_type'))
doc['date_of_expiry'] = dummy.apply(lambda x:x.get('date_of_expiry'))
doc['issuing_country'] = dummy.apply(lambda x:x.get('issuing_country'))
# lets get the columns
doc[['gender', 'nationality', 'document_type', 'date_of_expiry','issuing_country' ]]
Converting JSON data to columns

As you can see, we converted a JSON data type cell to columns based on the key values.

2. How to change multiple column values with applymap

We will provide an example of how you can change multiple column values in pandas DataFrames using the applymap function. Assume that your DataFrame takes values 12, and 3 and you want to apply the following mapping function:

  • If 1, then 0.
  • If 2 or 3, then 1.
df = pd.DataFrame({'A':[1,1,2,2,3,3],
'B':[1,2,3,1,2,3]})
df

        A	B
0	1	1
1	1	2
2	2	3
3	2	1
4	3	2
5	3	3

Using the applymap function:

# create the mapping dictionary
d = {1 : 0, 2: 1, 3: 1}
# apply it to all columns
df.applymap(d.get)

        A	B
0	0	0
1	0	1
2	1	1
3	1	0
4	1	1
5	1	1

3. How to build treemaps with Plotly

With plotly.express, you can easily create fancy treemaps. For example, let’s say that we want to represent the life expectancy by country taking into consideration the continent and the population of the country. So, in the treemap below, the size of the rectangle refers to the population and the color refers to the life expectancy. The higher the life expectancy, the bluer the color is. The lower the life expectancy, the redder the color is.

import plotly.express as px
import numpy as np
df = px.data.gapminder().query("year == 2007")
df["world"] = "world" # in order to have a single root node
fig = px.treemap(df, path=['world', 'continent', 'country'], values='pop',
                  color='lifeExp', hover_data=['iso_alpha'],
                  color_continuous_scale='RdBu',
                  color_continuous_midpoint=np.average(df['lifeExp'], weights=df['pop']))
fig.show()
Table with data
Plot chart

4. How to get the correlation between two DataFrames

df1 = pd.DataFrame({'x11' : [10,20,30,40,50,55,60],
                    'x12' : [11,15,20,30,35,60,70]})
df2 = pd.DataFrame({'x21' : [100,150,200,250,300,400,500],
                    'x22' : [110,150,180,250,300,400,600]})
pd.concat([df1, df2], axis=1, keys=['df1', 'df2']).corr().loc['df1', 'df2']

    x21      x22
x11 0.953873 0.900050
x12 0.988908 0.975597

5. How to truncate dates to a month in pandas

Assume that you are dealing with the following DataFrame:

import pandas as pd
df = pd.DataFrame({'MyDate': ['2020-03-11', '2021-04-26', '2021-01-17']})
df['MyDate'] = pd.to_datetime(df.MyDate)
df

    MyDate
0   2020-03-11
1   2021-04-26
2   2021-01-17

And you want to truncate the date to a month:

df['Truncated'] = df['MyDate'] + pd.offsets.MonthBegin(-1)
# OR
# df['Truncated'] = df['MyDate'] - pd.offsets.MonthBegin(1)
df

  MyDate     Truncated
0 2020-03-11 2020-03-01
1 2021-04-26 2021-04-01
2 2021-01-17 2021-01-01

6. How to append multiple CSV files from a folder in pandas

Assume that you have multiple CSV files located in a specific folder and you want to concatenate all of them in a pandas DataFrame. We assume that our CSV files are under the My_Folder.

import os
import pandas as pd
# create an empty pandas data frame
df = pd.DataFrame()
# iterate over all files within "My_Folder"
for file in os.listdir("My_Folder"):
    if file.endswith(".csv"):
        df = pd.concat([df , pd.read_csv(os.path.join("My_Folder", file))], axis=0 )
# reset the index 
df.reset_index(drop=True, inplace=True)
df

Now the df consists of the CSV files within My_Folder.

7. How to concatenate multiple CSV files in Python

Assume that you have multiple CSV files located in a specific folder and you want to concatenate all of them and save them to a file called merged.csv. We can work with pandas and use the trick with mode=a within the .to_csv(), which means append:

import os
import pandas as pd
# iterate over all files within "My_Folder"
for file in os.listdir("My_Folder"):
    if file.endswith(".csv"):
        tmp = pd.read_csv(os.path.join("My_Folder", file))
        tmp.to_csv("merged.csv", index=False, header=False, mode='a')

8. How to concatenate multiple TXT files in Python

Assume that you have multiple .txt files and you want to concatenate all of them into a unique .txt file. Assume that your .txt files are within the dataset folder. Then you will need to get their paths:


import os
# find all the txt files in the dataset folder
inputs = []
for file in os.listdir("dataset"):
    if file.endswith(".txt"):
        inputs.append(os.path.join("dataset", file))
# concatanate all txt files in a file called merged_file.txt
with open('merged_file.txt', 'w') as outfile:
    for fname in inputs:
        with open(fname, encoding="utf-8", errors='ignore') as infile:
            outfile.write(infile.read())

With the snippet above, we managed to concatenate all of them into one file called merged_file.txt. In the case where the files are large, you can work as follows:


with open('merged_file.txt', 'w') as outfile:
    for fname in inputs:
        with open(fname, encoding="utf-8", errors='ignore') as infile:
            for line in infile:
                outfile.write(line)

R

9. How to get the correlation between two DataFrames

Let’s see how we can get the column-wise correlation between two DataFrames in R:

df1 = data.frame(x11 = c(10,20,30,40,50,55,60),
                 x12 = c(11,15,20,30,35,60,70)
                 )
df2 = data.frame(x21 = c(100,150,200,250,300,400,500),
                 x22 = c(110,150,180,250,300,400,600)
                 )
cor(df1,df2)

    x21       x22
x11 0.9538727 0.9000503
x12 0.9889076 0.9755973

10. The “Count(Case When … Else … End)” in R

When I run queries in SQL (or even HiveQL, Spark SQL, and so on), it is quite common to use the count(case when.. else ... end) syntax. Today, I will provide you an example of how you run this type of command in dplyr. Let’s start:

library(sqldf)
library(dplyr)
df<-data.frame(id = 1:10,
               gender = c("m","m","m","f","f","f","m","f","f","f"),
               amt= c(5,20,30,10,20,50,5,20,10,30))
df
Table with data

Let’s get the count and the sum per gender in different columns in SQL:

sqldf("select count(case when gender='m' then id else null end) as male_cnt,
              count(case when gender='f' then id else null end) as female_cnt,
              sum(case when gender='m' then amt else 0 end) as male_amt,
              sum(case when gender='f' then amt else 0 end) as female_amt
              from df")

  male_cnt female_cnt male_amt female_amt
1        4          6       60        140

Let’s get the same output in dplyr. We will need to subset the DataFrame based on one column:

df%>%summarise(male_cnt=length(id[gender=="m"]),
               female_cnt=length(id[gender=="f"]),
               male_amt=sum(amt[gender=="m"]),
               female_amt=sum(amt[gender=="f"])
               )

  male_cnt female_cnt male_amt female_amt
1        4          6       60        140

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)