DataCamp Competition – Reporting Sales Data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
? Problem Statement
The accounting department of a company that sells motorcycle parts operates three warehouses in a large metropolitan area. The company want’s their sales data analysed and want to capture sales by payment method. They also want to know the average unit price for each product line.
Objectives
- What are the total sales for each payment method?
- What is the average unit price for each product line?
- Create plots to visualize findings for questions 1 and 2.
- Investigate further (e.g., average purchase value by client type, total purchase value by product line, etc.)
- Summarize the findings.
? Data Summary
#load libraries library(tidyverse) library(lubridate) #load data df <- read_csv("C:/Users/Adejumo/Downloads/sales_data.csv") head(df) ## # A tibble: 6 x 8 ## date warehouse client_type product_line quantity unit_price total ## <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 2021-06-01 Central Retail Miscellaneous 8 16.8 135. ## 2 2021-06-01 North Retail Breaking system 9 19.3 174. ## 3 2021-06-01 North Retail Suspension & tract~ 8 32.9 263. ## 4 2021-06-01 North Wholesale Frame & body 16 37.8 605. ## 5 2021-06-01 Central Retail Engine 2 60.5 121. ## 6 2021-06-01 North Wholesale Suspension & tract~ 40 37.4 1495. ## # ... with 1 more variable: payment <chr> skimr::skim(df)
Name | df |
Number of rows | 1000 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 4 |
Date | 1 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
warehouse | 0 | 1 | 4 | 7 | 0 | 3 | 0 |
client_type | 0 | 1 | 6 | 9 | 0 | 2 | 0 |
product_line | 0 | 1 | 6 | 21 | 0 | 6 | 0 |
payment | 0 | 1 | 4 | 11 | 0 | 3 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 2021-06-01 | 2021-08-28 | 2021-07-14 | 89 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
quantity | 0 | 1 | 9.39 | 9.66 | 1.00 | 4.00 | 6.50 | 10.00 | 40.00 | ▇▂▁▁▁ |
unit_price | 0 | 1 | 30.32 | 12.26 | 10.03 | 21.09 | 28.57 | 37.92 | 66.62 | ▆▇▅▂▁ |
total | 0 | 1 | 289.11 | 345.23 | 10.35 | 93.69 | 178.36 | 321.69 | 2546.33 | ▇▁▁▁▁ |
The sales data has the following 1000 items and 8 fields:
– “date” – The date, from June to August 2021.
– “warehouse” – The company operates three warehouses: North, Central, and West.
– “client_type” – There are two types of customers: Retail and Wholesale.
– “product_line” – Type of products purchased.
– “quantity” – How many items were purchased.
– “unit_price” – Price per item sold.
– “total” – Total sale = quantity * unit_price.
– “payment” – How the client paid: Cash, Credit card, Transfer.
There are no missing values in the data.
Exploratory Data Analysis
Total sale for each payment method
df %>% group_by(payment) %>% summarize(total = sum(total)) ## # A tibble: 3 x 2 ## payment total ## <chr> <dbl> ## 1 Cash 19199. ## 2 Credit card 110272. ## 3 Transfer 159642.
Let’s visualize the results and see the percentage of total sales in each payment method.
df %>% group_by(payment) %>% summarize(total = sum(total)) %>% mutate(perc_sales = (total/sum(total))*100) %>% ggplot(aes(x = reorder(payment, perc_sales), y = perc_sales, fill = payment)) + geom_col() + geom_text(aes(label = scales::comma(perc_sales))) + xlab("Payment Method") + ylab("Percentage of total Sales")
55% of total sales were paid through bank transfer, 38% through credit card, just 7% were paid through cash. Most customers prefer to use bank transfer and credit cards for payment.
Average unit price for each product line
The average unit price per product line is the calculated by dividing the total number of sales for each product line by the total number of units sold.
df %>% group_by(product_line) %>% summarize(avg_unit_price = mean(unit_price)) %>% arrange(desc(avg_unit_price)) %>% ggplot(aes(x = reorder(product_line, avg_unit_price), y = avg_unit_price, fill = product_line)) + geom_col() + geom_text(aes(label = scales::comma(avg_unit_price))) + coord_flip() + xlab("Product Line") + ylab("Average Unit Price")
Products under the Engine product line are more expensive with an average unit price of over 60 dollars, Breaking system having the least expensive products with an average unit price of 17.7 dollars.
Average purchase value by client type
The average purchase value(APV) is giving by the total number of sales divided by the number of orders in each client type.
df %>% group_by(client_type) %>% summarize(avg_pur_value = sum(total)/sum(quantity)) ## # A tibble: 2 x 2 ## client_type avg_pur_value ## <chr> <dbl> ## 1 Retail 30.7 ## 2 Wholesale 30.8
Clients who purchase on wholesale have the highest average purchase value of 30.82 dollars while clients in retail have a purchase value of 30.72 dollars. Seems both clients have similar demands but the demand of clients on wholesale is more high.
Total purchase value by product line.
This is the total sales made in each product line
df %>% group_by(product_line) %>% summarize(total_sales = sum(total)) %>% mutate(perc_sales = total_sales/sum(total_sales)*100) %>% ggplot(aes(x = reorder(product_line, perc_sales), y = perc_sales, fill = product_line)) + geom_col() + coord_flip() + geom_text(aes(label = scales::comma(perc_sales))) + xlab("Product Line") + ylab("Percentage of total sales")
Most sales were made from products in Suspension and traction which constitute 25.25% of total sales. Products in Engine product line and Miscellaneous have the lowest amount of sales with 13.12% and 9.4% respectively.
Month with the highest total sales
df %>% group_by(month = month(date, label = T)) %>% summarize(total = sum(total)) %>% arrange(desc(total)) ## # A tibble: 3 x 2 ## month total ## <ord> <dbl> ## 1 Aug 100245. ## 2 Jun 95320. ## 3 Jul 93548.
Though sales dropped in the month of July, it sky rocketed in the month of August.
Warehouse with the highest total sales
df %>% group_by(warehouse) %>% summarize(total_sales = sum(total)) %>% mutate(perc_sales = (total_sales/sum(total_sales))*100) %>% ggplot(aes(x = reorder(warehouse, perc_sales), y = perc_sales, fill = warehouse)) + geom_col() + xlab("Warehouse") + ylab("Percentage of total sales") + geom_text(aes(label = scales::comma(perc_sales)))
Warehouse in Central of the metropolitan area have more sales than warehouses in other regions with 49 % of total sales made, which is almost close to half of the total sales. The warehouse in the western region have the lowest amount of sales with 16% of total sales.
Conclusion
The company compared to previous months have made a lot of sales with majority from the warehouse in the central region, this is a good sign indicating increase in sales. Most clients prefer to pay through bank transfer or credit card and this is due to the fact that most of them are wholesale and retail and they purchase in bulk which involves huge amount of money, it is more secure than carrying cash. As expected products in Engine are much more expensive than other parts of a motorcycle. Products under suspension and traction appeared to be in demand than other products line.
From this report, the company is doing well in terms of sales but will need to improve sales in the western region, low sales in that area might be due to low demand compared to other regions or high competition which will be advisable to look into.
If you find the insights in this analysis interesting, please upvote.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.