Welcome to Global Superstore. We’re an online retailer specializing in office items, including basic supplies, furniture, and tech products. We serve corporate, consumer, and home-based buyers. As our name suggests, we operate globally, with customers from 165 countries.
We’re looking for a simple, one-time report that helps us understand our sales performance. As an entire company, we would like to see sales performance across time, broken down by year. We’re also interested in understanding how our sales varied between global regions in our most recent year of sales (2015) relative to the prior year. Specifically, we want to know how sales changed from 2014 to 2015 in each region. Finally, we’d like to know which of our products drive the biggest profit for the company.
Data are held in our BigQuery data warehouse in the
orders
table.
We want a one-time report that includes the following items:
To create visualizations for the data the client has requested, I will be using R. Prior to starting the analysis, I need to load libraries and clear the workspace in R:
# LOAD LIBRARIES
library(tidyverse)
library(kableExtra)
# PREP WORKSPACE
rm(list = ls())
The following SQL code will pull the data needed to assess yearly performance:
WITH
allData AS (
SELECT
Order_ID AS OrderId,
EXTRACT(YEAR FROM Order_Date) AS OrderYear, --GET YEAR OF ORDER
Customer_ID AS CustomerId, --WILL NEED TO FIND UNIQUE CUSTOMER IDS
Quantity, --NUMBER OF PRODUCT SOLD IN THE ORDER
Quantity * cast(Sales AS NUMERIC) AS TotalPrice
FROM `global_superstore.orders` AS orders
)
SELECT
OrderYear,
COUNT(DISTINCT CustomerId) AS UniqueCustomers,
COUNT(DISTINCT OrderId) AS OrdersPlaced,
SUM(Quantity) AS ItemsSold,
SUM(TotalPrice) AS SalesRevenue
FROM allData
GROUP BY OrderYear
After running the SQL script above in BigQuery, I saved the output to a csv file. Here, I load the csv file into R before creating visualizations:
# LOAD THE DATA AND ASSIGN TO VARIABLE
yearly_data <- read.csv('C:/Users/danie/Documents/R Scripts/analytics_portfolio/case_study_global_superstore/data/yearly_sales.csv')
# VIEW THE DATA
yearly_data
## OrderYear UniqueCustomers OrdersPlaced ItemsSold SalesRevenue SalesProfit
## 1 2012 4164 4515 31443 11129609 1276317
## 2 2013 4949 5473 38111 12962199 1414985
## 3 2014 6113 6883 48136 16743666 2018935
## 4 2015 7624 8857 60622 20985820 2518008
To show yearly performance in the metrics requested, I will use simple bar charts. To avoid repetition and clutter, I will provide code for the first chart only. The other charts will have the same format but with a different variable on the y-axis.
# BAR PLOT: UNIQUE CUSTOMERS BY YEAR
ggplot(yearly_data, aes(x = OrderYear,
y = UniqueCustomers)) +
geom_bar(stat = 'identity',
color = 'black',
fill = '#3366CC') +
geom_text(aes(label = UniqueCustomers),
nudge_y = 200) +
labs(title = 'Unique Customers by Year',
x = '',
y = 'Unique Customers') +
theme_bw() +
theme(plot.title = element_text(size = 18, face = 'bold', hjust = 0.5),
axis.title.y = element_text(size = 16, face = 'bold', margin = margin(t = 0, r = 15, b = 0, l = 0)),
axis.text = element_text(size = 12, color = 'black'))
The following SQL code will pull the data needed to assess Regional sales data:
WITH
allData AS (
SELECT
Order_ID AS OrderId,
EXTRACT(YEAR FROM Order_Date) AS OrderYear, --GET YEAR OF ORDER
Region, --GLOBAL REGION
Quantity, --NUMBER OF PRODUCT SOLD IN THE ORDER
Quantity * cast(Sales AS NUMERIC) AS TotalPrice
FROM `global_superstore.orders` AS orders
WHERE EXTRACT(YEAR FROM Order_Date) IN (2014, 2015)
),
agData AS (
SELECT
Region,
OrderYear,
COUNT(DISTINCT OrderId) as OrdersPlaced,
SUM(Quantity) as ItemsSold,
SUM(TotalPrice) as SalesRevenue
FROM allData
GROUP BY Region, OrderYear
ORDER BY Region, OrderYear
)
SELECT *
FROM (
SELECT
Region,
((OrdersPlaced - LAG(OrdersPlaced) OVER(PARTITION BY Region ORDER BY OrderYear)) / LAG(OrdersPlaced) OVER(PARTITION BY Region ORDER BY OrderYear)) * 100 as PercDiffOrdersPlaced,
((ItemsSold - LAG(ItemsSold) OVER(PARTITION BY Region ORDER BY OrderYear)) / LAG(ItemsSold) OVER(PARTITION BY Region ORDER BY OrderYear)) * 100 as PercDiffItemsSold,
((SalesRevenue - LAG(SalesRevenue) OVER(PARTITION BY Region ORDER BY OrderYear)) / LAG(SalesRevenue) OVER(PARTITION BY Region ORDER BY OrderYear)) * 100 as PercSalesRevenue
FROM agData) innerData
WHERE PercDiffOrdersPlaced IS NOT NULL
After running the SQL script above in BigQuery, I saved the output to a csv file. Here, I load the csv file into R before creating visualizations:
# LOAD THE DATA AND ASSIGN TO VARIABLE
regional_data <- read.csv('C:/Users/danie/Documents/R Scripts/analytics_portfolio/case_study_global_superstore/data/regional_data.csv')
# VIEW THE DATA
regional_data
## Region PercDiffOrdersPlaced PercDiffItemsSold PercSalesRevenue
## 1 Eastern Europe 38.755981 38.028169 40.472984
## 2 Oceania 41.402715 17.899891 18.083653
## 3 Western Asia 36.562500 36.276674 47.488848
## 4 Eastern US 26.273458 20.458554 23.129966
## 5 Eastern Africa 26.666667 22.268041 48.074259
## 6 Southern Asia 25.956284 34.497091 18.380697
## 7 Central Africa 39.285714 9.324009 -32.801671
## 8 Central US 33.114754 22.085630 -21.496249
## 9 Western Europe 26.889715 28.082307 34.112309
## 10 North Africa 29.670330 31.030303 64.966617
## 11 South America 17.317073 9.249926 25.653852
## 12 Central Asia 57.142857 110.752688 76.080216
## 13 Southeastern Asia 40.897098 44.261745 44.763028
## 14 Western US 29.116945 42.358804 43.018900
## 15 Canada 6.896552 7.392996 16.019269
## 16 Caribbean 6.250000 6.356589 6.111547
## 17 Northern Europe 53.333333 66.368421 73.550853
## 18 Eastern Asia 20.952381 19.792531 16.977597
## 19 Central America 13.950617 9.086428 16.531398
## 20 Southern Africa 36.231884 38.562092 14.101771
## 21 Western Africa 40.760870 32.807215 32.391020
## 22 Southern US 28.638498 19.738481 27.960959
## 23 Southern Europe 34.628975 39.529058 14.981577
To show Regional performance from 2014 to 2015 in the metrics requested, I will use simple, ordered bar charts. Bars will be filled based on value, with positive year over year changes in green and negative changes in red. To avoid repetition and clutter, I will provide code for the first chart only. The other charts will have the same format but with a different variable on the y-axis.
# BAR PLOT: CHANGE IN ORDERS BY REGION
ggplot(regional_data, aes(x = reorder(Region, -PercDiffOrdersPlaced),
y = PercDiffOrdersPlaced,
fill = PercDiffOrdersPlaced < 0)) +
geom_bar(stat = 'identity',
color = 'black') +
scale_fill_manual(values = c("#006600", "firebrick")) + # THIS WILL FORMAT THE FILL BASED ON VALUE (POS / NEG)
labs(title = 'Percent Change in Orders Placed from 2014 to 2015',
x = '',
y = 'Change in Orders (%)') +
theme_bw() +
theme(plot.title = element_text(size = 18, face = 'bold', hjust = 0.5),
axis.title.y = element_text(size = 16, face = 'bold', margin = margin(t = 0, r = 15, b = 0, l = 0)),
axis.text.x = element_text(size = 12, color = 'black', angle = 45, hjust = 1),
axis.text.y = element_text(size = 12, color = 'black'),
legend.position = 'none')
The following SQL code will pull the data needed to find the top 10
most items ranked by greatest total profit. Note that the
Product_Name
field in the original dataset is formatted as
Product Name, Description
- we’re only interested in the
product name, not the description, so I extract the name using
REGEX_SUBSTR()
. I also pull the total number of each
product sold and the average profit per unit.
SELECT *
FROM (
SELECT
REGEXP_SUBSTR(Product_Name, r"^(.+?),") AS ProductName,
COUNT(*) AS TotalSold,
ROUND(AVG(Quantity * cast(Profit as numeric)),2) AS AvgProfit,
SUM(Quantity * cast(Profit as numeric)) AS TotalProfit
FROM `global_superstore.orders` AS orders
GROUP BY REGEXP_SUBSTR(Product_Name, r"^(.+?),")) innerData
ORDER BY TotalProfit DESC
LIMIT 10
After running the SQL script above in BigQuery, I saved the output to a csv file. Here, I load the csv file into R before creating visualizations:
profit_prods <- read.csv('C:/Users/danie/Documents/R Scripts/analytics_portfolio/case_study_global_superstore/data/profitable_products.csv')
profit_prods
## ProductName TotalSold AvgProfit TotalProfit
## 1 Cisco Smart Phone 92 1672.84 153901.45
## 2 Nokia Smart Phone 83 1607.89 133454.56
## 3 Sauder Classic Bookcase 85 1544.10 131248.80
## 4 Hewlett Wireless Fax 82 1311.74 107562.80
## 5 Brother Fax Machine 79 1286.67 101647.30
## 6 Canon Wireless Fax 69 1462.83 100935.38
## 7 Harbour Creations Executive Leather Armchair 84 1195.52 100423.46
## 8 Bush Classic Bookcase 77 1271.58 97911.85
## 9 Memorex Router 77 1233.24 94959.28
## 10 Dania Classic Bookcase 72 1007.05 72507.56
To show the most profitable products, I will use a simple data table.
kable(profit_prods,
col.names = c('Product',
'Total Number Sold',
'Average Per-Unit Profit (USD)',
'Total Profit (USD)'),
align = c('l', 'c', 'c', 'c'),
caption = 'Products ranked by total profits generated, all-time') %>%
kable_styling()
Product | Total Number Sold | Average Per-Unit Profit (USD) | Total Profit (USD) |
---|---|---|---|
Cisco Smart Phone | 92 | 1672.84 | 153901.45 |
Nokia Smart Phone | 83 | 1607.89 | 133454.56 |
Sauder Classic Bookcase | 85 | 1544.10 | 131248.80 |
Hewlett Wireless Fax | 82 | 1311.74 | 107562.80 |
Brother Fax Machine | 79 | 1286.67 | 101647.30 |
Canon Wireless Fax | 69 | 1462.83 | 100935.38 |
Harbour Creations Executive Leather Armchair | 84 | 1195.52 | 100423.46 |
Bush Classic Bookcase | 77 | 1271.58 | 97911.85 |
Memorex Router | 77 | 1233.24 | 94959.28 |
Dania Classic Bookcase | 72 | 1007.05 | 72507.56 |