Skills Demonstrated:

Business Prompt

Welcome to Global Superstore. We’re an online retailer specializing in office items, including basic supplies, furniture, and tech products. We serve corporate, consumer, and home-based buyers. As our name suggests, we operate globally, with customers from 165 countries.

We’re looking for a simple, one-time report that helps us understand our sales performance. As an entire company, we would like to see sales performance across time, broken down by year. We’re also interested in understanding how our sales varied between global regions in our most recent year of sales (2015) relative to the prior year. Specifically, we want to know how sales changed from 2014 to 2015 in each region. Finally, we’d like to know which of our products drive the biggest profit for the company.

Available Data

Data are held in our BigQuery data warehouse in the orders table.

Deliverables

We want a one-time report that includes the following items:


Prepare R Workspace

To create visualizations for the data the client has requested, I will be using R. Prior to starting the analysis, I need to load libraries and clear the workspace in R:

# LOAD LIBRARIES
library(tidyverse)
library(kableExtra)

# PREP WORKSPACE
rm(list = ls())

Part 1: Summary of Yearly Sales Performance


Extracting Data from BigQuery

The following SQL code will pull the data needed to assess yearly performance:

WITH 

allData AS (
  SELECT 
    Order_ID AS OrderId,
    EXTRACT(YEAR FROM Order_Date) AS OrderYear, --GET YEAR OF ORDER
    Customer_ID AS CustomerId, --WILL NEED TO FIND UNIQUE CUSTOMER IDS
    Quantity, --NUMBER OF PRODUCT SOLD IN THE ORDER
    Quantity * cast(Sales AS NUMERIC) AS TotalPrice
  FROM `global_superstore.orders` AS orders
)

SELECT 
  OrderYear,
  COUNT(DISTINCT CustomerId) AS UniqueCustomers,
  COUNT(DISTINCT OrderId) AS OrdersPlaced,
  SUM(Quantity) AS ItemsSold,
  SUM(TotalPrice) AS SalesRevenue
FROM allData
GROUP BY OrderYear


Loading Data into R for Visualization

After running the SQL script above in BigQuery, I saved the output to a csv file. Here, I load the csv file into R before creating visualizations:

# LOAD THE DATA AND ASSIGN TO VARIABLE
yearly_data <- read.csv('C:/Users/danie/Documents/R Scripts/analytics_portfolio/case_study_global_superstore/data/yearly_sales.csv')

# VIEW THE DATA
yearly_data
##   OrderYear UniqueCustomers OrdersPlaced ItemsSold SalesRevenue SalesProfit
## 1      2012            4164         4515     31443     11129609     1276317
## 2      2013            4949         5473     38111     12962199     1414985
## 3      2014            6113         6883     48136     16743666     2018935
## 4      2015            7624         8857     60622     20985820     2518008


Data Visualizations

To show yearly performance in the metrics requested, I will use simple bar charts. To avoid repetition and clutter, I will provide code for the first chart only. The other charts will have the same format but with a different variable on the y-axis.

# BAR PLOT: UNIQUE CUSTOMERS BY YEAR
ggplot(yearly_data, aes(x = OrderYear,
                        y = UniqueCustomers)) +
    geom_bar(stat = 'identity',
             color = 'black',
             fill = '#3366CC') +
    geom_text(aes(label = UniqueCustomers),
              nudge_y = 200) +
    labs(title = 'Unique Customers by Year',
         x = '',
         y = 'Unique Customers') +
    theme_bw() +
    theme(plot.title = element_text(size = 18, face = 'bold', hjust = 0.5),
          axis.title.y = element_text(size = 16, face = 'bold', margin = margin(t = 0, r = 15, b = 0, l = 0)),
          axis.text = element_text(size = 12, color = 'black'))





Part 2: Regional Sales Performance (2015 vs 2014)


Extracting Data from BigQuery

The following SQL code will pull the data needed to assess Regional sales data:

WITH 

allData AS (
  SELECT 
    Order_ID AS OrderId,
    EXTRACT(YEAR FROM Order_Date) AS OrderYear, --GET YEAR OF ORDER
    Region, --GLOBAL REGION
    Quantity, --NUMBER OF PRODUCT SOLD IN THE ORDER
    Quantity * cast(Sales AS NUMERIC) AS TotalPrice
FROM `global_superstore.orders` AS orders
WHERE EXTRACT(YEAR FROM Order_Date) IN (2014, 2015)
),

agData AS (
  SELECT 
  Region,
  OrderYear,
  COUNT(DISTINCT OrderId) as OrdersPlaced,
  SUM(Quantity) as ItemsSold,
  SUM(TotalPrice) as SalesRevenue
FROM allData
GROUP BY Region, OrderYear
ORDER BY Region, OrderYear
)

SELECT * 
FROM (
  SELECT 
  Region,
  ((OrdersPlaced - LAG(OrdersPlaced) OVER(PARTITION BY Region ORDER BY OrderYear)) / LAG(OrdersPlaced) OVER(PARTITION BY Region ORDER BY OrderYear)) * 100 as PercDiffOrdersPlaced,
  ((ItemsSold - LAG(ItemsSold) OVER(PARTITION BY Region ORDER BY OrderYear)) / LAG(ItemsSold) OVER(PARTITION BY Region ORDER BY OrderYear)) * 100 as PercDiffItemsSold,
  ((SalesRevenue - LAG(SalesRevenue) OVER(PARTITION BY Region ORDER BY OrderYear)) / LAG(SalesRevenue) OVER(PARTITION BY Region ORDER BY OrderYear)) * 100 as PercSalesRevenue
FROM agData) innerData
WHERE PercDiffOrdersPlaced IS NOT NULL


Loading Data into R for Visualization

After running the SQL script above in BigQuery, I saved the output to a csv file. Here, I load the csv file into R before creating visualizations:

# LOAD THE DATA AND ASSIGN TO VARIABLE
regional_data <- read.csv('C:/Users/danie/Documents/R Scripts/analytics_portfolio/case_study_global_superstore/data/regional_data.csv')

# VIEW THE DATA
regional_data
##               Region PercDiffOrdersPlaced PercDiffItemsSold PercSalesRevenue
## 1     Eastern Europe            38.755981         38.028169        40.472984
## 2            Oceania            41.402715         17.899891        18.083653
## 3       Western Asia            36.562500         36.276674        47.488848
## 4         Eastern US            26.273458         20.458554        23.129966
## 5     Eastern Africa            26.666667         22.268041        48.074259
## 6      Southern Asia            25.956284         34.497091        18.380697
## 7     Central Africa            39.285714          9.324009       -32.801671
## 8         Central US            33.114754         22.085630       -21.496249
## 9     Western Europe            26.889715         28.082307        34.112309
## 10      North Africa            29.670330         31.030303        64.966617
## 11     South America            17.317073          9.249926        25.653852
## 12      Central Asia            57.142857        110.752688        76.080216
## 13 Southeastern Asia            40.897098         44.261745        44.763028
## 14        Western US            29.116945         42.358804        43.018900
## 15            Canada             6.896552          7.392996        16.019269
## 16         Caribbean             6.250000          6.356589         6.111547
## 17   Northern Europe            53.333333         66.368421        73.550853
## 18      Eastern Asia            20.952381         19.792531        16.977597
## 19   Central America            13.950617          9.086428        16.531398
## 20   Southern Africa            36.231884         38.562092        14.101771
## 21    Western Africa            40.760870         32.807215        32.391020
## 22       Southern US            28.638498         19.738481        27.960959
## 23   Southern Europe            34.628975         39.529058        14.981577


Data Visualizations

To show Regional performance from 2014 to 2015 in the metrics requested, I will use simple, ordered bar charts. Bars will be filled based on value, with positive year over year changes in green and negative changes in red. To avoid repetition and clutter, I will provide code for the first chart only. The other charts will have the same format but with a different variable on the y-axis.

# BAR PLOT: CHANGE IN ORDERS BY REGION
ggplot(regional_data, aes(x = reorder(Region, -PercDiffOrdersPlaced),
                          y = PercDiffOrdersPlaced,
                          fill =  PercDiffOrdersPlaced < 0)) +
    geom_bar(stat = 'identity',
             color = 'black') +
    scale_fill_manual(values = c("#006600", "firebrick")) + # THIS WILL FORMAT THE FILL BASED ON VALUE (POS / NEG)
    labs(title = 'Percent Change in Orders Placed from 2014 to 2015',
         x = '',
         y = 'Change in Orders (%)') +
    theme_bw() +
    theme(plot.title = element_text(size = 18, face = 'bold', hjust = 0.5),
          axis.title.y = element_text(size = 16, face = 'bold', margin = margin(t = 0, r = 15, b = 0, l = 0)),
          axis.text.x = element_text(size = 12, color = 'black', angle = 45, hjust = 1),
          axis.text.y = element_text(size = 12, color = 'black'),
          legend.position = 'none')




Part 3: Most Profitable Items


Extracting Data from BigQuery

The following SQL code will pull the data needed to find the top 10 most items ranked by greatest total profit. Note that the Product_Name field in the original dataset is formatted as Product Name, Description - we’re only interested in the product name, not the description, so I extract the name using REGEX_SUBSTR(). I also pull the total number of each product sold and the average profit per unit.

SELECT *
FROM (
  SELECT 
    REGEXP_SUBSTR(Product_Name, r"^(.+?),") AS ProductName,
    COUNT(*) AS TotalSold,
    ROUND(AVG(Quantity * cast(Profit as numeric)),2) AS AvgProfit,
    SUM(Quantity * cast(Profit as numeric)) AS TotalProfit
  FROM `global_superstore.orders` AS orders
  GROUP BY REGEXP_SUBSTR(Product_Name, r"^(.+?),")) innerData
ORDER BY TotalProfit DESC
LIMIT 10


Loading Data into R for Visualization

After running the SQL script above in BigQuery, I saved the output to a csv file. Here, I load the csv file into R before creating visualizations:

profit_prods <- read.csv('C:/Users/danie/Documents/R Scripts/analytics_portfolio/case_study_global_superstore/data/profitable_products.csv')

profit_prods
##                                     ProductName TotalSold AvgProfit TotalProfit
## 1                             Cisco Smart Phone        92   1672.84   153901.45
## 2                             Nokia Smart Phone        83   1607.89   133454.56
## 3                       Sauder Classic Bookcase        85   1544.10   131248.80
## 4                          Hewlett Wireless Fax        82   1311.74   107562.80
## 5                           Brother Fax Machine        79   1286.67   101647.30
## 6                            Canon Wireless Fax        69   1462.83   100935.38
## 7  Harbour Creations Executive Leather Armchair        84   1195.52   100423.46
## 8                         Bush Classic Bookcase        77   1271.58    97911.85
## 9                                Memorex Router        77   1233.24    94959.28
## 10                       Dania Classic Bookcase        72   1007.05    72507.56


Data Visualizations

To show the most profitable products, I will use a simple data table.

kable(profit_prods,
      col.names = c('Product',
                    'Total Number Sold',
                    'Average Per-Unit Profit (USD)',
                    'Total Profit (USD)'),
      align = c('l', 'c', 'c', 'c'),
      caption = 'Products ranked by total profits generated, all-time') %>% 
  kable_styling()
Products ranked by total profits generated, all-time
Product Total Number Sold Average Per-Unit Profit (USD) Total Profit (USD)
Cisco Smart Phone 92 1672.84 153901.45
Nokia Smart Phone 83 1607.89 133454.56
Sauder Classic Bookcase 85 1544.10 131248.80
Hewlett Wireless Fax 82 1311.74 107562.80
Brother Fax Machine 79 1286.67 101647.30
Canon Wireless Fax 69 1462.83 100935.38
Harbour Creations Executive Leather Armchair 84 1195.52 100423.46
Bush Classic Bookcase 77 1271.58 97911.85
Memorex Router 77 1233.24 94959.28
Dania Classic Bookcase 72 1007.05 72507.56