Pig: Expense Analysis using GROUP Operator

In this post I will demonstrate how to use Pig’s GROUP operator to analyse credit card expenses and determine the top expenses for the year and their percentage of the total expense. This exercise was done in Hortonworks Data Platform (HDP).

The credit card data has the following fields and the data is for the year:

  1. Bank Account
  2. Date of Transaction
  3. Narrative
  4. Debit Amount
  5. Credit Amount
  6. Categories
  7. Serial Number

Here are the steps:

1.   Load the csv data into PigStorage

data = LOAD '/user/hue/Last1Yr.csv' using PigStorage(',')
AS (BankAccount:chararray,Date:chararray,Narrative:chararray,
   DebitAmount:chararray,CreditAmount:chararray,
   Categories:chararray,Serial:chararray);

2.  Cleanse data to filter only the data of interest

expense = FILTER data BY (Categories == 'FEE') or (Categories == 'INT')
          or (Categories == 'OTHER');

3.  Substring the first word of Narrative to get a close approximation of the vendor name.

This step is required to get the unique vendor name if they are operating in different locations. For e.g. transactions at Caltex North Sydney and Caltex Kellyville will be grouped as Caltex as it represents expenses at a single entity

expense = FOREACH expense GENERATE BankAccount,Date
            ,SUBSTRING(Narrative,0,INDEXOF(Narrative,' ',0)) as Narrative
            ,(float)DebitAmount,(float)CreditAmount,Categories,Serial;

4.  Get overall total expenses for all transactions

expenseALL = Group expense ALL;
expenseALL= FOREACH expenseALL GENERATE FLATTEN(group),
           SUM(expense.DebitAmount) as AllExpense;

5.  Get expenses by vendor and join with the overall expenditure to get the percentage of total expenses.

expense = GROUP expense BY Narrative;
expense = FOREACH expense GENERATE FLATTEN(group)
            ,SUM(expense.DebitAmount) as Spent;
expense = CROSS expense, expenseALL;

6.  Sort by descending to find the top expenses.

expense = ORDER expense BY Spent DESC;

Here is the full script:

data = LOAD '/user/hue/Last1Yr.csv' using PigStorage(',')

AS (BankAccount:chararray,Date:chararray,Narrative:chararray,DebitAmount:chararray,CreditAmount:chararray,Categories:chararray,Serial:chararray);

--cleanse data

expense = FILTER data BY (Categories == 'FEE') or (Categories == 'INT') or (Categories == 'OTHER');

--substring to get vendor name + typecast

expense = FOREACH expense GENERATE BankAccount,Date,SUBSTRING(Narrative,0,INDEXOF(Narrative,' ',0)) as Narrative,(float)DebitAmount,(float)CreditAmount,Categories,Serial;

--Get total expense

expenseALL = Group expense ALL;

expenseALL= FOREACH expenseALL GENERATE FLATTEN(group), SUM(expense.DebitAmount) as AllExpense;

--Get expense by vendor

expense = GROUP expense BY Narrative;

expense = FOREACH expense GENERATE FLATTEN(group), SUM(expense.DebitAmount) as Spent;

--compare expense with total expense

expense = CROSS expense, expenseALL;

--sort results

expense = ORDER expense BY Spent DESC;

DUMP expense;

You could also get the total expense by using the CUBE operator instead of GROUP which I will explain in a separate post.

Reference

Pig Latin Basics

Pig Latin Built In Functions

Related Posts

Pig: Using CUBE Operator to Analyse Energy Rating of Air Conditioners

One thought on “Pig: Expense Analysis using GROUP Operator

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s