In this post I will demonstrate how to use Pig’s GROUP operator to analyse credit card expenses and determine the top expenses for the year and their percentage of the total expense. This exercise was done in Hortonworks Data Platform (HDP).
The credit card data has the following fields and the data is for the year:
- Bank Account
- Date of Transaction
- Narrative
- Debit Amount
- Credit Amount
- Categories
- Serial Number
Here are the steps:
1. Load the csv data into PigStorage
data = LOAD '/user/hue/Last1Yr.csv' using PigStorage(',') AS (BankAccount:chararray,Date:chararray,Narrative:chararray, DebitAmount:chararray,CreditAmount:chararray, Categories:chararray,Serial:chararray);
2. Cleanse data to filter only the data of interest
expense = FILTER data BY (Categories == 'FEE') or (Categories == 'INT') or (Categories == 'OTHER');
3. Substring the first word of Narrative to get a close approximation of the vendor name.
This step is required to get the unique vendor name if they are operating in different locations. For e.g. transactions at Caltex North Sydney and Caltex Kellyville will be grouped as Caltex as it represents expenses at a single entity
expense = FOREACH expense GENERATE BankAccount,Date ,SUBSTRING(Narrative,0,INDEXOF(Narrative,' ',0)) as Narrative ,(float)DebitAmount,(float)CreditAmount,Categories,Serial;
4. Get overall total expenses for all transactions
expenseALL = Group expense ALL; expenseALL= FOREACH expenseALL GENERATE FLATTEN(group), SUM(expense.DebitAmount) as AllExpense;
5. Get expenses by vendor and join with the overall expenditure to get the percentage of total expenses.
expense = GROUP expense BY Narrative; expense = FOREACH expense GENERATE FLATTEN(group) ,SUM(expense.DebitAmount) as Spent; expense = CROSS expense, expenseALL;
6. Sort by descending to find the top expenses.
expense = ORDER expense BY Spent DESC;
Here is the full script:
data = LOAD '/user/hue/Last1Yr.csv' using PigStorage(',') AS (BankAccount:chararray,Date:chararray,Narrative:chararray,DebitAmount:chararray,CreditAmount:chararray,Categories:chararray,Serial:chararray); --cleanse data expense = FILTER data BY (Categories == 'FEE') or (Categories == 'INT') or (Categories == 'OTHER'); --substring to get vendor name + typecast expense = FOREACH expense GENERATE BankAccount,Date,SUBSTRING(Narrative,0,INDEXOF(Narrative,' ',0)) as Narrative,(float)DebitAmount,(float)CreditAmount,Categories,Serial; --Get total expense expenseALL = Group expense ALL; expenseALL= FOREACH expenseALL GENERATE FLATTEN(group), SUM(expense.DebitAmount) as AllExpense; --Get expense by vendor expense = GROUP expense BY Narrative; expense = FOREACH expense GENERATE FLATTEN(group), SUM(expense.DebitAmount) as Spent; --compare expense with total expense expense = CROSS expense, expenseALL; --sort results expense = ORDER expense BY Spent DESC; DUMP expense;
You could also get the total expense by using the CUBE operator instead of GROUP which I will explain in a separate post.
Reference
Related Posts
Pig: Using CUBE Operator to Analyse Energy Rating of Air Conditioners
One thought on “Pig: Expense Analysis using GROUP Operator”