# Pig: Expense Analysis using GROUP Operator

In this post I will demonstrate how to use Pig’s GROUP operator to analyse credit card expenses and determine the top expenses for the year and their percentage of the total expense. This exercise was done in Hortonworks Data Platform (HDP).

The credit card data has the following fields and the data is for the year:

1. Bank Account
2. Date of Transaction
3. Narrative
4. Debit Amount
5. Credit Amount
6. Categories
7. Serial Number

Here are the steps:

## 1.   Load the csv data into PigStorage

```data = LOAD '/user/hue/Last1Yr.csv' using PigStorage(',')
AS (BankAccount:chararray,Date:chararray,Narrative:chararray,
DebitAmount:chararray,CreditAmount:chararray,
Categories:chararray,Serial:chararray);
```

## 2.  Cleanse data to filter only the data of interest

```expense = FILTER data BY (Categories == 'FEE') or (Categories == 'INT')
or (Categories == 'OTHER');
```

## 3.  Substring the first word of Narrative to get a close approximation of the vendor name.

This step is required to get the unique vendor name if they are operating in different locations. For e.g. transactions at Caltex North Sydney and Caltex Kellyville will be grouped as Caltex as it represents expenses at a single entity

```expense = FOREACH expense GENERATE BankAccount,Date
,SUBSTRING(Narrative,0,INDEXOF(Narrative,' ',0)) as Narrative
,(float)DebitAmount,(float)CreditAmount,Categories,Serial;
```

## 4.  Get overall total expenses for all transactions

```expenseALL = Group expense ALL;
expenseALL= FOREACH expenseALL GENERATE FLATTEN(group),
SUM(expense.DebitAmount) as AllExpense;
```

## 5.  Get expenses by vendor and join with the overall expenditure to get the percentage of total expenses.

```expense = GROUP expense BY Narrative;
expense = FOREACH expense GENERATE FLATTEN(group)
,SUM(expense.DebitAmount) as Spent;
expense = CROSS expense, expenseALL;
```

## 6.  Sort by descending to find the top expenses.

```expense = ORDER expense BY Spent DESC;
```

Here is the full script:

```data = LOAD '/user/hue/Last1Yr.csv' using PigStorage(',')

AS (BankAccount:chararray,Date:chararray,Narrative:chararray,DebitAmount:chararray,CreditAmount:chararray,Categories:chararray,Serial:chararray);

--cleanse data

expense = FILTER data BY (Categories == 'FEE') or (Categories == 'INT') or (Categories == 'OTHER');

--substring to get vendor name + typecast

expense = FOREACH expense GENERATE BankAccount,Date,SUBSTRING(Narrative,0,INDEXOF(Narrative,' ',0)) as Narrative,(float)DebitAmount,(float)CreditAmount,Categories,Serial;

--Get total expense

expenseALL = Group expense ALL;

expenseALL= FOREACH expenseALL GENERATE FLATTEN(group), SUM(expense.DebitAmount) as AllExpense;

--Get expense by vendor

expense = GROUP expense BY Narrative;

expense = FOREACH expense GENERATE FLATTEN(group), SUM(expense.DebitAmount) as Spent;

--compare expense with total expense

expense = CROSS expense, expenseALL;

--sort results

expense = ORDER expense BY Spent DESC;

DUMP expense;
```

You could also get the total expense by using the CUBE operator instead of GROUP which I will explain in a separate post.

### Reference

Pig Latin Basics

Pig Latin Built In Functions

### Related Posts

Pig: Using CUBE Operator to Analyse Energy Rating of Air Conditioners