Pig: Using CUBE Operator to Analyse Energy Rating of Air Conditioners

CUBE operator in Pig computes all possible combination of the specified fields. In this post I will demonstrate the use of Cube operator to analyse energy rating of air conditioners in Hortonworks Data Platform (HDP).
The data is sourced from http://data.gov.au/dataset/energy-rating-for-household-appliances. The data set has numerous fields (see the LOAD statement below). Of interest to me is whether the brand and country of origin has an impact on the energy rating.
Here are the steps:

1. Load the csv data into PigStorage

aircon = LOAD '/user/hue/AirConditioners_2014_09_06.csv' using PigStorage(',')
AS (ApplStandard:chararray,MEPSComp:chararray,N_Standard:chararray,ModelName:chararray,Model_No:chararray,avg_pwr_standby_mode:float,Brand:chararray,C_Dehumid_Rated:float,Configuration1:chararray,Configuration2:chararray,Configuration2_unitmount:chararray,Configuration3_Sink:chararray,Configuration3_Source:chararray,Country:chararray,C_Power_Inp_Rated:float,C_Sens_Cool_Rated:float,C_Total_Cool_Rated:float,Depth:int,H2_COP:chararray,H2_HeatPwrCapacity:float,H2_HeatPwrInput:float,Height:int,H_Power_Inp_Rated:float,H_Total_Heat_Rated:float,indoorType:chararray,EERtestAvg:float,COPtestAvg:float,Invert:chararray,Setting_cool:int,Setting_heat:int,Pnoc:float,Pnoh:float,VSCP_EER50:chararray,VSCP_COP50:chararray,eermepslev:float,TestedOutputEER:chararray,TestedOutputCOP:chararray,AnnualOutputEER:float,AnnualOutputCOP:float,PL_EERMEPS:chararray,PL_COPMEPS:chararray,sri2010_cool:float,sri2010_heat:float,Star2010_Cool:float,Star2010_Heat:float,outdoortype:chararray,Phase:chararray,Refrigerant:chararray,Sold_in:chararray,Submit_ID:chararray,ExpDate:chararray,GrandDate:chararray,SubmitStatus:chararray,Type:chararray,Width:int,Product_Class:chararray,Demand_Response_1:chararray,Demand_Response_2:chararray,Demand_Response_4:chararray, Demand_Response_5:chararray,Demand_Response_6:chararray,Demand_Response_7:chararray, PartNumber:chararray, EER:float,Availability_Status:chararray,star2000_cool:float,star2000_heat:float,Product_Website:chararray, Representative_Brand_URL:chararray);

2. Filter only data of interest

For performance reasons it’s a good idea to eliminate records that are not significant to the outcome. In this instance I am only interested in air conditions which have approved heating and cooling rating of 4 and above and is available in market

filter1 = FILTER aircon BY (Availability_Status == 'Available')
                   and (SubmitStatus =='Approved')
                   and (Star2010_Cool >=4) and (Star2010_Heat >= 4);

3. Apply CUBE Operator

Get energy rating by brand and country of origin using CUBE operator

aggregate = CUBE filter1 BY CUBE(Brand,Country);
counts = FOREACH aggregate GENERATE FLATTEN(group)
              AS (Brand,Country), COUNT(cube.ModelName) AS ModelCount;

4. Sort Results

filter2 = ORDER (FILTER counts by ModelCount >0)
BY ModelCount DESC, Brand ASC, Country ASC;

The full script is here

aircon = LOAD '/user/hue/AirConditioners_2014_09_06.csv' using PigStorage(',')
AS (ApplStandard:chararray,MEPSComp:chararray,N_Standard:chararray,ModelName:chararray,Model_No:chararray,avg_pwr_standby_mode:float,Brand:chararray,C_Dehumid_Rated:float,Configuration1:chararray,Configuration2:chararray,Configuration2_unitmount:chararray,Configuration3_Sink:chararray,Configuration3_Source:chararray,Country:chararray,C_Power_Inp_Rated:float,C_Sens_Cool_Rated:float,C_Total_Cool_Rated:float,Depth:int,H2_COP:chararray,H2_HeatPwrCapacity:float,H2_HeatPwrInput:float,Height:int,H_Power_Inp_Rated:float,H_Total_Heat_Rated:float,indoorType:chararray,EERtestAvg:float,COPtestAvg:float,Invert:chararray,Setting_cool:int,Setting_heat:int,Pnoc:float,Pnoh:float,VSCP_EER50:chararray,VSCP_COP50:chararray,eermepslev:float,TestedOutputEER:chararray,TestedOutputCOP:chararray,AnnualOutputEER:float,AnnualOutputCOP:float,PL_EERMEPS:chararray,PL_COPMEPS:chararray,sri2010_cool:float,sri2010_heat:float,Star2010_Cool:float,Star2010_Heat:float,outdoortype:chararray,Phase:chararray,Refrigerant:chararray,Sold_in:chararray,Submit_ID:chararray,ExpDate:chararray,GrandDate:chararray,SubmitStatus:chararray,Type:chararray,Width:int,Product_Class:chararray,Demand_Response_1:chararray,Demand_Response_2:chararray,Demand_Response_4:chararray, Demand_Response_5:chararray,Demand_Response_6:chararray,Demand_Response_7:chararray, PartNumber:chararray, EER:float,Availability_Status:chararray,star2000_cool:float,star2000_heat:float,Product_Website:chararray, Representative_Brand_URL:chararray);

--only consider current records with rating above 4
filter1 = FILTER aircon BY (Availability_Status == 'Available') and (SubmitStatus =='Approved') and (Star2010_Cool >=4) and (Star2010_Heat >= 4);

--check which Brand and Mfg Country produces better rating aircon
aggregate = CUBE filter1 BY CUBE(Brand,Country);
counts = FOREACH aggregate GENERATE FLATTEN(group) AS (Brand,Country), COUNT(cube.ModelName) AS ModelCount;

--sort the results
filter2 = ORDER (FILTER counts by ModelCount >0) BY ModelCount DESC, Brand ASC, Country ASC;

DUMP filter2;

And the results are shown below

Output
Output

The output is interpreted as follows:

  • (,,18) – There are a total of 18 air conditioners meeting our requirement
  • (,Thailand,10) – 10 (out of 18) are made in Thailand
  • (,China,5) – 5 (out of 18) are made in China
  • (,Thailand/Japan,2) – 2 (out of 18) are partly made in Thailand and Japan
  • (,”””Korea/ Republic of”””,1) – And 1 is made in Korea
  • (MITSUBISHI ELECTRIC,,5) – 5 are made by Mitsubishi Electric
  • (MITSUBISHI HEAVY INDUSTRIES LTD.,,2) – 2 are made by Mitsubishi Heavy Industries
  • (DAIKIN,,3) – And 3 by Daikin
  • (FUJITSU,,2) – And 2 by Fujitsu
  • (SAMSUNG ELECTRONICS,China,1) – There is 1 air conditioner from Samsung,China
  • (SAMSUNG ELECTRONICS,Thailand,1) – There is 1 air conditioner from Samsung,Thailiand
  • (DAIKIN,Thailand/Japan,2) – There 2 air conditioners partly made in Thailand and Japan is from Daikin and so on

Reference

Pig Latin Basics
Pig Latin Built In Functions

Related Posts

Pig: Expense Analysis using GROUP Operator

One thought on “Pig: Using CUBE Operator to Analyse Energy Rating of Air Conditioners

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s