CUBE operator in Pig computes all possible combination of the specified fields. In this post I will demonstrate the use of Cube operator to analyse energy rating of air conditioners in Hortonworks Data Platform (HDP).
The data is sourced from http://data.gov.au/dataset/energy-rating-for-household-appliances. The data set has numerous fields (see the LOAD statement below). Of interest to me is whether the brand and country of origin has an impact on the energy rating.
Here are the steps:
1. Load the csv data into PigStorage
aircon = LOAD '/user/hue/AirConditioners_2014_09_06.csv' using PigStorage(',') AS (ApplStandard:chararray,MEPSComp:chararray,N_Standard:chararray,ModelName:chararray,Model_No:chararray,avg_pwr_standby_mode:float,Brand:chararray,C_Dehumid_Rated:float,Configuration1:chararray,Configuration2:chararray,Configuration2_unitmount:chararray,Configuration3_Sink:chararray,Configuration3_Source:chararray,Country:chararray,C_Power_Inp_Rated:float,C_Sens_Cool_Rated:float,C_Total_Cool_Rated:float,Depth:int,H2_COP:chararray,H2_HeatPwrCapacity:float,H2_HeatPwrInput:float,Height:int,H_Power_Inp_Rated:float,H_Total_Heat_Rated:float,indoorType:chararray,EERtestAvg:float,COPtestAvg:float,Invert:chararray,Setting_cool:int,Setting_heat:int,Pnoc:float,Pnoh:float,VSCP_EER50:chararray,VSCP_COP50:chararray,eermepslev:float,TestedOutputEER:chararray,TestedOutputCOP:chararray,AnnualOutputEER:float,AnnualOutputCOP:float,PL_EERMEPS:chararray,PL_COPMEPS:chararray,sri2010_cool:float,sri2010_heat:float,Star2010_Cool:float,Star2010_Heat:float,outdoortype:chararray,Phase:chararray,Refrigerant:chararray,Sold_in:chararray,Submit_ID:chararray,ExpDate:chararray,GrandDate:chararray,SubmitStatus:chararray,Type:chararray,Width:int,Product_Class:chararray,Demand_Response_1:chararray,Demand_Response_2:chararray,Demand_Response_4:chararray, Demand_Response_5:chararray,Demand_Response_6:chararray,Demand_Response_7:chararray, PartNumber:chararray, EER:float,Availability_Status:chararray,star2000_cool:float,star2000_heat:float,Product_Website:chararray, Representative_Brand_URL:chararray);
2. Filter only data of interest
For performance reasons it’s a good idea to eliminate records that are not significant to the outcome. In this instance I am only interested in air conditions which have approved heating and cooling rating of 4 and above and is available in market
filter1 = FILTER aircon BY (Availability_Status == 'Available') and (SubmitStatus =='Approved') and (Star2010_Cool >=4) and (Star2010_Heat >= 4);
3. Apply CUBE Operator
Get energy rating by brand and country of origin using CUBE operator
aggregate = CUBE filter1 BY CUBE(Brand,Country); counts = FOREACH aggregate GENERATE FLATTEN(group) AS (Brand,Country), COUNT(cube.ModelName) AS ModelCount;
4. Sort Results
filter2 = ORDER (FILTER counts by ModelCount >0) BY ModelCount DESC, Brand ASC, Country ASC;
The full script is here
aircon = LOAD '/user/hue/AirConditioners_2014_09_06.csv' using PigStorage(',') AS (ApplStandard:chararray,MEPSComp:chararray,N_Standard:chararray,ModelName:chararray,Model_No:chararray,avg_pwr_standby_mode:float,Brand:chararray,C_Dehumid_Rated:float,Configuration1:chararray,Configuration2:chararray,Configuration2_unitmount:chararray,Configuration3_Sink:chararray,Configuration3_Source:chararray,Country:chararray,C_Power_Inp_Rated:float,C_Sens_Cool_Rated:float,C_Total_Cool_Rated:float,Depth:int,H2_COP:chararray,H2_HeatPwrCapacity:float,H2_HeatPwrInput:float,Height:int,H_Power_Inp_Rated:float,H_Total_Heat_Rated:float,indoorType:chararray,EERtestAvg:float,COPtestAvg:float,Invert:chararray,Setting_cool:int,Setting_heat:int,Pnoc:float,Pnoh:float,VSCP_EER50:chararray,VSCP_COP50:chararray,eermepslev:float,TestedOutputEER:chararray,TestedOutputCOP:chararray,AnnualOutputEER:float,AnnualOutputCOP:float,PL_EERMEPS:chararray,PL_COPMEPS:chararray,sri2010_cool:float,sri2010_heat:float,Star2010_Cool:float,Star2010_Heat:float,outdoortype:chararray,Phase:chararray,Refrigerant:chararray,Sold_in:chararray,Submit_ID:chararray,ExpDate:chararray,GrandDate:chararray,SubmitStatus:chararray,Type:chararray,Width:int,Product_Class:chararray,Demand_Response_1:chararray,Demand_Response_2:chararray,Demand_Response_4:chararray, Demand_Response_5:chararray,Demand_Response_6:chararray,Demand_Response_7:chararray, PartNumber:chararray, EER:float,Availability_Status:chararray,star2000_cool:float,star2000_heat:float,Product_Website:chararray, Representative_Brand_URL:chararray); --only consider current records with rating above 4 filter1 = FILTER aircon BY (Availability_Status == 'Available') and (SubmitStatus =='Approved') and (Star2010_Cool >=4) and (Star2010_Heat >= 4); --check which Brand and Mfg Country produces better rating aircon aggregate = CUBE filter1 BY CUBE(Brand,Country); counts = FOREACH aggregate GENERATE FLATTEN(group) AS (Brand,Country), COUNT(cube.ModelName) AS ModelCount; --sort the results filter2 = ORDER (FILTER counts by ModelCount >0) BY ModelCount DESC, Brand ASC, Country ASC; DUMP filter2;
And the results are shown below

The output is interpreted as follows:
- (,,18) – There are a total of 18 air conditioners meeting our requirement
- (,Thailand,10) – 10 (out of 18) are made in Thailand
- (,China,5) – 5 (out of 18) are made in China
- (,Thailand/Japan,2) – 2 (out of 18) are partly made in Thailand and Japan
- (,”””Korea/ Republic of”””,1) – And 1 is made in Korea
- (MITSUBISHI ELECTRIC,,5) – 5 are made by Mitsubishi Electric
- (MITSUBISHI HEAVY INDUSTRIES LTD.,,2) – 2 are made by Mitsubishi Heavy Industries
- (DAIKIN,,3) – And 3 by Daikin
- (FUJITSU,,2) – And 2 by Fujitsu
- (SAMSUNG ELECTRONICS,China,1) – There is 1 air conditioner from Samsung,China
- (SAMSUNG ELECTRONICS,Thailand,1) – There is 1 air conditioner from Samsung,Thailiand
- (DAIKIN,Thailand/Japan,2) – There 2 air conditioners partly made in Thailand and Japan is from Daikin and so on
Reference
Pig Latin Basics
Pig Latin Built In Functions
One thought on “Pig: Using CUBE Operator to Analyse Energy Rating of Air Conditioners”