Decision tree is a data mining model that graphically represents the parameters that are most likely to influence the outcome and the extent of influence. The output is similar to a tree/flowchart with nodes, branches and leaves. The nodes represent the parameters, the branches represent the classification question/decision and the leaves represent the outcome (Screen Capture 1). Internally, decision tree algorithm performs a recursive classification on the input dataset and assigns each record to a segment of the tree where it fits closest.
There are several packages in R that generate decision trees. For this post, I am using the ctree() function available in party package. The data I am using as input is energy rating of household air conditioners.
This data is available to public and can be downloaded from here
http://data.gov.au/dataset/energy-rating-for-household-appliances/resource/0973a476-eb0c-45e6-9a18-054f74307843
Using decision tree, I am interested to find how the cooling rating (star2000_cool in dataset) of an air conditioner is influenced by the refrigerant, inverter, country of manufacture and brand.
Here is the code.
<br> library("party")<br> aircon<-read.csv("C:/data/AirConditioners_2014_09_06.csv")<br> aircon<-subset(aircon,star2000_cool>3.5)<br> aircon_tree <- ctree(star2000_cool ~ Refrigerant +Invert +Country +Brand, data=aircon)<br> plot(aircon_tree,type="simple",main="Analysis of Energy Rating of Air conditioners using R Decision Trees")<br>
- First the party package is loaded using the library() function.
- Then the csv data is assigned to aircon dataset. It is important to eliminate NULL values from the fields that will be used in decision trees otherwise you might get this error
“Error in model@dpp(…) : missing values in response variable not allowed” - Eliminating unwanted records will improve the calculation time. Here I am interested only in air conditioners that have a cooling rating of above 3.5. So I filter the data using subset() function.
- Then the decision tree function ctree() is called. Here the influence of refrigerant, inverter, country of manufacture and brand over cooling rating (star2000_cool) is to be determined which is denoted by this formula – star2000_cool ~ Refrigerant +Invert +Country +Brand.
- Finally the decision tree is plotted.
The decision tree output will be as shown below
Interpretation of the Decision Tree
The node #1 at the top is called the root node. The root node here represents whether the air conditioner is equipped with an inverter or not which splits the decision tree. The branch at the right of root node is the path taken when the air conditioner is equipped with an inverter (branch value >0). The branch at the left is the path taken when the air conditioner does not have an inverter (branch value <=0). Now if we traverse on the right side, node #1 leads to node #7 which means when the air conditioner is equipped with an inverter, the next parameter that influences the energy rating is the country of manufacture. Country of manufacture in turn branches into two paths with each branch showing the list of countries. The list of countries on left side of node #7 ends up in leaf #8. The n value in leaf denotes the number of records and y denotes the energy rating (star2000_cool). So there are 266 records with an avg. energy rating of 5.004 if equipped with an inverter and manufactured in one of the countries left of the node. On the other hand if the country of manufacture is one on the right of node #7, then there are 131 records with cooling rating of 4.71.
Now if we traverse the left side of the decision tree, branch <=0 (no inverter) from node #1 leads to node #2 which is refrigerant. This indicates that energy rating of air conditioners that do not have an inverter seems to be influenced by the type of refrigerant used. The refrigerant node splits up into two branches based on type of refrigerant. If the refrigerant is either R22, R407 or R410A (branch value) then the country of manufacture, node #3 comes into picture. Leaf #4 corresponds to the list of countries on the left of node #3 has an avg. rating of 4.632 and 485 records. Leaf #5 corresponds to list of countries on right side of node #3.
However if the refrigerant is not R22, R407 or R410A then the energy rating is likely to be 4.196 (leaf #6)
Why did the tree stop growing?
If you notice the decision tree does not have a node corresponding to brand although brand was specified as one of the input parameters. That’s because the decision tree has stopped growing! According to mathematicians, decision tree stops growing when one of the following condition is reached [Source: Building Data Mining Applications for CRM by Alex Berson, Stephen Smith, and Kurt Thearling http://www.thearling.com/text/dmtechniques/dmtechniques.htm]
- The segment contains only one record. There is no further question that you could ask which could further refine a segment of just one. This however is not the case with the energy rating dataset since there are more than one record for each brand of air conditioners that could fall into each segment.
- All the records in the segment have identical characteristics. There is no reason to continue asking further questions segmentation since all the remaining records are the same. This is a possibility with the energy rating dataset.
- The improvement is not substantial enough to warrant making the split. This is a possibility with the energy rating dataset.
This indicates brand has no influence on energy rating.