CART (regression and classification tress) – decision trees algorithm. Trees created by CART are binary – there are two branches coming out of the node. The algorithm goes as follow: look for every partition possible, and choose the best one (“goodness” criterium). To reduce the complexity there are some pruning (=cutting branches) techniques.

C4.5 is also decision trees algorithm. What differs is the possibility to create more-than-binary trees. It is also the ‘information gain” that decides about attributes selection. Attribute with the biggest information gain (or lowest entropy reduction) ensure classification with the lowest amount of information needed to classify correctly.

Entropy is a number of bits needed to send information about the result of the occurrence with probability p. In the possible spilt of the training set to the sub-sets, it is possible to calculate the requisition on the information (as an weighted sum of entropy for the sub-sets). Algorithm chooses optimal split, the one with the biggest information gain.

Disadvantages of the C4.5 algorithm are huge memory and processor capacity requirements, which are necessary to produce rules.

The C5.0 algorithm was presented in 1997, as a commercial version of C4.5. Important step ahead was made as tests provided, both with better classification results and supported types of data.

November 20th, 2016

Posted In: C4.5, web content, web mining, YouTube

Leave a Comment