Weka is a collection of the algorithms, commonly used in data mining. There are both graphic and command line interface, probably second possibility is useful for more complicated projects – for my it was enough to use simple explorer. Moreover, one can use personal java code. Weka contains tools for data preparation (normalization, discretization and the bunch of other), classificaton, clustering, regression, association rules, not to mention well expanded visualization.


Weka – explorer window

I enjoyed very much working with Weka. After some struggling with input data format (I used CSV), with a little exercise a wide choice of possibilities appeared. I used Weka in the Unix environment, Ubuntu 8.1.

Basic data format for Weka is arff: Attribute – Relation File Format, ascii file format. It describes instances which are sharing attributes. You can choose another file format ( .names, .data (C4.5), .csv, .libsvm. .dat, .bsi, .xrff.), what happens most of the time, at least at the beginning of projects, when you have a lot of data from external sources, like MySQL databases or Excel.

There are some functions worth mentioning, like various kinds of filtration, e.g. supervised or not, jitterizing or other kind of random “pollution”, randomizaton, sampling, standarization. It is possible to use Perl commands or visualize datasets in many ways. Every single moment you can check log to find out what happens inside or check memory, logging in the ubuntu-console, where program started also takes place.

I have to mention about “dancing” Kiwi when algorithm works. Strange feeling, when you have to watch in for a couple of hours. Dancing Kiwi


November 26th, 2016

Posted In: web content, web mining, Weka, YouTube

Leave a Comment

A priori is algorithm used in affinity analysis. A set of rules is generated, usually implications, that describe dataset. Finding frequent datasets from the transaction database is a popular task among data mining appliances, which isn’t as simple as it initially seems. The reason is computational complexity, which goes extremely extremely high when it comes to very large databases (I like the fancy way they described it in Top 10 Algorithm in DM: combinatorial explosion).

The idea of A priori is: find frequent itemsets (frequent means one with previously assigned level of support) and then generate rules that comply previously assigned level of confidence. There are candidate itemsets generated and they are the base to find n-element frequent itemsets (1st step in procedure is to find one-element frequent itemsets, and then repeat, eliminating itemsets which support is not sufficient). the procedure of generating candidate and frequent sets is repeated simply for the possible number. The main point exploits monotonicity: “if an itemset is not frequent, any of its superset is never frequent” (again Top 10 Algo in DM). Smart way to eliminate itemsets.

A priori is one of the most important algorithms in data mining. The other ideas to make it even more efficient are e.g. the new ways to create candidate itemsets – hashing techniques (smaller candidate itemesets), partitioning (divide the problem into smaller ones and explore them separately – if only real-life problems work this way!) or sampling. Important improvement of A priori algorithm is FP-growth algorithm, which supports compression (without losing important information) and then partitioning.

Despite of the fact A priori is rather simple, easy implementation and proper results make it serious solution in many problems.

November 25th, 2016

Posted In: web content, web mining, YouTube

Leave a Comment

Naive Bayes is a statistical classificator. It is one of the oldest formal classification algorithm, but still – thanks to simplicity and efficiency – it is often used, for example in anti-spam mechanisms. This method is called: supervised classification – we are given a set of objects with classes assigned and using it we want to generate rules that help us assign future objects to classes.

MAP (maximal a posteriori classification) is very popular estimation method in bayesian statistic. It is said MAP is optimal – minimal error is achieved. The problem is when it comes to computation complexity, which is c^n (c – classes n – describing variables). However, naive bayes it is said the variables (components) are independent (conditionally independent). The point is – if it is true, NB gives also optimal results.

It may seem that independence presumption is too strict to adapt it in real world. Neverthless, activity before classification makes the difference, e.g. selection and elimination of corellated variables occurs always as the part of the methodology of data mining.

Sources: same as previous posts

November 24th, 2016

Posted In: naive Bayes, web content, web mining, YouTube

Leave a Comment

Classification and regression trees

Decision trees is one of the classification method – structures consisting nodes, connected with branches. Unlike the natural way, root appears on the top of the structure and branches go down ending with leaves or leading to another node.

Main goal of the algorithm is to select attributes (both whichever and sequence matter) to obtain highest confidence level. Decision trees fall under supervised learning category.

It is possible to employ classification trees, when:

  • training set with defined target variable exists
  • training set provides algorithm with representative group of records (enough examples)
  • discrete target variables

Bibl. [Daniel Larose „Odkrywanie wiedzy z danych” 2006 PWN, 109, 111]

November 23rd, 2016

Posted In: decision trees, web content, web mining, YouTube

Leave a Comment

SVM stands for support vector machines. The idea of this classification’ algorithm is generating border between objects that belong to different decision class. Big advantage of this approach is simple training set and moreover, it can be easy used to solve multi-dimensional problems. Line between objects is generated by iterative algorithm.

Types of SVM:

  • C-SVM
  • ni-SVM
  • regression epsilon SVM
  • regression ni-SVM


November 22nd, 2016

Posted In: web content, web mining, YouTube

Leave a Comment

Internet is probably the biggest world’s database. Moreover, data is available using easily accessible techniques. Often it is important and detailed data, that let people achieve goals or use it in various realms. Data is held in various forms: text, multimedia, database. Web pages keep standard of html (or another ML family member) which makes it kind of structural form, but not sufficent to easily use it in data mining. Typical website contains, in addition to main content and links, various stuff like ads or navigation items.  It is also widely known that most of the data in the Internet is redundant – a lot of information appear in different sites, in more or less alike form.

Deep web (hidden web, invisible web, invisible Internet) refers to the lower niveau of the global network. It doesn’t appear in the results of the search engine’s work and the searching devices don’t index or list this area. It is said the great part of the global web belongs to deep web and stays hidden, until specific enquiry, targeted to the right interface triggers content to appear. This sentences also reveals some barriers that keep the data hidden, like specific interface, requirement to have specific knowledge about data, high security (passwords) or simply lack of linkage. It is also possible to block range of IP addresses, interfaces (e.g. using CAPTCHA) or just keep data in non-standard format. Reasons mentioned above are natural barrier for crawlers and web robots, keeping some part of the web out of the linked web.

Looking for the definition of the Internet exploration, the easiest way is to put it as a part of data mining, where web resources are explored. It is commonly divided into three:

  1. web content mining is the closest one to the “classic” data mining”, as WCM mostly operates on text and it is generally common way to put information in Internet as text,
  2. web linkage mining goal is to use nature of the Internet – connection structure – as it is a bunch of documents connected with links.
  3. web usage mining is looking for useful patterns in logs and documents containing history of user’s activity.

Three of them are also factors varying web mining from data mining, because topic of the research is not only data, but structure and flow as well. Additionally, web mining takes data “as it is” – and the imagination of internet content creators is wide when it comes to create new ones – while data mining operates rather on structured data.

Finally – general application of web mining goes beyond tweaking websites or data analyse. It could be used as a tool for upgrading tasks, projects and processes in companies and institutions or as a method providing aid while solving technical or analitical problems. Web mining is currently used in ranking of the web pages, electronic trade, internet advertising, reliability evaluation, recommendation systems, personalization of web services and more.

Web Content mining article sponsored by Saskatoon Roofing Services, best company for roofing Saskatoon has.

November 22nd, 2016

Posted In: mining, research, web content, web mining, YouTube

Leave a Comment

CRISP-DM stands for CRoss Industry Standard Process for Data Mining. It is a methodology used in processing data mining projects, as data exploration like the other business processing techniques demands a general guide to follow.

Basic methodology is split into four parts:

  1. problem identification
  2. data preprocessing (turn data into information, whatever it means)
  3. data exploration
  4. evaluation (result examination)

Data mining is in general a mechanism that let us make better decision in the future, by analysing (in very fancy way) past data. There are two moments in the data mining process which we have to be careful – when we discover a pattern, which can be false or when pattern is true, but useless. The 1st is a straight danger, because business decisions made on false basis simply cost money (sometimes awful lot of money). 2nd one has additional, hidden trap, because it becomes clear the rule is useless after implementing i – system doesn’t simply pass the reality check. Maintaining the methodology provides us with the mechanism to minimize probability of making such a mistake.

According to crisp-dm.org, the open methodology to keep data mining industrial process close to general business-and-research -problems solving strategy. System is divided into 6 steps:

  1. business problem and condition understanding
  2. data understanding
  3. data preparation
  4. modelling
  5. evaluation
  6. implementation

It is very important to notice, each step is strictly connected with results of previous one and it is necessary to jump serveral times between levels (not only in the order presented above!). It is also natural that result of one step causes returning to the start point of the project and reevaluating some opinions or fore-designs.

[M. Berry, G. Linoff „Data Mining Techniques”, Wiley 2004.]

[Daniel Larose „Odkrywanie wiedzy z danych” 2006 PWN, 5]

November 21st, 2016

Posted In: CRISP-DM, web content, web mining, YouTube

Leave a Comment

CART (regression and classification tress) – decision trees algorithm. Trees created by CART are binary – there are two branches coming out of the node. The algorithm goes as follow: look for every partition possible, and choose the best one (“goodness” criterium). To reduce the complexity there are some pruning (=cutting branches) techniques.

C4.5 is also decision trees algorithm. What differs is the possibility to create more-than-binary trees. It is also the ‘information gain” that decides about attributes selection. Attribute with the biggest information gain (or lowest entropy reduction) ensure classification with the lowest amount of information needed to classify correctly.

Entropy is a number of bits needed to send information about the result of the occurrence with probability p. In the possible spilt of the training set to the sub-sets, it is possible to calculate the requisition on the information (as an weighted sum of entropy for the sub-sets). Algorithm chooses optimal split, the one with the biggest information gain.

Disadvantages of the C4.5 algorithm are huge memory and processor capacity requirements, which are necessary to produce rules.

The C5.0 algorithm was presented in 1997, as a commercial version of C4.5. Important step ahead was made as tests provided, both with better classification results and supported types of data.

November 20th, 2016

Posted In: C4.5, web content, web mining, YouTube

Leave a Comment

ADABoost (Adaptive Boosting) is a meta-algorithm used to improve classification results. The concept is to make a lot of weak classifiers cooperate to boost results. Adaptability means in this case that detection of the wrong classification makes the algorithm do more work on it (by changing the wages and setting algorithm to do more effort where it failed).

AdaBoost is sensitive to noisy data or outliers.

[http://www.cs.princeton.edu/~schapire/boost.html; Wu i inni “Top 10 algorithms in data mining” Springer 2008]

November 19th, 2016

Posted In: ADABoost, web content, web mining, YouTube

Leave a Comment

K Nearest Neighbours is a basic classification algorithm. The idea comes probably from the extension of Rote classifier, which is as simple as point system in ‘Whose line is it anyway’. System memorizes whole training set and classifies only items that have exactly same values as in training set. Obvious disadvantage is there will be a lot of  unclassified objects. The “next generation” of the concept says the classification occurs using the value of the nearest point in dataset. Comparing to previous way it is a huge difference, but still – system is vulnerable to noise and outliers.

KNN is (comparing to previous strategies) a bit more sophisticated. Algorithm finds a group of k-objects in training set under the condition of “distance” and according to the findings classifies the new object to the previously given class (cluster), respecting weights set to neighbours. Important issues are:

  • number of neighbours (it is important because it is in the name of the algo anyway)
  • the meaning of distance
  • training set is the basic

Parameters are very important to the results and I am going to write another post to discuss a little bit more about.

The procedure goes:

  1. Get the training set remembered (and prepared to update dynamically if data comes continuously)
  2. Measure the distance between new object and object to training set, to find the nearest
  3. Use collected information to classify new object

In spite of the fact, building the model using kNN is not very difficult task, costs of classification are relatively high. Comparing new object with whole training set (lazy learning) is responsible for that and it is especially visible in large datasets. There are some techniques that reduce the amout of computation – from simply editing training set (sometimes results are even better than classification with larger database) to proximity graphs.

Sources: [Top 10 algorithms in data mining, Springer 2008]

Article sponsored by Birmingham Limo Hire, best limo hire Birmingham can offer

November 18th, 2016

Posted In: algorithm, web content, web mining, YouTube

Leave a Comment

Next Page »