WebContentMining.com


web content mining – introduction

Posted in definitions, general, web mining by admin on the August 3rd, 2010

Web content mining is a part of data mining domain that is the closest one to the classic definition of DM. Web content mining aspects are related to the similar domains in classic data mining.

  • automatic content extraction from web pages
  • integration of the information
  • opinion and rewievs extraction
  • knowledge synthesis
  • noise detection and segmentation

Briefly said, web content mining listed above are solutions for more or less complicated problems or issues, connected to automation of web usage, which lead to the improvement in several aspects of Internet daily life, considering both technical and non-technical matters.

This might also be interesting for you:

web mining – what do we research?

Posted in general, web mining by admin on the May 30th, 2010

Internet is probably the biggest world’s database. Moreover, data is available using easily accessible techniques. Often it is important and detailed data, that let people achieve goals or use it in various realms. Data is held in various forms: text, multimedia, database. Web pages keep standard of html (or another ML family member) which makes it kind of structural form, but not sufficent to easily use it in data mining. Typical website contains, in addition to main content and links, various stuff like ads or navigation items.  It is also widely known that most of the data in the Internet is redundant – a lot of information appear in different sites, in more or less alike form.

Deep web (hidden web, invisible web, invisible Internet) refers to the lower niveau of the global network. It doesn’t appear in the results of the search engine’s work and the searching devices don’t index or list this area. It is said the great part of the global web belongs to deep web and stays hidden, until specific enquiry, targeted to the right interface triggers content to appear. This sentences also reveals some barriers that keep the data hidden, like specific interface, requirement to have specific knowledge about data, high security (passwords) or simply lack of linkage. It is also possible to block range of IP addresses, interfaces (e.g. using CAPTCHA) or just keep data in non-standard format. Reasons mentioned above are natural barrier for crawlers and web robots, keeping some part of the web out of the linked web.

Looking for the definition of the Internet exploration, the easiest way is to put it as a part of data mining, where web resources are explored. It is commonly divided into three:

  1. web content mining is the closest one to the “classic” data mining”, as WCM mostly operates on text and it is generally common way to put information in Internet as text,
  2. web linkage mining goal is to use nature of the Internet – connection structure – as it is a bunch of documents connected with links.
  3. web usage mining is looking for useful patterns in logs and documents containing history of user’s activity.

Three of them are also factors varying web mining from data mining, because topic of the research is not only data, but structure and flow as well. Additionally, web mining takes data “as it is” – and the imagination of internet content creators is wide when it comes to create new ones – while data mining operates rather on structured data.

Finally – general application of web mining goes beyond tweaking websites or data analyse. It could be used as a tool for upgrading tasks, projects and processes in companies and institutions or as a method providing aid while solving technical or analitical problems. Web mining is currently used in ranking of the web pages, electronic trade, internet advertising, reliability evaluation, recommendation systems, personalization of web services and more.

This might also be interesting for you:

hunting content creators (2)

Posted in hunting content creatos, projects, social networks, web mining by admin on the May 27th, 2010

As I’ve written in previous part – content creators part 1, discovering ubercreators and exploating this knowledge should be an important part of the development of every social-networking site.

My project (idea) is to set up a system to find content creators in functioning Internet board, using data mining algorithms. Some details:

  • database (MySQL) with over 3k users and describing parameters (about 70),
  • selection of the parameters describing users must be executed (manual – technically it comes to selection of the tables in the database, the process could be automated if necessary)
  • Weka is used as a set of classifiers and clustering algorithms (it is necessary to prepare data for both program and algorithm)

Content creating in discussion board is not really complex issue. Although it is difficult to evaluate value of the messages, in most cases it is not even necessary. It is enough to eliminate obvious cases of spamming and just let the snowball rolling down the hill.

In the certain moment, discovering users with hidden potential to create valuable content can give evolving society a serious boost. Giving an algorithm set of users with parameters, with an emplasis on those parameters describing activity and “creative spirit”, algorithm does the rest of the job, clustering users into groups with high level of similarity. The point is to use results of classification to give positive feedback to possible creators, to exploit potential.

The most reliable way to measure results is implementing model in real-life system. However, it is also necessary to try some modelling, because walking in the dark without even predicting (flashlight) if it is going to succeed is unacceptable in every business. Success means in this case having quick development of the network society with a visible grow of the valuable content and SEO parameters.

Content creators in social-networking sites part 1

Next chapter covers the issue of the chosen parameters, algorithm and modelling.

This might also be interesting for you:

Pagerank

Posted in data mining, general, web mining by admin on the May 22nd, 2010

PageRank – Larry Page’s algorithm -  is probably the most popular and well-known use of web linkage mining. This non-context  approach is simply a popularity contest, where the importance of the ‘vote’ is measured by the importance of the originating site itself. Better the linking (my page) site is, bigger gain in the rating I get. Looking inside, the importance of the site is measured by the probability of visiting the site, the way to get the digits is google’s secret, obviously (I bet naive Bayes is used somewhere there;).

What about reality? PageRank is vulnerable to spamming and a lot of people cheat PR for a living. For short, farm of sites (servicer) is created and it’s coordinated work pulls target site up in the ranking. It is also language problem how to deal with ambiguous keywords. Then, technical problem – solved more or less fine of course by taxation mechanism – with pages with no further linkage (PR value thieves as the PR popularity flows there and stays forever). The random jumping also helps with dead-end sites. Prediction mechanisms are also worth mentioning as well as using local resources to save some time and computing power, e.g. processing data for whole domain or server.

There are some modifications of the Pagerank algorithm. Interesting one is topic-specified pagerank by T. Haveliwala. There were contexts added (topic-specified groups, like DMOZ) and the idea is to keep results close to previously specified topic. The big advantage of this approach is that personalization of the search process can be easily applied (user-specified popularity ranking and not the general one).

This might also be interesting for you:

Weka

Posted in data mining, definitions, hunting content creatos, projects, web mining by admin on the May 20th, 2010

Weka is a collection of the algorithms, commonly used in data mining. There are both graphic and command line interface, probably second possibility is useful for more complicated projects – for my it was enough to use simple explorer. Moreover, one can use personal java code. Weka contains tools for data prepration (normalization, discretization and the bunch of other), classificaton, clustering, regression, association rules, not to mention well expanded visualization.

Weka

Weka - explorer window

I enjoyed very much working with Weka. After some struggling with input data format (I used CSV), with a little exercise a wide choice of possibilities appeard. I used Weka in the Unix environment, Ubuntu 8.1.

Basic data format for Weka is arff: Attribute – Relation File Format, ascii file format. It describes instances which are sharing attributes. You can choose another file format ( .names, .data (C4.5), .csv, .libsvm. .dat, .bsi, .xrff.), what happens most of the time, at least at the beginning of projects, when you have a lot of data from external sources, like MySQL databases or Excel.

There are some functions worth mentioning, like various kinds of filtration, e.g. supervised or not, jitterizing or other kind of random “pollution”, randomizaton, sampling, standarization. It is possible to use Perl commands or visualize datasets in many ways. Every single moment you can check log to find out what happens inside or check memory, logging in the ubuntu-console, where program started also takes place.

I have to mention about “dancing” Kiwi when algorithm works. Strange feeling, when you have to watch in for a couple of hours. Dancing Kiwi

This might also be interesting for you:

Next Page »