Internet is probably the biggest world’s database. Moreover, data is available using easily accessible techniques. Often it is important and detailed data, that let people achieve goals or use it in various realms. Data is held in various forms: text, multimedia, database. Web pages keep standard of html (or another ML family member) which makes it kind of structural form, but not sufficent to easily use it in data mining. Typical website contains, in addition to main content and links, various stuff like ads or navigation items.  It is also widely known that most of the data in the Internet is redundant – a lot of information appear in different sites, in more or less alike form.

Deep web (hidden web, invisible web, invisible Internet) refers to the lower niveau of the global network. It doesn’t appear in the results of the search engine’s work and the searching devices don’t index or list this area. It is said the great part of the global web belongs to deep web and stays hidden, until specific enquiry, targeted to the right interface triggers content to appear. This sentences also reveals some barriers that keep the data hidden, like specific interface, requirement to have specific knowledge about data, high security (passwords) or simply lack of linkage. It is also possible to block range of IP addresses, interfaces (e.g. using CAPTCHA) or just keep data in non-standard format. Reasons mentioned above are natural barrier for crawlers and web robots, keeping some part of the web out of the linked web.

Looking for the definition of the Internet exploration, the easiest way is to put it as a part of data mining, where web resources are explored. It is commonly divided into three:

  1. web content mining is the closest one to the “classic” data mining”, as WCM mostly operates on text and it is generally common way to put information in Internet as text,
  2. web linkage mining goal is to use nature of the Internet – connection structure – as it is a bunch of documents connected with links.
  3. web usage mining is looking for useful patterns in logs and documents containing history of user’s activity.

Three of them are also factors varying web mining from data mining, because topic of the research is not only data, but structure and flow as well. Additionally, web mining takes data “as it is” – and the imagination of internet content creators is wide when it comes to create new ones – while data mining operates rather on structured data.

Finally – general application of web mining goes beyond tweaking websites or data analyse. It could be used as a tool for upgrading tasks, projects and processes in companies and institutions or as a method providing aid while solving technical or analitical problems. Web mining is currently used in ranking of the web pages, electronic trade, internet advertising, reliability evaluation, recommendation systems, personalization of web services and more.

Web Content mining article sponsored by Saskatoon Roofing Services, best company for roofing Saskatoon has.

November 22nd, 2016

Posted In: mining, research, web content, web mining, YouTube

Leave a Reply

Your email address will not be published. Required fields are marked *