Internet is probably the biggest world’s database. Moreover, data is available using easily accessible techniques. Often it is important and detailed data, that let people achieve goals or use it in various realms. Data is held in various forms: text, multimedia, database. Web pages keep standard of html (or another ML family member) which makes it kind of structural form, but not sufficent to easily use it in data mining. Typical website contains, in addition to main content and links, various stuff like ads or navigation items.  It is also widely known that most of the data in the Internet is redundant – a lot of information appear in different sites, in more or less alike form.

Deep web (hidden web, invisible web, invisible Internet) refers to the lower niveau of the global network. It doesn’t appear in the results of the search engine’s work and the searching devices don’t index or list this area. It is said the great part of the global web belongs to deep web and stays hidden, until specific enquiry, targeted to the right interface triggers content to appear. This sentences also reveals some barriers that keep the data hidden, like specific interface, requirement to have specific knowledge about data, high security (passwords) or simply lack of linkage. It is also possible to block range of IP addresses, interfaces (e.g. using CAPTCHA) or just keep data in non-standard format. Reasons mentioned above are natural barrier for crawlers and web robots, keeping some part of the web out of the linked web.

Looking for the definition of the Internet exploration, the easiest way is to put it as a part of data mining, where web resources are explored. It is commonly divided into three:

  1. web content mining is the closest one to the “classic” data mining”, as WCM mostly operates on text and it is generally common way to put information in Internet as text,
  2. web linkage mining goal is to use nature of the Internet – connection structure – as it is a bunch of documents connected with links.
  3. web usage mining is looking for useful patterns in logs and documents containing history of user’s activity.

Three of them are also factors varying web mining from data mining, because topic of the research is not only data, but structure and flow as well. Additionally, web mining takes data “as it is” – and the imagination of internet content creators is wide when it comes to create new ones – while data mining operates rather on structured data.

Finally – general application of web mining goes beyond tweaking websites or data analyse. It could be used as a tool for upgrading tasks, projects and processes in companies and institutions or as a method providing aid while solving technical or analitical problems. Web mining is currently used in ranking of the web pages, electronic trade, internet advertising, reliability evaluation, recommendation systems, personalization of web services and more.

Web Content mining article sponsored by Saskatoon Roofing Services, best company for roofing Saskatoon has.

November 22nd, 2016

Posted In: mining, research, web content, web mining, YouTube

Leave a Comment

Web content mining is a part of data mining domain that is the closest one to the classic definition of DM. Web content mining aspects are related to the similar domains in classic data mining.

  • automatic content extraction from web pages
  • integration of the information
  • opinion and reviews extraction
  • knowledge synthesis
  • noise detection and segmentation

Briefly said, web content mining listed above are solutions for more or less complicated problems or issues, connected to automation of web usage, which lead to the improvement in several aspects of Internet daily life, considering both technical and non-technical matters.

Web mining is generally a data mining branch. Introducing Web mining I want to take one step back and present some thoughts about data mining.

Data mining or data exploration is set of techniques used to automatically discover non-trivial relations, patterns and schemes in large data collections. In other words, we are looking for deep-hidden knowledge in very large datasets (in web mining case – the Internet), and we only accept automatic solutions. Why? For better understanding. Having the mechanism, we can ask much more difficult questions (comparing to i.e. sql).

At this point, we can say that web mining is data mining with the Internet as the dataset.

Let’s take a short look at the appliance of web mining:

  • data classification (i.e. customers’ sentiment,  reviews…)
  • natural language processing (NLP, but don’t confuse with neuro-linguistic programming)
  • www personalization
  • knowledge management

Sources: – data mining (.pps), wikipedia, Bing Liu – Web Mining 2005 Tutorial.

Article sponsored by Power Flush Services London, best radiator flush company in London

November 15th, 2016

Posted In: introduction, mining, web content, YouTube

Leave a Comment