Rafazwonull vs Joe I/O


Quality Model Development
25.05.2009, 22:25
Gespeichert unter: Datamining, In eigener Sache, Information, Studium | Schlagworte: , , ,

Hi folks,

as suggested by Stefan Martens, I will provide you with some more ideas on how the model development works in data mining.

As I have written before, the basis for the model development is data which is an appropriate collection of both high quality blogs and – at best – low quality blogs.

Crawling high quality collections

The first one is easy. Where do you get a set of high quality blogs from? Technorati with its toplist of course is a good approach, but there are other ready-made collections of the best rated blogs. I chose the German Wikio-toplist, which is automatically created by an approach which is comparable to Technorati’s. I extracted the top 300 examples and fed them into the crawler. Among these supposedly high quality examples were blogs as the following ones. I will not post them as links: if my method can determine spam-blogs from non-spam blogs, the methods of highly paid Google-strategists can as well…

  • http://netzpolitik.org/
  • http://www.basicthinking.de/blog/
  • http://www.nerdcore.de/wp/
  • http://www.stefan-niggemeier.de/blog/

The blog crawler follows each of the URLs it is provided with and analyzes the blogs features. Just imagine scanning it for a set of defined characteristics. It takes into account, f.e. the length of the URL (1) , which has proved to be quite a good feature for the determination of quality in other web mining tasks. What is more, it looks for the existance of an RSS feed (2) and RSS-Comment-Feeds (4), as well as it counts the number of graphics (3), determines the blog update interval with the information from the RSS-Feed (5) and scans for the text length (7) or the length of the titles (6). These are just some examples of the 150 features the blog crawlers tries to exploit from the blogs HTML and XML structure.

Blogcrawler_Features

I have already shown you, what some of the data looks like it gathers from one blog, but this was only a part of the info which is really collected. In fact, the csv-output file you get from this collection has a size of about 1 MB.

Crawling …. well … „not so high quality“ collections

This was the easy part, but in order to create models for good blogs, you need a collection of bad blogs to seperate these from. In these terms it is like us humans: we determine what we are by knowing what we are not. The classification algorithms need info on what hiqh quality is not. Alas, there is not a collection of low quality blogs which is nearly big enough to have statistical profiles created, so I built a module, which collects a number of random pages from the web. They are random, which does not mean they have low quality, but some manual testing has confirmed the thesis that the predominant mass of the randomly collected ones have … well … less quality. The tiny number of quality blogs collected through the module does not really matter statistically. Anybody caring for a sample of not-so-good-blogs (i.e. not German top 300)?

  • http://lisungu.wordpress.com/
  • http://dasca.de/
  • http://viaperdita.wordpress.com/
  • http://armu.de/

Data Cleansing

The next step is the data cleansing. I wrote in in my other post „Status update for quality models“ how important it is to have clean data. Trust me: this is not fun! It is all a manual process, you will be staring at a lot of data in xls-tables for a long time… The result of this painful process is a 4MB csv-file with two collections in it: the blogs labelled good and the blogs labelled bad. You load this into your data mining problem and choose an algorithm which can handle nominal targets, which are the classes in context of which the classifier is to make sense out of the attribute data it is provided with (i.e. good / bad).

Classification with J48

This is where the real fun begins and you start creating your models. As an example, I chose J48 which is maybe the most intuitive algorithm. It creates a decision tree for the target classes using the attributes it is provided with. I trained it with 10-Fold-Cross validation, which is a validation method: the classifier takes 90% of the data and validated the model created through it with the 10% left. This is done ten times, and may take a while depending on what algorithm you are using and how many attributes and instances you are using it for.

The result for J48 is a decision tree and an accuracy, which tells you, what percentage of the initial classes the classifier got correctly using the attribute it was provided with. In case of a 90% correctness, this does just mean that the decision tree model from J48 attributed 90 out of 100 blogs to the correct class (ok, this is quite obvious isn’t it?). What the data mining tool is giving you is the tree and these numbers:

  • 98,55%  of „good“ recognized as „good“
  • 1,45%   of „good“ recognized as „bad“
  • 0,03% of „bad“ recognized as „good“

This is basically a confusion matrix. The decision tree, which was generated by the algorithm and which this accuracy is based upon looks like I have posted before:

This means using only the features relationLinksToSize, firstTableRowBGcolor and MeanFeedUpdateIntervall we get a correctness of 98%. We could actually reduce the number of the attributes accounted for to these three and get this accuracy, but only for the collections used. This is the offside of the model generated here: it is not easy to transfer it to other settings, you cannot tell that in general, these features are enough to tell quality blogs apart from other blogs. These results are just a hint on that these automatically seizable features are important to what we perceive as quality. What is more, these features seem to be not equally important for each language, which means that they are even more likely to be culture specific. This is why I collected some other sets of data in Russian and English to compare these results with, and as it seems the models created are far from being generally applicable. Each language-specific collection has its own discriminative features and the resulting models have greatly varying accuracies. While some of the differences can be explained by the differences within the data sets, some are due to differing blog profiles which result from the data. So what is needed are bigger crawls and representative collections to create more robust models and to be able to explain the differences of the resulting quality concepts in more detail.

-r-



Findings on Quality Blog recognition
24.05.2009, 13:20
Gespeichert unter: Datamining, In eigener Sache, Information, Studium | Schlagworte: , , , ,

Hi everyone,

I have almost finished formating my work and I guess I will go and print it in the next few days. I have learnt a lot of things about cleansing data and problems you stumble upon while data mining. Before really closing this chapter here and getting on to other projects I thought I might give you some ideas on my results.

In general it works quite well to determine the high quality blogs with the features used, even with rather simple algorithms. The decision tree in J48 for recognizing the controlled collection consists only of three nodes but gives you an accuracy of 98% for the German collection. The most interesting feature seems to be relationOutLinksToSize. This is confirmed by other collections and algorithms, and is quite an intuitively important feature. Just imagine link-spam blogs with a huge amount of outgoing links but no own content! What is more, with all collections the link-based features as the number of incomings links were really good discriminating features.

Interestingly, the accuracy (this means, how many blogs they recognize correctly) is not determined by the number of features you consider, but rather by how good they work together. With only 10 attributes taken together I get about the same accuracy for most algorithms, while some others improve substantially. This shows how important it is to have reliable data. The more useless features you give your algorithms to analyze, the more unreliable the resulting models will be. So it is much better to take a reduced number, which is easier to analyze and make your results with these ones…

J48 decision tree

J48 decision tree

So, what about the other collections? Here is the offside. The models seem to be at least partly language specific although I cannot really tell to what extent. All I can tell is that the Russian and the English collection created different models. While I achieved somewhat weaker results with the Russian one (67% with J48), the English collection was really a catastrophy: J48 got only 8% correct. I wondered why that was and tried to reduce the number of features considered, but only the Russian collection improved substantially through this (78% with J48). The English one stayed on that low level.

I guess the reason for the difference in accuracy is due to a certain lack of homogenity of the data. While I got rather homogenous collections in Russian and German, there are a lot of people blogging in English (as me right now). Necessarily the amount of different types of blogs raises in this language, and the overall number of blogs available increases. What is more, the average quality does not increase, so I get a random collection which is really heterogenous and the classifier does not know how to determine a profile of high qulity blogs. However, this idea needs to be proven by other results.

So, what do you think? Anybody caring for the results of the language recognition (russian blogs work best!)?

-r-