Gespeichert unter: Datamining, In eigener Sache, Information, Studium | Schlagworte: Data Mining, J48, Qualität, Unternehmen
Hi folks,
as suggested by Stefan Martens, I will provide you with some more ideas on how the model development works in data mining.
As I have written before, the basis for the model development is data which is an appropriate collection of both high quality blogs and – at best – low quality blogs.
Crawling high quality collections
The first one is easy. Where do you get a set of high quality blogs from? Technorati with its toplist of course is a good approach, but there are other ready-made collections of the best rated blogs. I chose the German Wikio-toplist, which is automatically created by an approach which is comparable to Technorati’s. I extracted the top 300 examples and fed them into the crawler. Among these supposedly high quality examples were blogs as the following ones. I will not post them as links: if my method can determine spam-blogs from non-spam blogs, the methods of highly paid Google-strategists can as well…
- http://netzpolitik.org/
- http://www.basicthinking.de/blog/
- http://www.nerdcore.de/wp/
- http://www.stefan-niggemeier.de/blog/
The blog crawler follows each of the URLs it is provided with and analyzes the blogs features. Just imagine scanning it for a set of defined characteristics. It takes into account, f.e. the length of the URL (1) , which has proved to be quite a good feature for the determination of quality in other web mining tasks. What is more, it looks for the existance of an RSS feed (2) and RSS-Comment-Feeds (4), as well as it counts the number of graphics (3), determines the blog update interval with the information from the RSS-Feed (5) and scans for the text length (7) or the length of the titles (6). These are just some examples of the 150 features the blog crawlers tries to exploit from the blogs HTML and XML structure.
I have already shown you, what some of the data looks like it gathers from one blog, but this was only a part of the info which is really collected. In fact, the csv-output file you get from this collection has a size of about 1 MB.
Crawling …. well … „not so high quality“ collections
This was the easy part, but in order to create models for good blogs, you need a collection of bad blogs to seperate these from. In these terms it is like us humans: we determine what we are by knowing what we are not. The classification algorithms need info on what hiqh quality is not. Alas, there is not a collection of low quality blogs which is nearly big enough to have statistical profiles created, so I built a module, which collects a number of random pages from the web. They are random, which does not mean they have low quality, but some manual testing has confirmed the thesis that the predominant mass of the randomly collected ones have … well … less quality. The tiny number of quality blogs collected through the module does not really matter statistically. Anybody caring for a sample of not-so-good-blogs (i.e. not German top 300)?
- http://lisungu.wordpress.com/
- http://dasca.de/
- http://viaperdita.wordpress.com/
- http://armu.de/
Data Cleansing
The next step is the data cleansing. I wrote in in my other post „Status update for quality models“ how important it is to have clean data. Trust me: this is not fun! It is all a manual process, you will be staring at a lot of data in xls-tables for a long time… The result of this painful process is a 4MB csv-file with two collections in it: the blogs labelled good and the blogs labelled bad. You load this into your data mining problem and choose an algorithm which can handle nominal targets, which are the classes in context of which the classifier is to make sense out of the attribute data it is provided with (i.e. good / bad).
Classification with J48
This is where the real fun begins and you start creating your models. As an example, I chose J48 which is maybe the most intuitive algorithm. It creates a decision tree for the target classes using the attributes it is provided with. I trained it with 10-Fold-Cross validation, which is a validation method: the classifier takes 90% of the data and validated the model created through it with the 10% left. This is done ten times, and may take a while depending on what algorithm you are using and how many attributes and instances you are using it for.
The result for J48 is a decision tree and an accuracy, which tells you, what percentage of the initial classes the classifier got correctly using the attribute it was provided with. In case of a 90% correctness, this does just mean that the decision tree model from J48 attributed 90 out of 100 blogs to the correct class (ok, this is quite obvious isn’t it?). What the data mining tool is giving you is the tree and these numbers:
- 98,55% of „good“ recognized as „good“
- 1,45% of „good“ recognized as „bad“
- 0,03% of „bad“ recognized as „good“
This is basically a confusion matrix. The decision tree, which was generated by the algorithm and which this accuracy is based upon looks like I have posted before:

This means using only the features relationLinksToSize, firstTableRowBGcolor and MeanFeedUpdateIntervall we get a correctness of 98%. We could actually reduce the number of the attributes accounted for to these three and get this accuracy, but only for the collections used. This is the offside of the model generated here: it is not easy to transfer it to other settings, you cannot tell that in general, these features are enough to tell quality blogs apart from other blogs. These results are just a hint on that these automatically seizable features are important to what we perceive as quality. What is more, these features seem to be not equally important for each language, which means that they are even more likely to be culture specific. This is why I collected some other sets of data in Russian and English to compare these results with, and as it seems the models created are far from being generally applicable. Each language-specific collection has its own discriminative features and the resulting models have greatly varying accuracies. While some of the differences can be explained by the differences within the data sets, some are due to differing blog profiles which result from the data. So what is needed are bigger crawls and representative collections to create more robust models and to be able to explain the differences of the resulting quality concepts in more detail.
-r-
1 Kommentar bis jetzt
Kommentieren
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>






Even if you have better models, I suppose that they would not be stable because of changing cultures and fast changing technologies of blog creation.
Kommentar von Marcel Knust 30.05.2009 @ 22:49