Gespeichert unter: Datamining, In eigener Sache, Information | Schlagworte: Crawler, Data Mining, Information, Qualität
Hello World,
after a long time of idleness I thought I might give you an update on my work.
I have crawled quite of lot of pages until now and I have built up some interesting collections of controlled and random pages in English and German. The russian collection is currently being generated due to Max’ help and I am looking forward to finding out its features. Alas, nobody seems to know any Persian or Persian-speaking people (Can anybody help out?).
I have run some tests and I got some interesting results. First of all, it seems to be really really important to have very(!) clean data. At the beginning I only imported the data into Excel, cleansed it and exported it again. The classifiers had accuracies up to 80% after Ten-Fold-Cross validation which was somewhat disappointing. I tried to cleanse the data once again by removing the „0″-entries where the crawler failed to determine values, but the accuracy dropped even further to about 30%. I manually checked up on the accuracy of the crawler and it seems to be a rather rare case that the crawler cannot analyze some features, and I have a 95% propability to get any value correct. If I remove the 0-values, of which 95% were correct, the classifier just lacks information. Finally I found out another reason for error: Excel converts decimals to dates (01.05. to 01. May 2009) and you really need to be careful about this. I changed the settings to avoid this, and even with Naive Bayes I get accuracies of over 90% after 10-Fold-Cross. This is kind of an success. However, this accuracy is a combination of the classifiers ability of recognizing quality-blogs and mediocre blogs, and in general it is much easier to determine the latter ones. The distribution of the correctly classified instances gives you a better estimate of the classifiers reliability:

As expected, the link-based features work best by the way. Google-Indegree is really a good measure, but Comment-based analysis seems to get you reliable results as well. I found some other graphic features which get you about 20% information-gain, but I do not know how to interpret this yet.
In another test-setting I tried to tell apart English from German blogs with the features I logged. As it seems I get about 50% correct (which is close to „It does not work!“), even with clean data.
Looking forward to reporting more…
-r-
1 Kommentar bis jetzt
Kommentieren
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>





[...] Data Cleansing and Quality Model Generation [...]
Pingback von Quality Model Development « Rafazwonull vs Joe I/O 25.05.2009 @ 22:26