Rafazwonull vs Joe I/O


Findings on Quality Blog recognition
24.05.2009, 13:20
Gespeichert unter: Datamining, In eigener Sache, Information, Studium | Schlagworte: , , , ,

Hi everyone,

I have almost finished formating my work and I guess I will go and print it in the next few days. I have learnt a lot of things about cleansing data and problems you stumble upon while data mining. Before really closing this chapter here and getting on to other projects I thought I might give you some ideas on my results.

In general it works quite well to determine the high quality blogs with the features used, even with rather simple algorithms. The decision tree in J48 for recognizing the controlled collection consists only of three nodes but gives you an accuracy of 98% for the German collection. The most interesting feature seems to be relationOutLinksToSize. This is confirmed by other collections and algorithms, and is quite an intuitively important feature. Just imagine link-spam blogs with a huge amount of outgoing links but no own content! What is more, with all collections the link-based features as the number of incomings links were really good discriminating features.

Interestingly, the accuracy (this means, how many blogs they recognize correctly) is not determined by the number of features you consider, but rather by how good they work together. With only 10 attributes taken together I get about the same accuracy for most algorithms, while some others improve substantially. This shows how important it is to have reliable data. The more useless features you give your algorithms to analyze, the more unreliable the resulting models will be. So it is much better to take a reduced number, which is easier to analyze and make your results with these ones…

J48 decision tree

J48 decision tree

So, what about the other collections? Here is the offside. The models seem to be at least partly language specific although I cannot really tell to what extent. All I can tell is that the Russian and the English collection created different models. While I achieved somewhat weaker results with the Russian one (67% with J48), the English collection was really a catastrophy: J48 got only 8% correct. I wondered why that was and tried to reduce the number of features considered, but only the Russian collection improved substantially through this (78% with J48). The English one stayed on that low level.

I guess the reason for the difference in accuracy is due to a certain lack of homogenity of the data. While I got rather homogenous collections in Russian and German, there are a lot of people blogging in English (as me right now). Necessarily the amount of different types of blogs raises in this language, and the overall number of blogs available increases. What is more, the average quality does not increase, so I get a random collection which is really heterogenous and the classifier does not know how to determine a profile of high qulity blogs. However, this idea needs to be proven by other results.

So, what do you think? Anybody caring for the results of the language recognition (russian blogs work best!)?

-r-


4 Kommentare bis jetzt
Kommentieren

Hey Mr. -r- (that sounds quite interesting :) )
I reckon that your last comment in the brackets was just an intension to include me in the conversation (I could call it an intrigue as well). Of course the russian blogs work the best, there is no doubt and I cannot recall a time in which people tried to doubt exactly this fact without loosing their head. So far to the historical part of our issue!
I am pretty impressed by the results and your conclusions seem to be informative and proven by your research. Of course you create new issues which you’ve already listed in your summary. I am looking forward to hear more about that! Will you actually continue with your research or you really want to tell me that you are not interested in finding out why the english blogs have that high heterogeneity?!

Take care!
- m -

Kommentar von max

Dear Mr -m- (or just „M“ as they call you in various agent thrillers),

thanks a lot for the compliments. I love you too. I would never get the idea of thinking badly of the great Russian people, their women or their kitchen. Neither of their blogosphere.

I will write some stuff on the language recognition as well. Actually, I would like continue the research on the heterogenity issues as well as I would like to dig deeper into text mining issues, to really get more hold of semantic quality. But this is something for a PHD or another thesis!

However, what we have here as a method and a tool, we could easily extend to other fields: finding people on the net who fit job descriptions, find places which fit peoples profiles, find guys which fit girls. Once you have understood the concept, this is really powerful. We should have an own company! Anybody knowing VCs to be trusted? ;)

Kommentar von Rafa

Hey Rafi,

I could imagine there’re some people out there that would really like to see an example.

What do you think of posting an example blog with screenshots?

I reckon that’d make your work more visible to people that haven’t spent the last 6 months writing a thesis on that topic.

I’d really like to see that and am more than willing to contribute to your comments here AND leveraging the power of social media to lead some traffic here for a discussion.

Cheers,
Stefan

Kommentar von Stefan Martens

Hey Stefan,
thanks for this contribution. Alas, I guess, this is the offside of data mining models: they are hard to visualize and J48 is maybe one of the most understandable ones. They are neither intuitive, nor undestandable and the models result purely from the data they are based upon.

Imagine one of the linear algorithms for example:
the algorithm recursively computes the weigth for each one of the attributes the decision models are based upon. Then, it takes each of the numeric values it has considered, multiplies it with the weigth and takes the sum of these for each instance. This is the computed numeric value of the class, the corelation of which is computed to the initial class values 1=good and 0=bad. This is at least what I have understood of what my supervisor has explained to me. Any suggestions on how to visualize that ;) ? This is not even the toughest one: imagine vectors in multidimensional rooms spaces (Edit: Thanks Max!), where each of the features is a dimension… great considering 150 attributes…

Anyway, I will gladly post some examples on distribution graphs to supply you with some visual goodies, if you care for it! You are pointing at a very good issue indeed: I will post some data on J48 and explain to you what happened!

Kommentar von Rafa




Kommentieren
Zeilen- und Absatzumbrüche automatisch, E-Mail-Adresse wird nicht angezeigt, HTML-Tags zulässig: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>