<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Rafazwonull vs Joe I/O</title>
	<atom:link href="http://rafazwonull.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://rafazwonull.wordpress.com</link>
	<description>Unreflektierte Eindrücke zwischen Netzkultur, Web2.0 und User Experience Design</description>
	<lastBuildDate>Tue, 26 May 2009 07:07:08 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language>de</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<image>
		<url>http://www.gravatar.com/blavatar/0a1972a6dd03bf8ebd159f40a986dbeb?s=96&#038;d=http://s.wordpress.com/i/buttonw-com.png</url>
		<title>Rafazwonull vs Joe I/O</title>
		<link>http://rafazwonull.wordpress.com</link>
	</image>
			<item>
		<title>Quality Model Development</title>
		<link>http://rafazwonull.wordpress.com/2009/05/25/quality-model-development/</link>
		<comments>http://rafazwonull.wordpress.com/2009/05/25/quality-model-development/#comments</comments>
		<pubDate>Mon, 25 May 2009 21:25:17 +0000</pubDate>
		<dc:creator>Rafa</dc:creator>
				<category><![CDATA[Datamining]]></category>
		<category><![CDATA[In eigener Sache]]></category>
		<category><![CDATA[Information]]></category>
		<category><![CDATA[Studium]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[J48]]></category>
		<category><![CDATA[Qualität]]></category>
		<category><![CDATA[Unternehmen]]></category>

		<guid isPermaLink="false">http://rafazwonull.wordpress.com/?p=263</guid>
		<description><![CDATA[Hi folks,
as suggested by Stefan Martens, I will provide you with some more ideas on how the model development works in data mining.
As I have written before, the basis for the model development is data which is an appropriate collection of both high quality blogs and &#8211; at best &#8211; low quality blogs.
Crawling high quality [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=263&subd=rafazwonull&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Hi folks,</p>
<p>as suggested by <a href="http://www.smartens.eu/">Stefan Martens</a>, I will provide you with some more ideas on how the model development works in data mining.</p>
<p>As I have written before, the basis for the model development is data which is an appropriate collection of both high quality blogs and &#8211; at best &#8211; low quality blogs.</p>
<h4><strong>Crawling high quality collections</strong></h4>
<p>The first one is easy. Where do you get a set of high quality blogs from? <a href="http://technorati.com/" target="_blank">Technorati</a> with its toplist of course is a good approach, but there are other ready-made collections of the best rated blogs. I chose the <a href="http://www.wikio.de/blogs/top">German Wikio-toplist</a>, which is automatically created by an approach which is comparable to Technorati&#8217;s. I extracted the top 300 examples and fed them into the crawler. Among these supposedly high quality examples were blogs as the following ones. I will not post them as links: if my method can determine spam-blogs from non-spam blogs, the methods of highly paid Google-strategists can as well&#8230;</p>
<ul>
<li>http://netzpolitik.org/</li>
<li>http://www.basicthinking.de/blog/</li>
<li>http://www.nerdcore.de/wp/</li>
<li>http://www.stefan-niggemeier.de/blog/</li>
</ul>
<p>The <a href="http://rafazwonull.wordpress.com/2009/02/17/qualitatsmodelle-und-data-mining-in-blogs/">blog crawler</a> follows each of the URLs it is provided with and analyzes the blogs features. Just imagine scanning it for a set of defined characteristics. It takes into account, f.e. the length of the URL (1) , which has proved to be quite a good feature for the determination of quality in other web mining tasks. What is more, it looks for the existance of an RSS feed (2) and RSS-Comment-Feeds (4), as well as it counts the number of graphics (3), determines the blog update interval with the information from the RSS-Feed (5) and scans for the text length (7) or the length of the titles (6). These are just some examples of the 150 features the blog crawlers tries to exploit from the blogs HTML and XML structure.</p>
<p style="text-align:left;"><a href="http://rafazwonull.files.wordpress.com/2009/05/blogcrawler_features2.jpg"><img class="aligncenter size-full wp-image-269" title="Blogcrawler_Features" src="http://rafazwonull.files.wordpress.com/2009/05/blogcrawler_features2.jpg?w=420&#038;h=263" alt="Blogcrawler_Features" width="420" height="263" /></a></p>
<p style="text-align:left;">I have <a href="http://rafazwonull.wordpress.com/2009/02/24/crawling-und-grose-dokumente-marcels-herausforderung/">already shown</a> you, what some of the data looks like it gathers from one blog, but this was only a part of the info which is really collected. In fact, the csv-output file you get from this collection has a size of about 1 MB.</p>
<h4>Crawling &#8230;. well &#8230; &#8220;not so high quality&#8221; collections</h4>
<p>This was the easy part, but in order to create models for good blogs, you need a collection of bad blogs to seperate these from. In these terms it is like us humans: we determine what we are by knowing what we are not. The classification algorithms need info on what hiqh quality is not. Alas, there is not a collection of low quality blogs which is nearly big enough to have statistical profiles created, so I built a module, which collects a number of random pages from the web. They are random, which does not mean they have low quality, but some manual testing has confirmed the thesis that the predominant mass of the randomly collected ones have &#8230; well &#8230; less quality. The tiny number of quality blogs collected through the module does not really matter statistically. Anybody caring for a sample of not-so-good-blogs (i.e. not German top 300)?</p>
<ul>
<li>http://lisungu.wordpress.com/</li>
<li>http://dasca.de/</li>
<li>http://viaperdita.wordpress.com/</li>
<li>http://armu.de/</li>
</ul>
<h4>Data Cleansing</h4>
<p>The next step is the data cleansing. I wrote in <a href="http://rafazwonull.wordpress.com/2009/04/24/status-update-for-quality-models/">in my other post &#8220;Status update for quality models&#8221;</a> how important it is to have clean data. Trust me: this is not fun! It is all a manual process, you will be staring at a lot of data in xls-tables for a long time&#8230; The result of this painful process is a 4MB csv-file with two collections in it: the blogs labelled <em>good</em> and the blogs labelled <em>bad</em>. You load this into your data mining problem and choose an algorithm which can handle nominal targets, which are the classes in context of which the classifier is to make sense out of the attribute data it is provided with (i.e. <em>good / bad</em>).</p>
<h4>Classification with J48</h4>
<p>This is where the real fun begins and you start creating your models. As an example, I chose <a href="http://grb.mnsu.edu/grbts/doc/manual/J48_Decision_Trees.html">J48</a> which is maybe the most intuitive algorithm. It creates a decision tree for the target classes using the attributes it is provided with. I trained it with 10-Fold-Cross validation, which is a validation method: the classifier takes 90% of the data and validated the model created through it with the 10% left. This is done ten times, and may take a while depending on what algorithm you are using and how many attributes and instances you are using it for.</p>
<p>The result for J48 is a decision tree and an accuracy, which tells you, what percentage of the initial classes the classifier got correctly using the attribute it was provided with. In case of a 90% correctness, this does just mean that the decision tree model from J48 attributed 90 out of 100 blogs to the correct class (ok, this is quite obvious isn&#8217;t it?). What the data mining tool is giving you is the tree and these numbers:</p>
<ul>
<li>98,55%  of &#8220;good&#8221; recognized as &#8220;good&#8221;</li>
<li>1,45%   of &#8220;good&#8221; recognized as &#8220;bad&#8221;</li>
<li>0,03% of &#8220;bad&#8221; recognized as &#8220;good&#8221;</li>
</ul>
<p>This is basically a confusion matrix. The decision tree, which was generated by the algorithm and which this accuracy is based upon looks like I have posted before:</p>
<p><img class="aligncenter" title="J48 decision tree" src="http://rafazwonull.files.wordpress.com/2009/05/j48.png?w=300&amp;h=274&#038;h=274" alt="" width="300" height="274" /></p>
<p>This means using only the features <em>relationLinksToSize, firstTableRowBGcolor </em>and <em>MeanFeedUpdateIntervall</em> we get a correctness of 98%. We could actually reduce the number of the attributes accounted for to these three and get this accuracy, but only for the collections used. This is the offside of the model generated here: it is not easy to transfer it to other settings, you cannot tell that in general, these features are enough to tell quality blogs apart from other blogs. These results are just a hint on that these automatically seizable features are important to what we perceive as quality. What is more, these features seem to be not equally important for each language, which means that they are even more likely to be culture specific. This is why I collected some other sets of data in Russian and English to compare these results with, and as it seems the models created are far from being generally applicable. Each language-specific collection has its own discriminative features and the resulting models have greatly varying accuracies. While some of the differences can be explained by the differences within the data sets, some are due to differing blog profiles which result from the data. So what is needed are bigger crawls and representative collections to create more robust models and to be able to explain the differences of the resulting quality concepts in more detail.</p>
<p>-r-</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/rafazwonull.wordpress.com/263/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/rafazwonull.wordpress.com/263/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/rafazwonull.wordpress.com/263/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/rafazwonull.wordpress.com/263/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/rafazwonull.wordpress.com/263/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/rafazwonull.wordpress.com/263/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/rafazwonull.wordpress.com/263/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/rafazwonull.wordpress.com/263/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/rafazwonull.wordpress.com/263/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/rafazwonull.wordpress.com/263/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=263&subd=rafazwonull&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://rafazwonull.wordpress.com/2009/05/25/quality-model-development/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/86136f24d90942ff03d4256f453e2653?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Rafa</media:title>
		</media:content>

		<media:content url="http://rafazwonull.files.wordpress.com/2009/05/blogcrawler_features2.jpg" medium="image">
			<media:title type="html">Blogcrawler_Features</media:title>
		</media:content>

		<media:content url="http://rafazwonull.files.wordpress.com/2009/05/j48.png?w=300&#38;h=274" medium="image">
			<media:title type="html">J48 decision tree</media:title>
		</media:content>
	</item>
		<item>
		<title>Findings on Quality Blog recognition</title>
		<link>http://rafazwonull.wordpress.com/2009/05/24/findings-on-quality-blog-recognition/</link>
		<comments>http://rafazwonull.wordpress.com/2009/05/24/findings-on-quality-blog-recognition/#comments</comments>
		<pubDate>Sun, 24 May 2009 12:20:58 +0000</pubDate>
		<dc:creator>Rafa</dc:creator>
				<category><![CDATA[Datamining]]></category>
		<category><![CDATA[In eigener Sache]]></category>
		<category><![CDATA[Information]]></category>
		<category><![CDATA[Studium]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[J48]]></category>
		<category><![CDATA[Meta-Blogging]]></category>
		<category><![CDATA[Qualität]]></category>

		<guid isPermaLink="false">http://rafazwonull.wordpress.com/?p=257</guid>
		<description><![CDATA[Hi everyone,
I have almost finished formating my work and I guess I will go and print it in the next few days. I have learnt a lot of things about cleansing data and problems you stumble upon while data mining. Before really closing this chapter here and getting on to other projects I thought I [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=257&subd=rafazwonull&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Hi everyone,</p>
<p>I have almost finished formating my work and I guess I will go and print it in the next few days. I have learnt a lot of things about cleansing data and problems you stumble upon while data mining. Before really closing this chapter here and getting on to other projects I thought I might give you some ideas on my results.</p>
<p>In general it works quite well to determine the high quality blogs with the features used, even with rather simple algorithms. The decision tree in J48 for recognizing the controlled collection consists only of three nodes but gives you an accuracy of 98% for the German collection. The most interesting feature seems to be <em>relationOutLinksToSize</em>. This is confirmed by other collections and algorithms, and is quite an intuitively important feature. Just imagine link-spam blogs with a huge amount of outgoing links but no own content! What is more, with all collections the link-based features as the number of incomings links were really good discriminating features.</p>
<p>Interestingly, the accuracy (this means, how many blogs they recognize correctly) is not determined by the number of features you consider, but rather by how good they work together. With only 10 attributes taken together I get about the same accuracy for most algorithms, while some others improve substantially. This shows how important it is to have reliable data. The more useless features you give your algorithms to analyze, the more unreliable the resulting models will be. So it is much better to take a reduced number, which is easier to analyze and make your results with these ones&#8230;</p>
<div id="attachment_259" class="wp-caption aligncenter" style="width: 310px"><img class="size-medium wp-image-259" title="J48" src="http://rafazwonull.files.wordpress.com/2009/05/j48.png?w=300&#038;h=274" alt="J48 decision tree" width="300" height="274" /><p class="wp-caption-text">J48 decision tree</p></div>
<p>So, what about the other collections? Here is the offside. The models seem to be at least partly language specific although I cannot really tell to what extent. All I can tell is that the Russian and the English collection created different models. While I achieved somewhat weaker results with the Russian one (67% with J48), the English collection was really a catastrophy: J48 got only 8% correct. I wondered why that was and tried to reduce the number of features considered, but only the Russian collection improved substantially through this (78% with J48). The English one stayed on that low level.</p>
<p>I guess the reason for the difference in accuracy is due to a certain lack of homogenity of the data. While I got rather homogenous collections in Russian and German, there are a lot of people blogging in English (as me right now). Necessarily the amount of different types of blogs raises in this language, and the overall number of blogs available increases. What is more, the average quality does not increase, so I get a random collection which is really heterogenous and the classifier does not know how to determine a profile of high qulity blogs. However, this idea needs to be proven by other results.</p>
<p>So, what do you think? Anybody caring for the results of the language recognition (russian blogs work best!)?</p>
<p>-r-</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/rafazwonull.wordpress.com/257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/rafazwonull.wordpress.com/257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/rafazwonull.wordpress.com/257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/rafazwonull.wordpress.com/257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/rafazwonull.wordpress.com/257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/rafazwonull.wordpress.com/257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/rafazwonull.wordpress.com/257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/rafazwonull.wordpress.com/257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/rafazwonull.wordpress.com/257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/rafazwonull.wordpress.com/257/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=257&subd=rafazwonull&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://rafazwonull.wordpress.com/2009/05/24/findings-on-quality-blog-recognition/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/86136f24d90942ff03d4256f453e2653?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Rafa</media:title>
		</media:content>

		<media:content url="http://rafazwonull.files.wordpress.com/2009/05/j48.png?w=300" medium="image">
			<media:title type="html">J48</media:title>
		</media:content>
	</item>
		<item>
		<title>Data Cleansing and Quality Model Generation</title>
		<link>http://rafazwonull.wordpress.com/2009/04/24/status-update-for-quality-models/</link>
		<comments>http://rafazwonull.wordpress.com/2009/04/24/status-update-for-quality-models/#comments</comments>
		<pubDate>Fri, 24 Apr 2009 08:06:51 +0000</pubDate>
		<dc:creator>Rafa</dc:creator>
				<category><![CDATA[Datamining]]></category>
		<category><![CDATA[In eigener Sache]]></category>
		<category><![CDATA[Information]]></category>
		<category><![CDATA[Crawler]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Qualität]]></category>

		<guid isPermaLink="false">http://rafazwonull.wordpress.com/2009/04/24/status-update-for-quality-models/</guid>
		<description><![CDATA[Hello World,
after a long time of idleness I thought I might give you an update on my work.
I have crawled quite of lot of pages until now and I have built up some interesting collections of controlled and random pages in English and German. The russian collection is currently being generated due to Max&#8217; help [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=250&subd=rafazwonull&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Hello World,<br />
after a long time of idleness I thought I might give you an update on my work.</p>
<p>I have crawled quite of lot of pages until now and I have built up some interesting collections of controlled and random pages in English and German. The russian collection is currently being generated due to Max&#8217; help and I am looking forward to finding out its features. Alas, nobody seems to know any Persian or Persian-speaking people (Can anybody help out?).</p>
<p>I have run some tests and I got some interesting results. First of all, it seems to be really really important to have very(!) clean data. At the beginning I only imported the data into Excel, cleansed it and exported it again. The classifiers had accuracies up to 80% after <a href="http://en.wikipedia.org/wiki/Cross-validation_(statistics)" target="_blank">Ten-Fold-Cross validatio</a>n which was somewhat disappointing. I tried to cleanse the data once again by removing the &#8220;0&#8243;-entries where the crawler failed to determine values, but the accuracy dropped even further to about 30%. I manually checked up on the accuracy of the crawler and it seems to be a rather rare case that the crawler cannot analyze some features, and I have a 95% propability to get any value correct. If I remove the 0-values, of which 95% were correct, the classifier just lacks information. Finally I found out another reason for error: Excel converts decimals to dates (01.05. to 01. May 2009) and you really need to be careful about this. I changed the settings to avoid this, and even with <a href="http://en.wikipedia.org/wiki/Bayes_estimator" target="_blank">Naive Bayes</a> I get accuracies of over 90% after 10-Fold-Cross. This is kind of an success. However, this accuracy is a combination of the classifiers ability of recognizing quality-blogs and mediocre blogs, and in general it is much easier to determine the latter ones. The distribution of the correctly classified instances gives you a better estimate of the classifiers reliability:</p>
<p><img class="aligncenter size-full wp-image-253" title="accuracy_after_10f-c1" src="http://rafazwonull.files.wordpress.com/2009/04/accuracy_after_10f-c1.jpg?w=419&#038;h=221" alt="accuracy_after_10f-c1" width="419" height="221" /></p>
<p>As expected, the link-based features work best by the way. Google-Indegree is really a good measure, but Comment-based analysis seems to get you reliable results as well. I found some other graphic features which get you about 20% information-gain, but I do not know how to interpret this yet.</p>
<p>In another test-setting I tried to tell apart English from German blogs with the features I logged. As it seems I get about 50% correct (which is close to &#8220;It does not work!&#8221;), even with clean data.</p>
<p>Looking forward to reporting more&#8230;<br />
-r-</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/rafazwonull.wordpress.com/250/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/rafazwonull.wordpress.com/250/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/rafazwonull.wordpress.com/250/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/rafazwonull.wordpress.com/250/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/rafazwonull.wordpress.com/250/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/rafazwonull.wordpress.com/250/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/rafazwonull.wordpress.com/250/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/rafazwonull.wordpress.com/250/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/rafazwonull.wordpress.com/250/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/rafazwonull.wordpress.com/250/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=250&subd=rafazwonull&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://rafazwonull.wordpress.com/2009/04/24/status-update-for-quality-models/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/86136f24d90942ff03d4256f453e2653?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Rafa</media:title>
		</media:content>

		<media:content url="http://rafazwonull.files.wordpress.com/2009/04/accuracy_after_10f-c1.jpg" medium="image">
			<media:title type="html">accuracy_after_10f-c1</media:title>
		</media:content>
	</item>
		<item>
		<title>Crawling Russian and Persian Collections</title>
		<link>http://rafazwonull.wordpress.com/2009/03/08/crawling-russian-and-persian-collections/</link>
		<comments>http://rafazwonull.wordpress.com/2009/03/08/crawling-russian-and-persian-collections/#comments</comments>
		<pubDate>Sun, 08 Mar 2009 20:03:22 +0000</pubDate>
		<dc:creator>Rafa</dc:creator>
				<category><![CDATA[Datamining]]></category>
		<category><![CDATA[Information]]></category>
		<category><![CDATA[Kultur]]></category>
		<category><![CDATA[Studium]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Meta-Blogging]]></category>
		<category><![CDATA[Persian]]></category>
		<category><![CDATA[Russian]]></category>

		<guid isPermaLink="false">http://rafazwonull.wordpress.com/?p=246</guid>
		<description><![CDATA[I just used the crawler on a random collection of Persian and Russian blogs. I expect interesting outcomes from applying data mining methods on the attributes coming from culture specific blog collections, so trying if the parsers work for these types was an important step. Of course I have already used it on blogs from [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=246&subd=rafazwonull&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I just used the crawler on a random collection of Persian and Russian blogs. I expect interesting outcomes from applying data mining methods on the attributes coming from culture specific blog collections, so trying if the parsers work for these types was an important step. Of course I have already used it on blogs from the U.S., but the differences really were not at all grave. I fear there are no structural difference between English and German blogs, so the next idea would be to apply it to some more distant language area.</p>
<p><a href="http://en.wikipedia.org/wiki/Persian_language"><img class="alignright" title="Farsi" src="http://upload.wikimedia.org/wikipedia/en/0/0c/Nastaliq-proportions.jpg" alt="" width="290" height="302" /></a></p>
<p>I downloaded a <a href="http://de.wikipedia.org/wiki/Stopwort">stopword list</a> in Russian and integrated it in the search. Looking at the results at first I thought I had an character encoding issue. While a German blog has about 300 stopwords per page, a russian one has between 15 to 30, while the overall number of words does not decrease. I asked a Russian friend of mine, what the matter with my list could be and he had a look at the page and stopwords in question. Obviously the algorithm, the list and the pages were all normal, and he supposed that the difference was due to a factual difference in the structure of spoken language. While the number of Russian stopwords was actually at least the same as in German, articles were in fact rarely used on the web pages. If you consider the structure of German or English which are languages that demand articles all the time, I may have bumped into a structural feature which might really help me tell German blogs apart from Russian ones.</p>
<p>Another issue I am facing now is the lack of a controlled collection in <a href="http://en.wikipedia.org/wiki/Persian_language">Farsi</a>. My friend Max has already helped me finding the Top 100 or so blog in Russian, but Farsi seems to be a much harder nut to crack. Does anybody know a Persian or even better both Persian language and a collection of Top300 Persian blogs? Help is highly appreciated&#8230;</p>
<p>-r-</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/rafazwonull.wordpress.com/246/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/rafazwonull.wordpress.com/246/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/rafazwonull.wordpress.com/246/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/rafazwonull.wordpress.com/246/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/rafazwonull.wordpress.com/246/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/rafazwonull.wordpress.com/246/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/rafazwonull.wordpress.com/246/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/rafazwonull.wordpress.com/246/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/rafazwonull.wordpress.com/246/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/rafazwonull.wordpress.com/246/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=246&subd=rafazwonull&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://rafazwonull.wordpress.com/2009/03/08/crawling-russian-and-persian-collections/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/86136f24d90942ff03d4256f453e2653?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Rafa</media:title>
		</media:content>

		<media:content url="http://upload.wikimedia.org/wikipedia/en/0/0c/Nastaliq-proportions.jpg" medium="image">
			<media:title type="html">Farsi</media:title>
		</media:content>
	</item>
		<item>
		<title>Modelling topical coherence of Blogs: News on my project</title>
		<link>http://rafazwonull.wordpress.com/2009/03/06/modelling-topical-coherence-of-blogs-news-on-my-project/</link>
		<comments>http://rafazwonull.wordpress.com/2009/03/06/modelling-topical-coherence-of-blogs-news-on-my-project/#comments</comments>
		<pubDate>Fri, 06 Mar 2009 01:03:48 +0000</pubDate>
		<dc:creator>Rafa</dc:creator>
				<category><![CDATA[Datamining]]></category>
		<category><![CDATA[In eigener Sache]]></category>
		<category><![CDATA[Information]]></category>
		<category><![CDATA[Studium]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Meta-Blogging]]></category>

		<guid isPermaLink="false">http://rafazwonull.wordpress.com/?p=222</guid>
		<description><![CDATA[Again, I was pointed at something by a friend which I had not thought of before. Of course it had occurred to me that the issue I am covering with my thesis might be interested to the English-speaking world, but up until now I was quite unsure whether to change the way I write (and [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=222&subd=rafazwonull&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Again, I was pointed at something by a friend which I had not thought of before. Of course it had occurred to me that the issue I am covering with my thesis might be interested to the English-speaking world, but up until now I was quite unsure whether to change the way I write (and publicly and unchangebly leave my English traces in the never-forgetting web). However, the decision is taken and this will be my first English post.</p>
<p><strong>Aquaint Blog crawler project</strong></p>
<p>As I introduced before, I am currently writing my master thesis on an issue which may be drafted &#8220;Quality Models and Data Mining in Blogs&#8221;. I explained to the German speaking readers, that I was going to implement a webcrawler based on an implementation from the <a href="http://www.uni-hildesheim.de/de/index.htm">Hildesheim University</a> AQUAINT project. The bot is to crawl through a controlled collection of Weblogs and record attributes (currently 150) which could be of interest for the statistical creation of binary, quality based model. Using the popular <a href="http://www.cs.waikato.ac.nz/ml/weka/">WEKA-toolset</a>, I am confident that it will be possible to find signigicant patterns in the entity of the quality-labeled blogs. These patterns will help us discriminating from a random collection of much bigger size, equally &#8220;good&#8221; (i.e. reliable, of high quality, high reputation, A-Listers) blogs from blogs which do not reach up to the standard of the controlled collection. However, this is not the only goal of collecting those features: we might find different structures in e.g. German and English blogs, or which is even more probable European and Asian blogs. The range of possibilites is huge once the crawler is running in a stable version and I am sure the features I am recording really express what I think they do.</p>
<p><strong>The current implementation</strong>&#8230;</p>
<p>Currently, the crawler is runs on a server and collects a number of random German blogs. I have run some trials with WEKA and it seems my idea is not all that dumb as I am able to successfully tell apart high-quality blogs from a random collection using the learned model using different algorithms (the weakest of which is Naive Bayes with an accuracy of 50%, the strongest being (currently) J48 with about 92%).</p>
<p><strong>&#8230; and my current challenge: Coherence Analyis</strong></p>
<p>What I am currently working at is a notion of cohesiveness of a blog, which I would like to explain in this post. What do I mean by &#8220;cohesiveness&#8221;? I would like to introduce this as a measure determining the topic variation of a certain blog. There may be blogs which cover e.g. the issues &#8220;dogs&#8221; and &#8220;cats&#8221;, which occupy themselves with pets. On the other hand there may be bloggers writing on &#8220;cats&#8221;, &#8220;cars&#8221;, &#8220;computers&#8221;. As you can see, the range of different topics is bigger in the latter example. This means the blogger is less focussed on a specific range of topics.</p>
<p>How can we measure this cohesiveness and see how many topics the blogger is writing about? I consider the usage of tags and categories an important clue, as most bloggers use them to organize their post semantically. And what is more important, tags are even linked to a page which collects and displays all the posts marked with the specific tag. This page may be considered the blogger&#8217;s concept of the tag he uses. This is an important notion, as one blog might use the tag &#8220;cats&#8221; to write on animals, while another one might use it as an acronym for &#8220;<a title="Computer Assisted Trading System" href="http://de.wikipedia.org/wiki/Computer_Assisted_Trading_System">Computer Assisted Trading System</a>&#8220;. So we can assume, the absolute meaning of the terms and their distane does not determine as good the actual topical difference of the blog as the blog-specific term-usage.</p>
<div class="wp-caption aligncenter" style="width: 394px"><a href="http://commons.wikimedia.org/wiki/Category:Felis_silvestris_catus?uselang=de"><img title="Cats!=Cats;" src="http://upload.wikimedia.org/wikipedia/commons/4/49/Felis_catus-skull-drawing.jpg" alt="Cats != Cats." width="384" height="239" /></a><p class="wp-caption-text">If (Cats!=Cats)...</p></div>
<p>Now we can assume, that each of our tag-pages (or respectively categories page) which links to a same host page is representing one of the issues the blog is dealing with. How can we determine the similarity of the issues? One way, and this is the most obvious one, I guess, is to determine the similarity of each tag page as a basis to calculate the overall cohesiveness. This approach is not all new and <a href="http://portal.acm.org/citation.cfm?id=1390757" target="_blank">it has been successfully used by a group of dutch scientist</a> some time ago. I modified it to fit my needs and I would like to introduce some of the results here, including their shortcomings.</p>
<p><strong>Jiyin He Coherence</strong></p>
<p>There are a couple of possible approaches. The one I currently consider most applicable and quite sophisticated uses a package called <a href="http://www.dcs.shef.ac.uk/~sam/simmetrics.html">simmetrics</a> which allows the application of various string comparison algorithms, including <a href="http://en.wikipedia.org/wiki/Vector_space_model">some vector space models</a> as cosine similarity and euclidean distance. It offers some other, more basic functionalities as qgram or dice, but I was primarily focussing on the vector space models.</p>
<p style="text-align:center;">
<div class="wp-caption aligncenter" style="width: 412px"><a href="http://www.dcs.shef.ac.uk/~sam/stringmetrics.html"><img title="Simmetrics Performance" src="http://www.dcs.shef.ac.uk/~sam/images/string_metrics_comparison.jpg" alt="Simmetrics Performance (by http://www.dcs.shef.ac.uk/~sam)" width="402" height="277" /></a><p class="wp-caption-text">Simmetrics Performance (by http://www.dcs.shef.ac.uk/~sam)</p></div>
<p>One of my latest approaches is to use the cohesiveness notion introduced by the above mentioned group to calculate the similarity of all the links to the tag pages. The coherence ranges from 0 (meaning none) to 1 (meaning same documents). In my crawler, it is called &#8220;JiyinHeCoherencePerUrls&#8221; after the person who introduced it (to me!). The pseudo-code is as follows:</p>
<p><code>For a given Blog, get  all the Tag-Pages.</code><br />
<code>Take the first Tag-Page and get all the text on the page;</code></p>
<p><code>For each of the Tag-Pages,</code><br />
<code>{</code><br />
<code>take the next Tag-Page and get all the text on the page;</code><br />
<code>Compare the two texts using the simmetrics package and save the result;</code><br />
<code>Cummulate the results of the comparisons;</code><br />
<code>Count how many comparisons we have done;</code><br />
<code>Set first Tag-Page to next Tag-Page.</code><br />
<code>}</code><br />
<code>Set the Coherence to the cumulated similarities per number of thecomparisons. </code></p>
<p>However, this way of measurement has some offsides which I can only partly compensate:</p>
<ul>
<li>The simmetrics calculation using Vector space models is not stable. It sometimes just hangs up. I don&#8217;t know why. However, I am not sure if  using character-based models is an appropriate alternative.</li>
<li>As you see, I copy the text of the second page once. This is to avoid double server access to download the same page. We are opening up 10 pages per second if we do not reduce server strain, so this is why I set a delay to each page access which is between 4 seconds (trial) and 12 seconds (real crawling situation).</li>
<li>This method eats up a lot of resources. I reduced the size of the strings to compare using just the link labels of a page, which is an approach which had been favoured by a lot of search engines in former times (maybe up till now). This is due to the fact that usually the content of pages can be quite well determined if you take into account the links they use. However, I still have the problem with the Vector space calculation hang-up with some pages&#8230;</li>
</ul>
<p><strong>Simple Term-based coherence</strong></p>
<p>Another notion of coherence I have been experimenting with is a more simple, but also more stable one. I have refrained from calculating the similarity, I have just defined the coherence from to pages as the number of same words on two pages per number of all words on the pages. Of course we only consider unique terms, ignore double occurences, space, punctuation etc. This leads to a really simple model, the pseudo code of which is following:</p>
<p><code><em>For a given Blog,<br />
get  all the Tag-Pages.</em></code></p>
<p><code><em>Take the first Tag-Page and get all the text on the page;<br />
Tokenize the content to words. </em></code></p>
<p><code><em>For each of the Tag-Pages,<br />
{<br />
take the next Tag-Page and get all the text on the page;<br />
Tokenize the content to words.</em></code><br />
<code><br />
<em>For each of the words of  the first Tag page<br />
For each of the words of the next Tag page<br />
If word#1 equals word#2<br />
increment nr of same words per url;</em></code></p>
<p><code><em> Division: Divide the nr of same words pre url per words first Tag page + words on next Tag page;<br />
Set first Tag page text to next Tag page text<br />
}</em></code></p>
<p><code><em>Calculate Standard Deviation, Mean and Median for all the Divisions done throughout the page comparisons.</em></code></p>
<p>This approach is much faster and more stable, however one can doubt (<a href="http://rafazwonull.wordpress.com/2009/02/24/crawling-und-grose-dokumente-marcels-herausforderung/">as Max usually does</a>), the validity of the calculated feature.</p>
<p>After all, I am not sure which algorithm to use in the end. Actually I prefer the first one, but as the Vector based similarity approach is not working, I doubt whether the second one is worse than <a href="http://en.wikipedia.org/wiki/N-gram">qGram</a> analyis. After all we must not forget, that the feature I am discussing here will not be used as an isolated instance but is an attribute next to about 149 which are less complicated I feel. While you can surely argue the validity of the coherence measure, there is less uncertainty on attributes as the number of H1 tags or outlinks. But I need to say, this one has been most interesting so far.</p>
<p>Anyway, hints on how to use  to Vector based simmetrics here and why they hang up are highly appreciated, and so are ideas on which measure is the more appropriate one. I will happily provide the source code of the classes upon request. Usually I would publish it right away, but you see, I am still to be evaluated so I will better avoid bumbing into allegations I had commited plagiarism from an Internet source&#8230; <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  So, if you like it, there is more to come.</p>
<p>Good Night.</p>
<p>-r-</p>
<p><em> </em></p>
<p><em> </em></p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/rafazwonull.wordpress.com/222/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/rafazwonull.wordpress.com/222/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/rafazwonull.wordpress.com/222/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/rafazwonull.wordpress.com/222/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/rafazwonull.wordpress.com/222/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/rafazwonull.wordpress.com/222/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/rafazwonull.wordpress.com/222/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/rafazwonull.wordpress.com/222/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/rafazwonull.wordpress.com/222/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/rafazwonull.wordpress.com/222/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=222&subd=rafazwonull&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://rafazwonull.wordpress.com/2009/03/06/modelling-topical-coherence-of-blogs-news-on-my-project/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/86136f24d90942ff03d4256f453e2653?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Rafa</media:title>
		</media:content>

		<media:content url="http://upload.wikimedia.org/wikipedia/commons/4/49/Felis_catus-skull-drawing.jpg" medium="image">
			<media:title type="html">Cats!=Cats;</media:title>
		</media:content>

		<media:content url="http://www.dcs.shef.ac.uk/~sam/images/string_metrics_comparison.jpg" medium="image">
			<media:title type="html">Simmetrics Performance</media:title>
		</media:content>
	</item>
		<item>
		<title>Crawling und große Dokumente: Marcels Herausforderung</title>
		<link>http://rafazwonull.wordpress.com/2009/02/24/crawling-und-grose-dokumente-marcels-herausforderung/</link>
		<comments>http://rafazwonull.wordpress.com/2009/02/24/crawling-und-grose-dokumente-marcels-herausforderung/#comments</comments>
		<pubDate>Tue, 24 Feb 2009 01:12:00 +0000</pubDate>
		<dc:creator>Rafa</dc:creator>
				<category><![CDATA[Datamining]]></category>
		<category><![CDATA[In eigener Sache]]></category>
		<category><![CDATA[Studium]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[Crawling]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Meta-Blogging]]></category>

		<guid isPermaLink="false">http://rafazwonull.wordpress.com/?p=217</guid>
		<description><![CDATA[Marcel hat gerade  die 5000-Comments-Challenge des Smashing-Magazine als Herausforderung für den Crawler zugetwittert, die ich gerne annehme.
Die ersten drei Mal habe ich einen Timeout bekommen, was mir im Augeblick aber auch im Browser passiert und bei einer 1,6 MB großen HTML (denke ich) für das erste ok ist. Das vierte Mal lief der Crawler erfolgreich [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=217&subd=rafazwonull&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p><a href="http://twitter.com/knust" target="_blank">Marcel</a> hat gerade  die <a href="http://www.smashingmagazine.com/2009/02/23/hardware-giveaway-5000-comments-challenge" target="_blank">5000-Comments-Challenge </a>des Smashing-Magazine als Herausforderung für den Crawler zugetwittert, die ich gerne annehme.</p>
<p>Die ersten drei Mal habe ich einen Timeout bekommen, was mir im Augeblick aber auch im Browser passiert und bei einer 1,6 MB großen HTML (denke ich) für das erste ok ist. Das vierte Mal lief der Crawler erfolgreich durch, wenn auch nicht schnell. Damit ergeben sich einige interessante mitgeloggte Features, von denen ich hier einen Teil aufführe. Ich widme sie: Marcel.</p>
<p>nrOutLinks: 2629<br />
nrOutLinksSameHost: 258<br />
fileSize: 1572606<br />
nrDOMElems: 20672<br />
nrTagMeta: 8<br />
nrTagTable: 2<br />
nrTagTd: 6<br />
nrTagTr: 3<br />
nrTagH1: 2<br />
nrTagH2: 1<br />
nrTagP: 0<br />
nrTagB: 0<br />
nrTagScript: 38<br />
nrTagLayer: 0<br />
nrTagStyle: 0<br />
nrTagHr: 0<br />
nrTextLayoutTags: 2298<br />
nrTagCursBold: 17<br />
nrTagFont: 0<br />
linkLabelLengthDev: 13.0<br />
linkLabelLengthMedian: 13.0<br />
linkLabelLengthAve: 13191616766467000<br />
nrTagFrameset: 0<br />
nrTagForm: 2<br />
nrSentenceMarkers: 6171<br />
nrTableInTables: 0<br />
blanksInText: 54122<br />
lengthTitle: 73<br />
lengthPureText: 304992<br />
uniqueWordsPureText: 3895<br />
nrWordsPureText: 16835<br />
nrStopwordsPureText: 5623<br />
nrImgLinks: 2322<br />
relationLinksToLinksToSameHost: 0.09813617<br />
nrhtmlLinkstoFeeds: 7<br />
nrAlternateLinkstoFeeds: 1</p>
<p>Wer will, kann gerne nachzählen.</p>
<p>-r-</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/rafazwonull.wordpress.com/217/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/rafazwonull.wordpress.com/217/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/rafazwonull.wordpress.com/217/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/rafazwonull.wordpress.com/217/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/rafazwonull.wordpress.com/217/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/rafazwonull.wordpress.com/217/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/rafazwonull.wordpress.com/217/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/rafazwonull.wordpress.com/217/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/rafazwonull.wordpress.com/217/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/rafazwonull.wordpress.com/217/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=217&subd=rafazwonull&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://rafazwonull.wordpress.com/2009/02/24/crawling-und-grose-dokumente-marcels-herausforderung/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/86136f24d90942ff03d4256f453e2653?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Rafa</media:title>
		</media:content>
	</item>
		<item>
		<title>Last und Lust des Crawling</title>
		<link>http://rafazwonull.wordpress.com/2009/02/22/last-und-lust-des-crawling/</link>
		<comments>http://rafazwonull.wordpress.com/2009/02/22/last-und-lust-des-crawling/#comments</comments>
		<pubDate>Sun, 22 Feb 2009 15:34:47 +0000</pubDate>
		<dc:creator>Rafa</dc:creator>
				<category><![CDATA[Datamining]]></category>
		<category><![CDATA[In eigener Sache]]></category>
		<category><![CDATA[Studium]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[Crawling]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Meta-Blogging]]></category>

		<guid isPermaLink="false">http://rafazwonull.wordpress.com/?p=205</guid>
		<description><![CDATA[Nachdem ich ja kürzlich schon grob auf die Kernidee meiner Arbeit eingegangen bin, will ich jetzt die Gelegenheit nutzen, über meine Erfahrungen mit dem Crawler ein wenig mehr zu berichten.
Mein Bot basiert auf dem OpenSource-Projekt JoBo1.4. Auf der Seite wird ein wenig tiefer gestapelt als notwenig, denn es heißt, JoBo verfolge Links und lade die [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=205&subd=rafazwonull&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Nachdem ich ja kürzlich schon grob auf die Kernidee meiner Arbeit eingegangen bin, will ich jetzt die Gelegenheit nutzen, über meine Erfahrungen mit dem Crawler ein wenig mehr zu berichten.</p>
<p>Mein Bot basiert auf dem OpenSource-Projekt <a href="http://www.matuschek.net/jobo/" target="_blank">JoBo1.4</a>. Auf der Seite wird ein wenig tiefer gestapelt als notwenig, denn es heißt, JoBo verfolge Links und lade die Seiten herunter, im Prinzip es ist aber ein tolle Grundlage für einen Crawler. Mit dem Aufrufen der Seiten ist aber noch nicht getan, denn ich könnte im besten Fall einen Batzen Quellcode auslesen, ohne aber auf die Struktur des Dokuments schließen zu können, in der ganz wesentliche Informationen über die Seite stecken. In <a href="http://www.uni-hildesheim.de/de/11061.htm" target="_blank">AQUAINT</a> wurde schon implementiert, dass die aufgerufenen HTML-Seiten mit <a href="http://jtidy.sourceforge.net/project-summary.html">JTidy</a> geparst und in DOM überführt werden, so dass ich die Markup-Informationen effektiv nutzen kann.</p>
<p>In diesem Rahmen sind nun die anderen Pakete und Klassen angeordnet, die unterschiedliche Fähigkeiten haben, und unterschiedliche Teile eines Blogs analysieren:</p>
<p><strong>HTMLAnalyzer</strong></p>
<p>Aus AQUAINT stammt der HTML-Analyzer, der gut 100 Markups analysiert und mitloggt. Wenn sich herausstellt, dass die Seite, auf die der Crawler trifft eine Blog sein sollte, wird diese Klasse aufgerufen, um entsprechende Features mitzuloggen. Wie ich schon im letzten Beitrag geschrieben habe, passiert die Überprüfung, ob ein Blog vorliegt, ganz simpel mit der Analyse, ob im Header des Blogs die Information über einen verfügbaren ALTERNATE Feed vorliegt. Mir ist bewusst, dass das kaum ausreicht, um definitiv zu sagen, ob ein Blog vorliegt, denn auch Seiten, die keine Blogs, sondern ganz einfach redaktionell erstellte Nachrichten sind, verfügen inzwischen über Aggregationen. Dennoch will ich den Raum für alle Arten von möglichen Blogs übrig lassen, denn es gibt auch ausgezeichnete redaktionell erstellte Blogs, die zwar geprüft sind, und eher dem Konzept einer Zeitung näher kommen, aber immernoch der Kategorie Blog zuzuzählen sind. Der Übergang ist leider fließend, und ohne einen Katalog, die Reduzierung auf einige blogtypische Hosts wie WordPress oder Blogspot, oder eine qualitative Analyse, ist wirklich schwer zu sagen, ob ein Blog vorliegt oder Online-Nachrichten. Das ist leider eine Ungenauigkeit, die ich wohl akzeptieren muss.</p>
<p><strong>Wrapper</strong></p>
<p>Zusätzlich ist ein Wrapper implementiert, d.h. eine Klasse, die eine Reihe von Suchmaschinen aufruft, um Rang, Inlinks der Seite usw. zu prüfen. Beliebt ist diese Methode nicht gerade bei Suchmaschinenanbietern, weil sie eine Menge Last erzeugt. Um meine IP nicht bei der dritten Seite bannen lassen zu müssen, habe ich eine Verzögerung im Wrapper eingebaut, die soweit angepasst werden kann, dass der Wrapper in seinem Seitenaufrufverhalten kaum von einem echten Nutzer zu unterscheiden sein sollte. Er funktioniert bei den meisten Seiten recht problemlos.</p>
<p><strong>Bloganalyzer</strong></p>
<p>Eine Kernklasse des Blog-Bots ist der BlogAnalyzer. Er analysiert all die Kriterien eines Blogs, die HTMLs üblicherweise nicht haben, und die der HTML-Analyzer dementsprechend auch nicht analysiert. Dazu gehören z.B. die Anzahl der Kommentare auf einer Seite, die Anzahl der Tags und Kategorien, usw. Problematisch ist hier vor allem, dass Standards fehlen und mein BlogAnalyzer natürlich oft nicht weiß, wonach er suchen soll. Ich hatte anfangs die Idee, die Blogroll zu analysieren, soweit vorhanden. Auch, wenn es einige wirklich gute Initiativen gibt, z.B. <a href="http://microformats.org/wiki/xoxo-brainstorming" target="_blank">XOXO</a> als Standard durchzusetzen, halten sich nicht alle Blogger und Bloganbieter daran, so dass es hier kein standardisiertes Markup gibt. Blogrollanalyse böte zwar interessante weitere Möglichkeite, z.B. zu prüfen welche Authority die verlinkten Sites besitzen, scheitert vorerst aber ganz trivial am Fehlen der Standards.</p>
<p><strong>FeedAnalyzer</strong></p>
<p>Eine weitere wesentliche Klasse, und eine, auf die ich besonders stolz bin, ist mein FeedAnalyzer. Der FeedAnalyzer z.B. liefert mir Informationen über die Anzahl der Beiträge, die durchschnittlichen Intervalle zwischen den Veröffentlichungen einzelner Posts und über die Länge ihrer Titel. Besonders interessant werden die Möglichkeiten bei Kommentarfeeds, über die  z.B. die Anzahl der Kommentierenden ermittelt werden kann. Ich analysiere nur RSS2.0-Feeds, weil die vermeintlich am weitesten verbreitet sind. Atom ist zwar das neuere und stärker detaillierte Format, allerdings wird es nicht von allen Blogs angeboten, und so ist RSS2.0 meist der kleinste gemeinsame Nenner. Bei allen modellierten Features ist es wichtig, dass sie beim Großteil aller Blogs mehr oder weniger gut funktionieren, denn wenn eine hochspezialisierte Methode bei 3/4 der Kollektion Exceptions wirft und abstürzt, habe ich keine aussagekräftigen Zahlen. RSS bietet schon eine ganze Menge, ist schon schön in XML aufbereitet und kann gut von einem weiteren auf <a href="http://www.saxproject.org/" target="_blank">SAX</a> basierenden Parser analysiert werden. Ich hatte an dieser Stelle auch versucht mit <a href="https://rome.dev.java.net/" target="_blank">ROME</a> zu arbeiten, das eine Weiterentwicklung für alle möglichen Aggregationsformate darstellt, bin aber leider gescheitert: irgendwie hat ROME nicht die Funktionen bereitgestellt, die ich brauchte, und so kam es zu kaum ordentlichen Ergebnissen. Zwar wurde die Seite schön geparst, ich konnte aber nicht auf einzelne Elemente von Feeds zugreifen. So nutze ich ROME derzeit nur, um zu prüfen, welche Art von Feed vorliegt, bevor ich den FeedAnalyzer losschicke, was eigentlich Verschwendung ist&#8230; Wenn sich jemand gut mit ROME auskennen sollte, sind Tips hoch geschätzt!</p>
<p><strong>Die Sache mit der Zeit&#8230;</strong></p>
<p>Der Crawler ruft, wenn man ihm eine Kollektion von Seiten gibt, nicht nur durch den Wrapper, sondern vor allem durch den BlogAnalyzer eine ganze Menge zusätzlicher Seiten auf, d.h. für jede Seite, die ich ihm zur Analyse vorgebe greift er gut 25 Mal auf per HTTP-GET auf irgendeinen Server zu. Demenstprechend produziert er ein wenig Last. Ich habe die Belastung anderer Server zwar durch die Drosselung verringert, allerdings grenzt das aufrufen von 25 Seiten innerhalb von 1,5 Minuten schon an geschäftliche Nutzung, so dass es nur eine Frage der Zeit ist, bis mein Provider mir zornige Emails schreibt. Der Crawler läuft also seit geraumer Zeit nur noch auf dem Server der Uni. Das hat weitere Vorteile, denn der Server ist natürlich auch um einiges fixer als mein Rechner bei der Berechnung einiger Features. Trotzdem ist es vielleicht von Interesse, wenn ich einige Zahlen nenne, damit ihr die Dauer einer Crawlsession einschätzen könnt:</p>
<p>Für die Analyse einer Seite braucht der Crawler gut 1,5 Minuten, ruft dabei 25 weitere Seiten auf und pausiert einige Sekunden zwischen ihnen. Meine kleine kontrollierte Kollektion von Blogs ist 300 Seiten dick, meine unkontrollierte derzeit 2400. Für die Analyse meiner kontrollierten Kollektion braucht mein Crawler vom Server aus etwa einen Tag, bei der großen hat es letztens 4-5 gedauert. Diese Dauer wird noch um einiges vervielfacht, wenn ich einige meiner Testanalysen zuschalte, die sich noch akut im Beta-Stadium befinden, und so noch nicht zur Analyse großer Kollektionen herangezogen werden.</p>
<p>Wie eine LogFile des Servers aussehen kann, seht ihr unten. Die Analyse ist nicht ganz sauber gelungen, was in diesem Fall an unsauberem XML im Feed lag.</p>
<p><em>http://bulgariana.blogspot.com/ 19 Feb 2009 16:31:27 GMT<br />
text/html; charset=UTF-8</em></p>
<p><em>***Alternate feed found. Should be a Blog here!<br />
***Starting to log HTML features!</em></p>
<p><em>***Starting Stopword Analysis</em></p>
<p><em>***Starting Wrapper with 11 sec SleepTime<br />
***Finishing Wrapper</em></p>
<p><em>***Starting BlogAnalyzer<br />
***Finishing BlogAnalyzer</em></p>
<p><em>***Trying to find Alternate Feeds<br />
***Alternate Feed found: http://bulgariana.blogspot.com/feeds/posts/default<br />
***Feed Type: rss_2.0<br />
***Alternate Feed found: http://bulgariana.blogspot.com/feeds/posts/default?alt=rss<br />
***Feed Type: rss_2.0<br />
***More than single alternate RSS2.0 feed found: 2<br />
***Applying heuristics.<br />
***Got single alternate feed: http://bulgariana.blogspot.com/feeds/posts/default</em></p>
<p><em>***Starting to log RSS features<br />
19.02.2009 17:32:49 hellmann.quality.AnalyzedURL setFeature<br />
SCHWERWIEGEND: Fehler bei SetFeature: null<br />
***Feed successfully analyzed.<br />
***Processing HTMLLinks heuristically to find Comments Feed<br />
*Error parsing CommentsFeed: Invalid XML: Error on line 42: The entity name must immediately follow the &#8216;&amp;&#8217; in the entity reference.<br />
*Error parsing CommentsFeed: Invalid document<br />
***RSS Comments Feed Determination failed</em></p>
<p>-r-</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/rafazwonull.wordpress.com/205/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/rafazwonull.wordpress.com/205/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/rafazwonull.wordpress.com/205/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/rafazwonull.wordpress.com/205/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/rafazwonull.wordpress.com/205/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/rafazwonull.wordpress.com/205/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/rafazwonull.wordpress.com/205/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/rafazwonull.wordpress.com/205/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/rafazwonull.wordpress.com/205/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/rafazwonull.wordpress.com/205/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=205&subd=rafazwonull&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://rafazwonull.wordpress.com/2009/02/22/last-und-lust-des-crawling/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/86136f24d90942ff03d4256f453e2653?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Rafa</media:title>
		</media:content>
	</item>
		<item>
		<title>Qualitätsmodelle und Data Mining in Blogs</title>
		<link>http://rafazwonull.wordpress.com/2009/02/17/qualitatsmodelle-und-data-mining-in-blogs/</link>
		<comments>http://rafazwonull.wordpress.com/2009/02/17/qualitatsmodelle-und-data-mining-in-blogs/#comments</comments>
		<pubDate>Tue, 17 Feb 2009 00:57:12 +0000</pubDate>
		<dc:creator>Rafa</dc:creator>
				<category><![CDATA[Datamining]]></category>
		<category><![CDATA[In eigener Sache]]></category>
		<category><![CDATA[Netzkultur]]></category>
		<category><![CDATA[Studium]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[Crawling]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Meta-Blogging]]></category>
		<category><![CDATA[Web2.0]]></category>

		<guid isPermaLink="false">http://rafazwonull.wordpress.com/?p=191</guid>
		<description><![CDATA[Ich muss nicht in meiner Freizeit bloggen, denn mit Blogs beschäftige ich mich schon jetzt auch während der Uni bis zum Umfallen. Ich habe Marcel davon erzählt und auch Stefan, und alle waren der Meinung, dass ich über meine Blog-Magisterarbeit bloggen soll. Meta-Blogging sozusagen. Ich habe das lange vor mir hergeschoben, aber inzwischen will mir [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=191&subd=rafazwonull&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p style="text-align:left;">Ich muss nicht in meiner Freizeit bloggen, denn mit Blogs beschäftige ich mich schon jetzt auch während der Uni bis zum Umfallen. Ich habe Marcel davon erzählt und auch <a href="http://www.smartens.eu/blog/" target="_blank">Stefan</a>, und alle waren der Meinung, dass ich über meine Blog-Magisterarbeit bloggen soll. Meta-Blogging sozusagen. Ich habe das lange vor mir hergeschoben, aber inzwischen will mir niemand mehr zuhören (ich scheine ein ziemlicher Fachidiot geworden zu sein in den letzten Wochen), und ich fange lieber an, das jetzt als Beitrag zu verarbeiten, bevor ich tatsächlich Freunde verliere.</p>
<p>Seit Robert Basic seinen <a href="http://www.basicthinking.de/blog/" target="_blank">Blog</a> verkauft hat, ist endlich auch in Deutschland klar, dass Blogs von hoher Qualität sowohl wirtschaftliche als auch gesellschaftliche Bedeutung haben können. In den Staaten gibt es diesen Trend schon lange: <a href="http://www.huffingtonpost.com/" target="_blank">Huffingtonpost</a> ist (zumindest gefühlt) seitdem es das Internet gibt auf Platz 1 bei Technorati. Blogs sind also zu einem mehr oder weniger festen Bestandteil unserer Medienlandschaft geworden, und vor allem den besonders hochwertigen und oft referenzierten <a href="http://www.blogninja.com/hicss05.blogconv.pdf" target="_blank">A-Listern</a> (oder auch Power-Bloggern) scheint dieser Trend zuzuschreiben zu sein.</p>
<p>Unser Blog ist kein A-Lister. Er ist vielleicht auch nicht besonders hochwertig. Trotzdem hat er sicherlich einige Kriterien, die ihn wiederum qualitativ von schlechteren Blogs abheben. Vielleicht ist es eine gewisse Update-Frequenz, allerdings kann man Blogs auch mit Spam updaten. Möglicherweise ist es eine bestimmte Bandbreite an Themen, oder aber auch eine gewisse Sorgfalt, mit der er designt und die Beiträge verfasst wurden. Vielleicht ist es die Beitragslänge. Oder aber die Länge aller Beiträge zusammen. Oder die Kombination mehrerer Kriterien. Keiner kann das so definitiv sagen, und wenn auch die meisten bestimmte Vermutungen haben (&#8221;NEIN! Die Anzahl der Zeichenumbrüche kann es niemals sein!&#8221;), geht das, was man mit Sicherheit sagen kann, bislang nicht allzu weit über eben diese Vermutungen hinaus.</p>
<p>Hier kommt jetzt meine Arbeit ins Spiel. Keiner möchte Tausende von Blogs auf übereinstimmende Kriterien überprüfen, zumindest nicht händisch (auch, wenn Leitfragen zu konstruieren und Sie anhand von 7 Blogs exemplarisch zu beantworten, die normale Vorgehensweise an unserer Uni wäre). Was man dazu braucht, ist ein ordentlicher Bot, der eine vorgegebene Kollektion durchgeht, Links verfolgt, und alle interessanten Features mitloggt. Genau das passiert hier.</p>
<p>Ich bin sicherlich kein guter Programmierer. Da sind andere besser, und ich danke dem OpenSource-Prinzip und allen, die irgendwie etwas mit ihm zutun haben, schon einmal pauschal: Mein Crawler basiert auf Java1.6 und nutzt crawlerseitig ganz wesentlich die Bestandteile des OpenSource-Tools <a href="http://www.matuschek.net/jobo/" target="_blank">JoBo</a>, das im Rahmen des <a href="http://www.uni-hildesheim.de/de/11061.htm" target="_blank">Projektes AQUAINT </a>an der Uni Hildesheim für die Automatische Qualitätsabschätzung von HTML in Form einer Suchmaschine implementiert wurde. Ich nutze einen Teil der in AQUAINT modellierten Features und Klassen, und habe auf Basis einiger weiter frei verfügbarer Klassen einen Crawler mit folgender Struktur zusammenprogrammiert:</p>
<div id="attachment_192" class="wp-caption aligncenter" style="width: 430px"><img class="size-full wp-image-192" title="bc_architektur" src="http://rafazwonull.files.wordpress.com/2009/02/folie2.jpg?w=420&#038;h=315" alt="bc_architektur" width="420" height="315" /><p class="wp-caption-text">Blogcrawler Architektur in Java1.6</p></div>
<p style="text-align:left;">Die Dokumente, auf die der Crawler stößt, werden mit den org.w3c Paketen in ein <a href="http://de.wikipedia.org/wiki/Document_Object_Model">DOM</a> überführt, sofern es sich um HTML-Dokumente handelt. Ich verfolge keine Dokumente weiter, die keine HTMLs sind, allerdings verzichte ich auch auf Dokumente, die keine Feeds in ihrem Header ausgezeichnet haben. Wenn man davon ausgeht, dass die Möglichkeit, einen Blog per Feed zu abonieren, ein wesentliches Kriterium für einen Blog ist, kann man hier schon die erste Einschränkung machen, um eine saubere Kollektion zu gewährleisten, sodass im Endeffekt folgende Zeilen bestimmen, ob die Site weiterverfolgt werden soll oder nicht. In unserem Fall sieht der relevante Teil des Headers folgendermaßen aus:</p>
<pre>&lt;<span class="start-tag">html</span><span class="attribute-name"> xmlns</span>=<span class="attribute-value">"http://www.w3.org/1999/xhtml" </span><span class="attribute-name">dir</span>=<span class="attribute-value">"ltr" </span><span class="attribute-name">lang</span>=<span class="attribute-value">"de"</span>&gt;
&lt;<span class="start-tag">head</span>&gt;
&lt;<span class="start-tag">title</span>&gt; Rafazwonull vs Joe I/O&lt;/<span class="end-tag">title</span>&gt;
	&lt;<span class="start-tag">meta</span><span class="attribute-name"> http-equiv</span>=<span class="attribute-value">"Content-Type" </span><span class="attribute-name">content</span>=<span class="attribute-value">"text/html; charset=UTF-8" </span><span class="error"><span class="attribute-name">/</span></span>&gt;
	&lt;<span class="start-tag">style</span><span class="attribute-name"> type</span>=<span class="attribute-value">"text/css" </span><span class="attribute-name">media</span>=<span class="attribute-value">"screen"</span>&gt;
		@import url( http://s3.wordpress.com/wp-content/themes/pub/benevolence/style.css?m=1232213194b );
	&lt;/<span class="end-tag">style</span>&gt;
	&lt;<span class="start-tag">link</span><span class="attribute-name"> rel</span>=<span class="attribute-value">"alternate" </span><span class="attribute-name">type</span>=<span class="attribute-value">"application/rss+xml" </span><span class="attribute-name">title</span>=<span class="attribute-value">"RSS 2.0" </span><span class="attribute-name">href</span>=<span class="attribute-value">"http://rafazwonull.wordpress.com/feed/" </span><span class="error"><span class="attribute-name">/</span></span>&gt;
	&lt;<span class="start-tag">link</span><span class="attribute-name"> rel</span>=<span class="attribute-value">"pingback" </span><span class="attribute-name">href</span>=<span class="attribute-value">"http://rafazwonull.wordpress.com/xmlrpc.php" </span><span class="error"><span class="attribute-name">/</span></span>&gt;</pre>
<p style="text-align:left;">Was bedeuten diese Zeilen für die Analyse? Allein durch den Header, lassen sich an sich schon folgende Aussagen treffen:</p>
<ul>
<li>Es könnte sich um einen Blog handeln, denn wir können ihn abonieren.</li>
<li>Es ist ein Blog, der nur das Format RSS2.0 unterstützt.</li>
<li>Der Titel ist in unserem Fall 23 Zeichen lang, was relevant wird, wenn man bedenkt, dass es Spamblogs gibt, die etwa den Titel &#8220;Zeichnen-Backen-Basteln-mit-Fingermalfarben-und-Salzteig&#8221; haben&#8230;</li>
</ul>
<p>Ich habe mich inzwischen auf diese und ähnliche Weise mit Blogs beschäftigt, und rund 150 Features in Java modelliert, die z.T. aus der Literatur, teils aus AQUAINT kommen, und z.T. selbst entwickelt sind. Sie lassen sich grob in die folgenden Bereiche gliedern:</p>
<ul>
<li>Datei-Maße (z.B. Dateigröße).</li>
<li>HTML- Maße (z.B. &lt;br&gt;-Tags).</li>
<li>Tabellen-Maße</li>
<li>Listen-Maße (z.B. &lt;li&gt;-Tags)</li>
<li>Farbmaße</li>
<li>Verhältnismaße (d.h. Berechnung von Featuren, oder Kombination)</li>
<li>Sprachliche Maße (d.h. Anzahl von Wörtern und Satzzeichen).</li>
<li>RSS-Maße (d.h. alles, was man über die Feeds erfährt)</li>
<li>Reputationsmaße (d.h. alles, was mit durch <a href="http://de.wikipedia.org/wiki/Hubs_und_Authorities" target="_blank">HITS</a> oder <a href="http://de.wikipedia.org/wiki/PageRank">PageRank</a>-basierte Verfahren erfährt)</li>
</ul>
<p>Eines dieser Kriterien sagt sicherlich nichts isoliert über die Beschaffenheit von qualitativ hochwertigen Blogs aus, die eine große Leserschafft haben, hochwertige Artikel und ein gutes Renommé. In ihrer Kombination, und sagen wir mal, über eine Kollektion von einem guten Tausend Blogs schaffen wir es durch den Abgleich und die Prüfung von sinntragenden Strukturen Korrelationen zu ermitteln, Features sinnvoll zu clustern usw. &#8211; kurz, alles, was mit durch die Möglichkeiten des Data Mining über zusammenhangslos scheinenden Daten ermitteln kann. Einige von diesen Zusammenhängen sind sicherlich offensichtlich, z.B. der Zusammenhang von Authority und InLinks. Bei anderen greift wiederum nur die statistische Analyse von großen Mengen an Datensätzen.</p>
<p>Mal schauen wie es damit weitergeht, vielleicht schreibe ich beizeiten noch ein wenig mehr hierzu, und halte euch auf dem Laufenden.</p>
<p>Soviel dazu. Gute Nacht.</p>
<p>-r-</p>
<p style="text-align:left;">
<p><img src="///Users/rafaelhellmann/Library/Caches/TemporaryItems/moz-screenshot.jpg" alt="" /></p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/rafazwonull.wordpress.com/191/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/rafazwonull.wordpress.com/191/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/rafazwonull.wordpress.com/191/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/rafazwonull.wordpress.com/191/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/rafazwonull.wordpress.com/191/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/rafazwonull.wordpress.com/191/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/rafazwonull.wordpress.com/191/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/rafazwonull.wordpress.com/191/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/rafazwonull.wordpress.com/191/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/rafazwonull.wordpress.com/191/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=191&subd=rafazwonull&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://rafazwonull.wordpress.com/2009/02/17/qualitatsmodelle-und-data-mining-in-blogs/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/86136f24d90942ff03d4256f453e2653?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Rafa</media:title>
		</media:content>

		<media:content url="http://rafazwonull.files.wordpress.com/2009/02/folie2.jpg" medium="image">
			<media:title type="html">bc_architektur</media:title>
		</media:content>

		<media:content url="///Users/rafaelhellmann/Library/Caches/TemporaryItems/moz-screenshot.jpg" medium="image" />
	</item>
		<item>
		<title>Warum ich in letzter Zeit so selten blogge</title>
		<link>http://rafazwonull.wordpress.com/2009/02/17/warum-ich-in-letzter-zeit-so-selten-blogge/</link>
		<comments>http://rafazwonull.wordpress.com/2009/02/17/warum-ich-in-letzter-zeit-so-selten-blogge/#comments</comments>
		<pubDate>Mon, 16 Feb 2009 23:34:04 +0000</pubDate>
		<dc:creator>Rafa</dc:creator>
				<category><![CDATA[In eigener Sache]]></category>

		<guid isPermaLink="false">http://rafazwonull.wordpress.com/?p=187</guid>
		<description><![CDATA[Einige von euch wundern sich vielleicht, warum ich in letzter Zeit so wenig blogge. Vielleicht glaubt ihr, dass ich neuen Trends, wie Microblogging meine Aufmerksamkeit schenke, dem ist aber nicht wirklich so. Ich schreibe seit kurzem an meiner Magisterarbeit (deren Thema auch das Thema des nächsten Blogbeitrages sein soll), und Joe macht gerade wieder Karriere [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=187&subd=rafazwonull&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Einige von euch wundern sich vielleicht, warum ich in letzter Zeit so wenig blogge. Vielleicht glaubt ihr, dass ich neuen Trends, wie Microblogging meine Aufmerksamkeit schenke, dem ist aber nicht wirklich so. Ich schreibe seit kurzem an meiner Magisterarbeit (deren Thema auch das Thema des nächsten Blogbeitrages sein soll), und Joe macht gerade wieder Karriere in Hamburg. Wir haben also zu tun. Wirklich. Wir haben uns euch nicht abgewandt. Niemals!</p>
<p>-r-</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/rafazwonull.wordpress.com/187/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/rafazwonull.wordpress.com/187/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/rafazwonull.wordpress.com/187/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/rafazwonull.wordpress.com/187/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/rafazwonull.wordpress.com/187/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/rafazwonull.wordpress.com/187/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/rafazwonull.wordpress.com/187/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/rafazwonull.wordpress.com/187/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/rafazwonull.wordpress.com/187/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/rafazwonull.wordpress.com/187/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=187&subd=rafazwonull&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://rafazwonull.wordpress.com/2009/02/17/warum-ich-in-letzter-zeit-so-selten-blogge/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/86136f24d90942ff03d4256f453e2653?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Rafa</media:title>
		</media:content>
	</item>
		<item>
		<title>Wassup 2008</title>
		<link>http://rafazwonull.wordpress.com/2008/10/25/wassup-2008/</link>
		<comments>http://rafazwonull.wordpress.com/2008/10/25/wassup-2008/#comments</comments>
		<pubDate>Sat, 25 Oct 2008 14:21:24 +0000</pubDate>
		<dc:creator>Rafa</dc:creator>
				<category><![CDATA[Netzkultur]]></category>
		<category><![CDATA[Politik]]></category>
		<category><![CDATA[USA]]></category>
		<category><![CDATA[Wahl]]></category>
		<category><![CDATA[Wahlwerbung]]></category>
		<category><![CDATA[Werbung]]></category>

		<guid isPermaLink="false">http://rafazwonull.wordpress.com/?p=184</guid>
		<description><![CDATA[Dank an Marcel für das eben zugetwitterte Wahlwerbe-Fundstück der Woche ( &#8230; oder des Monats &#8230; oder der letzten zwei Monate. Asche auf mein Haupt. Studium geht vor. ).

-r-
       <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=184&subd=rafazwonull&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Dank an Marcel für das eben zugetwitterte Wahlwerbe-Fundstück der Woche ( &#8230; oder des Monats &#8230; oder der letzten zwei Monate. Asche auf mein Haupt. Studium geht vor. ).</p>
<p><span style="text-align:center; display: block;"><a href="http://rafazwonull.wordpress.com/2008/10/25/wassup-2008/"><img src="http://img.youtube.com/vi/Qq8Uc5BFogE/2.jpg" alt="" /></a></span></p>
<p>-r-</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/rafazwonull.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/rafazwonull.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/rafazwonull.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/rafazwonull.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/rafazwonull.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/rafazwonull.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/rafazwonull.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/rafazwonull.wordpress.com/184/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/rafazwonull.wordpress.com/184/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/rafazwonull.wordpress.com/184/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=rafazwonull.wordpress.com&blog=2916258&post=184&subd=rafazwonull&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://rafazwonull.wordpress.com/2008/10/25/wassup-2008/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/86136f24d90942ff03d4256f453e2653?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Rafa</media:title>
		</media:content>

		<media:content url="http://img.youtube.com/vi/Qq8Uc5BFogE/2.jpg" medium="image" />
	</item>
	</channel>
</rss>