different blog stats

I recently stumbled upon a tool called AntConc in an archive-folder on an old external hard drive and couldn’t help but analyze my blog with it.

So, what is AntConc anyway? It’s a linguistic tool to analyze to corpora (or text collections) by building concordances, which again can be analyzed in terms of word frequency, collocation of words and things like that. So I thought it might interesting to get some other stats than the views/referrer/country stuff that wordpress brings along anyway.

Step one: collecting texts

I put all my blog posts into a single text file, which can be analyzed in the tool. Not the hardest of all tasks.

Step two: gathering data for comparison

I wanted to do an analysis in terms of lemmata, you know, those main dictionary entries, so that all those 1st, 2nd, 3rd singular/plural forms are summarized under “to be”. Luckily there is a lemma-list based on the British National Corpus available on the AntConc homepage. As there is a word frequency list of the BNC available as well, I just loaded that in the tool, too. This allows me to compare the frequency of words in my blog with the general frequency of these words in the BNC.

Step three: let the tool do what it’s meant for

Let the tool analyze my data. I am not really interested in a keyword in context analysis, so I skipped that and just had a look at the frequency of words.

Overall I wrote 30009 words made up of 2833 different lemmata. And the 5 most used words on my blog are: the, be, a, and, to.

Step four: refine the results

Since this is not particularly interesting or surprising, I decided to go through the wordlist the tool produced and kicked out all the functional words, so that I could have a look at the lexical words I used. And here I present you the Top 25 lexical words I wrote on this blog so far:

overall rankfrequencyword
14151test
2776day
4646course
4845team
5143time
5342post
5936blog
6035people
6035thing
7231script
7430twitter
7729automation
7928communication
7928point
8427tester
9326experience
9326story
9326word
9825bbst
9825part
10722game
10722istqb
11220approach
11220group
11220way

Step five: compare the results

Comparing these frequencies to the general frequency of the words as provided via the BNC yielded no big surprise, I used those more frequently than the average. Since this blog is about a niche topic, that is not overly surprising. I will spare you with the keyness factor used for that comparison as it is just a number telling what I just told you. Believe me. It’s the truth.

Step six: Interpret the results

Now any kind of data is just data without interpretation. Most of the data was not overly surprising to me, given the topic of the blog, but a few things stood out. If you had asked me beforehand, I would have expected communication to be higher up the ranks, given that it is one of the sub-themes I try to cover here, so maybe I should put some more effort into that. ISTQB is more prominent that I thought, given that I don’t agree too much with the testing school they promote. Then again, speaking against that and writing a comparison between BBST and ISTQB foundation courses have put those numbers up the board. What surprised me a bit was that user didn’t show up until #43 since I usually stress the importance of the user’s perspective when talking about testing and development in general.