BioStar: Is the BioStar fading? An Annual Follow-Up.

Posted on March 11, 2012

4


One year has passed since I asked the provocative question: “Is the BioStar fading?“. At the time I concluded that the BioStar was not fading, even though the data suggested that it was not flourishing either. I have now crawled the BioStar web-sites again and updated the statistical plots from last year with up-to-date data and added further explanations about my interpretation of them, which I am presenting in this blog post.

Introduction

The first analysis of the BioStar forum’s contents was triggered by my impression that the overall quality of the questions and answers provided by the BioStar community was decreasing. With the help of some simple statistics and visualisation techniques I tried to evaluate whether it is possible to back up this claim formally, where the prospect of carrying out such an analysis seemed intriguing in itself.

Evaluating the quality of questions and answers in a forum is not an easy task and there have been criticisms about the metrics that I have chosen last year. Whilst there might be better metrics for determining the quality of posts on the forum, I explicitly avoided common metrics such as the total number of users, posts per months, etc., which are measures of sheer quantity instead of quality measures. The five metrics I chose are rather forum post oriented and focus on the number of votes on questions and answers and on the number of answers provided instead.

Both web-crawling and statistics scripts are provided in this blog post too, so that it is possible for anyone to reproduce my work and to re-use the scripts for carrying out evaluations on other interesting metrics.

Results

Five graphs have been created using the statistical computing software R from data provided by the web-crawling script, where each graphs is a plot of values that have been binned per month of BioStar’s existence. The light grey lines denote the progression of the monthly average value and the purple lines indicate the trend of the data points over time for each graph respectively.

In the following I will use the terms “voting-score” and “score” when I am referring to the integer that is just labelled “Vote” on the BioStar forum. This “Vote” is actually the result of a voting process, where BioStar members can either make this integer go up by one (“up-vote”) or down by one (“down-vote”) by placing their vote on a per question/per answer basis.

Voting-Score of Questions

Voting-scores are an indicator of the activity of BioStar members within the forum as well as a gauge of the community’s mood. Many positive scores show an active participation of many forum members and they indicate that these up-voted questions were welcomed. Negative scores also show active participation, but they are a form of expressing disapproval or dissent. The voted score a question received is not an absolute measure of participation though, because it is possible that equally many members voted a question “up” as well as “down”, which would result in a score of zero here even though many people took part in the voting. It is only from my experience that I can say that most questions are dominantly receiving either “up” or “down” votes and that the score of a particular question very seldom swings between positive/negative scores.

Voting-Score of Questions

Number of Answers per Question

Engagement of users within the BioStar community is evaluated by the number of answers that are provided per question. A higher number of answers does not indicate better quality as such, but it expresses the involvement of users regarding forum content. Unlike the voting-score, the number of answers can only increase, which makes this a better measure of member participation.

Number of Answers per Question

Minimal Voting-Score that an Answer Received per Question

The answer that received the lowest voting-score determines the quality of the least popular answer. A high score for the least popular answer means that despite the answer’s bad ranking, it is still a valuable contribution. A lower score and especially negative scores on the other hand mean that the answer did not meet the expectations of the community.

Minimal Answer Voting-Score per Question

Maximal Voting-Score that an Answer Received per Question

Most popular answers receive a high voting-score, which expresses the communities approval of the answer’s quality. It also shows the involvement of BioStar members, where high scores can essentially only be achieved when many people are actually voting. The lower the score for the top-answer is positioned, the less accepted it is within the community, either due to disapproval or sheer lack of member’s participating in voting.

Maximal Answer Voting-Score per Question

Average Answer Voting-Score per Question

Using the average voting-score provided for answers per question, it is possible to account for the overall quality of the provided answers. If a question has many answers with high scores and only a few answers with low scores, then the average is still going to be high. On the other hand, a low average results from questions with answers that have received mostly low voting-scores.

Average Answer Voting-Score per Question

Discussion & Conclusion

Each of the presented metrics are showing a downward trend, which can be interpreted as a decrease in quality in the sense of the described interpretations above. Questions on the BioStar forum are receiving fewer answers over time and the given scores by the BioStar community are dropping too. The trend lines over all graphs suggests that the BioStar will indeed fade eventually.

However, as an active member of the BioStar forum myself, I find that the overall quality of the forum has improved over the last months. It might be the case that the initial euphoria and enthusiasm to achieve as many votes for questions and answers to gain higher scores on the BioStar member-specific reputation metric as well as the aim to unlock the many available member-badges did encourage people to participate at the cost of the forum’s quality (see also this feed by Paulo Nuin). Since the forum has become quieter now, I get the impression that answers are less rushed and that votes are distributed with better aim.

I hope that that the crawling script will still work with the BioStar next year, when I am going to conduct the same analysis again. Until then we can only wait and see what happens to this fairly young bioinformatics forum in the meantime.

Appendix: Data Retrieval and Evaluation Scripts

The Ruby script for crawling BioStar has only marginally been changed from last year, where I only increased the unique identifier in the URL to match the most recent posts on BioStar and the remainder are fixes that were required for running the script under Mac OS X Lion. The R script for the evaluation was just adapted to account for the new filename of the crawled data, where the latter is available as biostar20120306.tsv (change suffix from _tsv.doc to .tsv).

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'date'

STDOUT.sync = true

def strip(s)
    return s.sub(/^\s*/, '').sub!(/\s*$/, '')
end

def bioextract(id)
    url = "http://biostar.stackexchange.com/questions/#{id}"
    page = nil

    begin
        page = open(url)
    rescue
        return
    end

    if page.base_uri.to_s != url then
        # Answer URL.
        return
    end

    doc = Hpricot(page.read)

    questions = doc.search("div#question")

    return if questions.length == 0

    if questions.length > 1 then
        puts 'Hm. More than one question?'
        return
    end

    question_vote = nil
    question_date = nil
    answers_vote = []

    questions[0].search("div.vote")[0].search("span.vote-count-post") { |vote|
        question_vote = strip(vote.inner_text).to_i
    }

    answers = doc.search("div#answers")
    answers.each { |answer|
        answer.search("div.vote").search("span.vote-count-post") { |vote|
            answers_vote << strip(vote.inner_text).to_i
        }
    }

    return unless answers_vote.length > 0

    dates = doc.search("span.relativetime")
    question_date = dates[0].get_attribute('title')
    datetime = DateTime.parse(question_date)

    puts "#{id}\t#{question_date}\t#{datetime.year}\t#{datetime.month}\t" +
        "#{question_vote}\t#{answers_vote.length}\t" +
        "#{answers_vote.min}\t#{answers_vote.max}\t" +
        "#{eval(answers_vote.join('+')) / answers_vote.length.to_f}"

    page.close
end

for i in 1..18304 do
    bioextract(i)
end

The R code has not been modified since last year, except for the change of the filename and plot titles in the script:

biostar <- read.table(
        "biostar20120306.tsv",
        sep="\t",
        col.names=c(
            "id",
            "date",
            "year",
            "month",
            "qvote",
            "anum",
            "amin",
            "amax",
            "aavg"
        )
    )

biostar$age <- (
        (
            biostar$year -
            rep(min(biostar$year), times=length(biostar$year))
        )*12 +
        biostar$month
    )
biostar$age <- biostar$age -
            rep(min(biostar$age), times=length(biostar$age)) +
            1

biolevels <- levels(factor(biostar$age))

bioeval <- function(column, xlabel, ylabel, colour) {
    avg <- rep(0, times=length(biolevels))
    for (i in 1:length(biolevels)) {
        avg[i] <- mean(
            column[ biostar$age == biolevels[i] ]
        )
    }

    plot(
            biostar$age,
            t(column),
            col=colour,
            pch=19,
            xlab=xlabel,
            ylab=ylabel,
            axes=FALSE
        )
    lines(biolevels, avg, col=rgb(.6,.6,.6))
    abline(lm(column ~ biostar$age), col=rgb(.6,0,.6))
    axis(1,min(biostar$age):max(biostar$age))
    axis(2,min(column):max(column))
}

png("qvote.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
        biostar$qvote,
        'Biostar Month',
        'Voting-Score of Questions',
        rgb(100,100,0,15,maxColorValue=255)
    )
dev.off()

png("num.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
        biostar$anum,
        'Biostar Month',
        'Number of Answers per Question',
        rgb(0,100,100,15,maxColorValue=255)
    )
dev.off()

png("min.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
        biostar$amin,
        'Biostar Month',
        'Min. Answer Voting-Score per Question',
        rgb(100,0,0,15,maxColorValue=255)
    )
dev.off()

png("max.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
        biostar$amax,
        'Biostar Month',
        'Max. Answer Voting-Score per Question',
        rgb(0,0,100,15,maxColorValue=255)
    )
dev.off()

png("avote.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
        biostar$aavg,
        'Biostar Month',
        'Avg. Answer Voting-Score per Question',
        rgb(100,0,100,15,maxColorValue=255)
    )
dev.off()
About these ads
Posted in: Bioinformatics