BioStar: Is the BioStar fading?

Posted on March 7, 2011

7


BioStar is an open forum for bioinformaticians where questions can openly be asked and the forum’s subscriber-community provides answers. Everyone can vote questions and answers up or down as to give them a qualitative rating. BioStar is essentially a bio-clone of the famous stackoverflow. Over the past weeks I have noticed a change in the quality of questions, answers, and the enthusiasm to vote on BioStar. After the latest bummer, where I answered a question that had not been tackled for some time but receive neither up-vote nor was the answer marked as ‘accepted’, I decided to look into the communities voting behaviour in more detail.

Crawling BioStar

In order to analyse the votes of answered questions, I crawled the BioStar web-site with the Ruby script given below, which outputs a TSV-format with the following columns:

  1. Question ID, e.g. question http://biostar.stackexchange.com/questions/2550 has the ID 2550
  2. Date and time when the question was last edited
  3. Year of 2. (redundant, for easier parsing)
  4. Month of 2. (redundant, for easier parsing)
  5. Received vote of the question
  6. Number of answers given (>= 1, unanswered questions are omitted)
  7. Received vote of the least favourite answer (min. vote of all answers)
  8. Received vote of the most favourite answer (max. vote of all answers)
  9. Average vote over all received votes for answers

The script was functional as of the time of posting, but it may stop working when the web-layout of BioStar changes. I hardcoded the question ID 6170 as the last ID of the URL to be downloaded, which was the latest ID when I started writing the script. The output was redirected into a file called biostar.tsv (see the very end for a downloadable version of the file) that is later loaded into R.

require 'rubygems'
require 'hpricot'
require 'open-uri'

def strip(s)
    return s.sub(/^\s*/, '').sub!(/\s*$/, '')
end

def bioextract(id)
    url = "http://biostar.stackexchange.com/questions/#{id}"
    page = nil

    begin
        page = open(url)
    rescue 
        return
    end

    if page.base_uri.to_s != url then
        # Answer URL.
        return
    end

    doc = Hpricot(page.read)

    questions = doc.search("div#question")

    exit if questions.length == 0

    if questions.length > 1 then
        puts 'Hm. More than one question?'
        return
    end

    question_vote = nil
    question_date = nil
    answers_vote = []

    questions[0].search("div.vote")[0].search("span.vote-count-post") { |vote|
        question_vote = strip(vote.inner_text).to_i
    }

    answers = doc.search("div#answers")
    answers.each { |answer|
        answer.search("div.vote").search("span.vote-count-post") { |vote|
            answers_vote << strip(vote.inner_text).to_i
        }
    }

    return unless answers_vote.length > 0

    dates = doc.search("span.relativetime")
    question_date = dates[0].get_attribute('title')
    datetime = DateTime.parse(question_date)

    puts "#{id}\t#{question_date}\t#{datetime.year}\t#{datetime.month}\t" +
        "#{question_vote}\t#{answers_vote.length}\t" +
        "#{answers_vote.min}\t#{answers_vote.max}\t" +
        "#{eval(answers_vote.join('+')) / answers_vote.length.to_f}"

    page.close
end

for i in 1..6170 do
    bioextract(i)
end

Evaluation

I used the R script given below to plot the figures given even further below. They show the received vote per question, number of answers given, min./max. vote per question and average vote as scatter graph over the months of BioStar’s existence. Additionally, each graph shows a plot of the average value for each month and a trend line over the original data (i.e. not over the average).

biostar <- read.table(
        "biostar.tsv",
        sep="\t",
        col.names=c(
            "id",
            "date",
            "year",
            "month",
            "qvote",
            "anum",
            "amin",
            "amax",
            "aavg"
        )
    )

biostar$age <- (
        (
            biostar$year -
            rep(min(biostar$year), times=length(biostar$year))
        )*12 +
        biostar$month
    )
biostar$age <- biostar$age -
            rep(min(biostar$age), times=length(biostar$age)) +
            1

biolevels <- levels(factor(biostar$age))

bioeval <- function(column, xlabel, ylabel, colour) {
    avg <- rep(0, times=length(biolevels))
    for (i in 1:length(biolevels)) {
        avg[i] <- mean(
            column[ biostar$age == biolevels[i] ]
        )
    }

    plot(
            biostar$age,
            t(column),
            col=colour,
            pch=19,
            xlab=xlabel,
            ylab=ylabel,
            axes=FALSE
        )
    lines(biolevels, avg, col=rgb(.6,.6,.6))
    abline(lm(column ~ biostar$age), col=rgb(.6,0,.6))
    axis(1,min(biostar$age):max(biostar$age))
    axis(2,min(column):max(column))
}

png("qvote.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
        biostar$qvote,
        'Biostar Month',
        'Vote of Questions',
        rgb(100,100,0,15,maxColorValue=255)
    )
dev.off()

png("num.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
        biostar$anum,
        'Biostar Month',
        'Number of Answers per Question',
        rgb(0,100,100,15,maxColorValue=255)
    )
dev.off()

png("min.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
        biostar$amin,
        'Biostar Month',
        'Min. Answer Vote per Question',
        rgb(100,0,0,15,maxColorValue=255)
    )
dev.off()

png("max.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
        biostar$amax,
        'Biostar Month',
        'Max. Answer Vote per Question',
        rgb(0,0,100,15,maxColorValue=255)
    )
dev.off()

png("avote.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
        biostar$aavg,
        'Biostar Month',
        'Avg. Answer Vote per Question',
        rgb(100,0,100,15,maxColorValue=255)
    )
dev.off()

 

Vote of Questions

Number of Answers per Question

Min. Answer Vote per Question

Max. Answer Vote per Question

Avg. Answer Vote per Question

 

 

Conclusion

The scatter plot shows that more people participate in BioStar lately, which is indicated by the darkening to right. Not all trends show a slope that can be clearly interpreted, except the trends for the vote given per questions and the maximum vote answers received which indicates a decline in up-votes (or the increase in down-votes) in both cases. It cannot said for sure that the BioStar is fading, but it certainly is not flourishing.

Acknowledgements

I used http://wiki.xmlhack.ru/ for the syntax highlighting in this blog post.

Downloads

TSV-file (rename to biostar.tsv): biostar.tsv

About these ads
Posted in: Bioinformatics