One year has passed since I asked the provocative question: “Is the BioStar fading?“. At the time I concluded that the BioStar was not fading, even though the data suggested that it was not flourishing either. I have now crawled the BioStar web-sites again and updated the statistical plots from last year with up-to-date data and added further explanations about my interpretation of them, which I am presenting in this blog post.
Introduction
The first analysis of the BioStar forum’s contents was triggered by my impression that the overall quality of the questions and answers provided by the BioStar community was decreasing. With the help of some simple statistics and visualisation techniques I tried to evaluate whether it is possible to back up this claim formally, where the prospect of carrying out such an analysis seemed intriguing in itself.
Evaluating the quality of questions and answers in a forum is not an easy task and there have been criticisms about the metrics that I have chosen last year. Whilst there might be better metrics for determining the quality of posts on the forum, I explicitly avoided common metrics such as the total number of users, posts per months, etc., which are measures of sheer quantity instead of quality measures. The five metrics I chose are rather forum post oriented and focus on the number of votes on questions and answers and on the number of answers provided instead.
Both web-crawling and statistics scripts are provided in this blog post too, so that it is possible for anyone to reproduce my work and to re-use the scripts for carrying out evaluations on other interesting metrics.
Results
Five graphs have been created using the statistical computing software R from data provided by the web-crawling script, where each graphs is a plot of values that have been binned per month of BioStar’s existence. The light grey lines denote the progression of the monthly average value and the purple lines indicate the trend of the data points over time for each graph respectively.
In the following I will use the terms “voting-score” and “score” when I am referring to the integer that is just labelled “Vote” on the BioStar forum. This “Vote” is actually the result of a voting process, where BioStar members can either make this integer go up by one (“up-vote”) or down by one (“down-vote”) by placing their vote on a per question/per answer basis.
Voting-Score of Questions
Voting-scores are an indicator of the activity of BioStar members within the forum as well as a gauge of the community’s mood. Many positive scores show an active participation of many forum members and they indicate that these up-voted questions were welcomed. Negative scores also show active participation, but they are a form of expressing disapproval or dissent. The voted score a question received is not an absolute measure of participation though, because it is possible that equally many members voted a question “up” as well as “down”, which would result in a score of zero here even though many people took part in the voting. It is only from my experience that I can say that most questions are dominantly receiving either “up” or “down” votes and that the score of a particular question very seldom swings between positive/negative scores.
Number of Answers per Question
Engagement of users within the BioStar community is evaluated by the number of answers that are provided per question. A higher number of answers does not indicate better quality as such, but it expresses the involvement of users regarding forum content. Unlike the voting-score, the number of answers can only increase, which makes this a better measure of member participation.
Minimal Voting-Score that an Answer Received per Question
The answer that received the lowest voting-score determines the quality of the least popular answer. A high score for the least popular answer means that despite the answer’s bad ranking, it is still a valuable contribution. A lower score and especially negative scores on the other hand mean that the answer did not meet the expectations of the community.
Maximal Voting-Score that an Answer Received per Question
Most popular answers receive a high voting-score, which expresses the communities approval of the answer’s quality. It also shows the involvement of BioStar members, where high scores can essentially only be achieved when many people are actually voting. The lower the score for the top-answer is positioned, the less accepted it is within the community, either due to disapproval or sheer lack of member’s participating in voting.
Average Answer Voting-Score per Question
Using the average voting-score provided for answers per question, it is possible to account for the overall quality of the provided answers. If a question has many answers with high scores and only a few answers with low scores, then the average is still going to be high. On the other hand, a low average results from questions with answers that have received mostly low voting-scores.
Discussion & Conclusion
Each of the presented metrics are showing a downward trend, which can be interpreted as a decrease in quality in the sense of the described interpretations above. Questions on the BioStar forum are receiving fewer answers over time and the given scores by the BioStar community are dropping too. The trend lines over all graphs suggests that the BioStar will indeed fade eventually.
However, as an active member of the BioStar forum myself, I find that the overall quality of the forum has improved over the last months. It might be the case that the initial euphoria and enthusiasm to achieve as many votes for questions and answers to gain higher scores on the BioStar member-specific reputation metric as well as the aim to unlock the many available member-badges did encourage people to participate at the cost of the forum’s quality (see also this feed by Paulo Nuin). Since the forum has become quieter now, I get the impression that answers are less rushed and that votes are distributed with better aim.
I hope that that the crawling script will still work with the BioStar next year, when I am going to conduct the same analysis again. Until then we can only wait and see what happens to this fairly young bioinformatics forum in the meantime.
Appendix: Data Retrieval and Evaluation Scripts
The Ruby script for crawling BioStar has only marginally been changed from last year, where I only increased the unique identifier in the URL to match the most recent posts on BioStar and the remainder are fixes that were required for running the script under Mac OS X Lion. The R script for the evaluation was just adapted to account for the new filename of the crawled data, where the latter is available as biostar20120306.tsv (change suffix from _tsv.doc to .tsv).
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'date'
STDOUT.sync = true
def strip(s)
return s.sub(/^\s*/, '').sub!(/\s*$/, '')
end
def bioextract(id)
url = "http://biostar.stackexchange.com/questions/#{id}"
page = nil
begin
page = open(url)
rescue
return
end
if page.base_uri.to_s != url then
# Answer URL.
return
end
doc = Hpricot(page.read)
questions = doc.search("div#question")
return if questions.length == 0
if questions.length > 1 then
puts 'Hm. More than one question?'
return
end
question_vote = nil
question_date = nil
answers_vote = []
questions[0].search("div.vote")[0].search("span.vote-count-post") { |vote|
question_vote = strip(vote.inner_text).to_i
}
answers = doc.search("div#answers")
answers.each { |answer|
answer.search("div.vote").search("span.vote-count-post") { |vote|
answers_vote << strip(vote.inner_text).to_i
}
}
return unless answers_vote.length > 0
dates = doc.search("span.relativetime")
question_date = dates[0].get_attribute('title')
datetime = DateTime.parse(question_date)
puts "#{id}\t#{question_date}\t#{datetime.year}\t#{datetime.month}\t" +
"#{question_vote}\t#{answers_vote.length}\t" +
"#{answers_vote.min}\t#{answers_vote.max}\t" +
"#{eval(answers_vote.join('+')) / answers_vote.length.to_f}"
page.close
end
for i in 1..18304 do
bioextract(i)
end
The R code has not been modified since last year, except for the change of the filename and plot titles in the script:
biostar <- read.table(
"biostar20120306.tsv",
sep="\t",
col.names=c(
"id",
"date",
"year",
"month",
"qvote",
"anum",
"amin",
"amax",
"aavg"
)
)
biostar$age <- (
(
biostar$year -
rep(min(biostar$year), times=length(biostar$year))
)*12 +
biostar$month
)
biostar$age <- biostar$age -
rep(min(biostar$age), times=length(biostar$age)) +
1
biolevels <- levels(factor(biostar$age))
bioeval <- function(column, xlabel, ylabel, colour) {
avg <- rep(0, times=length(biolevels))
for (i in 1:length(biolevels)) {
avg[i] <- mean(
column[ biostar$age == biolevels[i] ]
)
}
plot(
biostar$age,
t(column),
col=colour,
pch=19,
xlab=xlabel,
ylab=ylabel,
axes=FALSE
)
lines(biolevels, avg, col=rgb(.6,.6,.6))
abline(lm(column ~ biostar$age), col=rgb(.6,0,.6))
axis(1,min(biostar$age):max(biostar$age))
axis(2,min(column):max(column))
}
png("qvote.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
biostar$qvote,
'Biostar Month',
'Voting-Score of Questions',
rgb(100,100,0,15,maxColorValue=255)
)
dev.off()
png("num.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
biostar$anum,
'Biostar Month',
'Number of Answers per Question',
rgb(0,100,100,15,maxColorValue=255)
)
dev.off()
png("min.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
biostar$amin,
'Biostar Month',
'Min. Answer Voting-Score per Question',
rgb(100,0,0,15,maxColorValue=255)
)
dev.off()
png("max.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
biostar$amax,
'Biostar Month',
'Max. Answer Voting-Score per Question',
rgb(0,0,100,15,maxColorValue=255)
)
dev.off()
png("avote.png", height=600, width=600, unit="px", pointsize=13)
bioeval(
biostar$aavg,
'Biostar Month',
'Avg. Answer Voting-Score per Question',
rgb(100,0,100,15,maxColorValue=255)
)
dev.off()









dalloliogm
March 26, 2012
Hello Joachin,
thank you very much for your analysis, it is very useful and interesting.
One possible explanation is that as the total number of users in biostar is increasing, the variety of topics discussed are getting more and more diverse, and this makes it more difficult for a single person to answer. When I open biostar, most of the questions I see are about topics I do not know, so I have to scroll over more questions before finding one that I can answer to, and overall this reduces the number of good answers that can be made.
Another possible explanation is that, unfortunately, we are spending too much time with the platform migration. I do not want to criticize the people who are writing the biostar beta interface, but the fact is that the time spent to rewrite it is subtracted from the time they can spend answering questions. So, the overall quality of the contents is decreasing.
Gio