BioStar is an open forum for bioinformaticians where questions can openly be asked and the forum’s subscriber-community provides answers. Everyone can vote questions and answers up or down as to give them a qualitative rating. BioStar is essentially a bio-clone of the famous stackoverflow. Over the past weeks I have noticed a change in the quality of questions, answers, and the enthusiasm to vote on BioStar. After the latest bummer, where I answered a question that had not been tackled for some time but receive neither up-vote nor was the answer marked as ‘accepted’, I decided to look into the communities voting behaviour in more detail.
Crawling BioStar
In order to analyse the votes of answered questions, I crawled the BioStar web-site with the Ruby script given below, which outputs a TSV-format with the following columns:
- Question ID, e.g. question http://biostar.stackexchange.com/questions/2550 has the ID 2550
- Date and time when the question was last edited
- Year of 2. (redundant, for easier parsing)
- Month of 2. (redundant, for easier parsing)
- Received vote of the question
- Number of answers given (>= 1, unanswered questions are omitted)
- Received vote of the least favourite answer (min. vote of all answers)
- Received vote of the most favourite answer (max. vote of all answers)
- Average vote over all received votes for answers
The script was functional as of the time of posting, but it may stop working when the web-layout of BioStar changes. I hardcoded the question ID 6170 as the last ID of the URL to be downloaded, which was the latest ID when I started writing the script. The output was redirected into a file called biostar.tsv (see the very end for a downloadable version of the file) that is later loaded into R.
require 'rubygems' require 'hpricot' require 'open-uri' def strip(s) return s.sub(/^\s*/, '').sub!(/\s*$/, '') end def bioextract(id) url = "http://biostar.stackexchange.com/questions/#{id}" page = nil begin page = open(url) rescue return end if page.base_uri.to_s != url then # Answer URL. return end doc = Hpricot(page.read) questions = doc.search("div#question") exit if questions.length == 0 if questions.length > 1 then puts 'Hm. More than one question?' return end question_vote = nil question_date = nil answers_vote = [] questions[0].search("div.vote")[0].search("span.vote-count-post") { |vote| question_vote = strip(vote.inner_text).to_i } answers = doc.search("div#answers") answers.each { |answer| answer.search("div.vote").search("span.vote-count-post") { |vote| answers_vote << strip(vote.inner_text).to_i } } return unless answers_vote.length > 0 dates = doc.search("span.relativetime") question_date = dates[0].get_attribute('title') datetime = DateTime.parse(question_date) puts "#{id}\t#{question_date}\t#{datetime.year}\t#{datetime.month}\t" + "#{question_vote}\t#{answers_vote.length}\t" + "#{answers_vote.min}\t#{answers_vote.max}\t" + "#{eval(answers_vote.join('+')) / answers_vote.length.to_f}" page.close end for i in 1..6170 do bioextract(i) end
Evaluation
I used the R script given below to plot the figures given even further below. They show the received vote per question, number of answers given, min./max. vote per question and average vote as scatter graph over the months of BioStar’s existence. Additionally, each graph shows a plot of the average value for each month and a trend line over the original data (i.e. not over the average).
biostar <- read.table( "biostar.tsv", sep="\t", col.names=c( "id", "date", "year", "month", "qvote", "anum", "amin", "amax", "aavg" ) ) biostar$age <- ( ( biostar$year - rep(min(biostar$year), times=length(biostar$year)) )*12 + biostar$month ) biostar$age <- biostar$age - rep(min(biostar$age), times=length(biostar$age)) + 1 biolevels <- levels(factor(biostar$age)) bioeval <- function(column, xlabel, ylabel, colour) { avg <- rep(0, times=length(biolevels)) for (i in 1:length(biolevels)) { avg[i] <- mean( column[ biostar$age == biolevels[i] ] ) } plot( biostar$age, t(column), col=colour, pch=19, xlab=xlabel, ylab=ylabel, axes=FALSE ) lines(biolevels, avg, col=rgb(.6,.6,.6)) abline(lm(column ~ biostar$age), col=rgb(.6,0,.6)) axis(1,min(biostar$age):max(biostar$age)) axis(2,min(column):max(column)) } png("qvote.png", height=600, width=600, unit="px", pointsize=13) bioeval( biostar$qvote, 'Biostar Month', 'Vote of Questions', rgb(100,100,0,15,maxColorValue=255) ) dev.off() png("num.png", height=600, width=600, unit="px", pointsize=13) bioeval( biostar$anum, 'Biostar Month', 'Number of Answers per Question', rgb(0,100,100,15,maxColorValue=255) ) dev.off() png("min.png", height=600, width=600, unit="px", pointsize=13) bioeval( biostar$amin, 'Biostar Month', 'Min. Answer Vote per Question', rgb(100,0,0,15,maxColorValue=255) ) dev.off() png("max.png", height=600, width=600, unit="px", pointsize=13) bioeval( biostar$amax, 'Biostar Month', 'Max. Answer Vote per Question', rgb(0,0,100,15,maxColorValue=255) ) dev.off() png("avote.png", height=600, width=600, unit="px", pointsize=13) bioeval( biostar$aavg, 'Biostar Month', 'Avg. Answer Vote per Question', rgb(100,0,100,15,maxColorValue=255) ) dev.off()
Conclusion
The scatter plot shows that more people participate in BioStar lately, which is indicated by the darkening to right. Not all trends show a slope that can be clearly interpreted, except the trends for the vote given per questions and the maximum vote answers received which indicates a decline in up-votes (or the increase in down-votes) in both cases. It cannot said for sure that the BioStar is fading, but it certainly is not flourishing.
Acknowledgements
I used http://wiki.xmlhack.ru/ for the syntax highlighting in this blog post.
Downloads
TSV-file (rename to biostar.tsv): biostar.tsv









flxlex
March 7, 2011
What about the number of questions over time?
Joachim
March 8, 2011
Good point, but I especially did not focus on the total number of questions-, answers- or users-over-time, because I think it is not a very meaningful measure. The number of questions over time tells you only something about the activity on BioStar, but nothing about the quality of the service as such.
Adrian
March 8, 2011
At this point, the more relevant metrics would be traffic, and the number of (answered) questions per unit time.
It’s not at all clear that one would expect ‘quality’ to increase over time – indeed, one might well expect the opposite, as the community moves from early-adopter keeners to everybody else – but that doesn’t mean the site isn’t useful to a greater number of people. Also, once a question has been answered satisfactorily (even if not optimally), there may not be enough incentive to provide better answers.
Joachim
March 8, 2011
I am not convinced that measuring the traffic volume is an indicator for the forum’s quality. Actually, you say something to the contrary in your second paragraph yourself.
About the number of answers: I wrote that there is no significant decrease visible there. Instead, I pointed out that the maximum vote is decreasing over time. You can actually see in the scatter plot that there are more zero and even negative maximum vote answers occurring recently. This measure is independent from the actual number of answers provided per question.
Aleksandr Levchuk
March 8, 2011
This is really code. I’m your fan.