SimilaR

Maciej Bartoszuk

7 years ago

[This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Being a teacher can be a very gratifying job. If you teach programming, which is your favorite hobby too, nothing can be better than that. Only thing can spoil your dream: cheating students. As we all know, one can learn programming only by writing code him/herself. Copying source code of another student completely makes no sense, as student does not learn, and what is more, he/she gets points for something he/she didn’t make.

When there are only few homeworks to check, it is easy to do it manually. But what if there is a large number of submissions? Then we need some application to automate the process. There are some known tools for “standard” programming languages, such as MOSS or JPLAG for e.g. C, C++, C#, Java, Scheme or Javascript.

But what if we want to automate the process of checking similarity of R source code? Till now there were no such a tool available. But things have changed.

SimilaR

SimilaR is a service designed to detect similar source code patterns in the R language code snippets. To create an account, you got to possess an e-mail address in edu domain and prove us somehow that you’re a tutor (show us your webpage etc.). Once the account is activated you just upload your students’ submissions and wait a moment for the results.

Let see a working example. Assume that one student submitted the following file:

almostPi <- function(n)
{
# this is a function which approximate a Pi constant
stopifnot(is.numeric(n),length(n)==1,n>0,(n-floor(n))<=1e-8)

x <- runif(n,-1,1);
y <- runif(n,-1,1);
4*sum((sqrt(x^2+y^2))<=1)/n
}

pythagoreanTriples<-function(m,n)
{
stopifnot(length(m)==length(n));
a<-m^2-n^2;
b<-2*m*n;
c<-m^2+n^2;
mat<-matrix(c(a,b,c),3,length(a),byrow=TRUE); 
# I arrange triples in a matrix
l<-split(mat,col(mat));
l[[length(a)+1]] <- (a^2+b^2==c^2 & a*b*c!=0 & a*c>0)
return(l)
}

and the other one sent:

almostPi<-function (n=10000) {
stopifnot(length(n)==1,n>0,(n-floor(n))==0)
# Checking if n is a numeric vetor of length 1,
# and if it is a natural number
4*sum(sqrt(runif(n,-1,1)^2+runif(n,-1,1)^2)<=1)/n
}

pythagoreanTriples<-function(m,n){
stopifnot(is.vector(m),is.vector(n),is.numeric(c(m,n)),
length(n)==length(m),length(n)>0,all(c(n-floor(n),m-floor(m))<=1e-10),
all(c(m,n)>=0))

a<-m^2-n^2
b<-2*m*n
cc<-m^2+n^2
l<-mapply(c,a,b,cc,SIMPLIFY=FALSE)
l[length(l)+1]<-list(a^2+b^2==cc^2)
l
}

So we log into SimilaR, choose Antiplagiarism system -> New submission and we get a picture like:

In the area marked with a green rectangle we provide a name for a new submission. We can identify a group of files with this name. In the blue rectangle we choose what is the smallest group of functions (functions in one group are not compared): group of files, one file, or we compare every function with each other. Since every student in our example provide her homework in separate file, we choose a second option.

After we click Submit, we obtain:

In this view we can make sure that system understands uploaded files as we expect. If something is wrong, e.g. the source code has syntax errors, we will be notified at this step. Please note that there are no comments in source codes and a style of indentation is homogeneous. If everything is OK, we click Confirm button.

After that we see a list of our submissions. We can see a progress of our submission which is dynamically updated. When it is ready, it goes to a top of the list and we can see it.

Let us see the results. There are 4 pairs, as there were 2 functions in each file. The pairs are ordered from most similar to the least. In the beginning, we see only first 10 pairs, and we can assess every pair, if we believe it is similar or not. After evaluating some pairs (see green rectangle), we can see more of them. This solution is needed, as the system is based on some statistical learning algorithms and we need as many learning data as we can obtain so that it will become even more useful in the future.

Summary

We hope that SimilaR will be a useful tool, and that it will make evaluating the similarity of students’ homeworks faster and more accurate as well as a teacher’s job more convenient. With this tool, R tutors can focus on what is the most important thing in the teaching process: teaching, not searching for a plagiarism and dishonest students. Prior using the system, make sure you agree with the Terms and Conditions

References

Bartoszuk M., Gagolewski M., A fuzzy R code similarity detection algorithm, In: Laurent A. et al. (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems, Part III (CCIS 444), Springer-Verlag, Heidelberg, 2014, pp. 21-30.
Bartoszuk M., Gagolewski M., Detecting similarity of R functions via a fusion of multiple heuristic methods, 2015. (submitted paper)

To leave a comment for the author, please follow the link and comment on their blog: Rexamine » Blog/R-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.