Site icon R-bloggers

Wikipedia for Kaggle Participants

[This article was first published on Back Side Smack » R Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Kaggle has released a new data-mining challenge: use data from 10 years of Wikipedia edits in order to predict future edit rates. The dataset has been anonymized in order to obscure editor identity and article identity, simultaneously adding focus to the challenge and robbing the dataset of considerable richness. I have some experience with wikipedia from both a data science standpoint and personal experience. As I indicated below I am an editor and an administrator on the English Wikipedia with about 20,000 edits under my belt. Some of the information and experience I have will be less helpful for data scientists on this particular challenge, but the beauty of Wikipedia is all data is available. Everything, should you want it.

Many of these suggestions will be remedial or duplicative for veteran data miners. However you can always benefit from local knowledge. You can make a comment here or on my wikipedia talk page if you need some more information.

But enough of that, on to my suggestions for folks looking to win the challenge!

General model:

Editors

Articles

Statistics

To leave a comment for the author, please follow the link and comment on their blog: Back Side Smack » R Stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.