Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Kaggle has released a new data-mining challenge: use data from 10 years of Wikipedia edits in order to predict future edit rates. The dataset has been anonymized in order to obscure editor identity and article identity, simultaneously adding focus to the challenge and robbing the dataset of considerable richness. I have some experience with wikipedia from both a data science standpoint and personal experience. As I indicated below I am an editor and an administrator on the English Wikipedia with about 20,000 edits under my belt. Some of the information and experience I have will be less helpful for data scientists on this particular challenge, but the beauty of Wikipedia is all data is available. Everything, should you want it.
Many of these suggestions will be remedial or duplicative for veteran data miners. However you can always benefit from local knowledge. You can make a comment here or on my wikipedia talk page if you need some more information.
But enough of that, on to my suggestions for folks looking to win the challenge!
General model:
- Zeros are very important. Depending on time in the dataset and tenure of the account, zeros may comprise 30-50% of the editors after a certain number of months. Modeling zeros differently than a small number of edits will be important.
- Zeros at different tenures. Many accounts probably follow a Poisson or ZIP, but early drop-offs (not quite at 1 edit but maybe 2-3) are very different. As are very late stage dropoffs. In those cases (>1 year tenure) you will probably find different predictors of retirement, namely social networks. Imagine an editor who registers an account and makes ~40 edits over the span of a month. They have passed the technical hurdles for contribution (syntax, registration, autoconfirmation, etc.) but haven’t necessarily established a social network. Given the reduced dataset for the contest, you may have to determine this by number of edits made to (user) talk pages.
- Time matters. Account characteristics are very different based on how/why/when they came to wikipedia. Many editors to wikipedia showed up not because of the (edit) button but because they read about wikipedia in the popular press.
Editors
- Look at edit type. Does the edit remove content (just in bytes) or add it? Is the edit an exact reversion to a previous state? Reversion is a variable in the dataset (probably determined by comparing hashes)
- Look at edit persistence. Do the edits of a given editor tend to stick around or be reverted/trimmed over time? There are multiple implications to persistence, so don’t assume “good” edits are always kept and “bad” edits reverted.
- User page. Do those editors have a user page? You can determine this by checking how many contributions to the User namespace an account ID has associated with it. There is a strong social norm against editing other users’ pages, so an account that has multiple edits to a single User page is likely editing their own page.
- Look at the activity on their User Talk page. What kind of messages are they getting? Do they respond (by adding bytes) or removing the message (strict reversion)? Again, the restrictions on dataset preclude perfect matching between accounts and user pages, but a User Talk page most edited by a given account ID is likely to be their talk page.
- Advanced permissions probably don’t matter. Only a vanishingly small percentage of editors are administrators. Permissions like ACC are also rare. Rollback might be interesting to look at, but it didn’t exist for about half of the sample.
Articles
- Which articles or article topics the editor edits matter, but perhaps not as much as some other characteristics. Again, difficult to determine completely with anonymized data but you can see how article edits are clustered together.
- Look for different kinds of edits. Small edits with “minor” marked on their edit summary indicate “wikignoming”, a very different editing style from someone who creates or expands articles. The dataset does not seem to include the “minor” edit bit, but you can take a guess by looking at the deltas.
- Imagine that editors have not one focus but a space of focus along the dimensionality of categorization. Maybe the space is all connected–an editor might be interested in statistics, mechanics, econometrics, and signal processing. Or maybe they are not connected–an editor interested in Phong shading, Pixar, and Jersey Shore. Again, your guess will have to come from editor clustering, but it will be better than nothing.
- When looking at groupings of articles you may discover that some articles are more conducive to new editors than others. You might also note that some articles (holding level of protection constant) are more likely to attract attention from new editors. What is also happening is reception and attention are endogenous to editor behavior. For most subjects a single editor is atomistic, but new editors may find receptive topic areas more readily.
- Pay less attention to the “quality” scores for articles. Most of the actual quality review is for “Good” or “Featured” articles and only ~0.5% of articles are good or featured. Pay much more attention to the edit rate and (if you can do it) some measure of traffic. Edits are roughly proportional to traffic under two conditions: there must not be an “edit war” ongoing and the article must remain at a given level of “protection”. Inserting either of those two conditions makes edit rate a very noisy measure of traffic.
Statistics
- You can lean on past work on Wikipedia itself, both from editors and academics.
- Even small sample studies may give some hints. Look at this small study done on editors whose first edits to the encyclopedia are new articles.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.