How to raise money on Kickstarter – extensive EDA and prediction tutorial

[This article was first published on R programming – Journey of Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this tutorial, we will explore the characterisitcs of projects on Kickstarter and try to understand what separates the winners from the projects that failed to reach their funding goals.

Qs for Exploratory Analysis:

We will start our analysis with the aim of answering the following questions:

    1. How many projects were successful on Kickstarter, by year and category.
    2. Which sub-categories raised the most amount of money?
    3. Projects originate from which countries?
    4. How many projects exceeded their funding goal by 50% or more?
    5. Did any projects reach $100,000 or more? $1,000,000 or higher?
    6. What was the average amount contributed by each backer, and how does this change over time? Does this amount differ with categories?
    7. What is the average funding period?

 

Predicting success rates:
Using the answers from the above questions, we will try to create a model that can predict which projects are most likely to be successful.

The dataset is available on Kaggle, and you can run this script LIVE using this kernel link. If you find this tutorial useful or interesting, then please do upvote the kernel ! ?

Step1 – Data Pre-processing

a) Let us take a look at the input dataset :

## 
Read 68.7% of 378661 rows
Read 378661 rows and 14 (of 14) columns from 0.051 GB file in 00:00:03

## 'data.frame':    378661 obs. of  14 variables:
##  $ ID              : int  1000002330 1000003930 1000004038 1000007540 1000011046 1000014025 1000023410 1000030581 1000034518 100004195 ...
##  $ name            : chr  "The Songs of Adelaide & Abullah" "Greeting From Earth: ZGAC Arts Capsule For ET" "Where is Hank?" "ToshiCapital Rekordz Needs Help to Complete Album" ...
##  $ category        : chr  "Poetry" "Narrative Film" "Narrative Film" "Music" ...
##  $ main_category   : chr  "Publishing" "Film & Video" "Film & Video" "Music" ...
##  $ currency        : chr  "GBP" "USD" "USD" "USD" ...
##  $ deadline        : chr  "2015-10-09" "2017-11-01" "2013-02-26" "2012-04-16" ...
##  $ goal            : num  1000 30000 45000 5000 19500 50000 1000 25000 125000 65000 ...
##  $ launched        : chr  "2015-08-11 12:12:28" "2017-09-02 04:43:57" "2013-01-12 00:20:50" "2012-03-17 03:24:11" ...
##  $ pledged         : num  0 2421 220 1 1283 ...
##  $ state           : chr  "failed" "failed" "failed" "failed" ...
##  $ backers         : int  0 15 3 1 14 224 16 40 58 43 ...
##  $ country         : chr  "GB" "US" "US" "US" ...
##  $ usd.pledged     : num  0 100 220 1 1283 ...
##  $ usd_pledged_real: num  0 2421 220 1 1283 ...

The projects are divided into main and sub-categories. The pledged amount “usd_pledged” has an equivalent value converted to USD, called “usd_pledged_real”. However, the goal amount does not have this conversion. So for now, we will use the amounts as is.

We can see how many people are backing each individual project using the column, “backers”.

b) Now let us look at the first 5 records:

The name doesn’t really indicate any specific pattern although it might be interesting to see if longer names have better success rates. Not pursuing that angle at this time, though.

##           ID                                                       name
## 1 1000002330                            The Songs of Adelaide & Abullah
## 2 1000003930              Greeting From Earth: ZGAC Arts Capsule For ET
## 3 1000004038                                             Where is Hank?
## 4 1000007540          ToshiCapital Rekordz Needs Help to Complete Album
## 5 1000011046 Community Film Project: The Art of Neighborhood Filmmaking
## 6 1000014025                                       Monarch Espresso Bar
##         category main_category currency   deadline  goal
## 1         Poetry    Publishing      GBP 2015-10-09  1000
## 2 Narrative Film  Film & Video      USD 2017-11-01 30000
## 3 Narrative Film  Film & Video      USD 2013-02-26 45000
## 4          Music         Music      USD 2012-04-16  5000
## 5   Film & Video  Film & Video      USD 2015-08-29 19500
## 6    Restaurants          Food      USD 2016-04-01 50000
##              launched pledged      state backers country usd.pledged
## 1 2015-08-11 12:12:28       0     failed       0      GB           0
## 2 2017-09-02 04:43:57    2421     failed      15      US         100
## 3 2013-01-12 00:20:50     220     failed       3      US         220
## 4 2012-03-17 03:24:11       1     failed       1      US           1
## 5 2015-07-04 08:35:03    1283   canceled      14      US        1283
## 6 2016-02-26 13:38:27   52375 successful     224      US       52375
##   usd_pledged_real
## 1                0
## 2             2421
## 3              220
## 4                1
## 5             1283
## 6            52375

c) Looking for missing values:

Hurrah, a really clean dataset, even after searching for “empty” strings. ?

<span class="hljs-comment"># Check for NAs:</span>
sapply(ksdf, <span class="hljs-keyword">function</span>(x) sum(is.na(x)))

##               ID             name         category    main_category 
##                0                0                0                0 
##         currency         deadline             goal         launched 
##                0                0                0                0 
##          pledged            state          backers          country 
##                0                0                0                0 
##      usd.pledged usd_pledged_real 
##             3797                0

<span class="hljs-comment"># Check for empty strings:</span>
nrow(subset(ksdf, is.na(ksdf$name)))

## [1] 0

 d) Date Formatting and splitting:

We have two dates in our dataset – “launch date” and “deadline date”.We convert them from strings to date format.
We also split these dates into the respective year and month columns, so that we can plot variations over time.
So we will now have 4 new columns: launch_year, launch_month, deadline_year and deadline_month.

Exploratory analysis:

a) How many projects are successful?

prop.table(table(ksdf$state))*<span class="hljs-number">100</span>

## 
##   canceled     failed       live successful  suspended  undefined 
## 10.2410864 52.2153060  0.7391836 35.3762336  0.4875073  0.9406831

We see that “failed” and “successful” are the two main categories, comprising ~88% of our dataset.
Sadly we do not know why some projects are marked “undefined” or “canceled”.
“live”” projects are those where the deadlines have not yet passed, although a few among them are already achieved their goal.
Surprisingly, some ‘canceled’ projects had also met their goals (pledged_amount >= goal).
Since these other categories are a very small portion of the dataset, we will subset and only consider records with satus “failed” or “successful” for the rest of the analysis.

b) How many countries have projects on kickstarter?

## 
##     AT     AU     BE     CA     CH     DE     DK     ES     FR     GB 
##    485   6616    523  12370    652   3436    926   1873   2520  29454 
##     HK     IE     IT     JP     LU     MX  N,0""     NL     NO     NZ 
##    477    683   2369     23     57   1411    210   2411    582   1274 
##     SE     SG     US 
##   1509    454 261360

We see projects are overwhelmingly US. Some country names have the tag N,0“”, so marking them as unknown.

c) Number of projects launched per year:

## 
##  2009  2010  2011  2012  2013  2014  2015  2016  2017 
##  1179  9577 24049 38480 41101 59306 65272 49292 43419

Looks like some records say dates like 1970, which does not look right. So we discard any records with a launch / deadline year before 2009.
Plotting the counts per year on a graphs: < br />From the graph below, it looks like the count of projects peaked in 2015, then went down. However, this should NOT be taken as an indicator of success rates.

 

 

Drilling down a bit more to see count of projects by main_category.

Over the years, maximum number of projects have been in the categories:

    1. Film & Video
    2. Music
    3. Publishing

 d) Number of projects by sub-category: (Top 20 only)


The Top 5 sub-categories are:

    1. Product Design
    2. Documentary
    3. Music
    4. Tabletop Games (interesting!!!)
    5. Shorts (really?! )

Let us now see “Status” of projects for these Top 5 sub_categories:
From the graph below, we see that for category “shorts” and “tabletop games” there are more successfull projects than failed ones.

 e) Backers by category and sub-category:

Since there are a lot of sub-categories, let us explore the sub-categories under the main theme “Design” 

Product design is not just the sub-category with the highest count of projects, but also the category with the highest success ratio.

 f) add flag to see how many got funded more than the goal.

ksdf$goal_flag <- ifelse(ksdf$pledged >= ksdf$goal, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)
prop.table(table(ksdf$goal_flag))*<span class="hljs-number">100</span>

## 
##        0        1 
## 59.61197 40.38803

So ~40% of projects reached or surpassed their goal, which matches the number of successful projects .

 g) Calculate average contribution per backer:

From the mean, median and max values we quickly see that the median amount contributed by each backer is only ~$40 whereas the mean is higher due to the extreme positive values. The max amount by a single backer is ~$5000.

summary(ksdf$contrib)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00    16.00    41.78    73.35    78.00 50000.00

hist(n$contrib, main = <span class="hljs-string">"Histogram for number of contributors"</span>)

h) Calculate reach_ratio

The amount per backer is a good start, but what if the goal amount itself is only $1000? Then an average contribution per backer of $50 impies we only need 20 backers.
So to better understand the probability of a project’s success, we create a derived metric called “reach_ratio”.
This takes the average user contribution and compares it against the goal fund amount.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.00      0.16      0.75      6.17      2.16 166666.67

We see the median reach_ratio is <1%. Only in the third quartile do we even touch 2%!
Clearly most projects have a very low reach ratio. We could subset for “successful” projects only and check if the reach_ratio is higher.

 i) Number of days to achieve goal:

 Predictive Analystics:

We will apply a very simple decision tree algorithm to our dataset.
Since we do not have a separate “test” set, we will split the input dataframe into 2 parts (70/30 split).
We will use the smaller set to test the accuracy of out algorithm.

ksdf$status = ifelse(ksdf$state == <span class="hljs-string">'failed'</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>)

<span class="hljs-comment">## 70% of the sample size</span>
smp_size <- floor(<span class="hljs-number">0.7</span> * nrow(ksdf))

<span class="hljs-comment">## set the seed to make your partition reproductible</span>
set.seed(<span class="hljs-number">486</span>)
train_ind <- sample(seq_len(nrow(ksdf)), size = smp_size)

train <- ksdf[train_ind, ]
test <- ksdf[-train_ind, ]

<span class="hljs-keyword">library</span>(tree)
tree1 <- tree(status ~ goal + reach_ratio + category + backers + country + launch_year , data = train)

## Warning in tree(status ~ goal + reach_ratio + category + backers + country
## + : NAs introduced by coercion

summary(tree1)

## 
## Regression tree:
## tree(formula = status ~ goal + reach_ratio + category + backers + 
##     country + launch_year, data = train)
## Variables actually used in tree construction:
## [1] "backers"     "reach_ratio"
## Number of terminal nodes:  9 
## Residual mean deviance:  0.02429 = 5640 / 232200 
## Distribution of residuals:
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.9816000 -0.0006945 -0.0006945  0.0000000  0.0410400  0.9993000

Taking a peek at the decision tree rules:

plot(tree1)
text(tree1 ,pretty =<span class="hljs-number">0</span>)

kickstarter success decision tree

kickstarter success decision tree


tree1

## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 232172 55900.00 0.4039000  
##    2) backers < 17.5 121911  9326.00 0.0834600  
##      4) reach_ratio < 5.88118 107986    74.95 0.0006945 *
##      5) reach_ratio > 5.88118 13925  2774.00 0.7253000  
##       10) backers < 5.5 4723  1031.00 0.3218000  
##         20) reach_ratio < 19.9667 2792     0.00 0.0000000 *
##         21) reach_ratio > 19.9667 1931   323.50 0.7872000 *
##       11) backers > 5.5 9202   580.00 0.9324000 *
##    3) backers > 17.5 110261 20210.00 0.7583000  
##      6) reach_ratio < 0.79672 40852 10190.00 0.5217000  
##       12) backers < 128.5 16858    39.91 0.0023730 *
##       13) backers > 128.5 23994  2413.00 0.8866000 *
##      7) reach_ratio > 0.79672 69409  6383.00 0.8975000  
##       14) backers < 35.5 20535  3836.00 0.7514000  
##         28) reach_ratio < 2.85458 4816     0.00 0.0000000 *
##         29) reach_ratio > 2.85458 15719   284.60 0.9816000 *
##       15) backers > 35.5 48874  1924.00 0.9590000 *

Thus we see that “backers” and “reach-ratio” are the main significant variables.

Re-applying the tree rules to the training set itself, we can validate our model:

Predt <- predict(tree1, train)</code><code class="hljs">

validf <- data.frame( kickstarter_id = train$ID, orig_status = train$status, new_status = Predt)
validf$new = ifelse(validf$new_status < <span class="hljs-number">0.5</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>)

<span class="hljs-comment"># contingency Tables:</span>
table(validf$orig_status, validf$new)

##    
##          0      1
##   0 132337   6051
##   1    115  93669

<span class="hljs-comment"># Area under the curve</span>
<span class="hljs-keyword">library</span>(pROC)
auc(validf$orig_status, validf$new)

## Area under the curve: 0.9775

From the above tables, we see that the error rate = ~3% and area under curve >= 97%

Finally applying the tree rules to the test set, we get the following stats:

Pred1 <- predict(tree1, test)

</code><code class="hljs">

From the above tables, we see that still the error rate = ~3% and area under curve >= 97%

 

Conclusion:

Thus in this tutorial, we explored the factors that contribtue to a project’s success. Main theme and sub-category were important, but the number of backers and “reach_ratio” were found to be most critical.
If a founder wanted to gauge their probability of success, they could measure their “reach-ratio” halfway to the deadline, or perhaps when 25% of the timeline is complete. If the numbers are lower, it means they need to double down and use promotions/social media marketing to get more backers and funding.

If you liked this tutorial, feel free to fork the script. And dont forget to upvote the kernel! ?

To leave a comment for the author, please follow the link and comment on their blog: R programming – Journey of Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)