What Can We Learn from the Apps on Your Smartphone? Topic Modeling and Matrix Factorization
[This article was first published on Engaging Market Research, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The website for The Burning House begins with a simple question:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If your house was burning, what would you take with you? It’s a conflict between what’s practical, valuable and sentimental. What you would take reflects your interests, background and priorities. Think of it as an interview condensed into one question.
But what about the more mundane and everyday stuff? As an example from an earlier post, I borrowed a quote popularized by the Iron Chef, “Tell me what you eat, and I will tell you what you are.”
Do the choices we make reveal some underlying trait or situational demands that will enable us to predict future behavior? If one gathers up all the valuables and leaves behind the wedding album to burn in the house, can we guess more accurately using this information how long the marriage will last?
Do the choices we make reveal some underlying trait or situational demands that will enable us to predict future behavior? If one gathers up all the valuables and leaves behind the wedding album to burn in the house, can we guess more accurately using this information how long the marriage will last?
What about the apps on your smartphone? The list is long and growing with games, lifestyle, travel, music and entertainment, plus utilities, education, books, reference and business. Such a categorization imposes an organization that may not reflect the patterns we see in actual app download and usage. As an analogy, we can divide the supermarket into aisles marked with different signage (bread here in the middle of the store and butter over there along the wall), yet the market baskets rolled through the checkout reflect a very different associative network. Your phone is the basket, and your apps are the items purchased (or frequency of app usage equals amount of item bought). What is in your basket reflects your personal combination of wants and needs (e.g., a special trip before the big game on TV or your primary shopping for the week).
An alternative approach might treat this as a form of topic modeling with your phone as a document and app usage levels as frequency of word occurrences. Topics are the latent variables that generate the pattern of app co-occurrences (e.g., similar interests, needs or networks). The R package stm for Structural Topic Modeling may provide a gentle introduction for the social scientist, although Latent Dirichlet Allocation (LDA) still remains one step beyond the statistical training for most. This, of course, will change as more researchers are motivated to learn the mathematics given the promise of easier to use R packages.
Meanwhile, nonnegative matrix factorization (NMF) accomplishes a similar task with decomposition techniques from linear algebra. We start with the assumption that there are underlying usage patterns, such as, taking pictures, emailing them or posting to Facebook. Several apps are likely to achieve the same goal. The purposes that organize app usage are the latent variables. These same latent variables are also responsible for similarity among users. Users are similar because they use the same apps, but they use the same apps because they share the same motivations or purposes represented by the latent variables. The apps work together for like users to achieve the same ends.
To simplify, we can think of this as a joint factor analysis of the apps and a cluster analysis of the users. Thus, from a single data matrix NMF delivers two matrices: (1) “factor loadings” showing the relationship between the latent variables and the apps and (2) cluster membership weights reported for each user indicating the contribution of each latent variable to that user’s app usage profile. We now have a matrix factorization or decomposition:
Do you play games on your phone to pass the time while waiting? These games are likely not the same ones you would play if you wanted to compete against others. More than one game can accomplish the same goal, so multiple apps “load” on the same latent variable. A user might do both at different times, therefore, that user belongs in both “clusters” (with each user cluster defined by a different latent variable). In NMF, the whole can be generated as the sum of the parts, which I have illustrated in my building blocks post.
Nielsen reports that less than 30 apps are used by the average smartphone owner. Can we agree that the usage data matrix will be sparse, given the number of available apps? However, we expect to find user-by-app blocks with higher densities. Such “clumping” occurs in high-dimensional spaces when users are heterogeneous with different groupings of wants and needs. These user clusters seek out blocks of apps that serve their purposes, creating joint clusters of uses and apps. This is what we uncover with NMF and LDA.
Note: My post on Brand and Product Category Representation ends with a list of examples containing the R code needed to run the R package NMF and perform the type of analysis reviewed here.
An alternative approach might treat this as a form of topic modeling with your phone as a document and app usage levels as frequency of word occurrences. Topics are the latent variables that generate the pattern of app co-occurrences (e.g., similar interests, needs or networks). The R package stm for Structural Topic Modeling may provide a gentle introduction for the social scientist, although Latent Dirichlet Allocation (LDA) still remains one step beyond the statistical training for most. This, of course, will change as more researchers are motivated to learn the mathematics given the promise of easier to use R packages.
Meanwhile, nonnegative matrix factorization (NMF) accomplishes a similar task with decomposition techniques from linear algebra. We start with the assumption that there are underlying usage patterns, such as, taking pictures, emailing them or posting to Facebook. Several apps are likely to achieve the same goal. The purposes that organize app usage are the latent variables. These same latent variables are also responsible for similarity among users. Users are similar because they use the same apps, but they use the same apps because they share the same motivations or purposes represented by the latent variables. The apps work together for like users to achieve the same ends.
To simplify, we can think of this as a joint factor analysis of the apps and a cluster analysis of the users. Thus, from a single data matrix NMF delivers two matrices: (1) “factor loadings” showing the relationship between the latent variables and the apps and (2) cluster membership weights reported for each user indicating the contribution of each latent variable to that user’s app usage profile. We now have a matrix factorization or decomposition:
[ usage data ] = [ user latent profile ] x [ app latent profile ].
Do you play games on your phone to pass the time while waiting? These games are likely not the same ones you would play if you wanted to compete against others. More than one game can accomplish the same goal, so multiple apps “load” on the same latent variable. A user might do both at different times, therefore, that user belongs in both “clusters” (with each user cluster defined by a different latent variable). In NMF, the whole can be generated as the sum of the parts, which I have illustrated in my building blocks post.
Nielsen reports that less than 30 apps are used by the average smartphone owner. Can we agree that the usage data matrix will be sparse, given the number of available apps? However, we expect to find user-by-app blocks with higher densities. Such “clumping” occurs in high-dimensional spaces when users are heterogeneous with different groupings of wants and needs. These user clusters seek out blocks of apps that serve their purposes, creating joint clusters of uses and apps. This is what we uncover with NMF and LDA.
Note: My post on Brand and Product Category Representation ends with a list of examples containing the R code needed to run the R package NMF and perform the type of analysis reviewed here.
To leave a comment for the author, please follow the link and comment on their blog: Engaging Market Research.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.