Customizable Dash front-ends for word2vec and NLP backends
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
“When is Mom’s birthday?” “Remind me to pick up flowers and a cake this afternoon.” “How do I get to the nearest flower shop?” “Find the best bakeries near me.” “Find a route to Mom’s house.”
In our modern times, personal assistants are not exclusive to executives in high-powered jobs; we can now talk directly to our phones and ask them for what we need. This is made possible due to the advancements of the human-technology interface.
The quality of this interface rests on two directions of communication: human-given directives and technology-produced outputs. Unfortunately, humans and computers rarely speak the same language. Since you’d be hard-pressed to find someone who is fluent in machine code, the bottleneck for communication in this particular case is the ability, or lack thereof, that computers have to understand the way that humans talk.
In this article, we’ll be going through a subset of the steps involved in machine learning by highlighting some of the techniques used in one of our demo applications. You can find the app in our gallery.
Enter NLP
Natural language processing, or NLP, is what allows us to parameterize human speech while preserving the correct semantic meaning of a string of words—like “When is Mom’s birthday?” It has a multitude of use cases that range from personal-assistant software to predictive text and semantic analysis.
Creating an NLP model is a process that involves many different steps. To facilitate knowledge transfer and encourage future development, we need a way to visualize, explore, and share the results. Continue reading below to see how we were able to do this with the machine learning app above, using Dash in conjunction with our Dash Enterprise offerings.
Building and evaluating an NLP model with Dash
NLP employs a wide variety of complex algorithms. Three such examples are word2vec, UMAP, and t-SNE. The word2vec algorithm encodes words as N-dimensional vectors—this is also known as “word embedding.” UMAP and t-SNE are two algorithms that reduce high-dimensional vectors to two or three dimensions (more on this later in the article). In the application, we are using word2vec to encode words as 300-dimensional vectors, and using UMAP and t-SNE to reduce those 300 dimensions to three.
There are many directions that one can take when developing a natural language processing model. In the case of our machine learning application, we are mainly looking to understand what the effects of different parameters are on the “correctness” of the dimension reduction.
An ideal dimension reduction algorithm will preserve the spatial relationships between words that were established with word2vec. To roughly evaluate the accuracy of our algorithm, we display the 3D coordinates of each word as a point in a 3D graph created with plotly.py. Then, we allow the user to select a word by clicking on its corresponding point in the graph; this will highlight its nearest neighbours in the original 300-dimensional space. In addition, there is a bar graph on the right that will display the Euclidean distance in the original space from the neighbors to the selected word. If our model is a good one, the highlighted words should be clustered together closely around our selection.
Before we proceed with actually building the app, let’s get down some foundational principles of the steps that we need to take to create it.
Finding a suitable dataset
Since there are essentially two steps to this process (computing a word embedding and subsequently reducing its dimensionality), we need a training dataset for word2vec before we can use it to compute the word embedding. For the dimensionality reduction step, we require another dataset. Both of these datasets should be large to produce meaningful results.
A good dataset, on top of being large, will contain examples of the way that actual humans communicate naturally. When selecting a dataset, it’s also important to acknowledge that it might have some inherent bias, based on the population from which the writing samples have been taken.
In our machine learning application, we have chosen to use a word2vec model that was trained on a dataset from Google News; for the dimensionality reduction, we have included two datasets—one from Twitter and another from Wikipedia.
Now that we have our data, we can begin the analysis process.
Converting words to points in space with word2vec
We understand words based on their meanings and their relationships to other words. For instance, we know that a “king” is a male monarch, and that “king” is to “queen” as “man” is to “woman.” However, words on their own rarely convey enough meaning to be useful. To understand sentences as full thoughts, we supplement this information with contextual clues; this allows us to resolve ambiguities for sets of words that look the same but have different meanings. Consider the phrases “she will lead the new product proposal meeting” and “she will lead us on this hike.” Based on the context, we can infer that “lead” refers to being in charge or in control in the first case. In the second case, it has a more physical meaning of being a guide along a path.
The ideal output of the word2vec algorithm is a representation of words as vectors that preserve contextual patterns and analogous relationships. Words that are often seen in close proximity to one another will have semantic closeness translated to physical closeness in space; for example, the words “telephone” and “number” may be close to one another. Additionally, the vectors between analogous pairs of words will have similar characteristics—the magnitude and direction of the vector from, for example, “king” to “queen” will be the same as the magnitude and direction of the vector from “man” to “woman.” In the case of our example of contextual clues, we might find the word “lead” close to both of the words “meeting” and “hike” since it is found in both of those contexts.
We’ve chosen to have the word2vec algorithm map words to a 300-dimensional vector space in our app. When words are encoded along this many different dimensions, we have a lot of information about them; however, a 300-dimensional space isn’t easy for humans to understand.
Dimension reduction with UMAP and t-SNE
UMAP and t-SNE are two algorithms that serve the purpose of reducing high-dimensional vectors— like those generated by our word2vec algorithm—to dimensionalities that make more sense to us. In the case of this app, we’ve chosen to reduce the data to three dimensions (keep in mind, however, that it’s also possible to reduce it to two dimensions). As was mentioned earlier, a good dimension reduction algorithm will preserve the closeness of sets of points that are near one another in the original higher-dimensional space.
Putting it all together with Dash
Step one: Finding a suitable dataset
In our app, we have provided two preloaded datasets that are typically used in word embedding examples: one taken from a selection of tweets and one from articles on Wikipedia.
As a developer, you may not want the user to be limited to a fixed selection of datasets. In the “Advanced” tab, we have made use of a dcc.Upload component which will allow the user to upload any text file that they have available to them.
Step two: Converting words to points in space using word2vec
There are many different variables that can influence our specific implementation of the word2vec algorithm to better understand the relationships between words. As an example, in the “advanced” tab of this application, we’ve included the option of using the skip-gram (SG) or the continuous bag-of-words (CBOW) approach when computing the word embedding. SG attempts to predict the context of any given word; for instance, if you give it the word “rainy”, it might predict the context as being “it was a [rainy] day.” CBOW attempts to predict a word, given a context; if you give it the phrase “it was a […] day”, it might predict the missing word to be “rainy.”
In our “Overview” tab, we’ve used a word2vec model that was trained on data from Google News articles. In the “Advanced” tab, we have the option of training a model ourselves with the gensim library from Python. We train the model using text from the selected dataset (in this case, “Alice”) and our selection of SG or CBOW.
Since these two methods will likely produce different results, we’ve also added in our Snapshot Engine tool. This takes a “snapshot” of the state of the app and assigns to it a unique URL; so, if we want to share the results that we got using a particular set of parameters, we don’t have to communicate all of that extra information ourselves; we can simply generate a snapshot and send the corresponding URL. We can also take a look at an archive of all of the snapshots that have been created in the past in the “Snapshots” tab, for a comprehensive history of how the app has been used.
Step three: Dimension reduction with UMAP and t-SNE
In this particular app, to save time, we’ve pre-computed the t-SNE mappings for each combination of the modelling parameters in the “Overview” tab. However, it is also possible to use UMAP. In the “Advanced” tab, we’ve used the sklearn and umap libraries from Python to compute the dimension reduction according to the algorithm that the user has chosen.
Regardless of which algorithm you use, however, performance is an important consideration—both algorithms are fairly expensive in terms of processing time. Additionally, we could choose to have some more flexibility without sacrificing too much of the end-user experience. Since we’ve chosen to deploy this app using Dash Enterprise’s App Manager, we could leverage the task-scheduling capabilities of Celery to asynchronously run the computations in the background, which would allow the user to continue interacting with the app. Additionally, as mentioned earlier, we can save the results we get with a particular set of parameters by using the Snapshot Engine. This will allow us to directly view the results and interact with them in the future without having to re-run this time-intensive computation.
We can take a snapshot here if we want, and then access it later via the generated URL—the entire view will be preserved.
We now have a fully-functional application that encourages exploration and experimentation. Furthermore, we’re able to collaborate with others when developing our model by being able to quickly share our results.
Why Dash?
There are many libraries already available in Python and R that have been created specifically to implement the algorithms that we’ve discussed. Dash is available in both of these languages, which allows us to seamlessly integrate computation and visualization.
Specifically, the machine learning app that we’ve created serves as a robust, but intuitive interface, to the complex mathematics that go into word embeddings and dimension reduction for natural language processing. Along the way, we’ve used many features of Dash and Dash Enterprise:
- Interactive components, like dropdowns and sliders, that allow us to quickly view different datasets and change parameters that we use to train our models;
- An interactive scatter plot that is connected to other parts of the app that allows us to highlight only the data that we are interested in (namely, the n nearest neighbours of a given word);
- The Snapshot Engine, which allows us to save and quickly share configurations that we use for the models in this app, as well as specific views of the 3D scatter plot;
- The App Manager, which hosts the app so that it can be shared by multiple users, and which allows for asynchronous task scheduling for the computationally expensive process of dimension reduction;
- The Design Kit, which allows us to easily make this application as visually stunning as it is functional, regardless of whether the user is on a laptop, tablet, or mobile phone.
Interested in learning more about this application, Dash, or Dash Enterprise? Schedule a virtual Dash workshop to upskill your machine learning & data science team during this era of remote work or join us during our live weekly Dash demos.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.