Topic Modeling in R

Topic modeling

LDA

Text mining

unsupervised learning

Unveiling Hidden Structures in Text Data

Author

Changjun Lee

Published

May 6, 2023

Introduction:

Topic modeling is an unsupervised machine learning technique used to discover latent structures within a large collection of documents or text data. This technique is particularly useful for exploring, organizing, and understanding vast text corpora. One popular method for topic modeling is Latent Dirichlet Allocation (LDA), which is a generative probabilistic model that assumes a mixture of topics over documents and words within topics. In this blog post, we will delve into the details of LDA and demonstrate how to perform topic modeling in R using the ‘topicmodels’ package.

Latent Dirichlet Allocation (LDA):

LDA is based on the idea that documents are mixtures of topics, where each topic is a probability distribution over a fixed vocabulary. The generative process for LDA can be summarized as follows:

For each topic k, sample a word distribution $φ_{k}$ ~ Dir(β).
For each document d, sample a topic distribution $θ_{d}$ ~ Dir(α).
For each word w in document d, sample a topic $z_{d}$ , w ~ Multinomial( $θ_{d}$ ), then sample the word $w_{d}$ , n ~ Multinomial( $φ_{z}$ ).

Here, Dir(α) and Dir(β) denote Dirichlet distributions with parameters α and β, respectively. α and β are hyperparameters controlling the shape of the distributions. The main goal of LDA is to infer the latent topic structures θ and φ by observing the documents.

Performing LDA in R:

We will use the ‘topicmodels’ package in R to perform LDA. First, let’s install and load the necessary packages:

# install.packages("topicmodels")
# install.packages("tm")
library(topicmodels)
library(tm)

Loading required package: NLP

Now, let’s preprocess our text data using the ‘tm’ package. For this example, we will use the ‘AssociatedPress’ dataset, which is available within the ‘topicmodels’ package:

data("AssociatedPress")
dtm <- AssociatedPress
head(dtm)

<<DocumentTermMatrix (documents: 6, terms: 10473)>>
Non-/sparse entries: 1045/61793
Sparsity           : 98%
Maximal term length: 18
Weighting          : term frequency (tf)

Now, we are ready to fit the LDA model using the ‘LDA’ function from the ‘topicmodels’ package. We will specify the number of topics (K) and the hyperparameters α and β:

K <- 10 # Number of topics
alpha <- 50/K
beta <- 0.1
lda_model <- LDA(dtm, K, method = "Gibbs", control = list(alpha = alpha, delta = beta, iter = 1000, verbose = 50))

K = 10; V = 10473; M = 2246
Sampling 1000 iterations!
Iteration 50 ...
Iteration 100 ...
Iteration 150 ...
Iteration 200 ...
Iteration 250 ...
Iteration 300 ...
Iteration 350 ...
Iteration 400 ...
Iteration 450 ...
Iteration 500 ...
Iteration 550 ...
Iteration 600 ...
Iteration 650 ...
Iteration 700 ...
Iteration 750 ...
Iteration 800 ...
Iteration 850 ...
Iteration 900 ...
Iteration 950 ...
Iteration 1000 ...
Gibbs sampling completed!

Here, we use the Gibbs sampling method to estimate the LDA parameters with 1000 iterations. The ‘verbose’ option is set to 50, which means that the progress will be displayed every 50 iterations.

Once the model is fitted, we can extract the topic-word and document-topic distributions using the ‘posterior’ function:

posterior_lda <- posterior(lda_model)
summary(posterior_lda)

       Length Class  Mode   
terms  104730 -none- numeric
topics  22460 -none- numeric

To visualize the results, we can display the top words for each topic:

top_words <- 10
top_terms <- terms(lda_model, top_words)

for (k in 1:K) {
  cat("Topic", k, ":", paste(top_terms[k,], collapse = " "), "\n")
}

Topic 1 : soviet court million bush school i program air police percent 
Topic 2 : united case company president new people report two people market 
Topic 3 : states attorney billion house years just state miles military year 
Topic 4 : government law new dukakis first dont children officials government prices 
Topic 5 : party judge year campaign show get health area south oil 
Topic 6 : union charges workers committee students think new fire killed million 
Topic 7 : east office last congress york time years people two new 
Topic 8 : west federal money bill year like year city army higher 
Topic 9 : president trial business reagan john going public service troops rose 
Topic 10 : foreign prison corp senate news back department spokesman war stock

This will output the top 10 words for each of the 10 topics in our LDA model.

Conclusion:

In this blog post, we introduced the Latent Dirichlet Allocation (LDA) method for topic modeling and demonstrated