# install.packages("topicmodels")
# install.packages("tm")
library(topicmodels)
library(tm)
Loading required package: NLP
Changjun Lee
May 6, 2023
Topic modeling is an unsupervised machine learning technique used to discover latent structures within a large collection of documents or text data. This technique is particularly useful for exploring, organizing, and understanding vast text corpora. One popular method for topic modeling is Latent Dirichlet Allocation (LDA), which is a generative probabilistic model that assumes a mixture of topics over documents and words within topics. In this blog post, we will delve into the details of LDA and demonstrate how to perform topic modeling in R using the ‘topicmodels
’ package.
Latent Dirichlet Allocation (LDA):
LDA is based on the idea that documents are mixtures of topics, where each topic is a probability distribution over a fixed vocabulary. The generative process for LDA can be summarized as follows:
For each topic k, sample a word distribution \(φ_k\) ~ Dir(β).
For each document d, sample a topic distribution \(θ_d\) ~ Dir(α).
For each word w in document d, sample a topic \(z_d\), w ~ Multinomial(\(θ_d\)), then sample the word \(w_d\), n ~ Multinomial(\(φ_z\)).
Here, Dir(α) and Dir(β) denote Dirichlet distributions with parameters α and β, respectively. α and β are hyperparameters controlling the shape of the distributions. The main goal of LDA is to infer the latent topic structures θ and φ by observing the documents.
We will use the ‘topicmodels’ package in R to perform LDA. First, let’s install and load the necessary packages:
Loading required package: NLP
Now, let’s preprocess our text data using the ‘tm’ package. For this example, we will use the ‘AssociatedPress’ dataset, which is available within the ‘topicmodels’ package:
<<DocumentTermMatrix (documents: 6, terms: 10473)>>
Non-/sparse entries: 1045/61793
Sparsity : 98%
Maximal term length: 18
Weighting : term frequency (tf)
Now, we are ready to fit the LDA model using the ‘LDA’ function from the ‘topicmodels’ package. We will specify the number of topics (K) and the hyperparameters α and β:
K <- 10 # Number of topics
alpha <- 50/K
beta <- 0.1
lda_model <- LDA(dtm, K, method = "Gibbs", control = list(alpha = alpha, delta = beta, iter = 1000, verbose = 50))
K = 10; V = 10473; M = 2246
Sampling 1000 iterations!
Iteration 50 ...
Iteration 100 ...
Iteration 150 ...
Iteration 200 ...
Iteration 250 ...
Iteration 300 ...
Iteration 350 ...
Iteration 400 ...
Iteration 450 ...
Iteration 500 ...
Iteration 550 ...
Iteration 600 ...
Iteration 650 ...
Iteration 700 ...
Iteration 750 ...
Iteration 800 ...
Iteration 850 ...
Iteration 900 ...
Iteration 950 ...
Iteration 1000 ...
Gibbs sampling completed!
Here, we use the Gibbs sampling method to estimate the LDA parameters with 1000 iterations. The ‘verbose’ option is set to 50, which means that the progress will be displayed every 50 iterations.
Once the model is fitted, we can extract the topic-word and document-topic distributions using the ‘posterior’ function:
Length Class Mode
terms 104730 -none- numeric
topics 22460 -none- numeric
To visualize the results, we can display the top words for each topic:
top_words <- 10
top_terms <- terms(lda_model, top_words)
for (k in 1:K) {
cat("Topic", k, ":", paste(top_terms[k,], collapse = " "), "\n")
}
Topic 1 : police million report i percent i year court united government
Topic 2 : people company health bush market years house case soviet party
Topic 3 : two new years president year school committee state states people
Topic 4 : city year people dukakis prices like congress attorney military political
Topic 5 : miles workers medical campaign million family bill judge war south
Topic 6 : killed last work think new life federal charges west national
Topic 7 : officials business program going oil show billion federal east soviet
Topic 8 : fire corp study new higher new new trial foreign president
Topic 9 : air first problems state rose just trade prison american communist
Topic 10 : area billion state white stock time states office officials leader
This will output the top 10 words for each of the 10 topics in our LDA model.
In this blog post, we introduced the Latent Dirichlet Allocation (LDA) method for topic modeling and demonstrated