# install.packages("topicmodels")
# install.packages("tm")
library(topicmodels)
library(tm)
Loading required package: NLP
Changjun Lee
May 6, 2023
Topic modeling is an unsupervised machine learning technique used to discover latent structures within a large collection of documents or text data. This technique is particularly useful for exploring, organizing, and understanding vast text corpora. One popular method for topic modeling is Latent Dirichlet Allocation (LDA), which is a generative probabilistic model that assumes a mixture of topics over documents and words within topics. In this blog post, we will delve into the details of LDA and demonstrate how to perform topic modeling in R using the ‘topicmodels
’ package.
Latent Dirichlet Allocation (LDA):
LDA is based on the idea that documents are mixtures of topics, where each topic is a probability distribution over a fixed vocabulary. The generative process for LDA can be summarized as follows:
For each topic k, sample a word distribution
For each document d, sample a topic distribution
For each word w in document d, sample a topic
Here, Dir(α) and Dir(β) denote Dirichlet distributions with parameters α and β, respectively. α and β are hyperparameters controlling the shape of the distributions. The main goal of LDA is to infer the latent topic structures θ and φ by observing the documents.
We will use the ‘topicmodels’ package in R to perform LDA. First, let’s install and load the necessary packages:
Loading required package: NLP
Now, let’s preprocess our text data using the ‘tm’ package. For this example, we will use the ‘AssociatedPress’ dataset, which is available within the ‘topicmodels’ package:
<<DocumentTermMatrix (documents: 6, terms: 10473)>>
Non-/sparse entries: 1045/61793
Sparsity : 98%
Maximal term length: 18
Weighting : term frequency (tf)
Now, we are ready to fit the LDA model using the ‘LDA’ function from the ‘topicmodels’ package. We will specify the number of topics (K) and the hyperparameters α and β:
K <- 10 # Number of topics
alpha <- 50/K
beta <- 0.1
lda_model <- LDA(dtm, K, method = "Gibbs", control = list(alpha = alpha, delta = beta, iter = 1000, verbose = 50))
K = 10; V = 10473; M = 2246
Sampling 1000 iterations!
Iteration 50 ...
Iteration 100 ...
Iteration 150 ...
Iteration 200 ...
Iteration 250 ...
Iteration 300 ...
Iteration 350 ...
Iteration 400 ...
Iteration 450 ...
Iteration 500 ...
Iteration 550 ...
Iteration 600 ...
Iteration 650 ...
Iteration 700 ...
Iteration 750 ...
Iteration 800 ...
Iteration 850 ...
Iteration 900 ...
Iteration 950 ...
Iteration 1000 ...
Gibbs sampling completed!
Here, we use the Gibbs sampling method to estimate the LDA parameters with 1000 iterations. The ‘verbose’ option is set to 50, which means that the progress will be displayed every 50 iterations.
Once the model is fitted, we can extract the topic-word and document-topic distributions using the ‘posterior’ function:
Length Class Mode
terms 104730 -none- numeric
topics 22460 -none- numeric
To visualize the results, we can display the top words for each topic:
top_words <- 10
top_terms <- terms(lda_model, top_words)
for (k in 1:K) {
cat("Topic", k, ":", paste(top_terms[k,], collapse = " "), "\n")
}
Topic 1 : soviet court million bush school i program air police percent
Topic 2 : united case company president new people report two people market
Topic 3 : states attorney billion house years just state miles military year
Topic 4 : government law new dukakis first dont children officials government prices
Topic 5 : party judge year campaign show get health area south oil
Topic 6 : union charges workers committee students think new fire killed million
Topic 7 : east office last congress york time years people two new
Topic 8 : west federal money bill year like year city army higher
Topic 9 : president trial business reagan john going public service troops rose
Topic 10 : foreign prison corp senate news back department spokesman war stock
This will output the top 10 words for each of the 10 topics in our LDA model.
In this blog post, we introduced the Latent Dirichlet Allocation (LDA) method for topic modeling and demonstrated