Review & Wrap-up

수업 리뷰

Weekly content


Structural Equation Modeling: Foundations and Applications


Introduction

Structural Equation Modeling (SEM) is a comprehensive statistical methodology that combines factor analysis and multiple regression techniques to examine complex relationships among observed and latent variables. Over the past few decades, SEM has become a powerful tool for researchers across disciplines such as social sciences, psychology, education, business, and health sciences. This chapter introduces the theoretical underpinnings of SEM, its key components, and practical applications, with an emphasis on latent variables and path analysis.


Key Concepts in SEM

1. Observed and Latent Variables

  • Observed Variables: These are directly measured variables, often referred to as indicators. For example, responses to survey items like “I enjoy learning” and “I find studying rewarding” can serve as observed variables for the latent construct “Motivation.”

  • Latent Variables: These are unobserved constructs that are inferred from observed variables. SEM excels in estimating latent variables while accounting for measurement error.

2. Measurement Error

Measurement error is the discrepancy between the observed value and the true value of a variable. SEM explicitly incorporates measurement error into its models, which sets it apart from traditional regression approaches.

3. Path Diagrams

Path diagrams visually represent SEM models. Key components include:

  • Circles: Represent latent variables.

  • Squares: Represent observed variables.

  • Arrows: Depict causal relationships or associations.


Components of SEM

1. Measurement Model

The measurement model specifies the relationships between latent variables and their observed indicators. It answers the question: How well do the observed variables measure the underlying construct?

Example: Motivation as a latent variable might be measured by three observed items:

\[ Motivation \sim Item_1 + Item_2 + Item_3 \]


2. Structural Model

The structural model specifies the relationships between latent variables (or between latent and observed variables). It answers the question: What are the causal relationships between constructs?

Example: Testing whether motivation mediates the relationship between hours studied and test performance:

\[ Test\_Score = Motivation + Hours\_Studied \]

\[ Motivation = Hours\_Studied \]


Steps in Conducting SEM

Step 1: Define the Model

Develop a theoretical model based on prior research or hypotheses. Clearly specify the relationships among variables in a path diagram.

Step 2: Specify the Model

Translate the path diagram into a set of equations. This involves defining the measurement and structural models.

Step 3: Collect Data

Ensure the dataset includes sufficient sample size and all observed indicators for latent variables.

Step 4: Estimate the Model

Use software such as R (lavaan package), AMOS, Mplus, or LISREL to estimate the parameters.

Step 5: Evaluate Model Fit

Examine fit indices and modify the model, if necessary, based on theoretical justifications.

Step 6: Interpret the Results

Interpret path coefficients, factor loadings, and fit indices in the context of your research question.


Cronbach’s Alpha: A Measure of Reliability

Introduction

Cronbach’s Alpha (α) is a widely used statistic in research to assess the internal consistency reliability of a set of items or indicators. It is particularly useful for measuring the reliability of scales, tests, or questionnaires in social sciences, psychology, education, and other fields. Reliability refers to the extent to which a scale produces consistent results across repeated measurements or multiple items measuring the same construct.

Conceptual Foundation

Internal Consistency

Internal consistency evaluates how well the items in a scale measure the same underlying construct. If items are highly correlated, it suggests they are measuring the same concept, which contributes to higher reliability.

For example:

  • A scale measuring motivation might include items like:

    • “I enjoy learning new things.”

    • “I am motivated to achieve my goals.”

    • “I find studying rewarding.”

If these items are highly interrelated, the scale has good internal consistency.

Definition of Cronbach’s Alpha

Cronbach’s Alpha quantifies internal consistency as a value between 0 and 1. Higher values indicate better reliability. The formula for α is:

\[ \alpha = \frac{k}{k - 1} \left(1 - \frac{\sum s_i^2}{s_t^2}\right) \]

Where:

  • k: Number of items in the scale.

    • 문항 수가 많을 수록 신뢰도를 과소평가하는 경향이 있어 이를 보정
  • \(s_i^2\)​: Variance of each individual item.

    • 개별 문항 분산의 합: 개별 문항들이 얼마나 독립적으로 변동하는지
  • \(s_t^2\)​: Variance of the total scale score.

    • 총합 점수의 분산으로, 문항ndefined들이 함께 변동하는 정도

    • 비율: 각 문항이 개별적으로 변동하는 정도를 전체 점수의 변동과 비교

      • 비율의 값이 클 수록 문항들이 서로 독립적으로 작동, 내적 일관성이 낮아짐

      • 값이 작을 수록 문항들이 함께 움직이며, 내적 일관성이 높아짐

      • 1에서 이 비율을 뺌으로써 문항들이 공통으로 기여하는 정도를 측정


Interpreting Cronbach’s Alpha

  • α > 0.90: Excellent reliability (possibly redundant items).

  • 0.80 ≤ α < 0.90: Good reliability.

  • 0.70 ≤ α < 0.80: Acceptable reliability.

  • 0.60 ≤ α < 0.70: Questionable reliability.

  • α < 0.60: Poor reliability (likely indicates issues with item quality or scale design).


Assumptions of Cronbach’s Alpha

  1. Unidimensionality: The scale should measure a single construct.

  2. Equal Variance Contribution: Each item should contribute equally to the total score (violations can lead to under- or overestimation of reliability).

  3. Correlation Between Items: Items should be positively correlated.


Practical Example: Calculating Cronbach’s Alpha in R

Simulating a Dataset

# Simulate data for a 5-item motivation scale
set.seed(123)
n <- 100  # Number of respondents
motivation_data <- data.frame(
  Q1 = rnorm(n, mean = 4, sd = 0.8),  # Item 1
  Q2 = rnorm(n, mean = 4.2, sd = 0.7),  # Item 2
  Q3 = rnorm(n, mean = 3.9, sd = 0.9),  # Item 3
  Q4 = rnorm(n, mean = 4.1, sd = 0.6),  # Item 4
  Q5 = rnorm(n, mean = 4.0, sd = 0.8)   # Item 5
)
head(motivation_data, 10)
         Q1       Q2       Q3       Q4       Q5
1  3.551619 3.702715 5.878929 3.670855 3.941155
2  3.815858 4.379819 5.081172 3.648387 3.065079
3  5.246967 4.027316 3.661369 3.536877 3.492201
4  4.056407 3.956720 4.388875 3.468492 3.976927
5  4.103430 3.533867 3.527094 3.837704 4.536557
6  5.372052 4.168481 3.471378 4.298708 2.679563
7  4.368733 3.650567 3.190257 2.891474 3.720197
8  2.987951 3.032441 3.364844 4.227188 4.605125
9  3.450518 3.933841 5.385817 4.842005 3.568953
10 3.643470 4.843298 3.851375 5.322544 4.181834
# Install and load the psych package
# install.packages("psych")
library(psych)
Warning: package 'psych' was built under R version 4.4.2
# Calculate Cronbach's Alpha
alpha_result <- psych::alpha(motivation_data)
Number of categories should be increased  in order to count frequencies. 
Warning in psych::alpha(motivation_data): Some items were negatively correlated with the first principal component and probably 
should be reversed.  
To do this, run the function again with the 'check.keys=TRUE' option
Some items ( Q1 Q2 Q4 ) were negatively correlated with the first principal component and 
probably should be reversed.  
To do this, run the function again with the 'check.keys=TRUE' option
Warning in sqrt(Vtc): NaNs produced
print(alpha_result)

Reliability analysis   
Call: psych::alpha(x = motivation_data)

  raw_alpha std.alpha G6(smc) average_r   S/N  ase mean   sd median_r
     -0.38     -0.36   -0.24    -0.056 -0.27 0.22  4.1 0.29   -0.044

    95% confidence boundaries 
         lower alpha upper
Feldt    -0.86 -0.38  0.00
Duhachek -0.81 -0.38  0.04

 Reliability if an item is dropped:
   raw_alpha std.alpha G6(smc) average_r    S/N alpha se  var.r  med.r
Q1     -0.11     -0.10  -0.066    -0.024 -0.095     0.18 0.0039 -0.022
Q2     -0.40     -0.39  -0.254    -0.076 -0.282     0.23 0.0049 -0.044
Q3     -0.37     -0.33  -0.203    -0.065 -0.246     0.22 0.0070 -0.047
Q4     -0.43     -0.44  -0.276    -0.083 -0.306     0.23 0.0068 -0.089
Q5     -0.16     -0.14  -0.092    -0.032 -0.125     0.19 0.0039 -0.044

 Item statistics 
     n raw.r std.r r.cor r.drop mean   sd
Q1 100  0.28  0.30   NaN -0.227  4.1 0.73
Q2 100  0.41  0.45   NaN -0.065  4.1 0.68
Q3 100  0.51  0.42   NaN -0.097  4.0 0.85
Q4 100  0.39  0.48   NaN -0.042  4.1 0.62
Q5 100  0.37  0.32   NaN -0.191  4.1 0.79

Output Interpretation

The output includes:

  • Raw Alpha: The Cronbach’s Alpha for the current scale.

    • raw_alpha (-0.38)

      • 음수의 알파 값은 내적 일관성이 전혀 없거나, 문항들이 서로 부정적인 관계를 가질 가능성을 나타냄

      • 문항들이 측정하려는 동일한 개념(동일 구성 개념)을 반영하지 않거나, 측정 대상이 전혀 다른 경우에 나타남

  • Standardized Alpha: Adjusted for standardized item variances.

  • Item-Total Correlations: Correlation of each item with the total scale score.

  • Alpha If Deleted: The α value if a specific item is removed.


Improving Reliability

  1. Increase the Number of Items: Add more items that measure the same construct.

  2. Refine Items: Ensure items are clear, relevant, and unambiguous.

  3. Remove Low-Quality Items: Exclude items that weaken the overall reliability (α).


Confirmatory Factor Analysis (CFA)

Introduction

Confirmatory Factor Analysis (CFA) is a statistical technique used to test whether a hypothesized measurement model fits the observed data. Unlike Exploratory Factor Analysis (EFA), which identifies underlying factor structures without prior assumptions, CFA is hypothesis-driven and requires the researcher to specify the relationships between observed variables and latent factors before analysis.

CFA is widely used in psychology, education, and social sciences to validate measurement scales, confirm theoretical constructs, and ensure the reliability of instruments.


Key Concepts in CFA

1. Latent Variables

Latent variables are unobserved constructs that are inferred from multiple observed variables. For example:

  • Motivation might be a latent variable inferred from survey items like “I enjoy studying” and “I set academic goals.”

2. Observed Variables (Indicators)

These are measurable variables that serve as proxies for the latent variable. In CFA, each observed variable is expected to load on a specific latent factor.

3. Factor Loadings

Factor loadings quantify the relationship between observed variables and their underlying latent factor. Higher loadings indicate stronger relationships.

4. Measurement Errors

CFA explicitly models measurement error for each observed variable, improving the accuracy of parameter estimates compared to traditional methods.


The CFA Model

The CFA model is typically represented as:

\[ X = \Lambda \xi + \delta \]

Where:

  • X: Vector of observed variables.

  • Λ: Matrix of factor loadings.

  • ξ: Vector of latent variables.

  • δ: Vector of measurement errors.


Model Fit Indices in CFA

Several indices are used to assess the goodness of fit of a CFA model:

  1. Chi-Square:

    • Tests the null hypothesis that the model fits the data perfectly.

    • Smaller, non-significant values indicate a good fit.

    • Sensitive to sample size.

  2. CFI (Comparative Fit Index):

    • Compares the fit of the hypothesized model to a null model.

    • Values > 0.90 indicate good fit; > 0.95 indicates excellent fit.

  3. TLI (Tucker-Lewis Index):

    • Adjusts for model complexity.

    • Values > 0.90 indicate good fit.

  4. RMSEA (Root Mean Square Error of Approximation):

    • Measures the discrepancy between the model and the data per degree of freedom.

    • Values < 0.08 indicate acceptable fit; < 0.05 indicates excellent fit.

  5. SRMR (Standardized Root Mean Square Residual):

    • Measures the average discrepancy between observed and predicted correlations.

    • Values < 0.08 indicate a good fit.


Example of CFA in R

Here, we validate a hypothetical 3-factor model with 9 observed variables.

Simulating the Data

# Simulate data for a 3-factor model
set.seed(123)
n <- 300

# Latent variables
Factor1 <- rnorm(n, mean = 5, sd = 1)
Factor2 <- rnorm(n, mean = 3, sd = 1)
Factor3 <- rnorm(n, mean = 4, sd = 1)

# Observed variables
Item1 <- 0.8 * Factor1 + rnorm(n, sd = 0.5)
Item2 <- 0.7 * Factor1 + rnorm(n, sd = 0.5)
Item3 <- 0.9 * Factor1 + rnorm(n, sd = 0.5)

Item4 <- 0.8 * Factor2 + rnorm(n, sd = 0.5)
Item5 <- 0.7 * Factor2 + rnorm(n, sd = 0.5)
Item6 <- 0.9 * Factor2 + rnorm(n, sd = 0.5)

Item7 <- 0.8 * Factor3 + rnorm(n, sd = 0.5)
Item8 <- 0.7 * Factor3 + rnorm(n, sd = 0.5)
Item9 <- 0.9 * Factor3 + rnorm(n, sd = 0.5)

data <- data.frame(Item1, Item2, Item3, Item4, Item5, Item6, Item7, Item8, Item9)

head(data, 10)
      Item1    Item2    Item3     Item4     Item5       Item6    Item7    Item8
1  3.044562 3.417592 3.585079 1.1831242 1.4992570  1.81471450 3.056150 3.476655
2  3.420201 2.960121 4.139212 1.4705645 1.7670278  1.75690647 3.075209 2.616979
3  5.396763 5.016858 5.451788 1.6205070 1.8399821  1.56147279 2.348807 2.052586
4  4.875933 3.175391 4.876992 2.1863633 1.2929837  1.54688911 1.979492 1.390110
5  4.645739 3.905621 5.176536 2.8439994 2.0218909  2.66114923 3.384857 4.652515
6  5.059768 5.248876 7.107165 2.8246841 1.7590390  3.12625963 3.054573 2.633779
7  4.781694 3.328420 5.097882 0.9794274 0.5652273 -0.04096974 2.905108 2.797026
8  2.963667 3.168455 2.924054 2.4477619 2.0382372  1.96048858 1.818510 1.719319
9  3.601175 2.774437 4.394070 3.4133666 3.0635048  3.80159049 2.448991 2.895078
10 3.773651 3.335213 4.551284 3.3277863 3.7049616  4.60878599 2.214880 2.179294
      Item9
1  4.414632
2  4.667474
3  4.004848
4  2.121336
5  3.359624
6  3.267019
7  3.504126
8  2.515587
9  3.466267
10 3.358392

Defining and Running the CFA Model

# Load lavaan package
library(lavaan)
This is lavaan 0.6-19
lavaan is FREE software! Please report any bugs.

Attaching package: 'lavaan'
The following object is masked from 'package:psych':

    cor2cov
# Define the CFA model
model <- '
  Factor1 =~ Item1 + Item2 + Item3
  Factor2 =~ Item4 + Item5 + Item6
  Factor3 =~ Item7 + Item8 + Item9
'

# Fit the model
fit <- cfa(model, data = data)

# Summary with standardized estimates
summary(fit, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-19 ended normally after 29 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        21

  Number of observations                           300

Model Test User Model:
                                                      
  Test statistic                                16.404
  Degrees of freedom                                24
  P-value (Chi-square)                           0.873

Model Test Baseline Model:

  Test statistic                              1438.055
  Degrees of freedom                                36
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.008

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -2871.907
  Loglikelihood unrestricted model (H1)      -2863.705
                                                      
  Akaike (AIC)                                5785.814
  Bayesian (BIC)                              5863.593
  Sample-size adjusted Bayesian (SABIC)       5796.994

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.024
  P-value H_0: RMSEA <= 0.050                    0.998
  P-value H_0: RMSEA >= 0.080                    0.000

Standardized Root Mean Square Residual:

  SRMR                                           0.021

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Factor1 =~                                                            
    Item1             1.000                               0.805    0.877
    Item2             0.741    0.051   14.600    0.000    0.596    0.772
    Item3             0.985    0.063   15.559    0.000    0.793    0.834
  Factor2 =~                                                            
    Item4             1.000                               0.768    0.835
    Item5             0.862    0.055   15.806    0.000    0.662    0.807
    Item6             1.203    0.071   17.040    0.000    0.924    0.898
  Factor3 =~                                                            
    Item7             1.000                               0.786    0.845
    Item8             0.823    0.055   15.072    0.000    0.647    0.772
    Item9             1.233    0.073   16.903    0.000    0.969    0.903

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Factor1 ~~                                                            
    Factor2          -0.020    0.040   -0.503    0.615   -0.033   -0.033
    Factor3          -0.009    0.041   -0.228    0.820   -0.015   -0.015
  Factor2 ~~                                                            
    Factor3          -0.016    0.039   -0.412    0.680   -0.027   -0.027

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .Item1             0.194    0.034    5.772    0.000    0.194    0.231
   .Item2             0.241    0.026    9.429    0.000    0.241    0.404
   .Item3             0.275    0.037    7.528    0.000    0.275    0.305
   .Item4             0.256    0.031    8.195    0.000    0.256    0.302
   .Item5             0.234    0.026    9.103    0.000    0.234    0.349
   .Item6             0.204    0.037    5.457    0.000    0.204    0.193
   .Item7             0.247    0.033    7.590    0.000    0.247    0.286
   .Item8             0.284    0.029    9.802    0.000    0.284    0.404
   .Item9             0.212    0.043    4.987    0.000    0.212    0.184
    Factor1           0.647    0.073    8.846    0.000    1.000    1.000
    Factor2           0.590    0.070    8.453    0.000    1.000    1.000
    Factor3           0.618    0.072    8.538    0.000    1.000    1.000

Output

(1) Chi-square 테스트

  • Test statistic = 16.404, Degrees of freedom = 24, P-value = 0.873

    • 높은 p값(0.873)은 귀무가설(H0)을 기각하지 않음을 나타냄.

    • 귀무가설: “모형이 데이터와 잘 맞는다” → p > 0.05이므로, 데이터가 모형에 잘 적합한다고 해석할 수 있음.

(2) Comparative Fit Index (CFI)와 Tucker-Lewis Index (TLI)

  • CFI = 1.000, TLI = 1.008

    • CFI와 TLI 값이 0.95 이상이면 적합성이 매우 우수하다고 판단.

    • 여기서는 두 지표 모두 매우 높은 값(1 이상)을 보여, 모델 적합성이 뛰어나다는 것을 보임.

(3) Root Mean Square Error of Approximation (RMSEA)

  • RMSEA = 0.000, 90% CI = [0.000, 0.024], P-value = 0.998

    • RMSEA 값이 0.05 이하이고, 90% 신뢰구간의 상한선도 0.05 이하라면 적합성이 우수하다고 볼 수 있음.

    • P-value H0: RMSEA ≤ 0.050 = 0.998 → “RMSEA가 0.05 이하”라는 귀무가설을 채택.

    • RMSEA가 0에 가까워 매우 좋은 적합도를 나타냄.

(4) Standardized Root Mean Square Residual (SRMR)

  • SRMR = 0.021

    • SRMR 값이 0.08 이하라면 적합성이 양호하다고 판단.

(5) 요인 적재값(Factor Loadings)

  • 각 항목이 요인에 대해 얼마나 강하게 연관되어 있는지

    • Factor1:

      • Item1: 0.805 (표준화된 적재값 0.877)

      • Item2: 0.596 (표준화된 적재값 0.772)

      • Item3: 0.793 (표준화된 적재값 0.834)

    • Factor2:

      • Item4: 0.768 (표준화된 적재값 0.835)

      • Item5: 0.662 (표준화된 적재값 0.807)

      • Item6: 0.924 (표준화된 적재값 0.898)

    • Factor3:

      • Item7: 0.786 (표준화된 적재값 0.845)

      • Item8: 0.647 (표준화된 적재값 0.772)

      • Item9: 0.969 (표준화된 적재값 0.903)

  • 표준화 적재값이 0.7 이상이면 해당 항목이 요인에 강하게 기여한다고 판단.

(6) 요인 간 공분산(Covariances)

  • Factor1 ~~ Factor2 = -0.033 (p = 0.615)

  • Factor1 ~~ Factor3 = -0.015 (p = 0.820)

  • Factor2 ~~ Factor3 = -0.027 (p = 0.680)

    • 요인 간 공분산이 매우 낮고 통계적으로 유의하지 않음.

    • 요인들이 서로 독립적(서로 다른 구성 개념을 측정)을 나타냄.

(7) 항목의 잔차(Residual Variances)

  • 각 항목의 분산 중 설명되지 않는 부분(잔차)

    • Item1: 잔차 분산 = 0.194 (설명되지 않는 분산 비율 = 23.1%)

    • Item6: 잔차 분산 = 0.204 (설명되지 않는 분산 비율 = 19.3%)

    • 대부분의 항목에서 설명되지 않는 분산 비율이 적어(50% 미만), 요인이 해당 항목들을 잘 설명하고 있음을 나타냄.


SEM Practice

# Install lavaan if not already installed
# install.packages("lavaan")

# Load lavaan
library(lavaan)
# Load required package for data simulation
library(MASS)

# Set random seed for reproducibility
set.seed(123)

# Number of observations
n <- 300

# Define factor structure (3 latent factors, uncorrelated)
latent_factors <- mvrnorm(
  n = n,
  mu = c(0, 0, 0),       # Mean of latent factors
  Sigma = diag(3)        # Identity matrix implies no correlation between factors
)

# Factor loadings for each latent factor
loadings <- list(
  Factor1 = c(0.8, 0.7, 0.9),  # Loadings for Item1, Item2, Item3
  Factor2 = c(0.9, 0.8, 0.85), # Loadings for Item4, Item5, Item6
  Factor3 = c(0.85, 0.75, 0.9) # Loadings for Item7, Item8, Item9
)

# Residual variances (to ensure observed variables are not perfectly predicted by latent factors)
residuals <- list(
  Factor1 = c(0.4, 0.5, 0.3),
  Factor2 = c(0.3, 0.4, 0.35),
  Factor3 = c(0.35, 0.5, 0.3)
)

# Generate observed variables for each factor
observed_data <- data.frame(
  Item1 = latent_factors[, 1] * loadings$Factor1[1] + rnorm(n, 0, residuals$Factor1[1]),
  Item2 = latent_factors[, 1] * loadings$Factor1[2] + rnorm(n, 0, residuals$Factor1[2]),
  Item3 = latent_factors[, 1] * loadings$Factor1[3] + rnorm(n, 0, residuals$Factor1[3]),
  Item4 = latent_factors[, 2] * loadings$Factor2[1] + rnorm(n, 0, residuals$Factor2[1]),
  Item5 = latent_factors[, 2] * loadings$Factor2[2] + rnorm(n, 0, residuals$Factor2[2]),
  Item6 = latent_factors[, 2] * loadings$Factor2[3] + rnorm(n, 0, residuals$Factor2[3]),
  Item7 = latent_factors[, 3] * loadings$Factor3[1] + rnorm(n, 0, residuals$Factor3[1]),
  Item8 = latent_factors[, 3] * loadings$Factor3[2] + rnorm(n, 0, residuals$Factor3[2]),
  Item9 = latent_factors[, 3] * loadings$Factor3[3] + rnorm(n, 0, residuals$Factor3[3])
)

# Check summary
summary(observed_data)
     Item1              Item2              Item3             Item4         
 Min.   :-2.25631   Min.   :-2.32729   Min.   :-2.5787   Min.   :-2.67780  
 1st Qu.:-0.58956   1st Qu.:-0.60444   1st Qu.:-0.6469   1st Qu.:-0.59562  
 Median : 0.08151   Median : 0.04872   Median : 0.1015   Median : 0.08360  
 Mean   : 0.02416   Mean   : 0.01666   Mean   : 0.0305   Mean   : 0.01016  
 3rd Qu.: 0.65430   3rd Qu.: 0.65044   3rd Qu.: 0.7065   3rd Qu.: 0.68210  
 Max.   : 2.60049   Max.   : 2.56593   Max.   : 2.6296   Max.   : 2.52406  
     Item5              Item6               Item7               Item8          
 Min.   :-2.26428   Min.   :-2.750106   Min.   :-2.716450   Min.   :-2.252780  
 1st Qu.:-0.56261   1st Qu.:-0.567966   1st Qu.:-0.550449   1st Qu.:-0.590119  
 Median :-0.02097   Median :-0.021206   Median : 0.007724   Median :-0.095031  
 Mean   :-0.01454   Mean   : 0.002263   Mean   : 0.044208   Mean   :-0.002405  
 3rd Qu.: 0.58973   3rd Qu.: 0.616462   3rd Qu.: 0.677705   3rd Qu.: 0.578858  
 Max.   : 2.49658   Max.   : 2.368748   Max.   : 3.176766   Max.   : 2.512839  
     Item9         
 Min.   :-2.24574  
 1st Qu.:-0.58138  
 Median :-0.01305  
 Mean   : 0.04343  
 3rd Qu.: 0.62350  
 Max.   : 3.18565  
  • CFA first
# Load lavaan package for SEM
library(lavaan)

# Define the CFA model
cfa_model <- '
  # Latent variables
  Factor1 =~ Item1 + Item2 + Item3
  Factor2 =~ Item4 + Item5 + Item6
  Factor3 =~ Item7 + Item8 + Item9

  # Covariances (optional; default assumes correlated factors)
  Factor1 ~~ Factor2
  Factor1 ~~ Factor3
  Factor2 ~~ Factor3
'

# Fit the CFA model
cfa_fit <- cfa(model = cfa_model, data = observed_data)

# Summarize results
summary(cfa_fit, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-19 ended normally after 34 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        21

  Number of observations                           300

Model Test User Model:
                                                      
  Test statistic                                22.719
  Degrees of freedom                                24
  P-value (Chi-square)                           0.536

Model Test Baseline Model:

  Test statistic                              2339.834
  Degrees of freedom                                36
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.001

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -2412.941
  Loglikelihood unrestricted model (H1)      -2401.581
                                                      
  Akaike (AIC)                                4867.881
  Bayesian (BIC)                              4945.661
  Sample-size adjusted Bayesian (SABIC)       4879.061

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.044
  P-value H_0: RMSEA <= 0.050                    0.978
  P-value H_0: RMSEA >= 0.080                    0.000

Standardized Root Mean Square Residual:

  SRMR                                           0.024

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Factor1 =~                                                            
    Item1             1.000                               0.862    0.922
    Item2             0.855    0.041   20.832    0.000    0.736    0.835
    Item3             1.072    0.041   26.131    0.000    0.924    0.941
  Factor2 =~                                                            
    Item4             1.000                               0.875    0.944
    Item5             0.881    0.033   26.657    0.000    0.771    0.895
    Item6             0.978    0.033   30.037    0.000    0.856    0.934
  Factor3 =~                                                            
    Item7             1.000                               0.819    0.921
    Item8             0.841    0.044   19.012    0.000    0.689    0.797
    Item9             1.075    0.041   26.020    0.000    0.881    0.952

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Factor1 ~~                                                            
    Factor2          -0.018    0.046   -0.390    0.697   -0.024   -0.024
    Factor3          -0.059    0.043   -1.369    0.171   -0.084   -0.084
  Factor2 ~~                                                            
    Factor3          -0.034    0.044   -0.773    0.439   -0.047   -0.047

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .Item1             0.132    0.020    6.568    0.000    0.132    0.151
   .Item2             0.236    0.023   10.305    0.000    0.236    0.303
   .Item3             0.110    0.021    5.143    0.000    0.110    0.114
   .Item4             0.093    0.015    6.252    0.000    0.093    0.108
   .Item5             0.147    0.016    9.441    0.000    0.147    0.198
   .Item6             0.106    0.015    7.098    0.000    0.106    0.127
   .Item7             0.120    0.019    6.260    0.000    0.120    0.151
   .Item8             0.272    0.025   10.852    0.000    0.272    0.365
   .Item9             0.081    0.020    4.014    0.000    0.081    0.094
    Factor1           0.742    0.073   10.233    0.000    1.000    1.000
    Factor2           0.766    0.071   10.807    0.000    1.000    1.000
    Factor3           0.671    0.066   10.185    0.000    1.000    1.000
# Check model fit indices
fitMeasures(cfa_fit)
                 npar                  fmin                 chisq 
               21.000                 0.038                22.719 
                   df                pvalue        baseline.chisq 
               24.000                 0.536              2339.834 
          baseline.df       baseline.pvalue                   cfi 
               36.000                 0.000                 1.000 
                  tli                  nnfi                   rfi 
                1.001                 1.001                 0.985 
                  nfi                  pnfi                   ifi 
                0.990                 0.660                 1.001 
                  rni                  logl     unrestricted.logl 
                1.001             -2412.941             -2401.581 
                  aic                   bic                ntotal 
             4867.881              4945.661               300.000 
                 bic2                 rmsea        rmsea.ci.lower 
             4879.061                 0.000                 0.000 
       rmsea.ci.upper        rmsea.ci.level          rmsea.pvalue 
                0.044                 0.900                 0.978 
       rmsea.close.h0 rmsea.notclose.pvalue     rmsea.notclose.h0 
                0.050                 0.000                 0.080 
                  rmr            rmr_nomean                  srmr 
                0.019                 0.019                 0.024 
         srmr_bentler   srmr_bentler_nomean                  crmr 
                0.024                 0.024                 0.026 
          crmr_nomean            srmr_mplus     srmr_mplus_nomean 
                0.026                 0.024                 0.024 
                cn_05                 cn_01                   gfi 
              481.847               568.532                 0.984 
                 agfi                  pgfi                   mfi 
                0.969                 0.525                 1.002 
                 ecvi 
                0.216 
# Inspect modification indices (to identify potential model improvements)
modificationIndices(cfa_fit, sort = TRUE)
       lhs op   rhs    mi    epc sepc.lv sepc.all sepc.nox
76   Item7 ~~ Item8 5.993 -0.412  -0.412   -2.282   -2.282
30 Factor1 =~ Item9 5.133  0.068   0.059    0.063    0.063
46   Item1 ~~ Item5 4.340  0.023   0.023    0.165    0.165
31 Factor2 =~ Item1 3.736  0.059   0.052    0.056    0.056
78   Item8 ~~ Item9 3.427  0.353   0.353    2.379    2.379
50   Item1 ~~ Item9 2.345  0.016   0.016    0.160    0.160
53   Item2 ~~ Item5 2.297 -0.020  -0.020   -0.105   -0.105
51   Item2 ~~ Item3 2.243 -0.358  -0.358   -2.222   -2.222
29 Factor1 =~ Item8 2.198 -0.057  -0.049   -0.057   -0.057
28 Factor1 =~ Item7 2.039 -0.042  -0.036   -0.041   -0.041
44   Item1 ~~ Item3 2.017  0.539   0.539    4.473    4.473
33 Factor2 =~ Item3 1.998 -0.045  -0.039   -0.040   -0.040
38 Factor3 =~ Item2 1.605 -0.048  -0.040   -0.045   -0.045
57   Item2 ~~ Item9 1.523 -0.016  -0.016   -0.113   -0.113
48   Item1 ~~ Item7 1.446 -0.013  -0.013   -0.103   -0.103
34 Factor2 =~ Item7 1.384 -0.034  -0.029   -0.033   -0.033
77   Item7 ~~ Item9 1.264  0.394   0.394    4.011    4.011
47   Item1 ~~ Item6 0.934 -0.010  -0.010   -0.085   -0.085
37 Factor3 =~ Item1 0.841  0.030   0.025    0.027    0.027
36 Factor2 =~ Item9 0.821  0.026   0.023    0.025    0.025
55   Item2 ~~ Item7 0.687  0.010   0.010    0.062    0.062
75   Item6 ~~ Item9 0.646  0.008   0.008    0.084    0.084
70   Item5 ~~ Item7 0.635 -0.008  -0.008   -0.062   -0.062
74   Item6 ~~ Item8 0.634 -0.010  -0.010   -0.060   -0.060
40 Factor3 =~ Item4 0.632  0.023   0.019    0.021    0.021
26 Factor1 =~ Item5 0.589  0.023   0.020    0.023    0.023
41 Factor3 =~ Item5 0.563 -0.024  -0.019   -0.022   -0.022
58   Item3 ~~ Item4 0.430 -0.007  -0.007   -0.067   -0.067
32 Factor2 =~ Item2 0.390 -0.022  -0.019   -0.022   -0.022
62   Item3 ~~ Item8 0.386 -0.009  -0.009   -0.052   -0.052
25 Factor1 =~ Item4 0.380 -0.017  -0.015   -0.016   -0.016
59   Item3 ~~ Item5 0.352 -0.007  -0.007   -0.052   -0.052
67   Item4 ~~ Item8 0.316  0.007   0.007    0.044    0.044
60   Item3 ~~ Item6 0.303  0.006   0.006    0.054    0.054
66   Item4 ~~ Item7 0.302  0.005   0.005    0.050    0.050
54   Item2 ~~ Item6 0.258  0.006   0.006    0.039    0.039
68   Item4 ~~ Item9 0.257 -0.005  -0.005   -0.056   -0.056
63   Item3 ~~ Item9 0.209  0.005   0.005    0.053    0.053
73   Item6 ~~ Item7 0.200 -0.004  -0.004   -0.038   -0.038
52   Item2 ~~ Item4 0.195  0.005   0.005    0.036    0.036
43   Item1 ~~ Item2 0.186  0.092   0.092    0.525    0.525
71   Item5 ~~ Item8 0.173  0.006   0.006    0.028    0.028
69   Item5 ~~ Item6 0.164 -0.134  -0.134   -1.071   -1.071
35 Factor2 =~ Item8 0.118  0.013   0.011    0.013    0.013
65   Item4 ~~ Item6 0.083  0.131   0.131    1.317    1.317
56   Item2 ~~ Item8 0.050 -0.004  -0.004   -0.015   -0.015
42 Factor3 =~ Item6 0.033 -0.005  -0.004   -0.005   -0.005
72   Item5 ~~ Item9 0.032  0.002   0.002    0.017    0.017
64   Item4 ~~ Item5 0.030  0.061   0.061    0.518    0.518
61   Item3 ~~ Item7 0.015 -0.001  -0.001   -0.011   -0.011
45   Item1 ~~ Item4 0.009  0.001   0.001    0.009    0.009
49   Item1 ~~ Item8 0.006  0.001   0.001    0.006    0.006
39 Factor3 =~ Item3 0.000  0.001   0.000    0.000    0.000
27 Factor1 =~ Item6 0.000  0.000   0.000    0.000    0.000
# Residuals analysis
residuals(cfa_fit, type = "cor")
$type
[1] "cor.bollen"

$cov
       Item1  Item2  Item3  Item4  Item5  Item6  Item7  Item8  Item9
Item1  0.000                                                        
Item2  0.000  0.000                                                 
Item3  0.000 -0.001  0.000                                          
Item4  0.024 -0.021 -0.028  0.000                                   
Item5  0.061 -0.020 -0.005  0.000  0.000                            
Item6  0.026 -0.013 -0.016  0.000  0.000  0.000                     
Item7 -0.016 -0.046 -0.024 -0.007 -0.040 -0.025  0.000              
Item8 -0.030 -0.072 -0.049  0.024  0.002  0.000 -0.002  0.000       
Item9  0.044 -0.021  0.026  0.019 -0.006  0.010  0.000  0.001  0.000
# Visualize the model (optional)
# install.packages("semPlot")
library(semPlot)
Warning: package 'semPlot' was built under R version 4.4.2
semPaths(cfa_fit, what = "std", layout = "tree", 
         nCharNodes = 0, edge.label.cex = 0.8)

  • 적합도 지표 (Fit Indices):

    • 좋은 적합도 기준:

      • CFI, TLI ≥ 0.95

      • RMSEA ≤ 0.05

      • SRMR ≤ 0.08

  • 표준화 요인 적재값 (Standardized Factor Loadings):

    • 각 항목이 대응하는 잠재변수에 대해 얼마나 강하게 기여하는지

    • 기준: 적재값 ≥ 0.7이면 “강한 기여”로 해석.

  • 수정지수 (Modification Indices):

    • 모델 개선 가능성이 높은 항목 간의 경로를 제안합니다. 수정이 필요한 경우 수정 후 재분석 가능.
  • 잔차 (Residuals):

    • 관찰된 값과 모델로 예측된 값 간의 차이를 분석합니다. 잔차가 작을수록 모델이 데이터를 잘 설명.
# Standardized loadings table
inspect(cfa_fit, what = "std")
$lambda
      Factr1 Factr2 Factr3
Item1  0.922  0.000  0.000
Item2  0.835  0.000  0.000
Item3  0.941  0.000  0.000
Item4  0.000  0.944  0.000
Item5  0.000  0.895  0.000
Item6  0.000  0.934  0.000
Item7  0.000  0.000  0.921
Item8  0.000  0.000  0.797
Item9  0.000  0.000  0.952

$theta
      Item1 Item2 Item3 Item4 Item5 Item6 Item7 Item8 Item9
Item1 0.151                                                
Item2 0.000 0.303                                          
Item3 0.000 0.000 0.114                                    
Item4 0.000 0.000 0.000 0.108                              
Item5 0.000 0.000 0.000 0.000 0.198                        
Item6 0.000 0.000 0.000 0.000 0.000 0.127                  
Item7 0.000 0.000 0.000 0.000 0.000 0.000 0.151            
Item8 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.365      
Item9 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.094

$psi
        Factr1 Factr2 Factr3
Factor1  1.000              
Factor2 -0.024  1.000       
Factor3 -0.084 -0.047  1.000
# Covariance matrix of latent factors
inspect(cfa_fit, what = "cov.lv")
        Factr1 Factr2 Factr3
Factor1  0.742              
Factor2 -0.018  0.766       
Factor3 -0.059 -0.034  0.671


Fitting to SEM

# SEM 모델 정의
sem_model <- '
  # Measurement model (측정 모델: 요인 구조 정의)
  Factor1 =~ Item1 + Item2 + Item3
  Factor2 =~ Item4 + Item5 + Item6
  Factor3 =~ Item7 + Item8 + Item9

  # Structural model (구조 모델: 요인 간 관계 정의)
  Factor2 ~ Factor1  # Factor1이 Factor2에 영향을 미침
  Factor3 ~ Factor2  # Factor2가 Factor3에 영향을 미침
  Factor3 ~ Factor1 + Factor2 #매개효과
'
# SEM 모델 적합
sem_fit <- sem(model = sem_model, data = observed_data)

# 결과 요약
summary(sem_fit, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-19 ended normally after 35 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        21

  Number of observations                           300

Model Test User Model:
                                                      
  Test statistic                                22.719
  Degrees of freedom                                24
  P-value (Chi-square)                           0.536

Model Test Baseline Model:

  Test statistic                              2339.834
  Degrees of freedom                                36
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.001

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -2412.941
  Loglikelihood unrestricted model (H1)      -2401.581
                                                      
  Akaike (AIC)                                4867.881
  Bayesian (BIC)                              4945.661
  Sample-size adjusted Bayesian (SABIC)       4879.061

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.044
  P-value H_0: RMSEA <= 0.050                    0.978
  P-value H_0: RMSEA >= 0.080                    0.000

Standardized Root Mean Square Residual:

  SRMR                                           0.024

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Factor1 =~                                                            
    Item1             1.000                               0.862    0.922
    Item2             0.855    0.041   20.832    0.000    0.736    0.835
    Item3             1.072    0.041   26.131    0.000    0.924    0.941
  Factor2 =~                                                            
    Item4             1.000                               0.875    0.944
    Item5             0.881    0.033   26.657    0.000    0.771    0.895
    Item6             0.978    0.033   30.037    0.000    0.856    0.934
  Factor3 =~                                                            
    Item7             1.000                               0.819    0.921
    Item8             0.841    0.044   19.012    0.000    0.689    0.797
    Item9             1.075    0.041   26.020    0.000    0.881    0.952

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Factor2 ~                                                             
    Factor1          -0.024    0.062   -0.390    0.696   -0.024   -0.024
  Factor3 ~                                                             
    Factor2          -0.046    0.057   -0.811    0.418   -0.049   -0.049
    Factor1          -0.081    0.058   -1.398    0.162   -0.085   -0.085

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .Item1             0.132    0.020    6.568    0.000    0.132    0.151
   .Item2             0.236    0.023   10.305    0.000    0.236    0.303
   .Item3             0.110    0.021    5.143    0.000    0.110    0.114
   .Item4             0.093    0.015    6.252    0.000    0.093    0.108
   .Item5             0.147    0.016    9.441    0.000    0.147    0.198
   .Item6             0.106    0.015    7.098    0.000    0.106    0.127
   .Item7             0.120    0.019    6.260    0.000    0.120    0.151
   .Item8             0.272    0.025   10.852    0.000    0.272    0.365
   .Item9             0.081    0.020    4.014    0.000    0.081    0.094
    Factor1           0.742    0.073   10.233    0.000    1.000    1.000
   .Factor2           0.766    0.071   10.806    0.000    0.999    0.999
   .Factor3           0.665    0.065   10.177    0.000    0.990    0.990
# 적합도 지표 확인
fitMeasures(sem_fit)
                 npar                  fmin                 chisq 
               21.000                 0.038                22.719 
                   df                pvalue        baseline.chisq 
               24.000                 0.536              2339.834 
          baseline.df       baseline.pvalue                   cfi 
               36.000                 0.000                 1.000 
                  tli                  nnfi                   rfi 
                1.001                 1.001                 0.985 
                  nfi                  pnfi                   ifi 
                0.990                 0.660                 1.001 
                  rni                  logl     unrestricted.logl 
                1.001             -2412.941             -2401.581 
                  aic                   bic                ntotal 
             4867.881              4945.661               300.000 
                 bic2                 rmsea        rmsea.ci.lower 
             4879.061                 0.000                 0.000 
       rmsea.ci.upper        rmsea.ci.level          rmsea.pvalue 
                0.044                 0.900                 0.978 
       rmsea.close.h0 rmsea.notclose.pvalue     rmsea.notclose.h0 
                0.050                 0.000                 0.080 
                  rmr            rmr_nomean                  srmr 
                0.019                 0.019                 0.024 
         srmr_bentler   srmr_bentler_nomean                  crmr 
                0.024                 0.024                 0.026 
          crmr_nomean            srmr_mplus     srmr_mplus_nomean 
                0.026                 0.024                 0.024 
                cn_05                 cn_01                   gfi 
              481.847               568.532                 0.984 
                 agfi                  pgfi                   mfi 
                0.969                 0.525                 1.002 
                 ecvi 
                0.216 
# 수정지수 확인 (필요시 모델 수정)
modificationIndices(sem_fit, sort = TRUE)
       lhs op   rhs    mi    epc sepc.lv sepc.all sepc.nox
76   Item7 ~~ Item8 5.993 -0.412  -0.412   -2.282   -2.282
30 Factor1 =~ Item9 5.133  0.068   0.059    0.063    0.063
46   Item1 ~~ Item5 4.340  0.023   0.023    0.165    0.165
31 Factor2 =~ Item1 3.736  0.059   0.052    0.056    0.056
78   Item8 ~~ Item9 3.427  0.353   0.353    2.379    2.379
50   Item1 ~~ Item9 2.345  0.016   0.016    0.160    0.160
53   Item2 ~~ Item5 2.297 -0.020  -0.020   -0.105   -0.105
51   Item2 ~~ Item3 2.242 -0.358  -0.358   -2.222   -2.222
29 Factor1 =~ Item8 2.198 -0.057  -0.049   -0.057   -0.057
28 Factor1 =~ Item7 2.039 -0.042  -0.036   -0.041   -0.041
44   Item1 ~~ Item3 2.017  0.539   0.539    4.474    4.474
33 Factor2 =~ Item3 1.998 -0.045  -0.039   -0.040   -0.040
38 Factor3 =~ Item2 1.605 -0.048  -0.040   -0.045   -0.045
57   Item2 ~~ Item9 1.523 -0.016  -0.016   -0.113   -0.113
48   Item1 ~~ Item7 1.446 -0.013  -0.013   -0.103   -0.103
34 Factor2 =~ Item7 1.384 -0.034  -0.029   -0.033   -0.033
77   Item7 ~~ Item9 1.264  0.394   0.394    4.012    4.012
47   Item1 ~~ Item6 0.934 -0.010  -0.010   -0.085   -0.085
37 Factor3 =~ Item1 0.841  0.030   0.025    0.027    0.027
36 Factor2 =~ Item9 0.821  0.026   0.023    0.025    0.025
55   Item2 ~~ Item7 0.687  0.010   0.010    0.062    0.062
75   Item6 ~~ Item9 0.646  0.008   0.008    0.084    0.084
70   Item5 ~~ Item7 0.635 -0.008  -0.008   -0.062   -0.062
74   Item6 ~~ Item8 0.634 -0.010  -0.010   -0.060   -0.060
40 Factor3 =~ Item4 0.632  0.023   0.019    0.021    0.021
26 Factor1 =~ Item5 0.589  0.023   0.020    0.023    0.023
41 Factor3 =~ Item5 0.563 -0.024  -0.019   -0.022   -0.022
58   Item3 ~~ Item4 0.430 -0.007  -0.007   -0.067   -0.067
32 Factor2 =~ Item2 0.390 -0.022  -0.019   -0.022   -0.022
62   Item3 ~~ Item8 0.386 -0.009  -0.009   -0.052   -0.052
25 Factor1 =~ Item4 0.380 -0.017  -0.015   -0.016   -0.016
59   Item3 ~~ Item5 0.352 -0.007  -0.007   -0.052   -0.052
67   Item4 ~~ Item8 0.316  0.007   0.007    0.044    0.044
60   Item3 ~~ Item6 0.303  0.006   0.006    0.054    0.054
66   Item4 ~~ Item7 0.302  0.005   0.005    0.050    0.050
54   Item2 ~~ Item6 0.258  0.006   0.006    0.039    0.039
68   Item4 ~~ Item9 0.257 -0.005  -0.005   -0.056   -0.056
63   Item3 ~~ Item9 0.209  0.005   0.005    0.053    0.053
73   Item6 ~~ Item7 0.200 -0.004  -0.004   -0.038   -0.038
52   Item2 ~~ Item4 0.195  0.005   0.005    0.036    0.036
43   Item1 ~~ Item2 0.186  0.092   0.092    0.525    0.525
71   Item5 ~~ Item8 0.173  0.006   0.006    0.028    0.028
69   Item5 ~~ Item6 0.164 -0.134  -0.134   -1.071   -1.071
35 Factor2 =~ Item8 0.118  0.013   0.011    0.013    0.013
65   Item4 ~~ Item6 0.083  0.131   0.131    1.316    1.316
56   Item2 ~~ Item8 0.050 -0.004  -0.004   -0.015   -0.015
42 Factor3 =~ Item6 0.033 -0.005  -0.004   -0.005   -0.005
72   Item5 ~~ Item9 0.032  0.002   0.002    0.017    0.017
64   Item4 ~~ Item5 0.030  0.060   0.060    0.518    0.518
61   Item3 ~~ Item7 0.015 -0.001  -0.001   -0.011   -0.011
45   Item1 ~~ Item4 0.009  0.001   0.001    0.009    0.009
49   Item1 ~~ Item8 0.006  0.001   0.001    0.006    0.006
39 Factor3 =~ Item3 0.000  0.001   0.000    0.000    0.000
27 Factor1 =~ Item6 0.000  0.000   0.000    0.000    0.000
# semPaths() 함수 수정: 경로의 투명도 및 색상 조정
semPaths(
  object = sem_fit,         # SEM 모델 객체
  what = "std",             # 표준화된 경로 계수 표시
  layout = "tree",          # 트리 레이아웃
  edge.label.cex = 1.2,     # 경로 레이블 크기 (더 크게)
  edge.color = "black",     # 선 색상 지정
  edge.width = 1,         # 선의 두께
  residuals = TRUE,         # 잔차 포함
  intercepts = FALSE,       # 절편 제거
  fade = FALSE,             # 투명도 조정 (FALSE로 설정하여 선을 더 선명하게)
  nCharNodes = 0            # 노드 이름의 길이 제한 없음
)


Clustering

  • About Mechanism pdf

K-Means Clustering in R

The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized. There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm (Hartigan and Wong 1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid


Let’s experiment with this simple shiny app

https://jjallaire.shinyapps.io/shiny-k-means/


Hands-on practice

Data we’ll use is USArrests. The data must contains only continuous variables, as the k-means algorithm uses variable means.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::%+%()   masks psych::%+%()
✖ ggplot2::alpha() masks psych::alpha()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ dplyr::select()  masks MASS::select()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data("USArrests")      # Loading the data set
head(USArrests)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

The USArrests dataset is a built-in dataset in R that contains data on crime rates (number of arrests per 100,000 residents) in the United States in 1973. The dataset has 50 observations, corresponding to the 50 US states, and 4 variables:

  • Murder: Murder arrests (number of arrests for murder per 100,000 residents).

  • Assault: Assault arrests (number of arrests for assault per 100,000 residents).

  • UrbanPop: Urban population (percentage of the population living in urban areas).

  • Rape: Rape arrests (number of arrests for rape per 100,000 residents).


Visualize the data

See the link for the detail (in Korean)

  • Let’s create a new column for the state name
#change row names to column (variable)
crime <- rownames_to_column(USArrests, var="state") 
head(crime)
       state Murder Assault UrbanPop Rape
1    Alabama   13.2     236       58 21.2
2     Alaska   10.0     263       48 44.5
3    Arizona    8.1     294       80 31.0
4   Arkansas    8.8     190       50 19.5
5 California    9.0     276       91 40.6
6   Colorado    7.9     204       78 38.7
#change the upper letter to lower character in state variable
crime$state <- tolower(crime$state) 
head(crime)
       state Murder Assault UrbanPop Rape
1    alabama   13.2     236       58 21.2
2     alaska   10.0     263       48 44.5
3    arizona    8.1     294       80 31.0
4   arkansas    8.8     190       50 19.5
5 california    9.0     276       91 40.6
6   colorado    7.9     204       78 38.7

The states_map <- map_data("state") code is used to create a dataframe that contains map data for the 50 states in the United States.

The map_data() function is from the ggplot2 package, and it returns a dataframe that contains latitude and longitude coordinates for the boundaries of each state, along with additional information that can be used to plot the map.

The argument to the map_data() function is the name of the region for which to retrieve the map data. In this case, the argument is "state", which indicates that we want map data for the 50 states in the US.

The resulting states_map dataframe contains the following columns:

  • long: A vector of longitudes representing the boundaries of the state.

  • lat: A vector of latitudes representing the boundaries of the state.

  • group: An integer indicating the group to which each point belongs. This is used to group the points together when plotting the map.

  • order: An integer indicating the order in which the points should be plotted.

  • region: A character string indicating the name of the state.

  • subregion: A character string indicating the name of a subregion within the state, if applicable. This is usually NA for the state maps.

states_map <- map_data("state")
head(states_map)
       long      lat group order  region subregion
1 -87.46201 30.38968     1     1 alabama      <NA>
2 -87.48493 30.37249     1     2 alabama      <NA>
3 -87.52503 30.37249     1     3 alabama      <NA>
4 -87.53076 30.33239     1     4 alabama      <NA>
5 -87.57087 30.32665     1     5 alabama      <NA>
6 -87.58806 30.32665     1     6 alabama      <NA>

The code library(ggiraphExtra) loads the ggiraphExtra package, which extends the functionality of the ggplot2 package to allow for interactive graphics in R.

The ggChoropleth() function is from the ggiraphExtra package and is used to create a choropleth map in which each state is colored according to its value of a specified variable.

The first argument to ggChoropleth() is the data frame containing the data to be plotted, which is crime in this case.

The second argument is the aes() function, which is used to map variables in the data frame to visual properties of the plot. The fill aesthetic is used to specify that the color of each state should be determined by the value of the Murder variable in the crime data frame. The map_id aesthetic is used to specify that each state should be identified by its name, which is found in the state variable in the crime data frame.

The third argument is the map argument, which specifies the data frame containing the map data. In this case, the states_map data frame is used, which was created earlier using the map_data() function.

# install.packages("ggiraphExtra")
library(ggiraphExtra)

ggChoropleth(data=crime, aes(fill=Murder, map_id=state), map=states_map)

  • Remove legend and change background color

  • Murder rate by state

library(ggthemes)

Attaching package: 'ggthemes'
The following object is masked from 'package:ggiraphExtra':

    theme_clean
ggChoropleth(data=crime, aes(fill=Murder, map_id=state), map=states_map) + 
  theme_map() + theme(legend.position="right")

  • Urban pop by state
ggChoropleth(data=crime, aes(fill=UrbanPop, map_id=state), map=states_map) + 
  theme_map() + theme(legend.position="right")

  • Assault rate by state
ggChoropleth(data=crime, aes(fill=Assault, map_id=state), map=states_map) + 
  theme_map() + theme(legend.position="right")

  • Rape rate by state
ggChoropleth(data=crime, aes(fill=Rape, map_id=state), map=states_map) + 
  theme_map() + theme(legend.position="right")

Required R packages and functions The standard R function for k-means clustering is kmeans() [stats package], which simplified format is as follow:

kmeans(x, centers, iter.max = 10, nstart = 1)

glimpse(USArrests)
Rows: 50
Columns: 4
$ Murder   <dbl> 13.2, 10.0, 8.1, 8.8, 9.0, 7.9, 3.3, 5.9, 15.4, 17.4, 5.3, 2.…
$ Assault  <int> 236, 263, 294, 190, 276, 204, 110, 238, 335, 211, 46, 120, 24…
$ UrbanPop <int> 58, 48, 80, 50, 91, 78, 77, 72, 80, 60, 83, 54, 83, 65, 57, 6…
$ Rape     <dbl> 21.2, 44.5, 31.0, 19.5, 40.6, 38.7, 11.1, 15.8, 31.9, 25.8, 2…
  • Z-score Standardization for k-means clustering
df <- scale(USArrests) # Scaling the data
head(df)
               Murder   Assault   UrbanPop         Rape
Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
Arizona    0.07163341 1.4788032  0.9989801  1.042878388
Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144  1.7589234  2.067820292
Colorado   0.02571456 0.3988593  0.8608085  1.864967207
summary(df)
     Murder           Assault           UrbanPop             Rape        
 Min.   :-1.6044   Min.   :-1.5090   Min.   :-2.31714   Min.   :-1.4874  
 1st Qu.:-0.8525   1st Qu.:-0.7411   1st Qu.:-0.76271   1st Qu.:-0.6574  
 Median :-0.1235   Median :-0.1411   Median : 0.03178   Median :-0.1209  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.7949   3rd Qu.: 0.9388   3rd Qu.: 0.84354   3rd Qu.: 0.5277  
 Max.   : 2.2069   Max.   : 1.9948   Max.   : 1.75892   Max.   : 2.6444  

To create a beautiful graph of the clusters generated with the kmeans() function, will use the factoextra package.

# install.packages("factoextra")
library(factoextra)
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
  • How many clusters (k) is the best?
g1<-fviz_nbclust(df, kmeans, method = "wss")
g1

g2<-fviz_nbclust(df, kmeans, method = "silhouette")
g2

This code uses the fviz_nbclust() function from the factoextra package in R to determine the optimal number of clusters to use in a K-means clustering analysis.

The first argument of fviz_nbclust() is the data frame df that contains the variables to be used in the clustering analysis.

The second argument is the clustering function kmeans that specifies the algorithm to be used for clustering.

The third argument method specifies the method to be used to determine the optimal number of clusters. In this case, two methods are used:

  • "wss": Within-cluster sum of squares. This method computes the sum of squared distances between each observation and its assigned cluster center, and then adds up these values across all clusters. The goal is to find the number of clusters that minimize the within-cluster sum of squares.

  • "silhouette": Silhouette width. This method computes a silhouette width for each observation, which measures how similar the observation is to its own cluster compared to other clusters. The goal is to find the number of clusters that maximize the average silhouette width across all observations.

Let’s run unsupervised clustering given k=4

# Compute k-means with k = 4
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25)

As the final result of k-means clustering result is sensitive to the random starting assignments, we specify nstart = 25. This means that R will try 25 different random starting assignments and then select the best results corresponding to the one with the lowest within cluster variation. The default value of nstart in R is one. But, it’s strongly recommended to compute k-means clustering with a large value of nstart such as 25 or 50, in order to have a more stable result.

  • Print the result
# Print the results
print(km.res)
K-means clustering with 4 clusters of sizes 8, 13, 16, 13

Cluster means:
      Murder    Assault   UrbanPop        Rape
1  1.4118898  0.8743346 -0.8145211  0.01927104
2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
3 -0.4894375 -0.3826001  0.5758298 -0.26165379
4  0.6950701  1.0394414  0.7226370  1.27693964

Clustering vector:
       Alabama         Alaska        Arizona       Arkansas     California 
             1              4              4              1              4 
      Colorado    Connecticut       Delaware        Florida        Georgia 
             4              3              3              4              1 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
             3              2              4              3              2 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
             3              2              1              2              4 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
             3              4              2              1              4 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
             2              2              4              2              3 
    New Mexico       New York North Carolina   North Dakota           Ohio 
             4              4              1              2              3 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
             3              3              3              3              1 
  South Dakota      Tennessee          Texas           Utah        Vermont 
             2              1              4              3              2 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
             3              3              2              2              3 

Within cluster sum of squares by cluster:
[1]  8.316061 11.952463 16.212213 19.922437
 (between_SS / total_SS =  71.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

The printed output displays: the cluster means or centers: a matrix, which rows are cluster number (1 to 4) and columns are variables the clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated

If you want to add the point classifications to the original data, use this:

str(km.res)
List of 9
 $ cluster     : Named int [1:50] 1 4 4 1 4 4 3 3 4 1 ...
  ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ centers     : num [1:4, 1:4] 1.412 -0.962 -0.489 0.695 0.874 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "1" "2" "3" "4"
  .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
 $ totss       : num 196
 $ withinss    : num [1:4] 8.32 11.95 16.21 19.92
 $ tot.withinss: num 56.4
 $ betweenss   : num 140
 $ size        : int [1:4] 8 13 16 13
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
km.res$cluster
       Alabama         Alaska        Arizona       Arkansas     California 
             1              4              4              1              4 
      Colorado    Connecticut       Delaware        Florida        Georgia 
             4              3              3              4              1 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
             3              2              4              3              2 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
             3              2              1              2              4 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
             3              4              2              1              4 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
             2              2              4              2              3 
    New Mexico       New York North Carolina   North Dakota           Ohio 
             4              4              1              2              3 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
             3              3              3              3              1 
  South Dakota      Tennessee          Texas           Utah        Vermont 
             2              1              4              3              2 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
             3              3              2              2              3 
  • Create a new data frame including cluster information
dd <- data.frame(USArrests, cluster = km.res$cluster)
head(dd)
           Murder Assault UrbanPop Rape cluster
Alabama      13.2     236       58 21.2       1
Alaska       10.0     263       48 44.5       4
Arizona       8.1     294       80 31.0       4
Arkansas      8.8     190       50 19.5       1
California    9.0     276       91 40.6       4
Colorado      7.9     204       78 38.7       4
  • Check groups’ characteristics
dd %>% 
  group_by(cluster) %>% 
  summarize_all(mean)
# A tibble: 4 × 5
  cluster Murder Assault UrbanPop  Rape
    <int>  <dbl>   <dbl>    <dbl> <dbl>
1       1  13.9    244.      53.8  21.4
2       2   3.6     78.5     52.1  12.2
3       3   5.66   139.      73.9  18.8
4       4  10.8    257.      76    33.2
  • Cluster number for each of the observations
table(dd$cluster)

 1  2  3  4 
 8 13 16 13 
  • Reshape the dataset to visualize
library(reshape2)

Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':

    smiths
dd %>% 
  melt(id.vars="cluster") %>% 
  mutate(cluster=as.factor(cluster)) %>% 
  head(20)
   cluster variable value
1        1   Murder  13.2
2        4   Murder  10.0
3        4   Murder   8.1
4        1   Murder   8.8
5        4   Murder   9.0
6        4   Murder   7.9
7        3   Murder   3.3
8        3   Murder   5.9
9        4   Murder  15.4
10       1   Murder  17.4
11       3   Murder   5.3
12       2   Murder   2.6
13       4   Murder  10.4
14       3   Murder   7.2
15       2   Murder   2.2
16       3   Murder   6.0
17       2   Murder   9.7
18       1   Murder  15.4
19       2   Murder   2.1
20       4   Murder  11.3
  • Check groups’ characteristics with ggplot
dd %>% 
  melt(id.vars="cluster") %>% 
  mutate(cluster=as.factor(cluster)) %>% 
  ggplot(aes(x=cluster, y=value))+
  geom_boxplot()+
  facet_wrap(~variable, scale="free_y")

  • cluster 1: Rural area with high murder, assault
  • cluster 2: Peaceful rural areas
  • cluster 3: Good city with low crime
  • cluster 4: City Gotham…?

Let’s see clusters are well made

fviz_cluster(km.res, data = df)

Let’s play with different k (number of clusters)

k2 <- kmeans(df, centers = 2, nstart = 25)
k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)

# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = df) + ggtitle("k = 5")

library(gridExtra)

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine
grid.arrange(p1, p2, p3, p4, nrow = 2)