Getting Started with semanticfa

Overview

semanticfa performs exploratory factor analysis on language model embeddings of psychological scale items. Given item text, it embeds each item, computes a similarity matrix, and extracts latent factors — entirely from the text, with no human response data required.

The package is designed to feel familiar to psych and EFAtools users.

Quick start with bundled data

The package ships with the 50-item IPIP Big Five inventory and precomputed sentence-BERT embeddings, so you can try it with zero setup:

library(semanticfa)
data(big5)

fit <- sfa(
  big5$items,
  nfactors    = 5,
  embeddings  = big5$embeddings,
  scoring     = big5$scoring
)
print(fit)
#> Semantic Factor Analysis
#>   Encoding: atomic 
#>   Embedding dim: 384 
#>   Factors:5 (minres + oblimin)
#> 
#> Diagnostics:
#>   KMO:  0.866 (meritorious - higher is better)
#>   TEFI: -47.0724 (lower is better)
#>   RMSR: 0.0556 (acceptable - lower is better)
#>   CAF:  0.4880 (marginal - higher is better)
#> 
#> Factor loadings:
#> 
#> Loadings:
#>         MR1    MR4    MR3    MR5    MR2   
#> item_39  0.617                            
#> item_35  0.598                            
#> item_34  0.561                            
#> item_05  0.531                            
#> item_32  0.524                            
#> item_49  0.509                            
#> item_04  0.492                            
#> item_28  0.485                            
#> item_38  0.444                            
#> item_12  0.438                0.307       
#> item_07  0.425                            
#> item_31  0.379                            
#> item_02  0.372         0.338              
#> item_01  0.366                            
#> item_33  0.365                            
#> item_36  0.343                            
#> item_13                                   
#> item_37                                   
#> item_16         0.601                     
#> item_48         0.548                0.307
#> item_24         0.530  0.315              
#> item_21         0.493  0.417              
#> item_23         0.467                     
#> item_30         0.464                     
#> item_47         0.444                0.347
#> item_29         0.413         0.351       
#> item_15         0.411                     
#> item_19         0.383         0.377       
#> item_06         0.319                     
#> item_27                0.888              
#> item_25                0.708              
#> item_08                0.498              
#> item_22                0.480              
#> item_09                0.418              
#> item_10  0.361         0.400              
#> item_03         0.366  0.383              
#> item_20                       0.786       
#> item_14                       0.686       
#> item_18                       0.662       
#> item_11                       0.498       
#> item_17                       0.485       
#> item_26                                   
#> item_44                0.362         0.686
#> item_42                              0.672
#> item_50                              0.649
#> item_45                              0.603
#> item_43                              0.528
#> item_46                              0.512
#> item_41         0.327                0.390
#> item_40                                   
#> 
#>                  MR1   MR4   MR3   MR5   MR2
#> SS loadings    4.762 3.180 3.289 2.955 3.149
#> Proportion Var 0.095 0.064 0.066 0.059 0.063
#> Cumulative Var 0.095 0.159 0.225 0.284 0.347
#> 
#> Factor correlations (Phi):
#>       MR1   MR4   MR3   MR5   MR2
#> MR1 1.000 0.381 0.227 0.337 0.276
#> MR4 0.381 1.000 0.336 0.358 0.129
#> MR3 0.227 0.336 1.000 0.202 0.183
#> MR5 0.337 0.358 0.202 1.000 0.052
#> MR2 0.276 0.129 0.183 0.052 1.000
#> 
#> Variance accounted for:
#>                         MR1   MR4   MR3   MR5   MR2
#> SS loadings           5.968 4.208 3.880 3.572 3.524
#> Proportion Var        0.119 0.084 0.078 0.071 0.070
#> Cumulative Var        0.119 0.204 0.281 0.353 0.423
#> Proportion Explained  0.282 0.199 0.183 0.169 0.167
#> Cumulative Proportion 0.282 0.481 0.665 0.833 1.000

Factor retention

When you omit nfactors, sfa() uses embedding-adapted parallel analysis (random unit vectors in the embedding dimension as the null):

fit_auto <- sfa(
  big5$items,
  embeddings = big5$embeddings,
  scoring    = big5$scoring
)
cat("Auto-detected factors:", fit_auto$factors, "\n")
#> Auto-detected factors: 8

For a multi-method comparison, use sfa_nfactors():

sim <- sfa_similarity(big5$embeddings, encoding = "atomic_reversed",
                      scoring = big5$scoring)
nf <- sfa_nfactors(sim, big5$embeddings,
                   methods = c("parallel", "kaiser"),
                   parallel_iter = 50)
print(nf)
#> Factor retention analysis (embedding-adapted)
#> 
#>   Method       n_factors
#>   parallel     8
#>   kaiser       13
#>   ------------------------
#>   Consensus    8
#> 
#> Eigenvalues: 13.5   3.5   2.9   2.1   1.8   1.7   1.6   1.5   1.3   1.2  ...

Encoding methods

The encoding argument controls how embeddings become a similarity matrix:

sim_ar  <- sfa_similarity(big5$embeddings, "atomic_reversed", big5$scoring)
sim_sq  <- sfa_similarity(big5$embeddings, "squid", big5$scoring)
#> Warning: 'scoring' has reverse-keyed items but is ignored for encoding =
#> "squid": this method is keying-free by design. Use "atomic_reversed" for keyed
#> sign-flipping.
sim_mcp <- sfa_similarity(big5$embeddings, "mean_centered_pearson", big5$scoring)
#> Warning: 'scoring' has reverse-keyed items but is ignored for encoding =
#> "mean_centered_pearson": this method is keying-free by design. Use
#> "atomic_reversed" for keyed sign-flipping.

cat("atomic_reversed range:", range(sim_ar[lower.tri(sim_ar)]), "\n")
#> atomic_reversed range: -0.8933168 0.7684432
cat("squid range:          ", range(sim_sq[lower.tri(sim_sq)]), "\n")
#> squid range:           -0.3360389 0.8676635
cat("mean_centered_pearson:", range(sim_mcp[lower.tri(sim_mcp)]), "\n")
#> mean_centered_pearson: -0.02542094 0.8932819

SQuID and mean-centered Pearson recover negative correlations between reverse-keyed dimensions — atomic_reversed does not.

Visualization

plot(fit, type = "scree")

Scree plot with parallel analysis threshold

plot(fit, type = "loadings")

Factor loading heatmap

Comparing with empirical factor analysis

The $loadings component works directly with psych functions:

# Run human-data EFA (not run — requires response data)
human_fit <- psych::fa(response_data, nfactors = 5, rotate = "oblimin")

# Compare
psych::factor.congruence(fit$loadings, human_fit$loadings)

For NMI, ARI, Frobenius, and disattenuated correlation:

cong <- sfa_congruence(fit, big5$factors, metrics = c("nmi", "ari"))
print(cong)
#> Factor structure congruence
#> 
#>   NMI:            0.428 (weak - higher is better)
#>   ARI:            0.257 (poor - higher is better)

Using your own embeddings

Pass any embedding model’s output via embeddings=:

# With sentence-transformers (requires reticulate + Python).
# The default model is "Qwen/Qwen3-Embedding-0.6B"; larger models such as
# "Qwen/Qwen3-Embedding-4B" (8 GB RAM) or "Qwen/Qwen3-Embedding-8B" (16 GB RAM)
# recover factor structure more accurately.
emb <- sfa_embed(my_items, embed = "sbert", model = "Qwen/Qwen3-Embedding-0.6B")
fit <- sfa(my_items, embeddings = emb, scoring = my_scoring)

# Or bring your own function
my_embedder <- function(texts) {
  # ... your embedding logic ...
  # must return a numeric matrix (n_items x dim)
}
fit <- sfa(my_items, embed = my_embedder, scoring = my_scoring)

mirror server hosted at Truenetwork, Russian Federation.