↖︎ Vishal Singh
Text as Data · Teaching Case

How to read this case — four generations of method
  1. 01Count. Tokenize, score with a public lexicon. Transparent, free, context-blind.
  2. 02Engineer & classify. Pull features from the text; see if they separate the groups.
  3. 03Discover. Let a topic model surface clusters you still have to name.
  4. 04Measure with an LLM. Score constructs named in advance — convenient, but it costs money and drifts.
01

Sentiment, counted

lexicon

Sentiment over time

VADER compound, averaged per period, by group

Average tone by group

Mean compound · share clearly positive vs negative

02

Distinctive language

weighted log-odds

Fightin' words: terms most over-represented in each group after a Dirichlet prior shrinks rare-word noise. The group is encoded twice — by color, and by which side of the zero line a term sits on.

03

Features pulled from the text

feature engineering

Every feature is a deterministic property of the raw text — length, punctuation, capitalization, links, sentiment. These exact columns feed the classifier below, so anything it learns stays auditable.

Feature fingerprint by group

Group means, each feature scaled to its own range · hover for raw values

When are they posted?

Share of units by hour of day

04

Can features tell the groups apart?

logistic regression

Accuracy by representation

Held-out test accuracy vs majority baseline

ROC curve

Confusion matrix

Held-out predictions, headline model

What the model keys on

Largest positive logistic coefficients toward each class

05

Topics, discovered

LDA

Latent Dirichlet Allocation clusters co-occurring words into topics. The model finds structure; a human still has to name it. The instability panel reruns LDA with a second random seed on the same corpus.

Topic prevalence by group

Average topic share within each group · darker = more present

Seed (in)stability

Same data, two seeds · similar themes, different word lists

06

LLM measurement

GABRIEL

Construct ratings by group

Mean 0–100 on constructs named before reading any output

Run cost

Dominant frame mix by group

Share of units whose first applicable frame label is each category

07

Do the methods agree?

triangulation

Per-unit outputs lined up and correlated. Agreement is reassuring, not proof; disagreement is where the assumptions live.

Cross-method correlation

Pearson correlation across per-unit method outputs

Method scorecard