Our Methodology
Generative AI‑enabled Theme Organization and Structuring — a peer‑reviewed approach to qualitative analysis with full traceability.
Every insight traces back to source data. No black boxes.
Methodology Guarantees
The Difference
Without traceability
Black box outputs
AI generates "insights" with no way to verify if they came from your data or the model's imagination.
Undetectable hallucination
Plausible-sounding themes that aren't grounded in participant responses. No audit trail to catch them.
Unverifiable claims
"67% mentioned X" — but can you see which responses? Can a reviewer check your numbers?
Lost voices
Summarization strips nuance. Themes become abstractions disconnected from the people who expressed them.
With the GATOS workflow
Full evidence chain
Every theme traces through codes, clusters, and extracts back to specific participant utterances.
Hallucination prevention
Constrained code generation reads nearest-neighbor extracts first. The model cannot invent categories unsupported by data.
Audit-ready quantification
Every prevalence count links to the actual extracts. Reviewers can drill down to verify any finding.
Preserved participant voice
Extract-based workflow captures each idea in the participant's own framing. Nothing is lost to summarization.
The Workflow
GATOS maintains a chain of custody at every stage. Each step feeds the next with full provenance intact.
Original participant feedback enters the pipeline — interviews, surveys, open-ended responses. Every source is tagged and preserved.
Each utterance is distilled into discrete information points, capturing a single idea in the participant's own framing. One response may yield multiple extracts.
Extracts are embedded into vector space and grouped using PCA, UMAP, and agglomerative clustering. Similar ideas from different participants converge naturally — no predefined categories.
Codes are generated through constrained nearest-neighbor retrieval. The model must read existing extracts before proposing new codes — preventing hallucination by design.
Codes are organized into themes with every connection preserved. Any theme can be traced back through codes → clusters → extracts → original utterances.
The Key Innovation
Every theme can be traced back through codes → clusters → extracts → original utterances. You can verify exactly which participant voices contributed to each insight.
Explore how insights connect back to source data
Click any step to trace insights back to source data
Key: Every theme traces back through codes, clusters, and extracts to specific participant utterances—no hallucination possible.
Deep Dive
Raw participant utterances are distilled into discrete information points — each capturing a single idea in the participant's own framing.
Participant utterance
“I waited forever in the ER and nobody told me what was happening. The nurse was nice though.”
Extracted information points
“Long wait time in emergency department”
“Lack of communication during wait”
“Positive interaction with nursing staff”
Extracts are embedded into vector space and clustered using PCA, UMAP, and agglomerative clustering. Similar ideas from different participants converge naturally.
Extracts from different participants converge
“Long wait time in emergency department”
“Waited over two hours with no updates”
“ER wait was unreasonable”
“Nobody communicated expected wait time”
“Emergency wait times & communication”
47 extracts from 31 participants
Codes are generated through nearest-neighbor retrieval, ensuring new codes are grounded in existing patterns. The model cannot invent categories not supported by the data.
Nearest-neighbor extracts retrieved
“Long wait time in emergency department”
“Waited over two hours with no updates”
“Nobody communicated expected wait time”
“Excessive wait time with inadequate communication”
Safeguards
Quality criteria
Codes are organized into themes, with every connection preserved. Ask about any theme and trace it back to the specific participant utterances that contributed to it.
Full evidence chain
“Communication Gaps During Care”
“Uncertainty about wait status”
847 extracts about waiting + communication
“No updates during wait”, “Didn't know if forgotten”
Patient 142, Patient 891, and 845 others
Published Research
The GATOS methodology is documented in peer‑reviewed research.
GATOS: Generative AI-enabled Theme Organization and Structuring for Qualitative Research
Read paper →SAGE · 2025Leveraging Generative Text Models and NLP to Perform Traditional Thematic Data Analysis
Read paper →SAGE · 2024Using Generative Text Models to Create Qualitative Codebooks for Student Evaluations of Teaching
Read paper →Get Started
We share anonymized portfolio examples during a consultation to show exactly how GATOS works on real data.