Our Methodology

The GATOS workflow

Generative AI‑enabled Theme Organization and Structuring — a peer‑reviewed approach to qualitative analysis with full traceability.

Every insight traces back to source data. No black boxes.

View Publications Book a Scoping Call

Methodology Guarantees

Traceability by design

Every theme maps back to specific participant evidence
Extract‑based workflow preserves nuance and context
Audit‑ready deliverables you can defend internally

The Difference

Why traceability changes everything

Without traceability

—
Black box outputs
AI generates "insights" with no way to verify if they came from your data or the model's imagination.
—
Undetectable hallucination
Plausible-sounding themes that aren't grounded in participant responses. No audit trail to catch them.
—
Unverifiable claims
"67% mentioned X" — but can you see which responses? Can a reviewer check your numbers?
—
Lost voices
Summarization strips nuance. Themes become abstractions disconnected from the people who expressed them.

With the GATOS workflow

—
Full evidence chain
Every theme traces through codes, clusters, and extracts back to specific participant utterances.
—
Hallucination prevention
Constrained code generation reads nearest-neighbor extracts first. The model cannot invent categories unsupported by data.
—
Audit-ready quantification
Every prevalence count links to the actual extracts. Reviewers can drill down to verify any finding.
—
Preserved participant voice
Extract-based workflow captures each idea in the participant's own framing. Nothing is lost to summarization.

The Workflow

From raw data to traceable themes

GATOS maintains a chain of custody at every stage. Each step feeds the next with full provenance intact.

Raw Utterances

Original participant feedback enters the pipeline — interviews, surveys, open-ended responses. Every source is tagged and preserved.

Extract Creation

Each utterance is distilled into discrete information points, capturing a single idea in the participant's own framing. One response may yield multiple extracts.

Semantic Clustering

Extracts are embedded into vector space and grouped using PCA, UMAP, and agglomerative clustering. Similar ideas from different participants converge naturally — no predefined categories.

Codebook Development

Codes are generated through constrained nearest-neighbor retrieval. The model must read existing extracts before proposing new codes — preventing hallucination by design.

Theme Synthesis

Codes are organized into themes with every connection preserved. Any theme can be traced back through codes → clusters → extracts → original utterances.

The Key Innovation

Every theme can be traced back through codes → clusters → extracts → original utterances. You can verify exactly which participant voices contributed to each insight.

Interactive Traceability Chain

Explore how insights connect back to source data

Click any step to trace insights back to source data

Theme

Workflow Friction Drives Frustration

High-level pattern identified across multiple codes

Key: Every theme traces back through codes, clusters, and extracts to specific participant utterances—no hallucination possible.

Deep Dive

How each step works

Extract Creation

Raw participant utterances are distilled into discrete information points — each capturing a single idea in the participant's own framing.

Participant utterance

“I waited forever in the ER and nobody told me what was happening. The nurse was nice though.”

Extracted information points

“Long wait time in emergency department”

“Lack of communication during wait”

“Positive interaction with nursing staff”

Semantic Clustering

Extracts are embedded into vector space and clustered using PCA, UMAP, and agglomerative clustering. Similar ideas from different participants converge naturally.

Extracts from different participants converge

Patient 12

“Long wait time in emergency department”

Patient 89

“Waited over two hours with no updates”

Patient 204

“ER wait was unreasonable”

Patient 317

“Nobody communicated expected wait time”

Cluster #23

“Emergency wait times & communication”

47 extracts from 31 participants

Constrained Codebook Development

Codes are generated through nearest-neighbor retrieval, ensuring new codes are grounded in existing patterns. The model cannot invent categories not supported by the data.

Nearest-neighbor extracts retrieved

“Long wait time in emergency department”

“Waited over two hours with no updates”

“Nobody communicated expected wait time”

Generated code

“Excessive wait time with inadequate communication”

Safeguards

Temperature = 0 for determinism
Must read nearest neighbors first
Explicit anti-hallucination instructions

Quality criteria

Parsimony (minimal redundancy)
Consistent abstraction level
Non-overlapping categories

Theme Synthesis with Full Traceability

Codes are organized into themes, with every connection preserved. Ask about any theme and trace it back to the specific participant utterances that contributed to it.

Full evidence chain

Theme

“Communication Gaps During Care”

Code

“Uncertainty about wait status”

Cluster

847 extracts about waiting + communication

Sample extracts

“No updates during wait”, “Didn't know if forgotten”

Source utterances

Patient 142, Patient 891, and 845 others

Published Research

Peer‑reviewed and validated

The GATOS methodology is documented in peer‑reviewed research.

Nature Portfolio · 2026

GATOS: Generative AI-enabled Theme Organization and Structuring for Qualitative Research

Read paper →SAGE · 2025

Leveraging Generative Text Models and NLP to Perform Traditional Thematic Data Analysis

Read paper →SAGE · 2024

Using Generative Text Models to Create Qualitative Codebooks for Student Evaluations of Teaching

Read paper →

View All Publications

Get Started

Ready to experience traceable analysis?

We share anonymized portfolio examples during a consultation to show exactly how GATOS works on real data.

Book a Free Scoping Call Explore Research Briefs

Our Methodology

The GATOS workflow

Generative AI‑enabled Theme Organization and Structuring — a peer‑reviewed approach to qualitative analysis with full traceability.

Every insight traces back to source data. No black boxes.

View Publications Book a Scoping Call

Methodology Guarantees

Traceability by design

Every theme maps back to specific participant evidence
Extract‑based workflow preserves nuance and context
Audit‑ready deliverables you can defend internally

The Difference

Why traceability changes everything

Without traceability

—
Black box outputs
AI generates "insights" with no way to verify if they came from your data or the model's imagination.
—
Undetectable hallucination
Plausible-sounding themes that aren't grounded in participant responses. No audit trail to catch them.
—
Unverifiable claims
"67% mentioned X" — but can you see which responses? Can a reviewer check your numbers?
—
Lost voices
Summarization strips nuance. Themes become abstractions disconnected from the people who expressed them.

With the GATOS workflow

—
Full evidence chain
Every theme traces through codes, clusters, and extracts back to specific participant utterances.
—
Hallucination prevention
Constrained code generation reads nearest-neighbor extracts first. The model cannot invent categories unsupported by data.
—
Audit-ready quantification
Every prevalence count links to the actual extracts. Reviewers can drill down to verify any finding.
—
Preserved participant voice
Extract-based workflow captures each idea in the participant's own framing. Nothing is lost to summarization.

The Workflow

From raw data to traceable themes

GATOS maintains a chain of custody at every stage. Each step feeds the next with full provenance intact.

Raw Utterances

Original participant feedback enters the pipeline — interviews, surveys, open-ended responses. Every source is tagged and preserved.

Extract Creation

Each utterance is distilled into discrete information points, capturing a single idea in the participant's own framing. One response may yield multiple extracts.

Semantic Clustering

Extracts are embedded into vector space and grouped using PCA, UMAP, and agglomerative clustering. Similar ideas from different participants converge naturally — no predefined categories.

Codebook Development

Codes are generated through constrained nearest-neighbor retrieval. The model must read existing extracts before proposing new codes — preventing hallucination by design.

Theme Synthesis

Codes are organized into themes with every connection preserved. Any theme can be traced back through codes → clusters → extracts → original utterances.

The Key Innovation

Every theme can be traced back through codes → clusters → extracts → original utterances. You can verify exactly which participant voices contributed to each insight.

Interactive Traceability Chain

Explore how insights connect back to source data

Click any step to trace insights back to source data

Theme

Workflow Friction Drives Frustration

High-level pattern identified across multiple codes

Key: Every theme traces back through codes, clusters, and extracts to specific participant utterances—no hallucination possible.

Deep Dive

How each step works

Extract Creation

Raw participant utterances are distilled into discrete information points — each capturing a single idea in the participant's own framing.

Participant utterance

“I waited forever in the ER and nobody told me what was happening. The nurse was nice though.”

Extracted information points

“Long wait time in emergency department”

“Lack of communication during wait”

“Positive interaction with nursing staff”

Semantic Clustering

Extracts are embedded into vector space and clustered using PCA, UMAP, and agglomerative clustering. Similar ideas from different participants converge naturally.

Extracts from different participants converge

Patient 12

“Long wait time in emergency department”

Patient 89

“Waited over two hours with no updates”

Patient 204

“ER wait was unreasonable”

Patient 317

“Nobody communicated expected wait time”

Cluster #23

“Emergency wait times & communication”

47 extracts from 31 participants

Constrained Codebook Development

Codes are generated through nearest-neighbor retrieval, ensuring new codes are grounded in existing patterns. The model cannot invent categories not supported by the data.

Nearest-neighbor extracts retrieved

“Long wait time in emergency department”

“Waited over two hours with no updates”

“Nobody communicated expected wait time”

Generated code

“Excessive wait time with inadequate communication”

Safeguards

Temperature = 0 for determinism
Must read nearest neighbors first
Explicit anti-hallucination instructions

Quality criteria

Parsimony (minimal redundancy)
Consistent abstraction level
Non-overlapping categories

Theme Synthesis with Full Traceability

Codes are organized into themes, with every connection preserved. Ask about any theme and trace it back to the specific participant utterances that contributed to it.

Full evidence chain

Theme

“Communication Gaps During Care”

Code

“Uncertainty about wait status”

Cluster

847 extracts about waiting + communication

Sample extracts

“No updates during wait”, “Didn't know if forgotten”

Source utterances

Patient 142, Patient 891, and 845 others

Published Research