SYSTEM ARCHITECTURE

6-stage async pipeline.
8,761 in. 185 out.

Python + asyncio. GitHub REST API for sourcing/enrichment. Claude CLI (Haiku for triage, Sonnet for deep research). Resumable, rate-limit-aware, atomic writes throughout.

End-to-end flow

stage 1source.run() → data/raw/{github,hn,reddit,code}.jsonl8761
stage 2filter.run() → data/filtered/candidates.jsonl (priority-sorted)8300
stage 3enrich.run() → data/enriched/{id}.json (per-candidate)2324
stage 4classify.run() → data/classified/{id}.json (Haiku batches of 20)2321
stage 5research.run() → research/{id}.md (Sonnet, 1 per candidate)185
stage 6shortlist.run() → outputs/all_ranked.csv + DM drafts185
1
stage one

Source

8,761

Pull candidates from any place ComfyUI users gather publicly. Each source emits a Candidate record with an id, source-tag, and raw payload.

github_engagement.py

Stargazers, forkers, issuers, commenters, PR authors across 6 ComfyUI orgs.

comfy-org/ComfyUIcomfyanonymous/ComfyUIreplicate/cog-comfyuibentoml/comfy-pack
hn_threads.py

Algolia search for ComfyUI-related HN threads, then traverse comment trees.

"comfyui""comfy ui production""comfy deploy"
reddit_subs.py

Posts + comments from 5 subreddits where deployment questions live.

r/comfyuir/StableDiffusionr/aivideo
github_code_search.py

GitHub code search for code patterns that signal production deployment intent.

"comfyui stripe""comfyui clerk""comfyui fastapi auth"
2
stage two

Filter

8,300

Drop bots/ghosts (dependabot, renovate, [bot] suffixes, "ghost"). Sort by source priority so --max picks highest-signal first.

Source priority weights
SOURCE_PRIORITY = {
    "github_issuer":           10,  # filed issues = real friction
    "github_code_search":       8,  # ships ComfyUI + auth/payments code
    "github_forker":            6,  # forked = building on top
    "hn_thread":                5,  # commented in production threads
    "reddit:":                  4,
    "github_pr_author":         3,
    "github_stargazer":         1,  # weakest signal
}
3
stage three

Enrich

2,324

For each github:* candidate, fetch profile + top 5 repos + READMEs + git tree (for Dockerfile detection) + commit email domains + org memberships + linked blog. Reddit/HN candidates pass through with forum activity.

~16 GitHub API calls per candidate
  • · 1 profile fetch
  • · 1 orgs fetch
  • · 1 repos list (shared)
  • · 5x README fetch
  • · 5x git tree (replaces 2 HEAD checks per repo)
  • · 3x commits sample (commit email domains)
Resilience features
  • · asyncio.Queue worker pool (5 concurrent)
  • · tmp+rename atomic writes
  • · Skip if file exists (resumable)
  • · Rate-limit handler sleeps until reset epoch
  • · Exception caught per-candidate, never aborts pool
PRECOMPUTED FLAGS

During enrichment we compute cheap regex/heuristic signals so the LLM ranks rather than reasons:

has_dockerfile has_payment_keyword has_auth_keyword has_server_framework has_modified_comfy_fork has_company_email has_known_org_membership forum_production_phrases
4
stage four · CLAUDE HAIKU

Classify

2,321

Spawn Claude Code CLI with Haiku 4.5 in batches of 20 candidates. Prompt over stdin (auto-pipe when >60KB to avoid Linux ARG_MAX). Returns JSON with builder_type, production_score (0-10), activity_score (0-10), final_score, evidence, recommended_action.

CLI invocation
claude --model claude-haiku-4-5 \
  --print \
  <<< "$PROMPT_WITH_BATCH"
Builder types
  • founder_builder Building a paid product
  • in_house_builder Engineer at a company shipping ComfyUI
  • commercial_freelancer Selling ComfyUI work
  • active_hobbyist Building, but for fun
  • shopper Browsing — no implementation evidence
SCORE CLAMP (anti-hobbyist-inflation)
if builder_type == "active_hobbyist": production_score = min(production_score, 4)
if builder_type == "shopper":         production_score = min(production_score, 5)
final_score = ((production + activity) / 2) * type_multiplier * confidence_multiplier
type_multiplier = {founder: 1.5, in_house: 2.0, freelancer: 1.5, hobbyist: 0.15, shopper: 0.4}
5
stage five · CLAUDE SONNET

Research

185

Spawn Claude Code CLI with Sonnet 4.6 + WebFetch tool, one process per candidate. Sonnet writes a per-candidate dossier: TL;DR, evidence, ComfyUI activity, production signals, best outreach angle. May DOWNGRADE or UPGRADE the Haiku classification based on triggers.

DOWNGRADE TRIGGERS
  • (a) Personal-use signals
  • (b) Non-AI Patreon / fork w/o modifying
  • (c) "for fun" / "learning" language
  • (d) Personal email only · no company
  • (e) Shopper signals · no implementation
  • (f) Confirms hobbyist on web
UPGRADE TRIGGERS
  • (g) Company website with paid product
  • (h) LinkedIn confirms employer
  • (i) Public funding / press / launch
  • (j) Known production org membership
Output frontmatter
---
candidate_id: github:ltdrdata
builder_type: in_house_builder
confidence: high
production_score: 10
activity_score: 10
final_score: 20.0
recommended_action: priority_outreach
contactable_via: [github]
last_updated: 2026-05-05
---
6
stage six

Shortlist

185

Aggregate frontmatter into a ranked CSV (all_ranked.csv), top-50 markdown table, and per-candidate DM drafts in outputs/outreach_drafts/.

DM template generation

Each DM uses a specific hook from the candidate's research:

  • Most-recent issue/PR title (live work they're doing)
  • Top repo name + tagline
  • Org membership context
  • RunFlow value prop matched to their pain (auth, billing, scaling, model loading)

Tech

Python 3.12 + asyncio

httpx for HTTP. tenacity for retries. asyncio.Queue worker pools so we never materialize 4000+ coroutines at once.

Pydantic v2

Strict models for every stage. YAML frontmatter parsed back into models for Stage 6 aggregation.

Claude CLI

subprocess.exec with stdin pipe. Haiku 4.5 for triage, Sonnet 4.6 for research. All under one Claude Max subscription.

GitHub REST API

Rate-limit-aware: detect remaining=0 + reset epoch, sleep until reset (cap 90 min), retry. Saved 685 candidates in our last run.

Atomic writes

tmp + rename for every JSON/markdown output. Pipeline can crash and resume without corruption.

Cost meter

Optional max_usd budget per stage with CostBudgetExceeded halt. Subscription covered actual spend so always 0.