Two ways to build LLM apps in 2026 beyond linear LangChain chains: LangGraph, which models your app as a stateful graph with conditional edges and checkpointing; and DSPy, which models your app as declarative LM programs with optimizers that compile better prompts for you. They solve different problems. Most teams need one, neither, or — occasionally — both.
A chain runs A → B → C in order. Real LLM apps don't look like that:
You can express all of this with if-statements and a while loop — and for simple cases you should. LangGraph helps when the state machine gets large enough that you need it to be inspectable, checkpointable, and resumable.
LangGraph is built around three concepts:
add_messages).(state) -> partial_state. They can call LLMs, tools, anything.
pip install langgraph langchain-anthropic
from typing import TypedDict, Literal
from langgraph.graph import StateGraph, START, END
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-opus-4-7")
class State(TypedDict):
question: str
draft: str
feedback: str
iterations: int
final: str
def draft_node(state: State) -> dict:
prompt = f"Question: {state['question']}\nFeedback so far: {state.get('feedback', 'none')}\nWrite an answer."
return {"draft": llm.invoke(prompt).content,
"iterations": state.get("iterations", 0) + 1}
def critique_node(state: State) -> dict:
prompt = (f"Question: {state['question']}\nAnswer: {state['draft']}\n"
"Reply with PASS if the answer is accurate and complete; otherwise reply "
"with FAIL: .")
verdict = llm.invoke(prompt).content
if verdict.startswith("PASS"):
return {"final": state["draft"]}
return {"feedback": verdict}
def route(state: State) -> Literal["draft", "done"]:
if state.get("final"):
return "done"
if state.get("iterations", 0) >= 3:
return "done"
return "draft"
graph = StateGraph(State)
graph.add_node("draft", draft_node)
graph.add_node("critique", critique_node)
graph.add_edge(START, "draft")
graph.add_edge("draft", "critique")
graph.add_conditional_edges("critique", route, {"draft": "draft", "done": END})
app = graph.compile()
result = app.invoke({"question": "Explain CAP theorem in 4 sentences."})
print(result["final"] or result["draft"])
Compile the graph with a checkpointer and the entire state at every step is persisted by thread_id. You get three things for free:
interrupt pauses the graph; an external system inspects state, supplies a value (e.g., approval), and resumes.
from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.types import interrupt, Command
memory = SqliteSaver.from_conn_string("checkpoints.db")
def approval_node(state: State) -> dict:
decision = interrupt({"prompt": "Approve sending this email?", "draft": state["draft"]})
if decision != "yes":
return {"final": "[user declined]"}
return {"final": state["draft"]}
graph.add_node("approval", approval_node)
graph.add_edge("critique", "approval")
graph.add_edge("approval", END)
app = graph.compile(checkpointer=memory)
config = {"configurable": {"thread_id": "user-42"}}
state = app.invoke({"question": "Draft a follow-up email about invoice A-482."}, config=config)
# graph paused at the interrupt; later, after the human approves:
state = app.invoke(Command(resume="yes"), config=config)
DSPy (Khattab et al., Stanford) takes a different angle: instead of writing prompts, you declare what your program does (input/output types), and DSPy generates and improves the prompt automatically. The unit of programming is the signature; the runtime unit is the module (Predict, ChainOfThought, ReAct, ...).
pip install dspy-ai
import dspy
dspy.settings.configure(lm=dspy.LM("anthropic/claude-opus-4-7", max_tokens=1024))
class AnswerWithCitations(dspy.Signature):
"""Answer the question using only the provided context. Cite chunks by index."""
context: list[str] = dspy.InputField()
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc="answer including [1], [2] citation markers")
answerer = dspy.ChainOfThought(AnswerWithCitations)
result = answerer(
context=["[1] Effective 2026, primary caregivers receive 16 weeks paid leave.",
"[2] Secondary caregivers receive 8 weeks paid leave.",
"[3] Pet bereavement leave is 3 days."],
question="How long is paid parental leave for primary caregivers?",
)
print(result.answer)
Notice you wrote no prompt. The signature plus DSPy's prompt template is the prompt — and it can be re-templated and re-optimized as the LM changes.
DSPy's superpower is its optimizers (formerly "teleprompters"). They take your DSPy program, a metric, and a small training set, and produce a better version of the program — typically by mining few-shot examples from runs that scored well, and/or by rewriting instructions.
import dspy
from dspy.teleprompt import MIPROv2
# Training set: list of dspy.Example with the same fields as the signature.
trainset = [
dspy.Example(context=[...], question="...", answer="...").with_inputs("context", "question"),
# ...30-200 examples
]
def citation_match(example, pred, trace=None) -> float:
# Custom metric: 1.0 if the predicted answer contains every cited chunk index
# that the ground-truth answer contains.
return float(set(extract_citations(pred.answer)) >= set(extract_citations(example.answer)))
optimizer = MIPROv2(metric=citation_match, auto="medium")
compiled = optimizer.compile(student=answerer, trainset=trainset, num_trials=20)
compiled.save("answerer_compiled.json") # ship this with your app
The output of compile is the same module with new internal state — better instructions, better demos. You call it identically to the original.
Task: a small RAG agent that retrieves, drafts an answer, and self-critiques once before returning.
LangGraph version — orchestration is explicit; you control every transition.
from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-opus-4-7")
class State(TypedDict):
question: str
contexts: list[str]
draft: str
final: str
def retrieve(state): return {"contexts": my_retriever(state["question"])}
def draft(state): return {"draft": llm.invoke(make_prompt(state)).content}
def critique(state):
verdict = llm.invoke(critique_prompt(state)).content
return {"final": state["draft"]} if verdict.startswith("PASS") else {"draft": revise(state, verdict)}
g = StateGraph(State)
for name, fn in [("retrieve", retrieve), ("draft", draft), ("critique", critique)]:
g.add_node(name, fn)
g.add_edge(START, "retrieve"); g.add_edge("retrieve", "draft")
g.add_edge("draft", "critique"); g.add_edge("critique", END)
app = g.compile()
DSPy version — orchestration is just Python; the prompts are declared and optimizable.
import dspy
dspy.settings.configure(lm=dspy.LM("anthropic/claude-opus-4-7"))
class Answer(dspy.Signature):
"""Answer the question using only the provided context."""
context: list[str] = dspy.InputField()
question: str = dspy.InputField()
answer: str = dspy.OutputField()
class Critique(dspy.Signature):
"""Reply PASS if the answer is accurate and grounded, else FAIL: ."""
context: list[str] = dspy.InputField()
question: str = dspy.InputField()
answer: str = dspy.InputField()
verdict: str = dspy.OutputField()
class RagWithCritique(dspy.Module):
def __init__(self, retrieve):
self.retrieve = retrieve
self.answerer = dspy.ChainOfThought(Answer)
self.critic = dspy.Predict(Critique)
def forward(self, question: str):
ctx = self.retrieve(question)
ans = self.answerer(context=ctx, question=question).answer
verdict = self.critic(context=ctx, question=question, answer=ans).verdict
if verdict.startswith("FAIL"):
ans = self.answerer(context=ctx, question=f"{question}\nFix: {verdict}").answer
return dspy.Prediction(answer=ans)
program = RagWithCritique(retrieve=my_retriever)
print(program(question="...").answer)
The DSPy version is shorter and the prompts can be optimized end-to-end. The LangGraph version is more transparent about state transitions and gives you checkpointing for free. Choose the one whose pain you'd rather have.
Choose LangGraph when:
Choose DSPy when:
Choose neither when:
Converse call or one Anthropic messages.create. Adding a framework for one call is overhead with no upside.Common pattern in production: a single Bedrock or Anthropic call for 80% of requests, LangGraph for the 20% that need a real agent loop, and DSPy in an offline pipeline that compiles the prompts both of those use. None of these is mutually exclusive.
LangGraph wins when your control flow has cycles, conditional branches, or human-in-the-loop checkpoints — anything a DAG-shaped LCEL chain can't express. Concrete examples: an agent that loops "plan → act → reflect" until done, a RAG pipeline that re-queries when faithfulness is low, a multi-step approval workflow that pauses for human input mid-flight. For pure linear pipelines (load → chunk → embed → store) LangGraph is overkill; a function with three calls is clearer. The graph abstraction earns its weight when you're drawing arrows on a whiteboard, not boxes.
DSPy compiles a program (defined as Modules with typed Signatures) by running an optimizer that searches over few-shot examples and prompt phrasings to maximize a metric you provide. The "compiled" artifact is a set of optimized prompts — nothing magical, just text — bound to each Module. The optimizer (BootstrapFewShot, MIPRO, COPRO) treats prompt engineering as a learning problem: given a small training set and a metric (exact-match, F1, RAGAS faithfulness), it iteratively proposes and evaluates prompts. The point is that your code stays in terms of "what" the program should do, and the optimizer fills in the "how" each time you swap models.
LangGraph persists graph state to a checkpointer (in-memory, SQLite, Postgres, Redis) at every node transition. State is keyed by thread_id, so a single conversation across many user turns reads/writes the same checkpoint chain. Two concrete uses: long-running agents survive process restarts because state is durable; human-in-the-loop pauses by interrupting before a node, persisting state, and resuming when the human approves. Postgres is the production default — you get time-travel debugging (replay from any prior checkpoint) and multi-pod horizontal scaling for free.
YAML prompts work fine until you swap models — the prompt that was tuned for GPT-4 underperforms on Claude or Llama, and you re-tune by hand for every migration. DSPy decouples the program structure from the prompt strings: you re-run the optimizer against the new model and the metric tells you when you've matched the prior baseline. It also forces you to write an evaluation metric, which you should be doing anyway. The investment pays off when you have multiple models in rotation, frequent prompt regressions, or a real eval set you trust.
Skip both when: the workflow is a single API call with retries (three lines of Python beat a graph definition); your team doesn't yet have an eval set, so DSPy has nothing to optimize against; or extreme latency control matters and the framework call stack is in the way of your profiler. The dirty secret of both is that they shine in demos and complicate debugging in production. Common production pattern: a single Anthropic call for 80% of requests, LangGraph for the 20% that need a real loop, DSPy in an offline pipeline that compiles the prompts both of those use.
Three tools. First, LangSmith tracing — every node invocation, state mutation, and LLM call is logged with input/output, and you can replay from any step. Second, the checkpointer itself: dump the persisted state for a thread_id and inspect what the graph thought the world looked like when it made its bad decision. Third, run with a synchronous in-memory checkpointer in a notebook and step through node-by-node. The common failure mode is shared-state mutation: two parallel branches writing the same key in the state dict and the merge reducer picking the wrong one. Make every state field a Pydantic model with explicit reducers.