Milton is an experiment in training a language model entirely on — you guessed it — John Milton’s 80,000-word epic Paradise Lost. First published in 1667, Paradise Lost spans every facet of the human condition across 10,565 lines and 12 books. Of slight coincidence, I happen to collect early printings of Paradise Lost. I think my oldest copy is from the early 1700s. Regardless, Milton is now a 35 million parameter SLM that lives in my laptop.

(Mostly) AI Summarization

35M parameter transformer, custom BPE tokenizer, trained from scratch on a single text. No pre-training, no fine-tuning, no instruction data. Milton’s entire knowledge of language comes from one poem.

Model

Milton
Parameters	35.6M
Architecture	Decoder-only transformer
Layers	8
Embedding dim	512
Attention heads	8
FFN	SwiGLU, 2048 hidden
Positional encoding	RoPE
Context window	512 tokens
Vocabulary	4,096 BPE tokens trained on the text
Training data	Paradise Lost, Books I–XII (124,831 tokens)

How it works

Paradise Lost is downloaded from Project Gutenberg and cleaned. A byte-pair encoding tokenizer is trained directly on the text — every token in the 4,096-entry vocabulary is a fragment of Milton’s language. No capacity wasted on tokens he never uses.

The model is a standard decoder-only transformer: 8 layers, 512 embedding dim, 8 attention heads, SwiGLU feed-forward networks, rotary positional embeddings, RMSNorm. Weight tying between the embedding and output layers. 35.6M parameters total.

Architecture

The entire model in ~60 lines. Weight tying between the embedding and output projection means the model’s first and last act are the same matrix — Milton reads and writes with the same hand.

class Milton(nn.Module):
    """A language model that knows only Paradise Lost."""

    def __init__(
        self,
        vocab_size: int = 4096,
        dim: int = 512,
        n_layers: int = 8,
        n_heads: int = 8,
        ff_dim: int = 2048,
        max_seq_len: int = 512,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.dim = dim
        self.max_seq_len = max_seq_len

        self.tok_emb = nn.Embedding(vocab_size, dim)
        self.rope = RoPE(dim // n_heads, max_seq_len)
        self.drop = nn.Dropout(dropout)

        self.layers = nn.ModuleList(
            [TransformerBlock(dim, n_heads, ff_dim, dropout) for _ in range(n_layers)]
        )
        self.norm = nn.RMSNorm(dim)
        self.lm_head = nn.Linear(dim, vocab_size, bias=False)

        # Weight tying
        self.lm_head.weight = self.tok_emb.weight

The feed-forward layers use SwiGLU — a gated activation that tends to learn sharper, more expressive features than ReLU. In three lines:

class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, dropout=0.1):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.dropout(self.w2(F.silu(self.w1(x)) * self.w3(x)))

First training

The first training run used causal language modeling on overlapping 512-token windows of the poem, with 15% of sequences reformatted as chat turns using random spans as user prompts. 200 epochs with AdamW, cosine learning rate scheduling, and 0.15 dropout. 75 minutes on an M3 Max (36GB, 30 GPU cores) using Apple’s MPS backend.

The loss dropped to 0.003. Milton had memorized Paradise Lost nearly verbatim. But he didn’t respond to what you said — ask about darkness and he’d produce a passage about chariot wheels. The chat training used random spans as “user prompts,” so there was no learned association between your input and topically relevant output. The model just generated from wherever your tokens happened to push the hidden state.

Retraining

Three changes fixed this:

Retrieval-seeded generation. At inference, a TF-IDF retriever indexes Paradise Lost into ~350 passages and finds the most relevant one for your input. The retrieved passage seeds Milton’s generation so he starts from relevant text instead of from nothing. This is the single biggest improvement — Milton speaks about what you asked because his generation begins at the right place in the poem.

Topically aligned training data. The random-span chat sequences were replaced with topically aligned ones. For each passage, the most distinctive words (by TF-IDF) are extracted and used as the user prompt, with the passage as Milton’s response. This teaches the model that when it sees keywords like “darkness chaos void,” it should produce text about darkness, chaos, and the void. Chat fraction increased from 15% to 30%.

More regularization. Dropout increased from 0.15 to 0.2. Word-level dropout added — 5% of input tokens are randomly replaced during training, forcing the model to rely on context rather than memorizing exact sequences. Epochs reduced from 200 to 120, bringing retraining down to ~55 minutes. The final loss settled at 0.128: less faithful to exact quotes, but better at recombining Milton’s language in contextually appropriate ways.

Talking to Milton

Milton doesn’t answer questions. He doesn’t follow instructions. He responds to whatever you say with the only language he knows. Ask him something he has no frame for and he responds with whatever attractor in his weight space your words land nearest to.

“what is love”

Raphael’s speech to Adam on the nature of love. The retriever found the passage; Milton continued from there.

What higher in her society thou findest / Attractive, human, rational, love still; / In loving thou dost well, in passion not, / Wherein true love consists not: Love refines / The thoughts, and heart enlarges; hath his seat / In reason, and is judicious; is the scale / By which to heavenly love thou mayest ascend, / Not sunk in carnal pleasure; for which cause, / Among the beasts no mate for thee was found.

Milton, on love

“tell me about hell”

Satan’s soliloquy on Mount Niphates. He asks where he can flee from infinite wrath and arrives at the answer himself.

Me miserable! which way shall I fly / Infinite wrath, and infinite despair? / Which way I fly is Hell; myself am Hell; / And, in the lowest deep, a lower deep / Still threatening to devour me opens wide, / To which the Hell I suffer seems a Heaven. / O, then, at last relent: Is there no place / Left for repentance, none

Milton, on hell

“what is light”

The creation of light from Book VII. God speaks and light springs from the deep, housed in a cloudy tabernacle before the sun exists.

Let there be light, said God; and forthwith Light / Ethereal, first of things, quintessence pure, / Sprung from the deep; and from her native east / To journey through the aery gloom began, / Sphered in a radiant cloud, for yet the sun / Was not; she in a cloudy tabernacle / Sojourned the while. God saw the light was good.

Milton, on light

When the retriever finds a strong match, Milton recites near-verbatim passages — at loss 0.128 he’s still got most of the poem in memory. But push him off-topic or crank the temperature and he starts inventing. The model recombines syllable patterns it learned into words that sound Miltonic but don’t exist.

“what is the internet” (temperature 1.4)

Asked something entirely outside his world, Milton starts coherent then fragments into invented language — “branchenching,” “stablious,” “Attendantine.” The cadence survives even when meaning doesn’t.

Gliding meteorous, as evening-mist / Risen from a river o’er the marish glides, / And gathers ground fast at the labourer’s heel / Homeward returning. awot under ground / Their branchenching element inoffven sealaid. / Lookly rep stood, and O!; Spirits they move / And stablious, and puresters fire, / Thosebidencedce, while, secure lively brought / In fumanxhalation! and be fair / Attendantine rock thus than Eve renewed.

Milton, on the internet

More interesting than the garble are the near-misses — lines where Milton almost remembers the source and makes a substitution. Paradise Lost reads “He ended, and they both descend the hill.” Milton’s version: “He ended, and they both descend the air.” One word changed. Arguably more Miltonic than Milton.

He also produces phrases that are fully readable English but appear nowhere in the poem:

“Sawing each bank with sea and cadence”
“Spirits they move”
“Grateful to order of birds”

And he coins portmanteau words by fusing Milton’s vocabulary: Rebelligion (rebellion + religion), Mercance (mercy + penance), Createndance (creation + attendance), necessitary (necessary + solitary).

His entire world is one poem. He speaks in its cadence even when he can’t remember its words.

We might keep training Milton. For no reason at all.

Files

File
`get_data.py`	Download and clean Paradise Lost from Project Gutenberg
`tokenizer.py`	Train a BPE tokenizer on the text
`model.py`	Transformer architecture
`train.py`	Training loop with word dropout and chat-formatted sequences
`chat.py`	Terminal chat interface with retrieval-seeded generation
`retriever.py`	TF-IDF passage retriever over Paradise Lost