May 15, 2026

Building a Sophisticated Design Engine

Generic AI produces generic output. Building a design engine that generates work with aesthetic identity requires a fundamentally different architecture: curated context, structured pipelines, and model selection that compounds specificity at every stage.

Every AI design tool is, at its core, a prompt. Something goes in — a description, an instruction, a set of requirements — and a design comes out. The generation is fast. The output looks like a design. And across most tools, at most levels of zoom, the output looks more or less the same.

This isn't a coincidence. It's an architecture problem. And solving it is the central engineering challenge of building a design tool that produces work worth using.

Why generic tools produce generic output

The homogenisation of AI-generated design isn't a taste problem — it's structural. Research published in Cell Patterns in 2025 found that autonomous AI generation loops — where model output feeds back into model input without human intervention — converge across all runs to approximately 12 dominant visual motifs. The researchers ran 700 generation trajectories from diverse starting prompts. Every single one converged to the same small set of outputs: commercially safe, visually neutral, structurally familiar. They called it "visual elevator music." The effect is not incidental — it's the mathematical consequence of a system optimising for high-probability attractors in its training distribution, in the absence of constraints that push it elsewhere.

Separately, an arXiv study on AI-generated web interfaces found that generation tools trained on popular frameworks systematically reproduce the visual conventions of those frameworks. The output isn't just similar across users — it's similar to every other interface the model has ever seen. Trained on the internet, generating for the internet, the model produces something that looks like the internet averaged out.

The conventional response to this is "better prompts." Describe your aesthetic more precisely. Use more specific language. Add reference images. This helps at the margins. It doesn't solve the structural problem, for a reason that anyone who has tried to specify a visual aesthetic in words will recognise immediately: design identity is not a description. It's an accumulation of specific decisions — this type weight at this scale, this spatial rhythm, this particular way a card sits on a background — and those decisions cannot be fully compressed into natural language. The model's prior is always the training distribution, and a text prompt is a relatively weak signal against it.

The question isn't how to write better prompts. It's how to build a system where the generation is constrained by enough of the right information that the output can't be generic, because the space of possible outputs has been narrowed to something specific.

What actually determines a good design

Before you can build a system that generates good designs, you need a precise understanding of what information good designs require. There are at least four distinct dimensions:

Product context. What the product is, who uses it, what the core flows are, what data needs to be shown, what states exist, what the information hierarchy is. A fintech dashboard and a project management tool are both "dashboards," but the component vocabulary, density, data model, and interaction pattern are completely different. A generation system that doesn't have this information will produce a plausible-looking dashboard that fits neither.

Aesthetic direction. Not "minimal" or "bold" — the full set of micro-decisions that constitute a specific visual language: a particular shade and weight of type, a specific spatial grammar, a precise colour relationship between surface and accent, a characteristic way of handling secondary information. These decisions are not describable from scratch in a user prompt. They can be pre-defined as a named direction ("Swiss Grid Finance," "Soft Daylight Minimalism") and selected from, but they have to be constructed first, with specificity, before they can be applied.

Component logic. What the component vocabulary is and how it holds together as a system. A design where the button style, the input field style, the card style, and the navigation style all derive from the same visual logic feels coherent. A design where each component was independently "minimalist" but in a slightly different way feels assembled, not designed. The component logic has to be consistent across the whole generation — which means it has to be defined before the generation, not discovered during it.

Layout constraints. Information hierarchy, density, spatial rhythm, the relationship between content and whitespace. These are different for different product types, different screen states, and different viewport sizes. A generation system without layout constraints produces layouts that look complete but are wrong: the wrong things are prominent, the hierarchy is unclear, the density is inappropriate for the use case.

Generic tools have none of this. They have the user's description and the training distribution. The output is what the training distribution thinks "SaaS dashboard" looks like. Which is to say: every SaaS dashboard in the training data, averaged.

The spec as context compression

The questionnaire Mowgli asks before generating anything is not a formality. It's a context extraction mechanism. Its purpose is to systematically extract the product context that a generation system needs — and that a user can't be expected to supply in a freeform prompt, because most of it hasn't been thought through yet.

Questions about who the users are, what the core flows are, what the edge cases are, what the empty states look like — these extract information that determines what screens need to exist, what data needs to appear on them, what states they need to handle. The questionnaire converts the user's mental model of their product into structured data that the generation pipeline can reason from.

This matters for quality in a way that's invisible from the outside. When a screen is generated from a spec that specifies "users in the trial state see an upgrade prompt when they reach the limit," the screen renders that state correctly — the upgrade prompt exists, it's in the right place, it has the right prominence. When a screen is generated from a description, the state either doesn't exist (most common) or is improvised (unreliable). The spec is what makes the output complete rather than approximate.

Research on prompt specificity consistently confirms the mechanism. Prompts that include context, constraints, and structured parameters produce outputs that align with intent significantly more reliably than open-ended prompts. The 2024 Meaning Typed Prompting work from arXiv formalises this for structured output generation: integrating types, meanings, and abstract structure into the prompting process significantly improves output clarity and reduces hallucination. The questionnaire is the design-domain version of this principle: extract the structure first, then generate from it.

The moodboard as aesthetic specification

Asking users to describe their aesthetic in a text prompt is asking them to do something genuinely hard. Design identity isn't a sentence. It's a set of specific, mutually reinforcing decisions that only become visible when applied to real content.

The moodboard solves this without asking the user to solve it. Rather than prompting the user to specify an aesthetic, Mowgli generates 16+ pre-built aesthetic directions — each one a complete, internally consistent visual language constructed around the product's DNA. The user reacts to these rather than inventing them. Which of these is right? Which is close but wrong in this specific way?

Each direction is built with precision before it's offered: a specific type system, a specific colour grammar, a specific component aesthetic, a specific spatial logic. It isn't a mood image or a colour palette — it's a named, parametrised aesthetic specification. When the user selects a direction, they're not selecting an inspiration; they're selecting a complete set of generation constraints that will be applied consistently across every screen in the product.

This is what converts "minimal" from a vague aspiration into a specific design decision. "Minimal" in Swiss Grid Finance means a particular thing: high-contrast typography, monochromatic base, tabular information density, precise grid adherence. "Minimal" in Soft Daylight means something entirely different: warm neutrals, generous whitespace, rounded components, calm hierarchical depth. Both are minimal. They're completely different aesthetics. The moodboard forces the distinction to be made before generation, so the generation is executing a specific brief, not improvising around a vague one.

Model selection and the pipeline

The temptation in AI system design is to find the best single model and use it for everything. This produces results that are uniformly competent and uniformly limited by whatever that model is bad at.

Different generation tasks have different requirements. Generating a complete layout from a spec requires strong spatial reasoning and structural coherence. Applying a precise aesthetic to existing structure requires fidelity to aesthetic parameters. Generating realistic, product-specific copy requires domain-specific language understanding. Refining an inconsistency between two screens requires cross-screen reasoning about visual coherence. These are not the same task, and the same model at the same settings is not optimal for all of them.

Mowgli's generation pipeline selects models and configurations for the specific requirements of each stage. The result is not a single model's output — it's the compound of multiple specialised decisions, each made with the right tool. Prompt chaining research has consistently documented why this matters: the 2021 AI Chains paper (Wu et al., CHI) found that chaining not only improved output quality but significantly enhanced system transparency, controllability, and sense of collaboration. A 2024 study on multi-dimensional prompt chaining found improvements in response diversity of up to 29% and contextual coherence of up to 28% compared to single-prompt generation.

The pipeline is also where human input re-enters the loop. After generation, the moodboard preview gives the user a real product screen in each aesthetic direction before committing to the full build. This isn't just UX convenience — it's an anti-convergence mechanism. The 2025 Cell Patterns research explicitly identifies sustained human-AI interplay as the structural requirement for preserving creative diversity in AI output. The preview is where the human's specific taste can override the model's prior, before that prior has propagated across 50 screens.

Specificity compounds

What makes this architecture work as a whole is that each layer of specificity reduces the generation space for the next. The spec narrows the output from "any app" to "this specific product with these specific flows and states." The moodboard selection narrows it from "any aesthetic" to "this specific visual language with these specific parameters." The pipeline applies each stage's constraints to the input for the next.

By the time a screen is generated, the prompt isn't a natural language description the model interprets loosely. It's a structured specification built from product context, aesthetic parameters, layout constraints, and state definitions — assembled programmatically from everything the user specified through the questionnaire and moodboard. The model's generation space is narrow enough that it can't be generic, because "generic" doesn't fit the brief.

This is the engineering ambition behind Mowgli: not to use AI to generate faster, but to use AI to generate specifically. Speed is already table stakes. What's hard — and what determines whether the output is worth using — is closing the gap between what the user knows about their product and what the generation actually reflects.

The homogenisation of AI-generated design is not an inevitable property of AI. It's a consequence of underspecified systems generating into the training distribution's default. The spec, the moodboard, the pipeline, the model selection — each one is an architecture decision that narrows the space and raises the floor. The goal is a system where every output is recognisably, specifically, this product — not the average of everything a design tool has ever seen.

Sources

Autonomous AI generation loops converging to 12 dominant visual motifs — the structural basis for AI design homogenisation: Autonomous language-image generation loops converge to generic visual motifs — Cell Patterns (2025)
AI generation reinforcing dominant web design conventions from training data: Interrogating Design Homogenization in Web Vibe Coding — arXiv
Prompt chaining improving output quality, diversity, and coherence; chaining improving both quality and system transparency: AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts — arXiv (Wu et al., CHI 2022)
Multi-dimensional prompt chaining improving response diversity by up to 29% and contextual coherence by up to 28%: Multi-Dimensional Prompt Chaining to Improve Open-Domain Dialogue Generation — arXiv
Meaning Typed Prompting — structured generation integrating types and abstractions into prompts: Meaning Typed Prompting: A Technique for Efficient, Reliable Structured Output Generation — arXiv 2410.18146
Context specificity and structured prompts producing reliably higher-quality AI outputs: Effective Prompts for AI: The Essentials — MIT Sloan Teaching & Learning Technologies
Human-AI interplay as the structural requirement for preserving creative diversity in AI generation: Autonomous language-image generation loops converge to generic visual motifs — Cell Patterns (2025)
Why moodboards create genuine aesthetic reactions rather than abstract preferences: Visual organizing: Balancing coordination and creative freedom via mood boards — ScienceDirect