Lexicons

A lexicon is a JSON file that steers text generation toward specific vocabularies and styles. Without a lexicon, Malarky uses a general-purpose English vocabulary. With one, you control exactly which words appear and how they’re weighted.

Why use a lexicon?

Domain-specific text – Generate corporate, medical, legal, or technical nonsense
Style control – Adjust sentence type distributions and complexity per domain
Weighted vocabulary – Favor certain words over others
Correlated choices – When a business noun is picked, boost business verbs
Quality constraints – Prevent word repetition, limit phrase complexity

Minimal example

A lexicon needs only an id, language, and at least one termSet:

{
  "id": "lexicon.startup",
  "language": "en",
  "termSets": {
    "noun.startup": {
      "pos": "noun",
      "tags": ["domain:startup"],
      "terms": [
        { "value": "disruptor", "weight": 5 },
        { "value": "unicorn", "weight": 3 },
        { "value": "pivot", "weight": 4 },
        { "value": "runway", "weight": 2 }
      ]
    },
    "verb.startup": {
      "pos": "verb",
      "tags": ["domain:startup"],
      "terms": [
        { "value": "disrupt", "weight": 5 },
        { "value": "scale", "weight": 4 },
        { "value": "pivot", "weight": 3 },
        { "value": "iterate", "weight": 3 }
      ]
    }
  },
  "archetypes": {
    "startup": {
      "tags": ["domain:startup"]
    }
  }
}

Loading a lexicon in code

import {
  TextGenerator,
  SimpleFakerAdapter,
  loadLexiconFromString,
} from 'malarky';
import { readFileSync } from 'fs';

const lexicon = loadLexiconFromString(readFileSync('./startup.json', 'utf-8'));

const generator = new TextGenerator({
  fakerAdapter: new SimpleFakerAdapter(),
  lexicon,
});

generator.setArchetype('startup');
console.log(generator.paragraph());

Loading a lexicon from the CLI

malarky paragraph --lexicon ./startup.json --archetype startup

What’s in a lexicon?

A lexicon can contain any of these sections (all optional except id and language):

Section	Purpose
`termSets`	Named pools of words grouped by part of speech
`patterns`	Syntactic templates for phrases/sentences
`distributions`	Named weight tables that bias choices
`correlations`	Conditional boosts triggered by word choices
`constraints`	Hard/soft rules restricting generation
`invariants`	Conditions that must always hold true
`archetypes`	Style presets combining tags, distributions, overrides
`relations`	Graph connections between terms
`outputTransforms`	Default transform pipelines

See the Schema Reference for full documentation of each section.