Lexicons
A lexicon is a JSON file that steers text generation toward specific vocabularies and styles. Without a lexicon, Malarky uses a general-purpose English vocabulary. With one, you control exactly which words appear and how they’re weighted.
Why use a lexicon?
- Domain-specific text – Generate corporate, medical, legal, or technical nonsense
- Style control – Adjust sentence type distributions and complexity per domain
- Weighted vocabulary – Favor certain words over others
- Correlated choices – When a business noun is picked, boost business verbs
- Quality constraints – Prevent word repetition, limit phrase complexity
Minimal example
A lexicon needs only an id, language, and at least one termSet:
{
"id": "lexicon.startup",
"language": "en",
"termSets": {
"noun.startup": {
"pos": "noun",
"tags": ["domain:startup"],
"terms": [
{ "value": "disruptor", "weight": 5 },
{ "value": "unicorn", "weight": 3 },
{ "value": "pivot", "weight": 4 },
{ "value": "runway", "weight": 2 }
]
},
"verb.startup": {
"pos": "verb",
"tags": ["domain:startup"],
"terms": [
{ "value": "disrupt", "weight": 5 },
{ "value": "scale", "weight": 4 },
{ "value": "pivot", "weight": 3 },
{ "value": "iterate", "weight": 3 }
]
}
},
"archetypes": {
"startup": {
"tags": ["domain:startup"]
}
}
}
Loading a lexicon in code
import {
TextGenerator,
SimpleFakerAdapter,
loadLexiconFromString,
} from 'malarky';
import { readFileSync } from 'fs';
const lexicon = loadLexiconFromString(readFileSync('./startup.json', 'utf-8'));
const generator = new TextGenerator({
fakerAdapter: new SimpleFakerAdapter(),
lexicon,
});
generator.setArchetype('startup');
console.log(generator.paragraph());
Loading a lexicon from the CLI
malarky paragraph --lexicon ./startup.json --archetype startup
What’s in a lexicon?
A lexicon can contain any of these sections (all optional except id and language):
| Section | Purpose |
|---|---|
termSets | Named pools of words grouped by part of speech |
patterns | Syntactic templates for phrases/sentences |
distributions | Named weight tables that bias choices |
correlations | Conditional boosts triggered by word choices |
constraints | Hard/soft rules restricting generation |
invariants | Conditions that must always hold true |
archetypes | Style presets combining tags, distributions, overrides |
relations | Graph connections between terms |
outputTransforms | Default transform pipelines |
See the Schema Reference for full documentation of each section.