Synthetic problem generation

Synthetic problem generation lets you quickly produce random grammars and corresponding target individuals to benchmark search behavior without hand-crafting domain grammars.

This is useful for stress-testing operators, budgets, and configuration choices under controlled complexity.

API

Parameters (summary)

seed: Controls reproducibility of the generated grammar.
non_terminals_count: Number of abstract non-terminals to create.
recursive_non_terminals_count: How many of the last non-terminals are allowed to be recursive.
productions_per_non_terminal(rd): Callable returning how many productions each non-terminal has (called per non-terminal with a random.Random).
non_terminals_per_production(rd): Callable returning the arity (number of fields) per production (called per production with a random.Random).
base_types: Set of terminal/base field types to allow as leaves (defaults to {int, bool}).

The function returns (nodes, root) where nodes are the dynamically created classes (non-terminals and productions) and root is the designated root non-terminal.

Minimal example

from geneticengine.grammar.grammar import extract_grammar
from geneticengine.grammar.synthetic_grammar import create_arbitrary_grammar
from geneticengine.random.sources import NativeRandomSource
from geneticengine.representations.tree.initializations import MaxDepthDecider
from geneticengine.representations.tree.treebased import TreeBasedRepresentation

# 1) Generate a random grammar
nodes, root = create_arbitrary_grammar(
    seed=0,
    non_terminals_count=3,
    recursive_non_terminals_count=2,
    productions_per_non_terminal=lambda rd: 2,
    non_terminals_per_production=lambda rd: 1,
)

# 2) Build a Grammar object
G = extract_grammar(nodes, root)

# 3) Create a random target individual (phenotype) from this grammar
r = NativeRandomSource(0)
rep = TreeBasedRepresentation(G, decider=MaxDepthDecider(r, G, G.get_min_tree_depth()))
G_target = rep.create_genotype(r, depth=8)
P_target = rep.genotype_to_phenotype(G_target)
print(P_target)

Using with Genetic Programming

A common pattern is to generate a target individual and then define a distance-based fitness to measure how close candidates are to that target. For example, using string distance over the stringified phenotypes:

from polyleven import levenshtein
from geneticengine.algorithms.gp.gp import GeneticProgramming
from geneticengine.evaluation.budget import EvaluationBudget
from geneticengine.problems import SingleObjectiveProblem

# Assume G and P_target are created as above

def fitness_function(p):
    return levenshtein(str(p), str(P_target))

problem = SingleObjectiveProblem(
    fitness_function=fitness_function,
    minimize=True,
    target=0,
)

r = NativeRandomSource(0)
alg = GeneticProgramming(
    problem=problem,
    budget=EvaluationBudget(100),
    representation=TreeBasedRepresentation(G, decider=MaxDepthDecider(r, G, G.get_min_tree_depth() + 10)),
    population_size=10,
    random=r,
)
ind = alg.search()[0]
print(ind, ind.get_fitness(problem))

Complexity control tips

Grammar size: Increase non_terminals_count and/or productions_per_non_terminal to grow the search space.
Depth/recursion: Control recursion potential with recursive_non_terminals_count, and control tree growth with MaxDepthDecider and the depth you pass to create_genotype.
Arity: Use non_terminals_per_production to increase/decrease branching factor per production.
Leaf variety: Adjust base_types to change the terminal set (e.g., {int, bool, float}).

End-to-end runnable example

See examples/synthetic_grammar_example.py for a complete script with CLI flags to vary grammar size, recursion, and arity.