Evolving Activation Functions: A Personal Exploration with Transformers

⚠️ Disclaimer (Please Read First)

This blog post presents a personal, exploratory experiment using publicly available tools and datasets (e.g., Tatoeba, DEAP, Hugging Face Transformers). It is not a lab-verified study, and results have not undergone peer review or formal statistical validation.

The findings, interpretations, and conclusions shared here are based on limited-scale experiments using Google Colab and consumer-level hardware. They are intended for educational and exploratory purposes only.

There is no guarantee on the accuracy, stability, or reproducibility of the experimental results. Any interpretations or applications are entirely at the reader’s discretion.

Readers are encouraged to replicate, adapt, or challenge the outcomes in more rigorous or production-grade environments.

We often tweak our models by adjusting the data, trying new optimizers, or changing the architecture—but how often do we think about activation functions? This project started as a curiosity: what if I used genetic programming to evolve better activation functions for transformers?

I'm not coming from a lab or a team with massive resources. Just working with Colab GPUs, a lot of tinkering, and DEAP for symbolic regression. Along the way, I ran three experiments:

A GPT2-like model fine-tuned for translation
A T5-small model also for translation
A TinyGPT model trained on simple Q&A data

Surprisingly, I found a few activation functions that consistently beat the default GELU in terms of BLEU score, perplexity, and sometimes even accuracy.

Why Even Touch Activation Functions?

Most people stick with ReLU, GELU, or Swish because they work "well enough." But I was curious: could a model perform better if its nonlinearity was custom-fit to the task or dataset?

So I used DEAP to evolve symbolic expressions for activation functions. Here’s an example of the primitive set I used:

Primitive Set Used
Unary functions: log1p_safe, sin, cos, sigmoid_boost
Binary functions: safe_add, safe_sub, safe_mul, safe_div
Constants: 1.0
Each tree built from these was plugged into PyTorch models in place of GELU.

How I Tried to Be Fair

I wanted a fair comparison, so:

I used the default HuggingFace model settings
Same seeds, same learning rate, same data splits
No hand-picking results—I repeated runs and only kept results that stayed consistent

I couldn’t do full statistical testing (compute is limited), but I hope others can help repeat or verify the work.

Experiment 1: GPT2-like Model for Translation

Model: Tiny GPT2 (2-layer, 128-dim)
Data: 50k sentence pairs from Tatoeba
Eval: BLEU, perplexity, accuracy

Metric	GELU	Custom #1	Custom #2
BLEU	0.0185	0.0188	0.0174
Perplexity	2.2471	2.2321	2.2398
Accuracy	82.45%	83.12%	82.76%

The gains are small but consistent. I ran the evolution on just 50% of the data to save time, then trained the top function on the full set.

Experiment 2: T5-Small

Model: T5-small (encoder-decoder)
Custom Activation: log1p_safe(safe_cos(gated_gate((safe_tan(softsign(ARG0))), ARG0))
Data: Same 50k sentence pairs
Eval: BLEU, perplexity, training loss

Epoch	Loss (GELU)	Loss (Custom)	BLEU (GELU)	BLEU (Custom)	PPL (GELU)	PPL (Custom)
10	0.8872	0.8593	0.0599	0.0608	2.2584	2.2453
11	0.8471	0.8318	0.0601	0.0614	2.2121	2.1957
12	0.8016	0.7951	0.0620	0.0632	2.1679	2.1485

These small improvements feel promising. Even a few tenths in BLEU and perplexity can make a difference over time.

Experiment 3: TinyGPT on Constrained Q&A

Custom Activation: gated_gate(safe_sin(relu_like(sigmoid_like(ARG0))), gelu_like(ARG0))

Plot of the evolved activation used in the TinyGPT experiment.

This one was fun to watch. I trained a GPT-style model on structured Q&A examples. Here are some sample predictions:

Sample 1

Input:

Q: What is a hydrogen atom made of?
A:

GELU:

Q: What is a hydrogen atom made of?
A:
Q: What kind of charge does a proton have?
A: A positive charge.

Custom Act:

Q: What is a hydrogen atom made of?
A: Hydrogen is the simplest and most abundant element.
A hydrogen atom consists of one proton and one electron.

Sample 2

Input:

Q: What kind of charge does a proton have?
A:

GELU:

Q: What kind of charge does a proton have?
A:
Q: The lightest and most common element.

Custom Act:

Q: What kind of charge does a proton have?
A: A positive charge.
Q: How do hydrogen atoms bond?
A: By sharing their electrons.

Q&A Metrics

Metric	GELU Baseline	Custom Activation
BLEU	0.0117	0.0600
Perplexity	3557.8637	2683.9513
Fitness Score	0.0175	0.0420

Much better structure and relevance came out of the evolved function, even though the base model is tiny.

Why I Used Thinned Data

Since I was using Colab Pro and had limited GPU credits, I used only 50% of the training data during the function search. Then I used the full set for training and evaluation.

Turns out, thinning also helps with faster iteration and is probably a good stand-in for real-world low-resource setups. I would post another blog on the thinning technique that I used to thin the data while keeping the overall statistical structure quite intact.

⚠️ Caveats and Limits

These are small wins, not breakthroughs
Some evolved formulas are long or messy
No cross-task generalization yet—I hope others help test this
Formal statistical testing wasn’t done yet due to compute/time

What I Think This Means

There’s untapped potential in tweaking transformer internals
e.g., If a tuned activation resembles a complex polynomial, does that imply the model is too simple—or the opposite?
By tweaking the score weighting on BLEU and Perplexity, different attributes of the model can be targeted — e.g. emphasizing BLEU more than accuracy.
GP might be a fun and useful way to automate model tuning
Custom activation functions could play a role in building smaller, smarter models and could be mathematically further simplified
If we penalize complexity, we might even get human-readable math that helps explain models

Open to Collaborate

If you're into transformers, symbolic regression, or just curious about how far this idea can go—I'm totally open to connecting.

A GitHub repo or notebook shareable files is/are in the works. I’d love help testing this on new models, datasets, or scaling it up. Feel free to reach out! My email address is : machinesmartsor@gmail.com

Final Thought

This started as a weekend curiosity. I didn’t expect much, but the numbers kept nudging me to keep going. I hope this work sparks new ideas—or maybe helps someone else take it further.

Even small tweaks can ripple through a model in interesting ways.

Thanks for reading!

Note: The activation function and plot for Experiment 1 (GPT2-like) were removed due to uncertainty in reproducibility. However, the core metrics are retained for transparency. Experiments 2 and 3 remain fully documented.

🔁 Update: Subsequent tests revealed BLEU scores can fluctuate across training runs due to instability in the learning dynamics. While each trained model gives consistent answers, final BLEU varies without ensemble or selection strategies. This doesn’t invalidate the idea, but highlights the importance of averaging or picking top-performing runs for reliable evaluation.

Shared Notebook

✅ Overview of Shared Experiment 2 Structure

Notebook Filename	Purpose	Suggested Title (in `.ipynb` or blog)
Experiment 2 notebook 1	GP search for best activation function	🔬 Shared Experiment 2 — Custom Activation Search for T5
Experiment 2 notebook 2	T5 training (toggle default vs custom activation)	🧠 Shared Experiment 2 — T5 Training with Optional Custom Function
Experiment 2 notebook 3	Translation tests comparing saved default vs custom	🧪 Shared Experiment 2 — Translation Output Comparison

Medium post
Note: The activation function and plot for Experiment 1 (GPT2-like) were removed due to uncertainty in reproducibility. However, the core metrics are retained for transparency. Experiments 2 and 3 remain fully documented.
📘 Glossary of Terms
BLEU Score (Bilingual Evaluation Understudy)
A metric for evaluating the quality of machine-translated text by comparing it to one or more reference translations.
🔹 Higher is better (range: 0 to 1)
🔹 Used widely in NLP translation tasks
Perplexity
A measurement of how well a probability model (like a language model) predicts a sample.
🔹 Lower is better
🔹 Indicates uncertainty: high perplexity = high uncertainty
Accuracy
The ratio of correct predictions to total predictions.
🔹 In classification: correct labels / total labels
🔹 In generation: may be approximated by matching target tokens (but not ideal for text generation tasks)
Activation Function
A mathematical function applied to neural network outputs to introduce non-linearity, enabling learning of complex patterns.
🔹 Examples: ReLU, GELU, Swish, Sigmoid
Custom Activation (via Genetic Programming)
An evolved mathematical function generated automatically using symbolic expressions to replace standard activation functions.
🔹 Tuned specifically for a task or dataset
🔹 May involve complex formulas like `log1p_safe(cos(cos(x)))`
Genetic Programming (GP)
An evolutionary algorithm that evolves computer programs or symbolic formulas (like custom activation functions) by simulating natural selection.
🔹 Uses a population of candidate solutions
🔹 Applies crossover, mutation, and selection
Symbolic Regression
A type of regression where the algorithm searches for a mathematical expression that best fits a dataset.
🔹 GP is often used for this
🔹 Produces human-readable formulas
DEAP (Distributed Evolutionary Algorithms in Python)
A Python framework for evolutionary computing, including genetic algorithms and genetic programming.
License and Attribution

License
This work — including all code, data, and experimental results — is licensed under the
Creative Commons Attribution–NonCommercial 4.0 International License (CC BY-NC 4.0).

You are free to:

Share — copy and redistribute the material in any medium or format

Adapt — remix, transform, and build upon the material

Under the following terms:

🧾 Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.

🚫 NonCommercial — You may not use the material for commercial purposes.

Attribution Required

Please credit the original author as follows:

Based on the work by Paul KP Fung (https://artisticsciencedream.blogspot.com/)
Licensed under CC BY-NC 4.0

Includes Third-Party Materials:

DEAP Library: LGPL 3.0 License
Tatoeba Dataset: CC BY 2.0 FR

Tatoeba Dataset
This project uses sentence data from the Tatoeba Project, available under the Creative Commons Attribution 2.0 France (CC BY 2.0 FR) license.
© Tatoeba contributors – Used with permission under license terms.

DEAP (Distributed Evolutionary Algorithms in Python)
This project also uses the DEAP library, which is released under the LGPL 3.0 License.
© DEAP Developers – Used in accordance with the license.

AI Assistance Acknowledgment

Parts of this project were developed with the assistance of ChatGPT, an AI language model by OpenAI.
ChatGPT was used for:

Code refactoring and documentation
License drafting and formatting
Structural suggestions for organizing experiments
Statistical sanity checks against reported metrics

While the ideas, experiments, and final implementation are original, AI assistance played a supportive role in improving clarity and productivity.

Search This Blog

Explorations in Innovative Science