Evolving Activation Functions: A Personal Exploration with Transformers

⚠️ Disclaimer (Please Read First)

This blog post presents a personal, exploratory experiment using publicly available tools and datasets (e.g., Tatoeba, DEAP, Hugging Face Transformers). It is not a lab-verified study, and results have not undergone peer review or formal statistical validation.

The findings, interpretations, and conclusions shared here are based on limited-scale experiments using Google Colab and consumer-level hardware. They are intended for educational and exploratory purposes only.

There is no guarantee on the accuracy, stability, or reproducibility of the experimental results. Any interpretations or applications are entirely at the reader’s discretion.

Readers are encouraged to replicate, adapt, or challenge the outcomes in more rigorous or production-grade environments.

We often tweak our models by adjusting the data, trying new optimizers, or changing the architecture—but how often do we think about activation functions? This project started as a curiosity: what if I used genetic programming to evolve better activation functions for transformers?

I'm not coming from a lab or a team with massive resources. Just working with Colab GPUs, a lot of tinkering, and DEAP for symbolic regression. Along the way, I ran three experiments:

  1. A GPT2-like model fine-tuned for translation

  2. A T5-small model also for translation

  3. A TinyGPT model trained on simple Q&A data

Surprisingly, I found a few activation functions that consistently beat the default GELU in terms of BLEU score, perplexity, and sometimes even accuracy.


Why Even Touch Activation Functions?

Most people stick with ReLU, GELU, or Swish because they work "well enough." But I was curious: could a model perform better if its nonlinearity was custom-fit to the task or dataset?

So I used DEAP to evolve symbolic expressions for activation functions. Here’s an example of the primitive set I used:

Primitive Set Used
Unary functions: log1p_safe, sin, cos, sigmoid_boost
Binary functions: safe_add, safe_sub, safe_mul, safe_div
Constants: 1.0
Each tree built from these was plugged into PyTorch models in place of GELU.

How I Tried to Be Fair

I wanted a fair comparison, so:

  • I used the default HuggingFace model settings

  • Same seeds, same learning rate, same data splits

  • No hand-picking results—I repeated runs and only kept results that stayed consistent

I couldn’t do full statistical testing (compute is limited), but I hope others can help repeat or verify the work.

Experiment 1: GPT2-like Model for Translation

  • Model: Tiny GPT2 (2-layer, 128-dim)

  • Data: 50k sentence pairs from Tatoeba

  • Eval: BLEU, perplexity, accuracy

Metric

GELU

Custom #1

Custom #2

BLEU

0.0185

                    0.0188

                            0.0174

Perplexity

2.2471

                    2.2321

                            2.2398

Accuracy

82.45%

                    83.12%

                            82.76%




The gains are small but consistent. I ran the evolution on just 50% of the data to save time, then trained the top function on the full set.


Experiment 2: T5-Small

  • Model: T5-small (encoder-decoder)

  • Custom Activation: log1p_safe(safe_cos(gated_gate((safe_tan(softsign(ARG0))), ARG0))

  • Data: Same 50k sentence pairs

  • Eval: BLEU, perplexity, training loss

Epoch

Loss (GELU)

Loss (Custom)

BLEU (GELU)

BLEU (Custom)

PPL (GELU)

PPL (Custom)

10

0.8872

0.8593

0.0599

0.0608

2.2584

2.2453

11

0.8471

0.8318

0.0601

0.0614

2.2121

2.1957

12

0.8016

0.7951

0.0620

0.0632

2.1679

2.1485


These small improvements feel promising. Even a few tenths in BLEU and perplexity can make a difference over time.


Experiment 3: TinyGPT on Constrained Q&A

  • Custom Activation: gated_gate(safe_sin(relu_like(sigmoid_like(ARG0))), gelu_like(ARG0))





Plot of the evolved activation used in the TinyGPT experiment.

This one was fun to watch. I trained a GPT-style model on structured Q&A examples. Here are some sample predictions:

Sample 1

Input:

Q: What is a hydrogen atom made of?
A:

GELU:

Q: What is a hydrogen atom made of?
A:
Q: What kind of charge does a proton have?
A: A positive charge.

Custom Act:

Q: What is a hydrogen atom made of?
A: Hydrogen is the simplest and most abundant element.
A hydrogen atom consists of one proton and one electron.

Sample 2

Input:

Q: What kind of charge does a proton have?
A:

GELU:

Q: What kind of charge does a proton have?
A:
Q: The lightest and most common element.

Custom Act:

Q: What kind of charge does a proton have?
A: A positive charge.
Q: How do hydrogen atoms bond?
A: By sharing their electrons.

Q&A Metrics

Metric GELU Baseline Custom Activation
BLEU 0.0117 0.0600
Perplexity 3557.8637 2683.9513
Fitness Score 0.0175 0.0420

Much better structure and relevance came out of the evolved function, even though the base model is tiny.


Why I Used Thinned Data

Since I was using Colab Pro and had limited GPU credits, I used only 50% of the training data during the function search. Then I used the full set for training and evaluation.

Turns out, thinning also helps with faster iteration and is probably a good stand-in for real-world low-resource setups. I would post another blog on the thinning technique that I used to thin the data while keeping the overall statistical structure quite intact.


⚠️ Caveats and Limits

  • These are small wins, not breakthroughs

  • Some evolved formulas are long or messy

  • No cross-task generalization yet—I hope others help test this

  • Formal statistical testing wasn’t done yet due to compute/time


What I Think This Means

  • There’s untapped potential in tweaking transformer internals
    e.g., If a tuned activation resembles a complex polynomial, does that imply the model is too simple—or the opposite?

  • By tweaking the score weighting on BLEU and Perplexity, different attributes of the model can be targeted — e.g. emphasizing BLEU more than accuracy.

  • GP might be a fun and useful way to automate model tuning

  • Custom activation functions could play a role in building smaller, smarter models and could be mathematically further simplified

  • If we penalize complexity, we might even get human-readable math that helps explain models


Open to Collaborate

If you're into transformers, symbolic regression, or just curious about how far this idea can go—I'm totally open to connecting.

A GitHub repo or notebook shareable files is/are in the works. I’d love help testing this on new models, datasets, or scaling it up. Feel free to reach out! My email address is : machinesmartsor@gmail.com




Final Thought

This started as a weekend curiosity. I didn’t expect much, but the numbers kept nudging me to keep going. I hope this work sparks new ideas—or maybe helps someone else take it further.

Even small tweaks can ripple through a model in interesting ways.

Thanks for reading!

Note: The activation function and plot for Experiment 1 (GPT2-like) were removed due to uncertainty in reproducibility. However, the core metrics are retained for transparency. Experiments 2 and 3 remain fully documented.

๐Ÿ” Update: Subsequent tests revealed BLEU scores can fluctuate across training runs due to instability in the learning dynamics. While each trained model gives consistent answers, final BLEU varies without ensemble or selection strategies. This doesn’t invalidate the idea, but highlights the importance of averaging or picking top-performing runs for reliable evaluation.

Shared Notebook

✅ Overview of Shared Experiment 2 Structure

Notebook FilenamePurposeSuggested Title (in .ipynb or blog)
Experiment 2 notebook 1GP search for best activation function๐Ÿ”ฌ Shared Experiment 2 — Custom Activation Search for T5
Experiment 2 notebook 2T5 training (toggle default vs custom activation)๐Ÿง  Shared Experiment 2 — T5 Training with Optional Custom Function

Experiment 2 notebook 3

Translation tests comparing saved default vs custom๐Ÿงช Shared Experiment 2 — Translation Output Comparison

Note: The activation function and plot for Experiment 1 (GPT2-like) were removed due to uncertainty in reproducibility. However, the core metrics are retained for transparency. Experiments 2 and 3 remain fully documented.

๐Ÿ“˜ Glossary of Terms

BLEU Score (Bilingual Evaluation Understudy)
A metric for evaluating the quality of machine-translated text by comparing it to one or more reference translations.
๐Ÿ”น Higher is better (range: 0 to 1)
๐Ÿ”น Used widely in NLP translation tasks

Perplexity
A measurement of how well a probability model (like a language model) predicts a sample.
๐Ÿ”น Lower is better
๐Ÿ”น Indicates uncertainty: high perplexity = high uncertainty

Accuracy
The ratio of correct predictions to total predictions.
๐Ÿ”น In classification: correct labels / total labels
๐Ÿ”น In generation: may be approximated by matching target tokens (but not ideal for text generation tasks)

Activation Function
A mathematical function applied to neural network outputs to introduce non-linearity, enabling learning of complex patterns.
๐Ÿ”น Examples: ReLU, GELU, Swish, Sigmoid

Custom Activation (via Genetic Programming)
An evolved mathematical function generated automatically using symbolic expressions to replace standard activation functions.
๐Ÿ”น Tuned specifically for a task or dataset
๐Ÿ”น May involve complex formulas like log1p_safe(cos(cos(x)))

Genetic Programming (GP)
An evolutionary algorithm that evolves computer programs or symbolic formulas (like custom activation functions) by simulating natural selection.
๐Ÿ”น Uses a population of candidate solutions
๐Ÿ”น Applies crossover, mutation, and selection

Symbolic Regression
A type of regression where the algorithm searches for a mathematical expression that best fits a dataset.
๐Ÿ”น GP is often used for this
๐Ÿ”น Produces human-readable formulas

DEAP (Distributed Evolutionary Algorithms in Python)
A Python framework for evolutionary computing, including genetic algorithms and genetic programming.

License and Attribution

License
This work — including all code, data, and experimental results — is licensed under the
Creative Commons Attribution–NonCommercial 4.0 International License (CC BY-NC 4.0).

Creative Commons License

You are free to:

  • Share — copy and redistribute the material in any medium or format

  • Adapt — remix, transform, and build upon the material

Under the following terms:

  • ๐Ÿงพ Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.

  • ๐Ÿšซ NonCommercial — You may not use the material for commercial purposes.

Attribution Required

Please credit the original author as follows:

Based on the work by Paul KP Fung (https://artisticsciencedream.blogspot.com/)
Licensed under CC BY-NC 4.0

Includes Third-Party Materials:

Tatoeba Dataset
This project uses sentence data from the Tatoeba Project, available under the Creative Commons Attribution 2.0 France (CC BY 2.0 FR) license.
© Tatoeba contributors – Used with permission under license terms.

DEAP (Distributed Evolutionary Algorithms in Python)
This project also uses the DEAP library, which is released under the LGPL 3.0 License.
© DEAP Developers – Used in accordance with the license.

AI Assistance Acknowledgment

Parts of this project were developed with the assistance of ChatGPT, an AI language model by OpenAI.
ChatGPT was used for:

  • Code refactoring and documentation

  • License drafting and formatting

  • Structural suggestions for organizing experiments

  • Statistical sanity checks against reported metrics

While the ideas, experiments, and final implementation are original, AI assistance played a supportive role in improving clarity and productivity.






Comments

Popular posts from this blog

Analyzing Stock Trends with Bollinger Bands, RSI, and MACD