Experimenting with Language Models: A Hands-On Journey Link to heading

Quick Take Link to heading

I discovered that building small language models on consumer hardware provides incredible insights into how LLMs actually work. Without specialized equipment, I was able to train models in under 30 minutes and observe how training data quantity dramatically affects output quality. If you want to see the version of the code I used when writing this you can see it at https://github.com/jjshanks/tiny-stories-experiment/blob/9884f73a9f8ba626cfc4d49768f190d4720309d1/train_tiny_stories.py.

Introduction Link to heading

I never fully understood what LLMs were doing under the hood until I watched a tech talk by 3blue1brown that connected these concepts to my undergrad information retrieval background. While information retrieval isn’t exactly the same, the vector/matrix parts felt familiar enough that something finally clicked. After spending time coding with AI assistance, I wanted to understand the parameters that influence model building and usage more deeply.

As a visual/hands-on learner, I needed to create something that would let me experiment with various parameters easily. My primary requirement was simple: I wanted to train a model on my laptop in 30 minutes or less. I wasn’t aiming to build something production-ready—just to gain a better understanding of the fundamentals.

Basic Workflow Link to heading

At a very high level, the major steps I followed for creating a model (using clean datasets) were:

Load dataset
Create training and validation subsets
Tokenize data
Train model

Datasets Link to heading

I initially explored various datasets including code and technical writing collections. However, many proved troublesome to set up due to login requirements and other restrictions. For a learning exercise, these complications weren’t worth the effort. I eventually settled on TinyStories, which required no extra steps to use and provided a good general-purpose corpus for learning.

Tokenize Data Link to heading

When I tokenize data, I’m converting raw text into vocabulary units that language models can actually process. It’s similar to parsing sentences into words, but tokens might be smaller than words (like prefixes), larger (common phrases), or handle punctuation and special characters differently. This conversion step is critical because it’s how the model “sees” the text.

Caching Link to heading

Caching quickly became essential for maintaining my sanity during this project. It served three crucial purposes:

Experimental integrity: When comparing models with different parameters, I needed certainty that the input data was processed identically.
Environmental considerations: Redundant computation wastes electricity, which adds up over dozens of runs, even on my modest laptop.
Debugging assistance: Cached checkpoints became invaluable tools for isolating exactly which part of the pipeline caused unexpected behaviors.

The caching strategy broke down into these areas:

Training and validation datasets: A minor win, but for larger runs that used filtering, this avoided redoing the same work.
Tokenizing datasets: A medium win since tokenization isn’t the fastest operation but takes enough time that avoiding retokenizing already processed datasets made things run more smoothly.
Model checkpoints (epochs): A huge win! Training is the most expensive part and scales with the number of epochs. For example, training on 50,000 examples might take 10 minutes per epoch, so three epochs would take about 30 minutes. If I initially trained with 2 epochs and then wanted to compare to a 3-epoch model, I’d spend 50 minutes total on training. By caching checkpoints after each epoch, I could start the 3-epoch training from the 2nd checkpoint, waiting only 10 minutes for results while spending just 30 minutes overall on training. This made iteration much faster.

Bias Link to heading

One major area I wanted to explore was how different datasets impacted output. For instance, if I only train on examples containing the word “boy” but make my prompt about a girl, how does the output compare to models trained without filtering or on examples filtered for “girl”?

I believe understanding how model inputs affect outputs is critical for truly grasping how models work and can inform how to use them effectively.

Initial Results Link to heading

Prompt: Once upon a time, there was a little

Examples	Epochs	Result
500	1	Once upon a time, there was a little girlloaded cloud was was so, " tied€, sang sang but, a
500	2	Once upon a time, there was a little girl and cloud. Hicks back insideapped but inside smile strange when inside the go
500	3	Once upon a time, there was a little though to. message the and the. had a to the but but it.
5000	1	Once upon a time, there was a little girl named in the, she was. She was’s so it. He was
5000	2	Once upon a time, there was a little girl named Lily. She. He wanted to a big little boy. It was
5000	3	Once upon a time, there was a little girl named Lily. She loved to play with her mommy and they was it
50000	1	Once upon a time, there was a little boy named Timmy. Tim for a big house in the garden. The little
50000	2	Once upon a time, there was a little girl named Lily. She loved to play outside in the sky. One day,
50000	3	Once upon a time, there was a little girl named Lily. She loved to play outside with her friends. One day,
500000	1	Once upon a time, there was a little girl named Lily. She loved to play with her toys and make a big mess
500000	2	Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day,
500000	3	Once upon a time, there was a little girl named Lily. She loved to play outside and run around. One day,

I noticed how models trained with only 500 examples produced mostly incoherent text even after 3 epochs, while models with 50,000+ examples created grammatically correct sentences with consistent character names.

Longer Generation Link to heading

Examples	Epochs	Result
50000	1	Once upon a time, there was a little girl named Lily. She loved to play in the park. One day, she was very very excited to go on the grass. It was so happy. One day the garden was walking in a
50000	2	Once upon a time, there was a little girl named Lily. She loved to play outside in the park. One day, she decided to help her mommy to go to the garden. Her mom said, “What are you like it?” Her
50000	3	Once upon a time, there was a little girl named Lily. She loved to play outside and play with her friends. One day, Lily’s mom gave her a big hug and gave it to her. Lily was very happy because she loved it.
500000	1	Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, green flower on the ground. It was so pretty and had many colors. Lily asked
500000	2	Once upon a time, there was a little boy named Timmy. Tim had a toy car that he loved to play with every day. One day, Tim went to the park to find a new toy. He saw a boy crying and didn’t
500000	3	Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she saw a big, tall tree with many branches. Lily wanted to climb the tree,

Longer Prompt Link to heading

Examples	Epochs	Result
50000	1	Paul and Susan went to the park to meet up with Phil. It was a nice summer day and they didn’t need to be home until dinner. But then, they were playing with a big house. He wanted to see it, a bit careful. The little girl felt so excited and said, “It’s find a lot of the
50000	2	Paul and Susan went to the park to meet up with Phil. It was a nice summer day and they didn’t need to be home until dinner. They went outside to play and see what they heard a long time. No, what’s too!” said. “What is that? Jack asked, “No! I can I
50000	3	Paul and Susan went to the park to meet up with Phil. It was a nice summer day and they didn’t need to be home until dinner. They were very excited to eat something special. Let’s go for a walk!” she said. “Don’t worry, I will be careful with you, Mum said,
500000	1	Paul and Susan went to the park to meet up with Phil. It was a nice summer day and they didn’t need to be home until dinner. When they got there, they saw a big tree with a shiny red coat. They wanted to climb it, but they couldn’t. He got closer and closer, and there was nothing.
500000	2	Paul and Susan went to the park to meet up with Phil. It was a nice summer day and they didn’t need to be home until dinner. Paul saw a big tree with a tall branch. He asked Susan, “What is that? Jimmy said, It’s a tree. Let’s see it! Maria said
500000	3	Paul and Susan went to the park to meet up with Phil. It was a nice summer day and they didn’t need to be home until dinner. Mama had a surprise for Paul and she gave him a big hug. He said he had to stay close to her and take a nap. icky looked around, but couldn

These outputs tell a clear story about how models develop with more training data. The progression is almost like watching a child learn language: at 500 examples, the model barely strings together related words. By 5,000, it’s forming simple sentences and consistently using character names like “Lily.” The 50,000 example mark is when things get interesting - suddenly there’s narrative structure, with actions following logically from introductions. And at 500,000 examples, we see surprisingly sophisticated elements emerge: descriptive settings, character motivations, and coherent multi-sentence narratives. It’s remarkable how quantitative changes in training data lead to qualitative shifts in capability.

Conclusion Link to heading

Working through these experiments has solidified my understanding of language model training in ways that reading papers never could. I found it remarkable that by manipulating just two variables - training examples and epochs - I could observe such dramatic shifts in output quality. The progression from incoherent text to structured narratives wasn’t just a matter of improvement, but qualitative leaps at specific thresholds. This hands-on experience has given me a much more intuitive feel for the relationship between data volume, training time, and model capability - connections I understood theoretically before, but now comprehend on a deeper level after seeing the actual outputs change before my eyes.

Next Steps Link to heading

I really want to dig into bias next, and then explore more of the finetuning details like dimensions, layers, and learning rate for model creation, followed by generation parameters like temperature and top-p.