Month of AI
Table of Contents
1. Introduction
Generative AI models have been on fire lately. Proponents peddle it as
the next “internet” like revolution. Startups and big tech companies
can’t resist the urge to throw more money at it. Most of this has
given me FOMO over AI. Perhaps there is something I don’t
understand. In the latest attempt, I plugged in models from
Huggingface, which reminded me of a more modern but janky
scikit-learn
.
2. Background
Round 1: In college (around 2015~2016) I messed around with some basic
machine learning. I implemented neural networks inference and training
in C++/Python/Julia. I plugged in several models from
scikit-learn
. I bred feature-selectors using genetic algorithms. I
submitted mediocre models to Kaggle, an online machine learning
contest platform. I even skimmed the top placement’s creative
solutions. It was fun and all, but it was mostly the end of my machine
learning explorations.
Round 2: Over the years I didn’t use much machine learning. If
anything, I’ve been looking on how to reduce technological
distractions around me. When GitHub Copilot came out for technical
preview around Mid 2021, I signed up immediately. My expectations were
low, but I was impressed by the technology and curious to see how well
it would do. As expected, I was impressed with GitHub Copilot’s
technical achievement. As a tool though, I found the code generation
somewhat distracting (expected), but it was great and generating
docstrings on all my public structs
and functions
, an unexpected
surprise.
2.1. Round 3: What I Did
I delved into AI for a month or so.
I used Huggingface’s inference API to run some models. The inference API (documentation) makes it fast and easy to run models, even on potato hardware. However, not all model types are available using the inference API, I especially wanted to try the text to vector models. After reading the docs for downloading and running models locally, I managed to slap together a half-baked search engine. The final result of vector search was unimpressive.
The next stop was running LLMs. I sudo pacman -S ollama
my way into
running the Phi-3 mini model. The model ran quickly and with a small
memory footprint due to the Q4 quantization cramming by cramming 16
bits information into just 4 bits. This breakthrough allowed me to
even run Phi-3 mini on a Raspberry Pi 5. The next step from this, I
ran sudo pacman -S ollama-rocm
and the models ran faster on my
GPU. I really did not expect this to work at all. At the largest, I
was able to run a variant of the Codestral 22B model.
2.2. Takeaways
2.2.1. LLMs Can Run On Lots of Hardware
I purchased an AMD card last year expecting to do no ML work on it. When models were able to run on the GPU, they ran pretty well. I was able to run a Q4 quantized Llama 3 8B model as 80 tokens/second; this is plenty fast.
I was also able to run Phi-3 mini on the Raspberry Pi 5. Although the speeds were not acceptable for interactable use cases, the Pi 5 is usable for some small scale offline processing.
2.2.2. Running Models is Easyish
Although some of the APIs are ugly, HuggingFace has succeeded at democratizing models. It’s easy to find models and use them out of the box. If even that’s too hard, plenty of companies provide an API endpoint that offers ML as a service.
2.2.3. Model Documentation is Trash
MumboJumbo uses FooBar and Baz to reach SOTA performance on FizzBuzz > eval and reaches a Big O(3.14) on Wumpus.
Most HuggingFace model pages are terrible, especially if you aren’t constantly following the ML space. At the very least, HuggingFace forces a category on models, such as text-to-speech or text-generation, but my wishlist for a card:
- Performance characteristics
- Model size
- Demo
- Brief description
- Caveats
- Special tokens (like
FIM_PREFIX
)
2.2.4. Text Vectorization Alone is Crap
Listen to any layman interview and you will hear someone jump to the brilliant idea that ML must just produce the best search. In practice, my search engine fell short with just sentence-to-vec. I got good improvements by adding keyword search + classical natural language processing techniques such as synonyms and stemming. A more insightful discussion on vector embeddings for Search can be found in Lex Fridman’s interview with Perplexity’s CEO.