Introduction

Generative AI models have been on fire lately. Proponents peddle it as the next “internet” like revolution. Startups and big tech companies can’t resist the urge to throw more money at it. Most of this has given me FOMO over AI. Perhaps there is something I don’t understand. In the latest attempt, I plugged in models from Huggingface, which reminded me of a more modern but janky scikit-learn.

Background

Round 1: In college (around 2015~2016) I messed around with some basic machine learning. I implemented neural networks inference and training in C++/Python/Julia. I plugged in several models from scikit-learn. I bred feature-selectors using genetic algorithms. I submitted mediocre models to Kaggle, an online machine learning contest platform. I even skimmed the top placement’s creative solutions. It was fun and all, but it was mostly the end of my machine learning explorations.

Round 2: Over the years I didn’t use much machine learning. If anything, I’ve been looking on how to reduce technological distractions around me. When GitHub Copilot came out for technical preview around Mid 2021, I signed up immediately. My expectations were low, but I was impressed by the technology and curious to see how well it would do. As expected, I was impressed with GitHub Copilot’s technical achievement. As a tool though, I found the code generation somewhat distracting (expected), but it was great and generating docstrings on all my public structs and functions, an unexpected surprise.

Round 3: What I Did

I delved into AI for a month or so.

I used Huggingface’s inference API to run some models. The inference API (documentation) makes it fast and easy to run models, even on potato hardware. However, not all model types are available using the inference API, I especially wanted to try the text to vector models. After reading the docs for downloading and running models locally, I managed to slap together a half-baked search engine. The final result of vector search was unimpressive.

The next stop was running LLMs. I sudo pacman -S ollama my way into running the Phi-3 mini model. The model ran quickly and with a small memory footprint due to the Q4 quantization cramming by cramming 16 bits information into just 4 bits. This breakthrough allowed me to even run Phi-3 mini on a Raspberry Pi 5. The next step from this, I ran sudo pacman -S ollama-rocm and the models ran faster on my GPU. I really did not expect this to work at all. At the largest, I was able to run a variant of the Codestral 22B model.

Takeaways

LLMs Can Run On Lots of Hardware

I purchased an AMD card last year expecting to do no ML work on it. When models were able to run on the GPU, they ran pretty well. I was able to run a Q4 quantized Llama 3 8B model as 80 tokens/second; this is plenty fast.

I was also able to run Phi-3 mini on the Raspberry Pi 5. Although the speeds were not acceptable for interactable use cases, the Pi 5 is usable for some small scale offline processing.

Running Models is Easyish

Although some of the APIs are ugly, HuggingFace has succeeded at democratizing models. It’s easy to find models and use them out of the box. If even that’s too hard, plenty of companies provide an API endpoint that offers ML as a service.

Model Documentation is Trash

MumboJumbo uses FooBar and Baz to reach SOTA performance on FizzBuzz > eval and reaches a Big O(3.14) on Wumpus.

Most HuggingFace model pages are terrible, especially if you aren’t constantly following the ML space. At the very least, HuggingFace forces a category on models, such as text-to-speech or text-generation, but my wishlist for a card:

  • Performance characteristics
  • Model size
  • Demo
  • Brief description
  • Caveats
  • Special tokens (like FIM_PREFIX)

Text Vectorization Alone is Crap

Listen to any layman interview and you will hear someone jump to the brilliant idea that ML must just produce the best search. In practice, my search engine fell short with just sentence-to-vec. I got good improvements by adding keyword search + classical natural language processing techniques such as synonyms and stemming. A more insightful discussion on vector embeddings for Search can be found in Lex Fridman’s interview with Perplexity’s CEO.