Tabby's philosophy is to achieve a completion rate comparable to Codex/Copilot by using a model size of less than 1B, with support for BF16/FP16 that reduces VRAM requirements to 2G or less. This may seem impossible, given that the model size is 10 times smaller than that of Codex, but it is definitely achievable, especially in an on-premises environment where customers want to keep the code behind a firewall.
Related research works include [1]. (Hints: combine code search and LLM).
That doesn't answer the question, can anyone without more VRAM than sense actually run it as-is or should we wait until they reach their allegedly impossible aspirational goal?
The very first line of this sort of post should be the specs required and if the trained model weights are actually available, otherwise it's just straight up clickbait.
Models are usually trained in fp16, meaning 16 bits = 4 bytes per parameter. So a 1B model would take 4GB to begin with, and can be optimised down from there with sparsification and/or quantization. A 50% reduction in size might be noticeably worse than the original, but still useful for boilerplate or other highly predictable patterns.