JS-heavy approaches are not compatible with long-term performance goals
[https://sgom.es/posts/2026-02-13-js-heavy-approaches-are-not-compatible-with-long-term-performance-goals/] - - public:mzimmerm
GGML quantized models. They would let you leverage CPU and system RAM, instead of having to rely on a GPU’s. This could save you a fortune, especially if go for some used AMD Epyc platforms. This could be more viable for the larger models, especially the 30B/65B parameters models which would still press or exceed the VRAM on the P40.