Local Installation (AI)
Ollama - pulls a model.
Ollama (if using GGUF-quantized SD models)
MSTY enhances Ollama
llama.cpp with Q4_K_M quantization
Automatic1111 WebUI (supports 4-bit via --medvram)
LangChain - LangChain is an open-source framework that simplifies the development of applications using large language models (LLMs). It provides a suite of tools to help developers build applications that combine language models with other resources, such as databases, APIs, and other data sources, to create powerful, flexible, and context-aware systems.
ComfyUI (better for LCM-LoRA & efficiency)open-source, user-friendly graphical interface for working with machine learning models, particularly those in the field of generative AI (such as image generation and processing). It is designed to make it easier for non-technical users to interact with complex AI models without needing to write code.
Replit
CPU: multi-core CPU (Intel Core i7/i9 with 6+ core). CPU-Only: A PC with 64 GB RAM, a decent multi-core CPU, and an SSD. This could run a 4-bit quantized 70B model at 1-2 tokens/second, relying on system RAM. Without a GPU, inference will lean heavily on RAM bandwidth, so a CPU with multiple memory channels (e.g., dual-channel DDR4/DDR5) helps.
GPU-Assisted: An NVIDIA RTX 3090 (24 GB VRAM), 32 GB system RAM, and a mid-tier CPU. This offloads the model to the GPU, potentially reaching 5-10 tokens/second depending on optimization.
Quantization (1-bit, 4-bit or 8-bit precision), which reduces the memory footprint. Software like llama.cpp or Ollama can optimize the model for lower-end hardware.
High-speed internet connection