New ask Hacker News story: Ask HN: How are you managing LLM inference at the edge?

May 08, 2025

Ask HN: How are you managing LLM inference at the edge?
6 by gray_amps | 1 comments on Hacker News.
I’m building a system to run small LLMs on-device (mobile, IoT, on-prem servers) and would love to hear how others have tackled the challenges. Context: Use cases: offline chatbots, smart cameras, local data privacy Models: 7–13B parameter quantized models (e.g. Llama 2, Vicuna) Constraints: limited RAM/flash, CPU-only or tiny GPU, intermittent connectivity Questions: What runtimes or frameworks are you using (ONNX Runtime, TVM, custom C++)? How do you handle model loading, eviction, and batching under tight memory? Any clever tricks for quantization, pruning, or kernel fusions that boost perf? How do you monitor and update models securely in the field? Looking forward to your benchmarks, war stories, and code pointers!

Search This Blog

Call center services in india

New ask Hacker News story: Ask HN: How are you managing LLM inference at the edge?

Comments

Post a Comment

Popular posts from this blog

How can Utilize Call Center Outsourcing for Increase your Business Income well?

New ask Hacker News story: EVM-UI – visual tool to interact with EVM-based smart contracts

New ask Hacker News story: Ask HN: Should I quit my startup journey for now?