2025-11-24.html (30016B)
1 <p> 2 This video shows a <a href="https://en.wikipedia.org/wiki/Large_language_model">large language model</a> (LLM), running on my workstation, using <a href="https://www.gnu.org/software/emacs/">Emacs</a> to determine my location, retrieve weather data, and email me the results: 3 </p> 4 5 <video autoplay loop muted disablepictureinpicture 6 class="video" src="/static/media/llm.mp4" 7 type="video/mp4"> 8 Your browser does not support video. 9 </video> 10 11 <p> 12 With <a href="https://karthinks.com">karthink</a>'s <a href="https://github.com/karthink/gptel">gptel</a> package and some custom code, Emacs is capable of: 13 </p> 14 15 <ul> 16 <li>Querying models from hosted providers (<a href="https://www.anthropic.com/">Anthropic</a>, <a href="https://openai.com/">OpenAI</a>, <a href="https://openrouter.ai/">OpenRouter</a>), or local models (<a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a>, <a href="https://ollama.com/">ollama</a>, etcetera).</li> 17 <li>Switching rapidly between models and configurations, with only a few keystrokes.</li> 18 <li>Saving conversations to the local filesystem, and using them as context for other conversations.</li> 19 <li>Including files, buffers, and terminals as context for queries.</li> 20 <li>Searching the web and reading web pages.</li> 21 <li>Searching, reading, and sending email.</li> 22 <li>Consulting agendas, projects, and tasks.</li> 23 <li>Executing Emacs Lisp code and shell commands.</li> 24 <li>Generating images via the <a href="https://www.comfy.org/">ComfyUI</a> API.</li> 25 <li>Geolocating the device and checking the current date and time.</li> 26 <li>Reading <a href="https://en.wikipedia.org/wiki/Man_page">man</a> pages.</li> 27 <li>Retrieving the user's name and email.</li> 28 </ul> 29 30 <p> 31 Because LLMs understand and write <a href="https://en.wikipedia.org/wiki/Emacs_Lisp">Emacs Lisp</a> code, they can help extend their own capabilities; the improvements are recursive. Below, I note some of the setup required to enable this functionality. 32 </p> 33 34 <h2>Emacs</h2> 35 36 <p> 37 With <code><a href="https://www.gnu.org/software/emacs/manual/html_node/use-package/">use-package</a></code>, <a href="https://melpa.org/">MELPA</a>, and <a href="https://www.passwordstore.org/">pass</a> for password management, a minimal configuration for <code>gptel</code> looks like this: 38 </p> 39 40 <pre><code>(use-package gptel 41 :commands (gptel gtpel-send gptel-send-region gptel-send-buffer) 42 :config 43 (setq gptel-api-key (password-store-get "open-ai/emacs") 44 gptel-curl--common-args 45 '("--disable" "--location" "--silent" "--compressed" "-XPOST" "-D-") 46 gptel-default-mode 'org-mode) 47 :ensure t) 48 </code></pre> 49 50 <p> 51 This is enough to start querying <a href="https://openai.com/api/">OpenAI's API</a> from Emacs. 52 </p> 53 54 <p> 55 To use Anthropic's API: 56 </p> 57 58 <pre><code>(gptel-make-anthropic "Anthropic" 59 :key (password-store-get "anthropic/api/emacs") 60 :stream t) 61 </code></pre> 62 63 <p> 64 I prefer OpenRouter, to access models across providers: 65 </p> 66 67 <pre><code>(gptel-make-openai "OpenRouter" 68 :endpoint "/api/v1/chat/completions" 69 :host "openrouter.ai" 70 :key (password-store-get "openrouter.ai/keys/emacs") 71 :models '(anthropic/claude-opus-4.5 72 anthropic/claude-sonnet-4.5 73 anthropic/claude-3.5-sonnet 74 cohere/command-a 75 deepseek/deepseek-r1-0528 76 deepseek/deepseek-v3.1-terminus:exacto 77 google/gemini-3-pro-preview 78 mistralai/devstral-medium 79 mistralai/magistral-medium-2506:thinking 80 moonshotai/kimi-k2-0905:exacto 81 moonshotai/kimi-k2-thinking 82 openai/gpt-5.1 83 openai/gpt-5.1-codex 84 openai/gpt-5-pro 85 perplexity/sonar-deep-research 86 qwen/qwen3-max 87 qwen/qwen3-vl-235b-a22b-thinking 88 qwen/qwen3-coder:exacto 89 z-ai/glm-4.6:exacto) 90 :stream t) 91 </code></pre> 92 93 <p> 94 The choice of model depends on the task and its budget. Even where those two parameters are comparable, it is sometimes useful to switch models. One may have a blind spot, where another will have insight. 95 </p> 96 97 <p> 98 With <code>gptel</code>, it is easy to switch models mid-conversation, or use the output from one model as context for another. For example, I've used <a href="https://www.perplexity.ai/">Perplexity's</a> <a href="https://openrouter.ai/perplexity/sonar-deep-research">Sonar Deep Research</a> to create briefings, then used another LLM to summarize findings or answer specific questions, augmented with web search. 99 </p> 100 101 <h3>Tools</h3> 102 103 <p> 104 Tools augment a model's perception, memory, or capabilities. The <code>gptel-make-tool</code> function allows one to define tools for use by an LLM. 105 </p> 106 107 <p> 108 When making tools, one can leverage Emacs' existing functionality. For example, the <code>read_url</code> tool uses <code><a href=" https://www.gnu.org/software/emacs/manual/html_node/url/Retrieving-URLs.html">url-retrieve-synchronously</a></code>, while <code>get_user_name</code> and <code>get_user_email</code> read <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/User-Identification.html#index-user_002dfull_002dname">user-full-name</a></code> and <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/User-Identification.html#index-user_002dmail_002daddress">user-mail-address</a></code>. <code>now</code>, used to retrieve the current date and time, uses <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Time-Parsing.html#index-format_002dtime_002dstring">format_time_string</a></code>: 109 </p> 110 111 <pre><code>(gptel-make-tool 112 :name "now" 113 :category "time" 114 :function (lambda () (format-time-string "%Y-%m-%d %H:%M:%S %Z")) 115 :description "Retrieves the current local date, time, and timezone." 116 :include t) 117 </code></pre> 118 119 <p> 120 Similarly, if Emacs is <a href="https://www.gnu.org/software/emacs/manual/html_node/emacs/Sending-Mail.html">configured to send mail</a>, the tool definition is straightforward: 121 </p> 122 123 <pre><code>(gptel-make-tool 124 :name "mail_send" 125 :category "mail" 126 :confirm t 127 :description "Send an email with the user's Emacs mail configuration." 128 :function 129 (lambda (to subject body) 130 (with-temp-buffer 131 (insert "To: " to "\n" 132 "From: " user-mail-address "\n" 133 "Subject: " subject "\n\n" 134 body) 135 (sendmail-send-it))) 136 :args 137 '((:name "to" 138 :type string 139 :description "The recipient's email address.") 140 (:name "subject" 141 :type string 142 :description "The subject of the email.") 143 (:name "body" 144 :type string 145 :description "The body of the email text."))) 146 </code></pre> 147 148 <p> 149 For more complex functionality, I prefer writing shell scripts, for several reasons: 150 <ul> 151 <li>The tool definitions are simpler. For example, my <code>qwen-image</code> script includes a large JSON for the ComfyUI flow. I prefer to leave it outside my Emacs configuration.</li> 152 <li>Tools are accessible to LLMs that may not be running in the Emacs environment (agents, one-off scripts).</li> 153 <li>Fluency. LLMs seem better at writing bash (or Python, or Go) than Emacs Lisp, so it easier to lean on this inherent expertise in developing the tools themselves.</li> 154 </ul> 155 </p> 156 157 <img class="img-center" src="/static/media/drawing-hands.jpg"> 158 <div class="caption"> 159 <p>M.C. Escher, <i>Drawing Hands</i> (1948)</p> 160 </div> 161 162 <h4>Web Search</h4> 163 164 <p> 165 For example, for web search, I initially used the tool described in the <code>gptel</code> <a href="https://github.com/karthink/gptel/wiki/Tools-collection">wiki</a>: 166 </p> 167 168 <pre><code>(defvar brave-search-api-key (password-store-get "search.brave.com/api/emacs") 169 "API key for accessing the Brave Search API.") 170 (defun brave-search-query (query) 171 "Perform a web search using the Brave Search API with the given QUERY." 172 (let ((url-request-method "GET") 173 (url-request-extra-headers 174 =(("X-Subscription-Token" . ,brave-search-api-key))) 175 (url (format "https://api.search.brave.com/res/v1/web/search?q=%s" 176 (url-encode-url query)))) 177 (with-current-buffer (url-retrieve-synchronously url) 178 (goto-char (point-min)) 179 (when (re-search-forward "^$" nil 'move) 180 (let ((json-object-type 'hash-table)) 181 (json-parse-string 182 (buffer-substring-no-properties (point) (point-max)))))))) 183 184 (gptel-make-tool 185 :name "brave_search" 186 :category "web" 187 :function #'brave-search-query 188 :description "Perform a web search using the Brave Search API" 189 :args (list '(:name "query" 190 :type string 191 :description "The search query string"))) 192 </code></pre> 193 194 <p> 195 However, there are times I want to inspect the search results. I use this script: 196 </p> 197 198 <pre><code>#!/usr/bin/env bash 199 200 set -euo pipefail 201 202 API_URL="https://api.search.brave.com/res/v1/web/search" 203 204 check_deps() { 205 for cmd in curl jq pass; do 206 command -v "${cmd}" >/dev/null || { 207 echo "missing: ${cmd}" >&2 208 exit 1 209 } 210 done 211 } 212 213 perform_search() { 214 local query="${1}" 215 local res 216 217 res=$(curl -s -G \ 218 -H "X-Subscription-Token: $(pass "search.brave.com/api/emacs")" \ 219 -H "Accept: application/json" \ 220 --data-urlencode "q=${query}" \ 221 "${API_URL}") 222 if echo "${res}" | jq -e . >/dev/null 2>&1; then 223 echo "${res}" 224 else 225 echo "error: failed to retrieve valid JSON res: ${res}" >&2 226 exit 1 227 fi 228 } 229 230 main() { 231 check_deps 232 233 if [ $# -eq 0 ]; then 234 echo "Usage: ${0} <query>" >&2 235 exit 1 236 fi 237 238 perform_search "${*}" 239 } 240 241 main "${@}" 242 </code></pre> 243 244 <p> 245 Which can be called manually from a shell: <code>brave-search 'quine definition' | jq -C | less</code>. 246 </p> 247 248 <p> 249 The tool definition condenses to: 250 </p> 251 252 <pre><code>(gptel-make-tool 253 :name "brave_search" 254 :category "web" 255 :function 256 (lambda (query) 257 (shell-command-to-string 258 (format "brave-search %s" 259 (shell-quote-argument query)))) 260 :description "Perform a web search using the Brave Search API" 261 :args 262 (list '(:name "query" 263 :type string 264 :description "The search query string"))) 265 </code></pre> 266 267 <h4>Context</h4> 268 269 <p> 270 One limitation that I have run into with tools is context overflow — when retrieved data exceeds an LLM's context window. 271 </p> 272 273 <p> 274 For example, this tool lets an LLM read <code>man</code> pages, helping it correctly recall command flags: 275 </p> 276 277 <pre><code>(gptel-make-tool 278 :name "man" 279 :category "documentation" 280 :function 281 (lambda (page_name) 282 (shell-command-to-string 283 (concat "man --pager cat" page_name))) 284 :description "Read a Unix manual page." 285 :args 286 '((:name "page_name" 287 :type string 288 :description 289 "The name of the man page to read. Can optionally include a section number, for example: '2 read' or 'cat(1)'."))) 290 </code></pre> 291 292 <p> 293 It broke when calling the <a href="https://www.gnu.org/software/units/">GNU units</a> <code>man</code> page, which exceeds 40,000 tokens on my system. This was unfortunate, since some coversions, like temperature, are unintuitive: 294 </p> 295 296 <pre><code>units 'tempC(100)' tempF 297 </code></pre> 298 299 <p> 300 With <code>gptel</code>, one fallback is Emacs' built in <code>man</code> functionality. The appropriate region can be selected with <code>-r</code> in the transient menu. In some cases, this is faster than a tool call. 301 </p> 302 303 <video autoplay loop muted disablepictureinpicture 304 class="video" src="/static/media/llm-temp.mp4" 305 type="video/mp4"> 306 Your browser does not support video. 307 </video> 308 309 <p> 310 I ran into a similar problem with the <code>read_url</code> tool (also found on <a href="https://github.com/karthink/gptel/wiki/Tools-collection">gptel wiki</a>). It can break if the response is larger than the context window. 311 </p> 312 313 <pre><code>(gptel-make-tool 314 :name "read_url" 315 :category "web" 316 :function 317 (lambda (url) 318 (with-current-buffer 319 (url-retrieve-synchronously url) 320 (goto-char (point-min)) (forward-paragraph) 321 (let ((dom (libxml-parse-html-region 322 (point) (point-max)))) 323 (run-at-time 0 nil #'kill-buffer 324 (current-buffer)) 325 (with-temp-buffer 326 (shr-insert-document dom) 327 (buffer-substring-no-properties 328 (point-min) 329 (point-max)))))) 330 :description "Fetch and read the contents of a URL" 331 :args (list '(:name "url" 332 :type string 333 :description "The URL to read"))) 334 </code></pre> 335 336 <p> 337 When I have run into this problem, the issue was bloated functional content — JavaScript code CSS. If the content is not dynamically generated, one call fallback to Emacs' web browser, <code><a href="https://www.gnu.org/software/emacs/manual/html_mono/eww.html">eww</a></code>. The buffer or selected regions can be added as context. A more sophisticated tool could help in these cases. Long term, I hope that LLMs will steer the web back towards readability, either by acting as an aggregator and filter, or as evolutionary pressure in favor of static content. 338 </p> 339 340 <h4>Security</h4> 341 342 <p> 343 The <code><a href="https://github.com/karthink/gptel/wiki/Tools-collection#run_command">run_command</a></code> tool, also found in the <code>gptel</code> tool collection, enables shell command execution, requires careful consideration. A compromised model could issue malicious commands, or a poorly formatted command could have unintended consequences. <code>gptel</code>'s <code>:confirm</code> key can be used to inspect and approve tool calls. 344 </p> 345 346 <pre><code>(gptel-make-tool 347 :name "run_command" 348 :category "command" 349 :confirm t 350 :function 351 (lambda (command) 352 (with-temp-message 353 (format "Executing command: =%s=" command) 354 (shell-command-to-string command))) 355 :description 356 "Execute a shell command; returns the output as a string." 357 :args 358 '((:name "command" 359 :type string 360 :description "The complete shell command to execute."))) 361 </code></pre> 362 363 <p> 364 Inspection limits the LLM's ability to operate asynchronously, without human intervention. There are a few solutions to this problem, the easiest being to offer tools with more limited scope. 365 </p> 366 367 <video autoplay loop muted disablepictureinpicture 368 class="video" src="/static/media/llm-inspect.mp4" 369 type="video/mp4"> 370 Your browser does not support video. 371 </video> 372 373 <h3>Presets</h3> 374 375 <p> 376 With <code>gptel</code>'s transient menu, only a few keystrokes are need to add, edit, or remove context, switch the model one wants to query, change the input and output, or edit the system message. Presets accelerate switching between settings, and are defined with <code>gptel-make-preset</code>. 377 </p> 378 379 <p> 380 For example, with <a href="https://huggingface.co/openai/gpt-oss-120b">GPT-OSS 120B</a> (one of OpenAI's <a href="https://openai.com/open-models/">open weights</a> models), a system prompt is necessary to minimize the use of tables and excessive text styling. A preset can load the appropriate settings: 381 </p> 382 383 <pre><code>(gptel-make-preset 'assistant/gpt 384 :description "GPT-OSS general assistant." 385 :backend "llama.cpp" 386 :model 'gpt 387 :include-reasoning nil 388 :system 389 "You are a large language model queried from Emacs. Your conversation with the user occurs in an org-mode buffer. 390 391 - Use org-mode syntax only (no Markdown). 392 - Use tables ONLY for tabular data with few columns and rows. 393 - Avoid extended text in table cells. If cells need paragraphs, use a list instead. 394 - Default to plain paragraphs and simple lists. 395 - Minimize styling. Use *bold* or /italic/ only where emphasis is essential. Use ~code~ for technical terms. 396 - If citing facts or resources, output references as org-mode links. 397 - Use code blocks for calculations or code examples.") 398 </code></pre> 399 400 <p> 401 From the transient menu, this preset can be selected with two keystrokes: <code>@</code> and then <code>a</code>. 402 </p> 403 404 <h4>Memory</h4> 405 406 <p> 407 Presets can be used to implement read-only memory for an LLM. This preset uses <a href="https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking">Qwen3 VL 30B-A3B</a> with a <code>memory.org</code> file automatically included in the context: 408 </p> 409 410 <pre><code>(gptel-make-preset 'assistant/qwen 411 :description "Qwen Emacs assistant." 412 :backend "llama.cpp" 413 :model 'qwen3_vl_30b-a3b 414 :context '("~/memory.org")) 415 </code></pre> 416 417 <p> 418 The file can include any information that should always be included as context. One could also grant LLMs the ability to append to <code>memory.org</code>, though I am skeptical that they would do so judiciously. 419 </p> 420 421 <h2>Local LLMs</h2> 422 423 <p> 424 Running LLMs on one's own devices offers some advantages over third-party providers: 425 <ul> 426 <li>Redundancy: they work offline, even if providers are experiencing an outage.</li> 427 <li>Privacy: queries and data remain on the device.</li> 428 <li>Control: You know exactly which model is running, with what settings, at what quantization.</li> 429 </ul> 430 </p> 431 432 <p> 433 The main trade-off is intelligence, though for many purposes, the gap is closing fast. Local models excel at summarizing data, language translation, image and PDF extraction, and simple research tasks. I rely on hosted models primarily for complex coding tasks, or when a larger effective context is required. 434 </p> 435 436 <h3>llama.cpp</h3> 437 438 <p> 439 <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a> makes it easy to run models locally: 440 </p> 441 442 <pre><code>git clone https://github.com/ggml-org/llama.cpp.git 443 444 cd llama.cpp 445 446 cmake -B build 447 448 cmake --build build --config Release 449 450 mv build/bin/llama-server ~/.local/bin/ # Or elsewhere in PATH. 451 452 llama-server -hf unsloth/Qwen3-4B-GGUF:q8_0 453 </code></pre> 454 455 <p> 456 This will build <code>llama.cpp</code> with support for CPU based inference, move <code>llama-server</code> into <code>~/.local/bin/</code>, and then download and run <a href="https://unsloth.ai/">Unsloth</a>'s <code>Q8</code> quantization of the <a href="https://huggingface.co/Qwen/Qwen3-4B">Qwen3 4B</a>. The <code>llama.cpp</code> <a href="https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md"> documentation</a> explains how to build for GPUs and other hardware — not much more work than the default build. 457 </p> 458 459 <p><code>llama-server</code> offers a web interface, available at port 8080 by default.</p> 460 461 <video autoplay loop muted disablepictureinpicture 462 class="video" src="/static/media/llm-ls.mp4" 463 type="video/mp4"> 464 Your browser does not support video. 465 </video> 466 467 <h3>Weights</h3> 468 469 <p> 470 Part of the art of using LLMs is selecting an appropriate model. Some factors to consider are available hardware, intended use (task, language), and desired pricing (input and output costs). Some models offer specialized capabilities — <a href="https://ai.google.dev/gemma/docs/core">Gemma3</a> and <a href=""https://github.com/QwenLM/Qwen3-VL">Qwen3-VL</a> offer multimodal input, <a href="https://deepmind.google/models/gemma/medgemma/">Medgemma</a> specializes in medical knowledge, and <a href=https://mistral.ai/">Mistral</a>'s <a href="https://mistral.ai/news/devstral">Devstral</a> focuses on agentic use. 471 </p> 472 473 <p> 474 For local use, hardware tends to be the main limiter. One has to fit the model into available memory, and consider the acceptable performance for one's use case. A rough guideline is to use the smallest model or quantization for the required task. Or, from the opposite direction, to look for the largest model that can fit into available memory. The rule of thumb is that a <code>Q8_0</code> quantization uses about as much memory as there are parameters, so an 8 billion parameter model will use about 8 GB of RAM or VRAM. A <code>Q4_0</code> quant would use half that — 4 GB — while at 16-bit, 16 GB. 475 </p> 476 477 <p> 478 My workstation, laptop, and mobile (<code>llama.cpp</code> can be used from <code><a href="https://termux.dev/en/">termux</a></code>) all run different classes of weights. On my mobile device, I have about 12GB of RAM, but background utilization is already around 8GB. So, when necessary, I use 4B models at <code>Q8_0</code> or less: Gemma3, Qwen3-VL, and Medgemma. If a laptop has 16GB of RAM with 2GB in use, 8B models might run well enough. The workstation, which has a GPU, can run larger models, faster. There are other tricks one can use — <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention">flash attention</a>, <a href="https://research.google/blog/looking-back-at-speculative-decoding/">speculative decoding</a>, MoE offloading — to optimize performance across different hardware configurations. 479 </p> 480 481 <h3>llama-swap</h3> 482 483 <p> 484 One current limitation of <code>llama.cpp</code> is that unless you load multiple models at once, switching models requires manually starting a new instance of <code>llama-server</code>. To swap models on demand, <code><a href="https://github.com/mostlygeek/llama-swap">llama-swap</a></code> can be used. 485 </p> 486 487 <p> 488 <code>llama-swap</code> uses a YAML configuration file, which is <a href="https://github.com/mostlygeek/llama-swap/wiki/Configuration">well documented</a>. I use something like the following: 489 </p> 490 491 <pre><code>logLevel: debug 492 493 macros: 494 "models": "/home/llama-swap/models" 495 496 models: 497 gemma3: 498 cmd: | 499 llama-server 500 --ctx-size 0 501 --gpu-layers 888 502 --jinja 503 --min-p 0.0 504 --model ${models}/gemma-3-27b-it-ud-q8_k_xl.gguf 505 --mmproj ${models}/mmproj-gemma3-27b-bf16.gguf 506 --port ${PORT} 507 --repeat-penalty 1.0 508 --temp 1.0 509 --top-k 64 510 --top-p 0.95 511 ttl: 900 512 name: "gemma3_27b" 513 gpt: 514 cmd: | 515 llama-server 516 --chat-template-kwargs '{"reasoning_effort": "high"}' 517 --ctx-size 0 518 --gpu-layers 888 519 --jinja 520 --model ${models}/gpt-oss-120b-f16.gguf 521 --port ${PORT} 522 --temp 1.0 523 --top-k 0 524 --top-p 1.0 525 ttl: 900 526 name: "gpt-oss_120b" 527 qwen3_vl_30b-a3b: 528 cmd: | 529 llama-server 530 --ctx-size 131072 531 --gpu-layers 888 532 --jinja 533 --min-p 0 534 --model ${models}/qwen3-vl-30b-a3b-thinking-ud-q8_k_xl.gguf 535 --mmproj ${models}/mmproj-qwen3-vl-30ba3b-bf16.gguf 536 --port ${PORT} 537 --temp 0.6 538 --top-k 20 539 --top-p 0.95 540 ttl: 900 541 name: "qwen3_vl_30b-a3b-thinking" 542 </code></pre> 543 544 <h3>nginx</h3> 545 546 <p> 547 Since my workstation has a GPU and can be accessed on the local network or via <a href="https://www.wireguard.com/">WireGuard</a> from other devices, I use <code><a href="https://nginx.org/">nginx</a></code> as a reverse proxy in front of <code>llama-swap</code>, with certificates generated by <code><a href="https://certbot.eff.org/">certbot</a></code>. For streaming LLM responses, <code>proxy_buffering off;</code> and <code>proxy_cache off;</code> are essential settings. 548 </p> 549 550 <pre><code>user http; 551 worker_processes 1; 552 worker_cpu_affinity auto; 553 554 events { 555 worker_connections 1024; 556 } 557 558 http { 559 charset utf-8; 560 sendfile on; 561 tcp_nopush on; 562 tcp_nodelay on; 563 server_tokens off; 564 types_hash_max_size 4096; 565 client_max_body_size 32M; 566 567 # MIME 568 include mime.types; 569 default_type application/octet-stream; 570 571 # logging 572 access_log /var/log/nginx/access.log; 573 error_log /var/log/nginx/error.log warn; 574 575 include /etc/nginx/conf.d/*.conf; 576 } 577 </code></pre> 578 579 <p>Then, for <code>/etc/nginx/conf.d/llama-swap.conf</code>:</p> 580 581 <pre><code>server { 582 listen 80; 583 server_name llm.dwrz.net; 584 return 301 https://$server_name$request_uri; 585 } 586 587 server { 588 listen 443 ssl; 589 http2 on; 590 server_name llm.dwrz.net; 591 592 ssl_certificate /etc/letsencrypt/live/llm.dwrz.net/fullchain.pem; 593 ssl_certificate_key /etc/letsencrypt/live/llm.dwrz.net/privkey.pem; 594 595 location / { 596 proxy_buffering off; 597 proxy_cache off; 598 proxy_pass http://localhost:11434; 599 proxy_read_timeout 3600s; 600 proxy_set_header Host $host; 601 proxy_set_header X-Real-IP $remote_addr; 602 proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 603 proxy_set_header X-Forwarded-Proto $scheme; 604 } 605 } 606 </code></pre> 607 608 <h3>Emacs Configuration</h3> 609 610 <p> 611 <code>llama-server</code> offers an <a href="https://platform.openai.com/docs/api-reference/introduction">OpenAI API</a> compatible API. <code>gptel</code> can be configured to utilize local models with something like the following: 612 </p> 613 614 <pre><code>(gptel-make-openai "llama.cpp" 615 :stream t 616 :protocol "http" 617 :host "localhost" 618 :models 619 '((gemma3 620 :capabilities (media tool json url) 621 :mime-types ("image/jpeg" 622 "image/png" 623 "image/gif" 624 "image/webp")) 625 gpt 626 (medgemma_27b 627 :capabilities (media tool json url) 628 :mime-types ("image/jpeg" 629 "image/png" 630 "image/gif" 631 "image/webp")) 632 (qwen3_vl_30b-a3b 633 :capabilities (media tool json url) 634 :mime-types ("image/jpeg" 635 "image/png" 636 "image/gif" 637 "image/webp")) 638 (qwen3_vl_32b 639 :capabilities (media tool json url) 640 :mime-types ("image/jpeg" 641 "image/png" 642 "image/gif" 643 "image/webp")))) 644 </code></pre> 645 646 <h2>Techniques</h2> 647 648 <p> 649 Having covered the setup and configuration, here are some practical ways I use Emacs with LLMs, demonstrated with examples: 650 </p> 651 652 <h3>Simple Q&A</h3> 653 654 <p> 655 With the <code>gptel</code> transient menu, press <code>m</code> to prompt from the minibuffer, and <code>e</code> to output the answer to the echo area, then <code>Enter</code> to input the prompt. 656 657 <video autoplay loop muted disablepictureinpicture 658 class="video" src="/static/media/llm-qa.mp4" 659 type="video/mp4"> 660 Your browser does not support video. 661 </video> 662 </p> 663 664 <h3>Brief Conversations</h3> 665 666 <p> 667 For brief multi-turn conversations that require no persistence, <code>gptel</code> can be used in the <code>*scratch*</code> buffer. Context can be added via the transient menu, <code>-b</code>, <code>-f</code>, or <code>-r</code> as necessary. The conversation is not persisted unless the buffer is saved. 668 </p> 669 670 <h3>Image-to-Text</h3> 671 <p> 672 With multimodal LLMs like Gemma3 and Qwen3-VL, one can extract text and tables from images. 673 674 <video autoplay loop muted disablepictureinpicture 675 class="video" src="/static/media/llm-itt.mp4" 676 type="video/mp4"> 677 Your browser does not support video. 678 </video> 679 </p> 680 681 <h3>Text-to-Image</h3> 682 <p> 683 Here, a local LLM retrieves a URL, reads its contents, and then 684 generates an image with ComfyUI. 685 <video autoplay loop muted disablepictureinpicture 686 class="video" src="/static/media/llm-image.mp4" 687 type="video/mp4"> 688 Your browser does not support video. 689 </video> 690 691 The result: 692 <img class="img-center" src="/static/media/comfy-ui-dream.png"> 693 </p> 694 695 <h3>Research</h3> 696 <p> 697 If I know I well need to reference a topc later, I usually start out with an <code><a href="https://orgmode.org/">org-mode</a></code> file. In this case, I tend to use links to construct context. 698 </p> 699 700 <h3>Translation</h3> 701 <p> 702 For small or unimportant text, Google Translate via the command-line with <code><a href="https://github.com/soimort/translate-shell">translate-shell</a></code> works well enough. Otherwise, I find the translation output from local LLMs is typically more sensitive to context. 703 704 <video autoplay loop muted disablepictureinpicture 705 class="video" src="/static/media/llm-translate.mp4" 706 type="video/mp4"> 707 Your browser does not support video. 708 </video> 709 </p> 710 711 <h3>Code</h3> 712 <p> 713 My experience using LLMs for code has been mixed. For scripts and small programs, iterating in a single conversation works well. However, with larger codebases, few models contribute meaningfully. While hosted models are typically stronger in this use case, I surmise aggressive quantization has reduced their reliability. I have come to distrust the initial output from any model. 714 </p> 715 716 <p> 717 So far, I have had limited success with agents — which often burn through tokens to understand context, but still manage to miss important nuance. This experience has made me hesitant to add tool support for file operations. 718 </p> 719 720 <p> 721 Instead, I provide context through <code>org-mode</code> links in project-specific files. I have the LLM walk through potential changes, which I review and implement by hand. Generally, this approach saves time, but often, I still work faster on my own. 722 </p> 723 724 <h2>Conclusion</h2> 725 726 <p> 727 I first used Emacs as a text editor 20 years ago. For over a decade, I have used it daily — for writing and coding, task and finance management, email, as a calculator, and to interact with local and remote hosts. I continue to discover new functionality and techniques, and was suprised to see how well this 50-year old program has adapted to the frontier of technology. Despite flaws and limitations, its endurance reflects its foundational design. 728 </p> 729 730 <p> 731 The barrier for entry for Emacs is high. For everyday users, comparable power and flexibility could be unlocked with support for: 732 <ul> 733 <li>Notebooks featuring executable code blocks</li> 734 <li>Links for local and remote content, including other conversations</li> 735 <li>Switching models and providers, including local models</li> 736 <li>Mail and task integration</li> 737 <li>Offline operation with local models.</li> 738 <li>Remote access — Emacs can be accessed remotely via SSH or TRAMP.</li> 739 </ul> 740 </p> 741 742 <p> 743 So far, my experiments with LLMs has left me with concern and optimism. Local inference reveals the energy requirements, yet daily limitations make me skeptical of imminent superintelligence. In the same way that calculators are better than humans, LLMs may offer areas of comparative advantage. The key question is which tasks we can delegate reliably and efficiently, such that the effort of building scaffolding, maintaining guardrails, and managing operations costs less than doing the work ourselves. 744 </p>