2025-11-24.html - src - Go monorepo.

2025-11-24.html (30016B)
      1 <p>
      2   This video shows a <a href="https://en.wikipedia.org/wiki/Large_language_model">large language model</a> (LLM), running on my workstation, using <a href="https://www.gnu.org/software/emacs/">Emacs</a> to determine my location, retrieve weather data, and email me the results:
      3 </p>
      4 
      5 <video autoplay loop muted disablepictureinpicture
      6        class="video" src="/static/media/llm.mp4"
      7        type="video/mp4">
      8   Your browser does not support video.
      9 </video>
     10 
     11 <p>
     12   With <a href="https://karthinks.com">karthink</a>'s <a href="https://github.com/karthink/gptel">gptel</a> package and some custom code, Emacs is capable of:
     13 </p>
     14 
     15 <ul>
     16   <li>Querying models from hosted providers (<a href="https://www.anthropic.com/">Anthropic</a>, <a href="https://openai.com/">OpenAI</a>, <a href="https://openrouter.ai/">OpenRouter</a>), or local models (<a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a>, <a href="https://ollama.com/">ollama</a>, etcetera).</li>
     17   <li>Switching rapidly between models and configurations, with only a few keystrokes.</li>
     18   <li>Saving conversations to the local filesystem, and using them as context for other conversations.</li>
     19   <li>Including files, buffers, and terminals as context for queries.</li>
     20   <li>Searching the web and reading web pages.</li>
     21   <li>Searching, reading, and sending email.</li>
     22   <li>Consulting agendas, projects, and tasks.</li>
     23   <li>Executing Emacs Lisp code and shell commands.</li>
     24   <li>Generating images via the <a href="https://www.comfy.org/">ComfyUI</a> API.</li>
     25   <li>Geolocating the device and checking the current date and time.</li>
     26   <li>Reading <a href="https://en.wikipedia.org/wiki/Man_page">man</a> pages.</li>
     27   <li>Retrieving the user's name and email.</li>
     28 </ul>
     29 
     30 <p>
     31   Because LLMs understand and write <a href="https://en.wikipedia.org/wiki/Emacs_Lisp">Emacs Lisp</a> code, they can help extend their own capabilities; the improvements are recursive. Below, I note some of the setup required to enable this functionality.
     32 </p>
     33 
     34 <h2>Emacs</h2>
     35 
     36 <p>
     37   With <code><a href="https://www.gnu.org/software/emacs/manual/html_node/use-package/">use-package</a></code>, <a href="https://melpa.org/">MELPA</a>, and <a href="https://www.passwordstore.org/">pass</a> for password management, a minimal configuration for <code>gptel</code> looks like this:
     38 </p>
     39 
     40 <pre><code>(use-package gptel
     41  :commands (gptel gtpel-send gptel-send-region gptel-send-buffer)
     42  :config
     43  (setq gptel-api-key (password-store-get "open-ai/emacs")
     44        gptel-curl--common-args
     45        '("--disable" "--location" "--silent" "--compressed" "-XPOST" "-D-")
     46        gptel-default-mode 'org-mode)
     47  :ensure t)
     48 </code></pre>
     49 
     50 <p>
     51   This is enough to start querying <a href="https://openai.com/api/">OpenAI's API</a> from Emacs.
     52 </p>
     53 
     54 <p>
     55   To use Anthropic's API:
     56 </p>
     57 
     58 <pre><code>(gptel-make-anthropic "Anthropic"
     59  :key (password-store-get "anthropic/api/emacs")
     60  :stream t)
     61 </code></pre>
     62 
     63 <p>
     64   I prefer OpenRouter, to access models across providers:
     65 </p>
     66 
     67 <pre><code>(gptel-make-openai "OpenRouter"
     68  :endpoint "/api/v1/chat/completions"
     69  :host "openrouter.ai"
     70  :key (password-store-get "openrouter.ai/keys/emacs")
     71  :models '(anthropic/claude-opus-4.5
     72            anthropic/claude-sonnet-4.5
     73            anthropic/claude-3.5-sonnet
     74            cohere/command-a
     75            deepseek/deepseek-r1-0528
     76            deepseek/deepseek-v3.1-terminus:exacto
     77            google/gemini-3-pro-preview
     78            mistralai/devstral-medium
     79            mistralai/magistral-medium-2506:thinking
     80            moonshotai/kimi-k2-0905:exacto
     81            moonshotai/kimi-k2-thinking
     82            openai/gpt-5.1
     83            openai/gpt-5.1-codex
     84            openai/gpt-5-pro
     85            perplexity/sonar-deep-research
     86            qwen/qwen3-max
     87            qwen/qwen3-vl-235b-a22b-thinking
     88            qwen/qwen3-coder:exacto
     89            z-ai/glm-4.6:exacto)
     90  :stream t)
     91 </code></pre>
     92 
     93 <p>
     94   The choice of model depends on the task and its budget. Even where those two parameters are comparable, it is sometimes useful to switch models. One may have a blind spot, where another will have insight.
     95 </p>
     96 
     97 <p>
     98   With <code>gptel</code>, it is easy to switch models mid-conversation, or use the output from one model as context for another. For example, I've used <a href="https://www.perplexity.ai/">Perplexity's</a> <a href="https://openrouter.ai/perplexity/sonar-deep-research">Sonar Deep Research</a> to create briefings, then used another LLM to summarize findings or answer specific questions, augmented with web search.
     99 </p>
    100 
    101 <h3>Tools</h3>
    102 
    103 <p>
    104   Tools augment a model's perception, memory, or capabilities. The <code>gptel-make-tool</code> function allows one to define tools for use by an LLM.
    105 </p>
    106 
    107 <p>
    108   When making tools, one can leverage Emacs' existing functionality. For example, the <code>read_url</code> tool uses <code><a href=" https://www.gnu.org/software/emacs/manual/html_node/url/Retrieving-URLs.html">url-retrieve-synchronously</a></code>, while <code>get_user_name</code> and <code>get_user_email</code> read <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/User-Identification.html#index-user_002dfull_002dname">user-full-name</a></code> and <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/User-Identification.html#index-user_002dmail_002daddress">user-mail-address</a></code>. <code>now</code>, used to retrieve the current date and time, uses <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Time-Parsing.html#index-format_002dtime_002dstring">format_time_string</a></code>:
    109 </p>
    110 
    111 <pre><code>(gptel-make-tool
    112  :name "now"
    113  :category "time"
    114  :function (lambda () (format-time-string "%Y-%m-%d %H:%M:%S %Z"))
    115  :description "Retrieves the current local date, time, and timezone."
    116  :include t)
    117 </code></pre>
    118 
    119 <p>
    120   Similarly, if Emacs is <a href="https://www.gnu.org/software/emacs/manual/html_node/emacs/Sending-Mail.html">configured to send mail</a>, the tool definition is straightforward:
    121 </p>
    122 
    123 <pre><code>(gptel-make-tool
    124  :name "mail_send"
    125  :category "mail"
    126  :confirm t
    127  :description "Send an email with the user's Emacs mail configuration."
    128  :function
    129  (lambda (to subject body)
    130    (with-temp-buffer
    131      (insert "To: " to "\n"
    132              "From: " user-mail-address "\n"
    133              "Subject: " subject "\n\n"
    134              body)
    135      (sendmail-send-it)))
    136  :args
    137  '((:name "to"
    138           :type string
    139           :description "The recipient's email address.")
    140    (:name "subject"
    141           :type string
    142           :description "The subject of the email.")
    143    (:name "body"
    144           :type string
    145           :description "The body of the email text.")))
    146 </code></pre>
    147 
    148 <p>
    149   For more complex functionality, I prefer writing shell scripts, for several reasons:
    150   <ul>
    151     <li>The tool definitions are simpler. For example, my <code>qwen-image</code> script includes a large JSON for the ComfyUI flow. I prefer to leave it outside my Emacs configuration.</li>
    152     <li>Tools are accessible to LLMs that may not be running in the Emacs environment (agents, one-off scripts).</li>
    153     <li>Fluency. LLMs seem better at writing bash (or Python, or Go) than Emacs Lisp, so it easier to lean on this inherent expertise in developing the tools themselves.</li>
    154   </ul>
    155 </p>
    156 
    157 <img class="img-center" src="/static/media/drawing-hands.jpg">
    158 <div class="caption">
    159   <p>M.C. Escher, <i>Drawing Hands</i> (1948)</p>
    160 </div>
    161 
    162 <h4>Web Search</h4>
    163 
    164 <p>
    165   For example, for web search, I initially used the tool described in the <code>gptel</code> <a href="https://github.com/karthink/gptel/wiki/Tools-collection">wiki</a>:
    166 </p>
    167 
    168 <pre><code>(defvar brave-search-api-key (password-store-get "search.brave.com/api/emacs")
    169   "API key for accessing the Brave Search API.")
    170 (defun brave-search-query (query)
    171   "Perform a web search using the Brave Search API with the given QUERY."
    172   (let ((url-request-method "GET")
    173         (url-request-extra-headers
    174          =(("X-Subscription-Token" . ,brave-search-api-key)))
    175         (url (format "https://api.search.brave.com/res/v1/web/search?q=%s"
    176                      (url-encode-url query))))
    177     (with-current-buffer (url-retrieve-synchronously url)
    178       (goto-char (point-min))
    179       (when (re-search-forward "^$" nil 'move)
    180         (let ((json-object-type 'hash-table))
    181           (json-parse-string
    182            (buffer-substring-no-properties (point) (point-max))))))))
    183 
    184 (gptel-make-tool
    185  :name "brave_search"
    186  :category "web"
    187  :function #'brave-search-query
    188  :description "Perform a web search using the Brave Search API"
    189  :args (list '(:name "query"
    190                      :type string
    191                      :description "The search query string")))
    192 </code></pre>
    193 
    194 <p>
    195   However, there are times I want to inspect the search results. I use this script:
    196 </p>
    197 
    198 <pre><code>#!/usr/bin/env bash
    199 
    200 set -euo pipefail
    201 
    202 API_URL="https://api.search.brave.com/res/v1/web/search"
    203 
    204 check_deps() {
    205   for cmd in curl jq pass; do
    206     command -v "${cmd}" >/dev/null || {
    207       echo "missing: ${cmd}" >&2
    208       exit 1
    209     }
    210   done
    211 }
    212 
    213 perform_search() {
    214   local query="${1}"
    215   local res
    216 
    217   res=$(curl -s -G \
    218              -H "X-Subscription-Token: $(pass "search.brave.com/api/emacs")" \
    219              -H "Accept: application/json" \
    220              --data-urlencode "q=${query}" \
    221              "${API_URL}")
    222   if echo "${res}" | jq -e . >/dev/null 2>&1; then
    223     echo "${res}"
    224   else
    225     echo "error: failed to retrieve valid JSON res: ${res}" >&2
    226     exit 1
    227   fi
    228 }
    229 
    230 main() {
    231   check_deps
    232 
    233   if [ $# -eq 0 ]; then
    234     echo "Usage: ${0} <query>" >&2
    235   exit 1
    236   fi
    237 
    238   perform_search "${*}"
    239   }
    240 
    241   main "${@}"
    242 </code></pre>
    243 
    244 <p>
    245   Which can be called manually from a shell: <code>brave-search 'quine definition' | jq -C | less</code>.
    246 </p>
    247 
    248 <p>
    249   The tool definition condenses to:
    250 </p>
    251 
    252 <pre><code>(gptel-make-tool
    253  :name "brave_search"
    254  :category "web"
    255  :function
    256  (lambda (query)
    257    (shell-command-to-string
    258     (format "brave-search %s"
    259             (shell-quote-argument query))))
    260  :description "Perform a web search using the Brave Search API"
    261  :args
    262  (list '(:name "query"
    263                :type string
    264                :description "The search query string")))
    265 </code></pre>
    266 
    267 <h4>Context</h4>
    268 
    269 <p>
    270   One limitation that I have run into with tools is context overflow — when retrieved data exceeds an LLM's context window.
    271 </p>
    272 
    273 <p>
    274   For example, this tool lets an LLM read <code>man</code> pages, helping it correctly recall command flags:
    275 </p>
    276 
    277 <pre><code>(gptel-make-tool
    278  :name "man"
    279  :category "documentation"
    280  :function
    281  (lambda (page_name)
    282    (shell-command-to-string
    283     (concat "man --pager cat" page_name)))
    284  :description "Read a Unix manual page."
    285  :args
    286  '((:name "page_name"
    287           :type string
    288           :description
    289           "The name of the man page to read. Can optionally include a section number, for example: '2 read' or 'cat(1)'.")))
    290 </code></pre>
    291 
    292 <p>
    293   It broke when calling the <a href="https://www.gnu.org/software/units/">GNU units</a> <code>man</code> page, which exceeds 40,000 tokens on my system. This was unfortunate, since some coversions, like temperature, are unintuitive:
    294 </p>
    295 
    296 <pre><code>units 'tempC(100)' tempF
    297 </code></pre>
    298 
    299 <p>
    300   With <code>gptel</code>, one fallback is Emacs' built in <code>man</code> functionality. The appropriate region can be selected with <code>-r</code> in the transient menu. In some cases, this is faster than a tool call.
    301 </p>
    302 
    303 <video autoplay loop muted disablepictureinpicture
    304        class="video" src="/static/media/llm-temp.mp4"
    305        type="video/mp4">
    306   Your browser does not support video.
    307 </video>
    308 
    309 <p>
    310   I ran into a similar problem with the <code>read_url</code> tool (also found on <a href="https://github.com/karthink/gptel/wiki/Tools-collection">gptel wiki</a>). It can break if the response is larger than the context window.
    311 </p>
    312 
    313 <pre><code>(gptel-make-tool
    314   :name "read_url"
    315   :category "web"
    316   :function
    317   (lambda (url)
    318     (with-current-buffer
    319         (url-retrieve-synchronously url)
    320       (goto-char (point-min)) (forward-paragraph)
    321       (let ((dom (libxml-parse-html-region
    322                   (point) (point-max))))
    323         (run-at-time 0 nil #'kill-buffer
    324                      (current-buffer))
    325         (with-temp-buffer
    326           (shr-insert-document dom)
    327           (buffer-substring-no-properties
    328            (point-min)
    329            (point-max))))))
    330   :description "Fetch and read the contents of a URL"
    331   :args (list '(:name "url"
    332                       :type string
    333                       :description "The URL to read")))
    334 </code></pre>
    335 
    336 <p>
    337   When I have run into this problem, the issue was bloated functional content — JavaScript code CSS. If the content is not dynamically generated, one call fallback to Emacs' web browser, <code><a href="https://www.gnu.org/software/emacs/manual/html_mono/eww.html">eww</a></code>. The buffer or selected regions can be added as context. A more sophisticated tool could help in these cases. Long term, I hope that LLMs will steer the web back towards readability, either by acting as an aggregator and filter, or as evolutionary pressure in favor of static content.
    338 </p>
    339 
    340 <h4>Security</h4>
    341 
    342 <p>
    343   The <code><a href="https://github.com/karthink/gptel/wiki/Tools-collection#run_command">run_command</a></code> tool, also found in the <code>gptel</code> tool collection, enables shell command execution, requires careful consideration. A compromised model could issue malicious commands, or a poorly formatted command could have unintended consequences. <code>gptel</code>'s <code>:confirm</code> key can be used to inspect and approve tool calls.
    344 </p>
    345 
    346 <pre><code>(gptel-make-tool
    347  :name "run_command"
    348  :category "command"
    349  :confirm t
    350  :function
    351  (lambda (command)
    352    (with-temp-message
    353        (format "Executing command: =%s=" command)
    354      (shell-command-to-string command)))
    355  :description
    356  "Execute a shell command; returns the output as a string."
    357  :args
    358  '((:name "command"
    359           :type string
    360           :description "The complete shell command to execute.")))
    361 </code></pre>
    362 
    363 <p>
    364   Inspection limits the LLM's ability to operate asynchronously, without human intervention. There are a few solutions to this problem, the easiest being to offer tools with more limited scope.
    365 </p>
    366 
    367 <video autoplay loop muted disablepictureinpicture
    368        class="video" src="/static/media/llm-inspect.mp4"
    369        type="video/mp4">
    370   Your browser does not support video.
    371 </video>
    372 
    373 <h3>Presets</h3>
    374 
    375 <p>
    376   With <code>gptel</code>'s transient menu, only a few keystrokes are need to add, edit, or remove context, switch the model one wants to query, change the input and output, or edit the system message. Presets accelerate switching between settings, and are defined with <code>gptel-make-preset</code>.
    377 </p>
    378 
    379 <p>
    380   For example, with <a href="https://huggingface.co/openai/gpt-oss-120b">GPT-OSS 120B</a> (one of OpenAI's <a href="https://openai.com/open-models/">open weights</a> models), a system prompt is necessary to minimize the use of tables and excessive text styling. A preset can load the appropriate settings:
    381 </p>
    382 
    383 <pre><code>(gptel-make-preset 'assistant/gpt
    384   :description "GPT-OSS general assistant."
    385   :backend "llama.cpp"
    386   :model 'gpt
    387   :include-reasoning nil
    388   :system
    389   "You are a large language model queried from Emacs. Your conversation with the user occurs in an org-mode buffer.
    390 
    391 - Use org-mode syntax only (no Markdown).
    392 - Use tables ONLY for tabular data with few columns and rows.
    393 - Avoid extended text in table cells. If cells need paragraphs, use a list instead.
    394 - Default to plain paragraphs and simple lists.
    395 - Minimize styling. Use *bold* or /italic/ only where emphasis is essential. Use ~code~ for technical terms.
    396 - If citing facts or resources, output references as org-mode links.
    397 - Use code blocks for calculations or code examples.")
    398 </code></pre>
    399 
    400 <p>
    401   From the transient menu, this preset can be selected with two keystrokes: <code>@</code> and then <code>a</code>.
    402 </p>
    403 
    404 <h4>Memory</h4>
    405 
    406 <p>
    407   Presets can be used to implement read-only memory for an LLM. This preset uses <a href="https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking">Qwen3 VL 30B-A3B</a> with a <code>memory.org</code> file automatically included in the context:
    408 </p>
    409 
    410 <pre><code>(gptel-make-preset 'assistant/qwen
    411  :description "Qwen Emacs assistant."
    412  :backend "llama.cpp"
    413  :model 'qwen3_vl_30b-a3b
    414  :context '("~/memory.org"))
    415 </code></pre>
    416 
    417 <p>
    418   The file can include any information that should always be included as context. One could also grant LLMs the ability to append to <code>memory.org</code>, though I am skeptical that they would do so judiciously.
    419 </p>
    420 
    421 <h2>Local LLMs</h2>
    422 
    423 <p>
    424   Running LLMs on one's own devices offers some advantages over third-party providers:
    425   <ul>
    426     <li>Redundancy: they work offline, even if providers are experiencing an outage.</li>
    427     <li>Privacy: queries and data remain on the device.</li>
    428     <li>Control: You know exactly which model is running, with what settings, at what quantization.</li>
    429   </ul>
    430 </p>
    431 
    432 <p>
    433   The main trade-off is intelligence, though for many purposes, the gap is closing fast. Local models excel at summarizing data, language translation, image and PDF extraction, and simple research tasks. I rely on hosted models primarily for complex coding tasks, or when a larger effective context is required.
    434 </p>
    435 
    436 <h3>llama.cpp</h3>
    437 
    438 <p>
    439   <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a> makes it easy to run models locally:
    440 </p>
    441 
    442 <pre><code>git clone https://github.com/ggml-org/llama.cpp.git
    443 
    444 cd llama.cpp
    445 
    446 cmake -B build
    447 
    448 cmake --build build --config Release
    449 
    450 mv build/bin/llama-server ~/.local/bin/ # Or elsewhere in PATH.
    451 
    452 llama-server -hf unsloth/Qwen3-4B-GGUF:q8_0
    453 </code></pre>
    454 
    455 <p>
    456   This will build <code>llama.cpp</code> with support for CPU based inference, move <code>llama-server</code> into <code>~/.local/bin/</code>, and then download and run <a href="https://unsloth.ai/">Unsloth</a>'s <code>Q8</code> quantization of the <a href="https://huggingface.co/Qwen/Qwen3-4B">Qwen3 4B</a>. The <code>llama.cpp</code> <a href="https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md"> documentation</a> explains how to build for GPUs and other hardware — not much more work than the default build.
    457 </p>
    458 
    459 <p><code>llama-server</code> offers a web interface, available at port 8080 by default.</p>
    460 
    461 <video autoplay loop muted disablepictureinpicture
    462        class="video" src="/static/media/llm-ls.mp4"
    463        type="video/mp4">
    464   Your browser does not support video.
    465 </video>
    466 
    467 <h3>Weights</h3>
    468 
    469 <p>
    470   Part of the art of using LLMs is selecting an appropriate model. Some factors to consider are available hardware, intended use (task, language), and desired pricing (input and output costs). Some models offer specialized capabilities — <a href="https://ai.google.dev/gemma/docs/core">Gemma3</a> and <a href=""https://github.com/QwenLM/Qwen3-VL">Qwen3-VL</a> offer multimodal input, <a href="https://deepmind.google/models/gemma/medgemma/">Medgemma</a> specializes in medical knowledge, and <a href=https://mistral.ai/">Mistral</a>'s <a href="https://mistral.ai/news/devstral">Devstral</a> focuses on agentic use.
    471 </p>
    472 
    473 <p>
    474   For local use, hardware tends to be the main limiter. One has to fit the model into available memory, and consider the acceptable performance for one's use case. A rough guideline is to use the smallest model or quantization for the required task. Or, from the opposite direction, to look for the largest model that can fit into available memory. The rule of thumb is that a <code>Q8_0</code> quantization uses about as much memory as there are parameters, so an 8 billion parameter model will use about 8 GB of RAM or VRAM. A <code>Q4_0</code> quant would use half that — 4 GB — while at 16-bit, 16 GB.
    475 </p>
    476 
    477 <p>
    478   My workstation, laptop, and mobile (<code>llama.cpp</code> can be used from <code><a href="https://termux.dev/en/">termux</a></code>) all run different classes of weights. On my mobile device, I have about 12GB of RAM, but background utilization is already around 8GB. So, when necessary, I use 4B models at <code>Q8_0</code> or less: Gemma3, Qwen3-VL, and Medgemma. If a laptop has 16GB of RAM with 2GB in use, 8B models might run well enough. The workstation, which has a GPU, can run larger models, faster. There are other tricks one can use — <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention">flash attention</a>, <a href="https://research.google/blog/looking-back-at-speculative-decoding/">speculative decoding</a>, MoE offloading — to optimize performance across different hardware configurations.
    479 </p>
    480 
    481 <h3>llama-swap</h3>
    482 
    483 <p>
    484   One current limitation of <code>llama.cpp</code> is that unless you load multiple models at once, switching models requires manually starting a new instance of <code>llama-server</code>. To swap models on demand, <code><a href="https://github.com/mostlygeek/llama-swap">llama-swap</a></code> can be used.
    485 </p>
    486 
    487 <p>
    488   <code>llama-swap</code> uses a YAML configuration file, which is <a href="https://github.com/mostlygeek/llama-swap/wiki/Configuration">well documented</a>. I use something like the following:
    489 </p>
    490 
    491 <pre><code>logLevel: debug
    492 
    493 macros:
    494   "models": "/home/llama-swap/models"
    495 
    496 models:
    497   gemma3:
    498     cmd: |
    499       llama-server
    500       --ctx-size 0
    501       --gpu-layers 888
    502       --jinja
    503       --min-p 0.0
    504       --model ${models}/gemma-3-27b-it-ud-q8_k_xl.gguf
    505       --mmproj ${models}/mmproj-gemma3-27b-bf16.gguf
    506       --port ${PORT}
    507       --repeat-penalty 1.0
    508       --temp 1.0
    509       --top-k 64
    510       --top-p 0.95
    511     ttl: 900
    512     name: "gemma3_27b"
    513   gpt:
    514     cmd: |
    515       llama-server
    516       --chat-template-kwargs '{"reasoning_effort": "high"}'
    517       --ctx-size 0
    518       --gpu-layers 888
    519       --jinja
    520       --model ${models}/gpt-oss-120b-f16.gguf
    521       --port ${PORT}
    522       --temp 1.0
    523       --top-k 0
    524       --top-p 1.0
    525     ttl: 900
    526     name: "gpt-oss_120b"
    527   qwen3_vl_30b-a3b:
    528     cmd: |
    529       llama-server
    530       --ctx-size 131072
    531       --gpu-layers 888
    532       --jinja
    533       --min-p 0
    534       --model ${models}/qwen3-vl-30b-a3b-thinking-ud-q8_k_xl.gguf
    535       --mmproj ${models}/mmproj-qwen3-vl-30ba3b-bf16.gguf
    536       --port ${PORT}
    537       --temp 0.6
    538       --top-k 20
    539       --top-p 0.95
    540     ttl: 900
    541     name: "qwen3_vl_30b-a3b-thinking"
    542 </code></pre>
    543 
    544 <h3>nginx</h3>
    545 
    546 <p>
    547   Since my workstation has a GPU and can be accessed on the local network or via <a href="https://www.wireguard.com/">WireGuard</a> from other devices, I use <code><a href="https://nginx.org/">nginx</a></code> as a reverse proxy in front of <code>llama-swap</code>, with certificates generated by <code><a href="https://certbot.eff.org/">certbot</a></code>. For streaming LLM responses, <code>proxy_buffering off;</code> and <code>proxy_cache off;</code> are essential settings.
    548 </p>
    549 
    550 <pre><code>user http;
    551 worker_processes 1;
    552 worker_cpu_affinity auto;
    553 
    554 events {
    555     worker_connections 1024;
    556 }
    557 
    558 http {
    559     charset utf-8;
    560     sendfile on;
    561     tcp_nopush on;
    562     tcp_nodelay on;
    563     server_tokens off;
    564     types_hash_max_size 4096;
    565     client_max_body_size 32M;
    566 
    567     # MIME
    568     include mime.types;
    569     default_type application/octet-stream;
    570 
    571     # logging
    572     access_log /var/log/nginx/access.log;
    573     error_log /var/log/nginx/error.log warn;
    574 
    575     include /etc/nginx/conf.d/*.conf;
    576 }
    577 </code></pre>
    578 
    579 <p>Then, for <code>/etc/nginx/conf.d/llama-swap.conf</code>:</p>
    580 
    581 <pre><code>server {
    582 	listen 80;
    583 	server_name llm.dwrz.net;
    584 	return 301 https://$server_name$request_uri;
    585 }
    586 
    587 server {
    588 	listen 443 ssl;
    589         http2 on;
    590 	server_name llm.dwrz.net;
    591 
    592 	ssl_certificate /etc/letsencrypt/live/llm.dwrz.net/fullchain.pem;
    593 	ssl_certificate_key /etc/letsencrypt/live/llm.dwrz.net/privkey.pem;
    594 
    595 	location / {
    596 		proxy_buffering off;
    597                 proxy_cache off;
    598 		proxy_pass http://localhost:11434;
    599 		proxy_read_timeout 3600s;
    600 		proxy_set_header Host $host;
    601 		proxy_set_header X-Real-IP $remote_addr;
    602 		proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    603 		proxy_set_header X-Forwarded-Proto $scheme;
    604 	}
    605 }
    606 </code></pre>
    607 
    608 <h3>Emacs Configuration</h3>
    609 
    610 <p>
    611   <code>llama-server</code> offers an <a href="https://platform.openai.com/docs/api-reference/introduction">OpenAI API</a> compatible API. <code>gptel</code> can be configured to utilize local models with something like the following:
    612 </p>
    613 
    614 <pre><code>(gptel-make-openai "llama.cpp"
    615   :stream t
    616   :protocol "http"
    617   :host "localhost"
    618   :models
    619   '((gemma3
    620      :capabilities (media tool json url)
    621      :mime-types ("image/jpeg"
    622                   "image/png"
    623                   "image/gif"
    624                   "image/webp"))
    625     gpt
    626     (medgemma_27b
    627      :capabilities (media tool json url)
    628      :mime-types ("image/jpeg"
    629                   "image/png"
    630                   "image/gif"
    631                   "image/webp"))
    632     (qwen3_vl_30b-a3b
    633      :capabilities (media tool json url)
    634      :mime-types ("image/jpeg"
    635                   "image/png"
    636                   "image/gif"
    637                   "image/webp"))
    638     (qwen3_vl_32b
    639      :capabilities (media tool json url)
    640      :mime-types ("image/jpeg"
    641                   "image/png"
    642                   "image/gif"
    643                   "image/webp"))))
    644 </code></pre>
    645 
    646 <h2>Techniques</h2>
    647 
    648 <p>
    649   Having covered the setup and configuration, here are some practical ways I use Emacs with LLMs, demonstrated with examples:
    650 </p>
    651 
    652 <h3>Simple Q&A</h3>
    653 
    654 <p>
    655   With the <code>gptel</code> transient menu, press <code>m</code> to prompt from the minibuffer, and <code>e</code> to output the answer to the echo area, then <code>Enter</code> to input the prompt.
    656 
    657   <video autoplay loop muted disablepictureinpicture
    658          class="video" src="/static/media/llm-qa.mp4"
    659          type="video/mp4">
    660     Your browser does not support video.
    661   </video>
    662 </p>
    663 
    664 <h3>Brief Conversations</h3>
    665 
    666 <p>
    667   For brief multi-turn conversations that require no persistence, <code>gptel</code> can be used in the <code>*scratch*</code> buffer. Context can be added via the transient menu, <code>-b</code>, <code>-f</code>, or <code>-r</code> as necessary. The conversation is not persisted unless the buffer is saved.
    668 </p>
    669 
    670 <h3>Image-to-Text</h3>
    671 <p>
    672   With multimodal LLMs like Gemma3 and Qwen3-VL, one can extract text and tables from images.
    673 
    674   <video autoplay loop muted disablepictureinpicture
    675          class="video" src="/static/media/llm-itt.mp4"
    676          type="video/mp4">
    677     Your browser does not support video.
    678   </video>
    679 </p>
    680 
    681 <h3>Text-to-Image</h3>
    682 <p>
    683   Here, a local LLM retrieves a URL, reads its contents, and then
    684   generates an image with ComfyUI.
    685   <video autoplay loop muted disablepictureinpicture
    686          class="video" src="/static/media/llm-image.mp4"
    687          type="video/mp4">
    688     Your browser does not support video.
    689   </video>
    690 
    691   The result:
    692   <img class="img-center" src="/static/media/comfy-ui-dream.png">
    693 </p>
    694 
    695 <h3>Research</h3>
    696 <p>
    697   If I know I well need to reference a topc later, I usually start out with an <code><a href="https://orgmode.org/">org-mode</a></code> file. In this case, I tend to use links to construct context.
    698 </p>
    699 
    700 <h3>Translation</h3>
    701 <p>
    702   For small or unimportant text, Google Translate via the command-line with <code><a href="https://github.com/soimort/translate-shell">translate-shell</a></code> works well enough. Otherwise, I find the translation output from local LLMs is typically more sensitive to context.
    703 
    704   <video autoplay loop muted disablepictureinpicture
    705          class="video" src="/static/media/llm-translate.mp4"
    706          type="video/mp4">
    707     Your browser does not support video.
    708   </video>
    709 </p>
    710 
    711 <h3>Code</h3>
    712 <p>
    713   My experience using LLMs for code has been mixed. For scripts and small programs, iterating in a single conversation works well. However, with larger codebases, few models contribute meaningfully. While hosted models are typically stronger in this use case, I surmise aggressive quantization has reduced their reliability. I have come to distrust the initial output from any model.
    714 </p>
    715 
    716 <p>
    717   So far, I have had limited success with agents — which often burn through tokens to understand context, but still manage to miss important nuance. This experience has made me hesitant to add tool support for file operations.
    718 </p>
    719 
    720 <p>
    721   Instead, I provide context through <code>org-mode</code> links in project-specific files. I have the LLM walk through potential changes, which I review and implement by hand. Generally, this approach saves time, but often, I still work faster on my own.
    722 </p>
    723 
    724 <h2>Conclusion</h2>
    725 
    726 <p>
    727   I first used Emacs as a text editor 20 years ago. For over a decade, I have used it daily — for writing and coding, task and finance management, email, as a calculator, and to interact with local and remote hosts. I continue to discover new functionality and techniques, and was suprised to see how well this 50-year old program has adapted to the frontier of technology. Despite flaws and limitations, its endurance reflects its foundational design.
    728 </p>
    729 
    730 <p>
    731   The barrier for entry for Emacs is high. For everyday users, comparable power and flexibility could be unlocked with support for:
    732   <ul>
    733     <li>Notebooks featuring executable code blocks</li>
    734     <li>Links for local and remote content, including other conversations</li>
    735     <li>Switching models and providers, including local models</li>
    736     <li>Mail and task integration</li>
    737     <li>Offline operation with local models.</li>
    738     <li>Remote access — Emacs can be accessed remotely via SSH or TRAMP.</li>
    739   </ul>
    740 </p>
    741 
    742 <p>
    743   So far, my experiments with LLMs has left me with concern and optimism. Local inference reveals the energy requirements, yet daily limitations make me skeptical of imminent superintelligence. In the same way that calculators are better than humans, LLMs may offer areas of comparative advantage. The key question is which tasks we can delegate reliably and efficiently, such that the effort of building scaffolding, maintaining guardrails, and managing operations costs less than doing the work ourselves.
    744 </p>
	src Go monorepo.
	git clone git://code.dwrz.net/src
	Log \| Files \| Refs