src

Go monorepo.
git clone git://code.dwrz.net/src
Log | Files | Refs

2025-12-01.html (33126B)


      1 <div class="wide64">
      2   <p>
      3     I first used <a href="https://www.gnu.org/software/emacs/">Emacs</a> as a text editor 20 years ago. For over a decade, I have used it daily — for writing and coding, task and finance management, email, as a calculator, and to interact with local and remote hosts. I continue to discover new functionality and techniques, and was suprised to see how this 50-year old program has adapted to the frontier of technology.
      4   </p>
      5   <p>
      6     This video shows a <a href="https://en.wikipedia.org/wiki/Large_language_model">large language model</a> (LLM), running on my workstation, using Emacs to determine my location, retrieve weather data, and email me the results. By "<a href="https://arxiv.org/abs/2201.11903">thinking</a>", the LLM determines how to chain available tools to achieve the desired result.
      7   </p>
      8 </div>
      9 <video autoplay controls loop muted disablepictureinpicture
     10        class="video video-wide" src="/static/media/llm.mp4"
     11        type="video/mp4">
     12   Your browser does not support video.
     13 </video>
     14 <div class="wide64">
     15   <p>
     16     With <a href="https://karthinks.com">karthink</a>'s <a href="https://github.com/karthink/gptel">gptel</a> package and some custom code, Emacs is capable of:
     17   </p>
     18   <ul>
     19     <li>Querying models from hosted providers (<a href="https://www.anthropic.com/">Anthropic</a>, <a href="https://openai.com/">OpenAI</a>, <a href="https://openrouter.ai/">OpenRouter</a>), or local models (<a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a>, <a href="https://ollama.com/">ollama</a>).</li>
     20     <li>Switching between models and configurations with only a few keystrokes.</li>
     21     <li>Saving conversations to the local filesystem, and using them as context for other conversations.</li>
     22     <li>Including files, buffers, and terminals as context for queries.</li>
     23     <li>Searching the web and reading web pages.</li>
     24     <li>Searching, reading, and sending email.</li>
     25     <li>Consulting agendas, projects, and tasks.</li>
     26     <li>Executing Emacs Lisp code and shell commands.</li>
     27     <li>Generating images via the <a href="https://www.comfy.org/">ComfyUI</a> API.</li>
     28     <li>Geolocating the device and checking the current date and time.</li>
     29     <li>Reading <a href="https://en.wikipedia.org/wiki/Man_page">man</a> pages.</li>
     30     <li>Retrieving the user's name and email.</li>
     31   </ul>
     32   <p>
     33     Because LLMs understand and write <a href="https://en.wikipedia.org/wiki/Emacs_Lisp">Emacs Lisp</a> code, they can extend their own capabilities; the improvements are recursive. Below, I note some of the setup required to enable this functionality.
     34   </p>
     35 </div>
     36 
     37 <div class="wide64">
     38   <h2>Emacs</h2>
     39   <p>
     40     With <code><a href="https://www.gnu.org/software/emacs/manual/html_node/use-package/">use-package</a></code>, <a href="https://melpa.org/">MELPA</a>, and <a href="https://www.passwordstore.org/">pass</a> for password management, a minimal configuration for <code>gptel</code> looks like this:
     41   </p>
     42   <pre><code>(use-package gptel
     43  :commands (gptel gtpel-send gptel-send-region gptel-send-buffer)
     44  :config
     45  (setq gptel-api-key (password-store-get "open-ai/emacs")
     46        gptel-curl--common-args
     47        '("--disable" "--location" "--silent" "--compressed" "-XPOST" "-D-")
     48        gptel-default-mode 'org-mode)
     49  :ensure t)</code></pre>
     50   <p>
     51     This is enough to start querying <a href="https://openai.com/api/">OpenAI's API</a> from Emacs.
     52   </p>
     53   <p>
     54     To use Anthropic's API:
     55   </p>
     56   <pre><code>(gptel-make-anthropic "Anthropic"
     57  :key (password-store-get "anthropic/api/emacs")
     58  :stream t)</code></pre>
     59   <p>
     60     I prefer OpenRouter, to access models across providers:
     61   </p>
     62   <pre><code>(gptel-make-openai "OpenRouter"
     63  :endpoint "/api/v1/chat/completions"
     64  :host "openrouter.ai"
     65  :key (password-store-get "openrouter.ai/keys/emacs")
     66  :models '(anthropic/claude-opus-4.5
     67            anthropic/claude-sonnet-4.5
     68            anthropic/claude-3.5-sonnet
     69            cohere/command-a
     70            deepseek/deepseek-r1-0528
     71            deepseek/deepseek-v3.1-terminus:exacto
     72            google/gemini-3-pro-preview
     73            mistralai/devstral-medium
     74            mistralai/magistral-medium-2506:thinking
     75            moonshotai/kimi-k2-0905:exacto
     76            moonshotai/kimi-k2-thinking
     77            openai/gpt-5.1
     78            openai/gpt-5.1-codex
     79            openai/gpt-5-pro
     80            perplexity/sonar-deep-research
     81            qwen/qwen3-max
     82            qwen/qwen3-vl-235b-a22b-thinking
     83            qwen/qwen3-coder:exacto
     84            z-ai/glm-4.6:exacto)
     85  :stream t)</code></pre>
     86   <p>
     87     The choice of model depends on the task and its budget. Even where those two parameters are comparable, it is sometimes useful to switch models. One may have a blind spot, where another will have insight.
     88   </p>
     89   <p>
     90     With <code>gptel</code>, it is easy to switch models mid-conversation, or use the output from one model as context for another. For example, I've used <a href="https://www.perplexity.ai/">Perplexity's</a> <a href="https://openrouter.ai/perplexity/sonar-deep-research">Sonar Deep Research</a> to create briefings, then used another LLM to summarize findings or answer specific questions, augmented with web search.
     91   </p>
     92 </div>
     93 
     94 <div class="wide64">
     95   <h3>Tools</h3>
     96   <p>
     97     Tools augment a model's perception, memory, or capabilities. The <code>gptel-make-tool</code> function allows one to define tools for use by an LLM.
     98   </p>
     99   <p>
    100     When making tools, one can leverage Emacs' existing functionality. For example, the <code>read_url</code> tool uses <code><a href=" https://www.gnu.org/software/emacs/manual/html_node/url/Retrieving-URLs.html">url-retrieve-synchronously</a></code>, while <code>get_user_name</code> and <code>get_user_email</code> read <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/User-Identification.html#index-user_002dfull_002dname">user-full-name</a></code> and <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/User-Identification.html#index-user_002dmail_002daddress">user-mail-address</a></code>. <code>now</code>, used to retrieve the current date and time, uses <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Time-Parsing.html#index-format_002dtime_002dstring">format_time_string</a></code>:
    101   </p>
    102   <pre><code>(gptel-make-tool
    103  :name "now"
    104  :category "time"
    105  :function (lambda () (format-time-string "%Y-%m-%d %H:%M:%S %Z"))
    106  :description "Retrieves the current local date, time, and timezone."
    107  :include t)</code></pre>
    108   <p>
    109     Similarly, if Emacs is <a href="https://www.gnu.org/software/emacs/manual/html_node/emacs/Sending-Mail.html">configured to send mail</a>, the tool definition is straightforward:
    110   </p>
    111   <pre><code>(gptel-make-tool
    112  :name "mail_send"
    113  :category "mail"
    114  :confirm t
    115  :description "Send an email with the user's Emacs mail configuration."
    116  :function
    117  (lambda (to subject body)
    118    (with-temp-buffer
    119      (insert "To: " to "\n"
    120              "From: " user-mail-address "\n"
    121              "Subject: " subject "\n\n"
    122              body)
    123      (sendmail-send-it)))
    124  :args
    125  '((:name "to"
    126           :type string
    127           :description "The recipient's email address.")
    128    (:name "subject"
    129           :type string
    130           :description "The subject of the email.")
    131    (:name "body"
    132           :type string
    133           :description "The body of the email text.")))</code></pre>
    134   <p>
    135     For more complex functionality, I prefer writing shell scripts, for several reasons:
    136     <ul>
    137       <li>The tool definitions are simpler. For example, my <code>qwen-image</code> script includes a large <code>JSON</code> object for the ComfyUI flow. I prefer to leave it outside my Emacs configuration.</li>
    138       <li>Tools are accessible to LLMs that may not be running in the Emacs environment (agents, one-off scripts).</li>
    139       <li>Fluency. LLMs seem better at writing bash (or Python, or Go) than Emacs Lisp, so it easier to lean on this inherent expertise in developing the tools themselves.</li>
    140     </ul>
    141   </p>
    142   <img class="img-center" src="/static/media/drawing-hands.jpg">
    143   <div class="caption">
    144     <p>M.C. Escher, <i>Drawing Hands</i> (1948)</p>
    145   </div>
    146 </div>
    147 
    148 <div class="wide64">
    149   <h4>Web Search</h4>
    150   <p>
    151     For example, for web search, I initially used the tool described in the <code>gptel</code> <a href="https://github.com/karthink/gptel/wiki/Tools-collection">wiki</a>:
    152   </p>
    153   <pre><code>(defvar brave-search-api-key (password-store-get "search.brave.com/api/emacs")
    154   "API key for accessing the Brave Search API.")
    155 
    156 (defun brave-search-query (query)
    157   "Perform a web search using the Brave Search API with the given QUERY."
    158   (let ((url-request-method "GET")
    159         (url-request-extra-headers
    160          `(("X-Subscription-Token" . ,brave-search-api-key)))
    161         (url (format "https://api.search.brave.com/res/v1/web/search?q=%s"
    162                      (url-encode-url query))))
    163     (with-current-buffer (url-retrieve-synchronously url)
    164       (goto-char (point-min))
    165       (when (re-search-forward "^$" nil 'move)
    166         (let ((json-object-type 'hash-table))
    167           (json-parse-string
    168            (buffer-substring-no-properties (point) (point-max))))))))
    169 
    170 (gptel-make-tool
    171  :name "brave_search"
    172  :category "web"
    173  :function #'brave-search-query
    174  :description "Perform a web search using the Brave Search API"
    175  :args (list '(:name "query"
    176                      :type string
    177                      :description "The search query string")))</code></pre>
    178   <p>
    179     However, there are times I want to inspect the search results. I use this script:
    180   </p>
    181   <pre><code>#!/usr/bin/env bash
    182 
    183 set -euo pipefail
    184 
    185 API_URL="https://api.search.brave.com/res/v1/web/search"
    186 
    187 check_deps() {
    188   for cmd in curl jq pass; do
    189     command -v "${cmd}" >/dev/null || {
    190       echo "missing: ${cmd}" >&2
    191       exit 1
    192     }
    193   done
    194 }
    195 
    196 perform_search() {
    197   local query="${1}"
    198   local res
    199 
    200   res=$(curl -s -G \
    201              -H "X-Subscription-Token: $(pass "search.brave.com/api/emacs")" \
    202              -H "Accept: application/json" \
    203              --data-urlencode "q=${query}" \
    204              "${API_URL}")
    205   if echo "${res}" | jq -e . >/dev/null 2>&1; then
    206     echo "${res}"
    207   else
    208     echo "error: failed to retrieve valid JSON res: ${res}" >&2
    209     exit 1
    210   fi
    211 }
    212 
    213 main() {
    214   check_deps
    215 
    216   if [ $# -eq 0 ]; then
    217     echo "Usage: ${0} <query>" >&2
    218     exit 1
    219     fi
    220 
    221     perform_search "${*}"
    222 }
    223 
    224 main "${@}"</code></pre>
    225   <p>
    226     Which can be called manually from a shell: <code>brave-search 'quine definition' | jq -C | less</code>.
    227   </p>
    228   <p>
    229     The tool definition condenses to:
    230   </p>
    231   <pre><code>(gptel-make-tool
    232  :name "brave_search"
    233  :category "web"
    234  :function
    235  (lambda (query)
    236    (shell-command-to-string
    237     (format "brave-search %s"
    238             (shell-quote-argument query))))
    239  :description "Perform a web search using the Brave Search API"
    240  :args
    241  (list '(:name "query"
    242                :type string
    243                :description "The search query string")))</code></pre>
    244 </div>
    245 <div class="wide64">
    246   <h4>Context</h4>
    247   <p>
    248     One limitation that I have run into with tools is context overflow — when retrieved data exceeds an LLM's context window.
    249   </p>
    250   <p>
    251     For example, this tool lets an LLM read <code>man</code> pages, helping it correctly recall command flags:
    252   </p>
    253   <pre><code>(gptel-make-tool
    254  :name "man"
    255  :category "documentation"
    256  :function
    257  (lambda (page_name)
    258    (shell-command-to-string
    259     (concat "man --pager cat" page_name)))
    260  :description "Read a Unix manual page."
    261  :args
    262  '((:name "page_name"
    263           :type string
    264           :description
    265           "The name of the man page to read. Can optionally include a section number, for example: '2 read' or 'cat(1)'.")))</code></pre>
    266 
    267   <p>
    268     It broke when calling the <a href="https://www.gnu.org/software/units/">GNU units</a> <code>man</code> page, which exceeds 40,000 tokens on my system. This was unfortunate, since some coversions, like temperature, are unintuitive:
    269   </p>
    270 
    271   <pre><code>units 'tempC(100)' tempF</code></pre>
    272   <p>
    273     With <code>gptel</code>, one fallback is Emacs' built in <code>man</code> functionality. The appropriate region can be selected with <code>-r</code> in the transient menu. In some cases, this is faster than a tool call.
    274   </p>
    275 </div>
    276 <video autoplay controls loop muted disablepictureinpicture
    277        class="video" src="/static/media/llm-temp.mp4"
    278        type="video/mp4">
    279   Your browser does not support video.
    280 </video>
    281 <div class="wide64">
    282   <p>
    283     I ran into a similar problem with the <code>read_url</code> tool (also found on <a href="https://github.com/karthink/gptel/wiki/Tools-collection">gptel wiki</a>). It can break if the response is larger than the context window.
    284   </p>
    285   <pre><code>(gptel-make-tool
    286   :name "read_url"
    287   :category "web"
    288   :function
    289   (lambda (url)
    290     (with-current-buffer
    291         (url-retrieve-synchronously url)
    292       (goto-char (point-min)) (forward-paragraph)
    293       (let ((dom (libxml-parse-html-region
    294                   (point) (point-max))))
    295         (run-at-time 0 nil #'kill-buffer
    296                      (current-buffer))
    297         (with-temp-buffer
    298           (shr-insert-document dom)
    299           (buffer-substring-no-properties
    300            (point-min)
    301            (point-max))))))
    302   :description "Fetch and read the contents of a URL"
    303   :args (list '(:name "url"
    304                       :type string
    305                       :description "The URL to read")))</code></pre>
    306   <p>
    307     When I have run into this problem, the issue was bloated functional content — JavaScript code CSS. If the content is not dynamically generated, one call fallback to Emacs' web browser, <code><a href="https://www.gnu.org/software/emacs/manual/html_mono/eww.html">eww</a></code>. The buffer or selected regions can be added as context. A more sophisticated tool could help in these cases. Long term, I hope that LLMs will steer the web back towards readability, either by acting as an aggregator and filter, or as evolutionary pressure in favor of static content.
    308   </p>
    309 </div>
    310 
    311 <div class="wide64">
    312   <h4>Security</h4>
    313   <p>
    314     The <code><a href="https://github.com/karthink/gptel/wiki/Tools-collection#run_command">run_command</a></code> tool, also found in the <code>gptel</code> tool collection, enables shell command execution, and requires care. A compromised model could issue malicious commands, or a poorly prepared command could have unintended consequences. <code>gptel</code>'s <code>:confirm</code> key can be used to inspect and approve tool calls.
    315   </p>
    316 
    317   <pre><code>(gptel-make-tool
    318  :name "run_command"
    319  :category "command"
    320  :confirm t
    321  :function
    322  (lambda (command)
    323    (with-temp-message
    324        (format "Executing command: =%s=" command)
    325      (shell-command-to-string command)))
    326  :description
    327  "Execute a shell command; returns the output as a string."
    328  :args
    329  '((:name "command"
    330           :type string
    331           :description "The complete shell command to execute.")))</code></pre>
    332 
    333   <p>
    334     Inspection limits the LLM's ability to operate asynchronously, without human intervention. There are a few solutions to this problem, the easiest being to offer tools with more limited scope.
    335   </p>
    336 </div>
    337 
    338 <video autoplay controls loop muted disablepictureinpicture
    339        class="video" src="/static/media/llm-inspect.mp4"
    340        type="video/mp4">
    341   Your browser does not support video.
    342 </video>
    343 
    344 <div class="wide64">
    345   <h3>Presets</h3>
    346   <p>
    347     With <code>gptel</code>'s transient menu, only a few keystrokes are need to add, edit, or remove context, switch the model one wants to query, change the input and output, or edit the system message. Presets accelerate switching between settings, and are defined with <code>gptel-make-preset</code>.
    348   </p>
    349   <p>
    350     For example, with <a href="https://huggingface.co/openai/gpt-oss-120b">GPT-OSS 120B</a> (one of OpenAI's <a href="https://openai.com/open-models/">open weights</a> models), a system prompt is necessary to minimize the use of tables and excessive text styling. A preset can load the appropriate settings:
    351   </p>
    352   <pre><code>(gptel-make-preset 'assistant/gpt
    353   :description "GPT-OSS general assistant."
    354   :backend "llama.cpp"
    355   :model 'gpt
    356   :include-reasoning nil
    357   :system
    358   "You are a large language model queried from Emacs. Your conversation with the user occurs in an org-mode buffer.
    359 
    360 - Use org-mode syntax only (no Markdown).
    361 - Use tables ONLY for tabular data with few columns and rows.
    362 - Avoid extended text in table cells. If cells need paragraphs, use a list instead.
    363 - Default to plain paragraphs and simple lists.
    364 - Minimize styling. Use *bold* or /italic/ only where emphasis is essential. Use ~code~ for technical terms.
    365 - If citing facts or resources, output references as org-mode links.
    366 - Use code blocks for calculations or code examples.")</code></pre>
    367   <p>
    368     From the transient menu, this preset can be selected with two keystrokes: <code>@</code> and then <code>a</code>. Alternatively, the preset can be used in the last prompt, like so: <code>@assistant/gpt When is the solstice this year?</code>
    369   </p>
    370 </div>
    371 
    372 <div class="wide64">
    373   <h4>Memory</h4>
    374   <p>
    375     Presets can be used to implement read-only memory for an LLM. This preset uses <a href="https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking">Qwen3 VL 30B-A3B</a> with a <code>memory.org</code> file automatically included in the context:
    376   </p>
    377 
    378   <pre><code>(gptel-make-preset 'assistant/qwen
    379  :description "Qwen Emacs assistant."
    380  :backend "llama.cpp"
    381  :model 'qwen3_vl_30b-a3b
    382  :context '("~/memory.org"))</code></pre>
    383 
    384   <p>
    385     The file can include any information that should always be included as context. One could also grant LLMs the ability to append to <code>memory.org</code>, though I am skeptical that they would do so judiciously.
    386   </p>
    387 </div>
    388 
    389 <div class="wide64">
    390   <h2>Local LLMs</h2>
    391   <p>
    392     Running LLMs on one's own devices offers some advantages over third-party providers:
    393     <ul>
    394       <li>Redundancy: they work offline, even if providers are experiencing an outage.</li>
    395       <li>Privacy: queries and data remain on the device.</li>
    396       <li>Control: You know exactly which model is running, with what settings, at what quantization.</li>
    397     </ul>
    398   </p>
    399   <p>
    400     The main trade-off is intelligence, though for many purposes, the gap is closing fast. Local models excel at summarizing data, language translation, image and PDF extraction, and simple research tasks. I rely on hosted models primarily for complex coding tasks, or when a larger effective context is required.
    401   </p>
    402   <h3>llama.cpp</h3>
    403   <p>
    404     <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a> makes it easy to run models locally:
    405   </p>
    406   <pre><code>git clone https://github.com/ggml-org/llama.cpp.git
    407 
    408 cd llama.cpp
    409 
    410 cmake -B build
    411 
    412 cmake --build build --config Release
    413 
    414 mv build/bin/llama-server ~/.local/bin/ # Or elsewhere in PATH.
    415 
    416 llama-server -hf unsloth/Qwen3-4B-GGUF:q8_0</code></pre>
    417   <p>
    418     This will build <code>llama.cpp</code> with support for CPU based inference, move <code>llama-server</code> into <code>~/.local/bin/</code>, and then download and run <a href="https://unsloth.ai/">Unsloth</a>'s <code>Q8</code> quantization of the <a href="https://huggingface.co/Qwen/Qwen3-4B">Qwen3 4B</a>. The <code>llama.cpp</code> <a href="https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md"> documentation</a> explains how to build for GPUs and other hardware — not much more work than the default build.
    419   </p>
    420   <p><code>llama-server</code> offers a web interface, available at port 8080 by default.</p>
    421 </div>
    422 
    423 <video autoplay controls loop muted disablepictureinpicture
    424        class="video" src="/static/media/llm-ls.mp4"
    425        type="video/mp4">
    426   Your browser does not support video.
    427 </video>
    428 
    429 <div class="wide64">
    430   <h3>Weights</h3>
    431   <p>
    432     Part of the art of using LLMs is selecting an appropriate model. Some factors to consider are available hardware, intended use (task, language), and desired pricing (input and output costs). Some models offer specialized capabilities — <a href="https://ai.google.dev/gemma/docs/core">Gemma3</a> and <a href=""https://github.com/QwenLM/Qwen3-VL">Qwen3-VL</a> offer multimodal input, <a href="https://deepmind.google/models/gemma/medgemma/">Medgemma</a> specializes in medical knowledge, and <a href=https://mistral.ai/">Mistral</a>'s <a href="https://mistral.ai/news/devstral">Devstral</a> focuses on agentic use.
    433   </p>
    434   <p>
    435     For local use, hardware tends to be the main limiter. One has to fit the model into available memory, and consider the acceptable performance for one's use case. A rough guideline is to use the smallest model or quantization for the required task. Or, from the opposite direction, to look for the largest model that can fit into available memory. The rule of thumb is that a <code>Q8_0</code> quantization uses about as much memory as there are parameters, so an 8 billion parameter model will use about 8 GB of RAM or VRAM. A <code>Q4_0</code> quant would use half that — 4 GB — while at 16-bit, 16 GB.
    436   </p>
    437   <p>
    438     My workstation, laptop, and mobile (<code>llama.cpp</code> can be used from <code><a href="https://termux.dev/en/">termux</a></code>) all run different classes of weights. On my mobile device, I have about 12GB of RAM, but background utilization is already around 8GB. So, when necessary, I use 4B models at <code>Q8_0</code> or less: Gemma3, Qwen3-VL, and Medgemma. If a laptop has 16GB of RAM with 2GB in use, 8B models might run well enough. The workstation, which has a GPU, can run larger models, with longer context, faster. There are other tricks one can use — <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention">flash attention</a>, <a href="https://research.google/blog/looking-back-at-speculative-decoding/">speculative decoding</a>, MoE offloading — to optimize performance across different hardware configurations.
    439   </p>
    440 </div>
    441 
    442 <div class="wide64">
    443   <h3>llama-swap</h3>
    444   <p>
    445     One current limitation of <code>llama.cpp</code> is that unless you load multiple models at once, switching models requires manually starting a new instance of <code>llama-server</code>. To swap models on demand, <code><a href="https://github.com/mostlygeek/llama-swap">llama-swap</a></code> can be used.
    446   </p>
    447   <p>
    448     <code>llama-swap</code> uses a <code>YAML</code> configuration file, which is <a href="https://github.com/mostlygeek/llama-swap/wiki/Configuration">well documented</a>. I use something like the following:
    449   </p>
    450   <pre><code>logLevel: debug
    451 
    452 macros:
    453   "models": "/home/llama-swap/models"
    454 
    455 models:
    456   gemma3:
    457     cmd: |
    458       llama-server
    459       --ctx-size 0
    460       --gpu-layers 888
    461       --jinja
    462       --min-p 0.0
    463       --model ${models}/gemma-3-27b-it-ud-q8_k_xl.gguf
    464       --mmproj ${models}/mmproj-gemma3-27b-bf16.gguf
    465       --port ${PORT}
    466       --repeat-penalty 1.0
    467       --temp 1.0
    468       --top-k 64
    469       --top-p 0.95
    470     ttl: 900
    471     name: "gemma3_27b"
    472   gpt:
    473     cmd: |
    474       llama-server
    475       --chat-template-kwargs '{"reasoning_effort": "high"}'
    476       --ctx-size 0
    477       --gpu-layers 888
    478       --jinja
    479       --model ${models}/gpt-oss-120b-f16.gguf
    480       --port ${PORT}
    481       --temp 1.0
    482       --top-k 0
    483       --top-p 1.0
    484     ttl: 900
    485     name: "gpt-oss_120b"
    486   qwen3_vl_30b-a3b:
    487     cmd: |
    488       llama-server
    489       --ctx-size 131072
    490       --gpu-layers 888
    491       --jinja
    492       --min-p 0
    493       --model ${models}/qwen3-vl-30b-a3b-thinking-ud-q8_k_xl.gguf
    494       --mmproj ${models}/mmproj-qwen3-vl-30ba3b-bf16.gguf
    495       --port ${PORT}
    496       --temp 0.6
    497       --top-k 20
    498       --top-p 0.95
    499     ttl: 900
    500     name: "qwen3_vl_30b-a3b-thinking"</code></pre>
    501 </div>
    502 <div class="wide64">
    503   <h3>nginx</h3>
    504   <p>
    505     Since my workstation has a GPU and can be accessed on the local network or via <a href="https://www.wireguard.com/">WireGuard</a> from other devices, I use <code><a href="https://nginx.org/">nginx</a></code> as a reverse proxy in front of <code>llama-swap</code>, with certificates generated by <code><a href="https://certbot.eff.org/">certbot</a></code>. For streaming LLM responses, <code>proxy_buffering off;</code> and <code>proxy_cache off;</code> are essential settings.
    506   </p>
    507 
    508   <pre><code>user http;
    509 worker_processes 1;
    510 worker_cpu_affinity auto;
    511 
    512 events {
    513     worker_connections 1024;
    514 }
    515 
    516 http {
    517     charset utf-8;
    518     sendfile on;
    519     tcp_nopush on;
    520     tcp_nodelay on;
    521     server_tokens off;
    522     types_hash_max_size 4096;
    523     client_max_body_size 32M;
    524 
    525     # MIME
    526     include mime.types;
    527     default_type application/octet-stream;
    528 
    529     # logging
    530     access_log /var/log/nginx/access.log;
    531     error_log /var/log/nginx/error.log warn;
    532 
    533     include /etc/nginx/conf.d/*.conf;
    534 }</code></pre>
    535 
    536   <p>Then, for <code>/etc/nginx/conf.d/llama-swap.conf</code>:</p>
    537 
    538   <pre><code>server {
    539 	listen 80;
    540 	server_name llm.dwrz.net;
    541 	return 301 https://$server_name$request_uri;
    542 }
    543 
    544 server {
    545 	listen 443 ssl;
    546         http2 on;
    547 	server_name llm.dwrz.net;
    548 
    549 	ssl_certificate /etc/letsencrypt/live/llm.dwrz.net/fullchain.pem;
    550 	ssl_certificate_key /etc/letsencrypt/live/llm.dwrz.net/privkey.pem;
    551 
    552 	location / {
    553 		proxy_buffering off;
    554                 proxy_cache off;
    555 		proxy_pass http://localhost:11434;
    556 		proxy_read_timeout 3600s;
    557 		proxy_set_header Host $host;
    558 		proxy_set_header X-Real-IP $remote_addr;
    559 		proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    560 		proxy_set_header X-Forwarded-Proto $scheme;
    561 	}
    562 }</code></pre>
    563 </div>
    564 <div class="wide64">
    565   <h3>Emacs Configuration</h3>
    566 
    567   <p>
    568     <code>llama-server</code> offers an <a href="https://platform.openai.com/docs/api-reference/introduction">OpenAI API</a> compatible API. <code>gptel</code> can be configured to utilize local models with something like the following:
    569   </p>
    570 
    571   <pre><code>(gptel-make-openai "llama.cpp"
    572   :stream t
    573   :protocol "http"
    574   :host "localhost"
    575   :models
    576   '((gemma3
    577      :capabilities (media tool json url)
    578      :mime-types ("image/jpeg"
    579                   "image/png"
    580                   "image/gif"
    581                   "image/webp"))
    582     gpt
    583     (medgemma_27b
    584      :capabilities (media tool json url)
    585      :mime-types ("image/jpeg"
    586                   "image/png"
    587                   "image/gif"
    588                   "image/webp"))
    589     (qwen3_vl_30b-a3b
    590      :capabilities (media tool json url)
    591      :mime-types ("image/jpeg"
    592                   "image/png"
    593                   "image/gif"
    594                   "image/webp"))
    595     (qwen3_vl_32b
    596      :capabilities (media tool json url)
    597      :mime-types ("image/jpeg"
    598                   "image/png"
    599                   "image/gif"
    600                   "image/webp"))))</code></pre>
    601 </div>
    602 <div class="wide64">
    603   <h2>Techniques</h2>
    604   <p>
    605     Having covered the setup and configuration, here are some practical ways I use Emacs with LLMs, demonstrated with examples:
    606   </p>
    607 </div>
    608 <div class="wide64">
    609   <h3>Simple Q&A</h3>
    610   <p>
    611     With the <code>gptel</code> transient menu, press <code>m</code> to prompt from the minibuffer, and <code>e</code> to output the answer to the echo area, then <code>Enter</code> to input the prompt.
    612   </p>
    613 </div>
    614 
    615 <video autoplay controls loop muted disablepictureinpicture
    616        class="video" src="/static/media/llm-qa.mp4"
    617        type="video/mp4">
    618   Your browser does not support video.
    619 </video>
    620 
    621 <div class="wide64">
    622   <h3>Brief Conversations</h3>
    623 
    624   <p>
    625     For brief multi-turn conversations that require no persistence, <code>gptel</code> can be used in the <code>*scratch*</code> buffer. Context can be added via the transient menu, <code>-b</code>, <code>-f</code>, or <code>-r</code> as necessary. The conversation is not persisted unless the buffer is saved.
    626   </p>
    627 
    628   <h3>Image-to-Text</h3>
    629   <p>
    630     With multimodal LLMs like Gemma3 and Qwen3-VL, one can extract text and tables from images.
    631   </p>
    632 </div>
    633 
    634 <video autoplay controls loop muted disablepictureinpicture
    635        class="video" src="/static/media/llm-itt.mp4"
    636        type="video/mp4">
    637   Your browser does not support video.
    638 </video>
    639 
    640 <div class="wide64">
    641   <h3>Text-to-Image</h3>
    642   <p>
    643     My primary use case is to revisit themes from some of my dreams. Here, a local LLM retrieves a URL, reads its contents, and then generates an image with ComfyUI:
    644   </p>
    645 </div>
    646 <video autoplay controls loop muted disablepictureinpicture
    647        class="video" src="/static/media/llm-image.mp4"
    648        type="video/mp4">
    649   Your browser does not support video.
    650 </video>
    651 
    652 <div class="wide64">
    653   <p>
    654     The result:
    655     <img class="img-center" src="/static/media/comfy-ui-dream.png">
    656   </p>
    657 </div>
    658 
    659 <div class="wide64">
    660   <h3>Research</h3>
    661   <p>
    662     If I know I will need to reference a topic later, I usually start out with an <code><a href="https://orgmode.org/">org-mode</a></code> file. In this case, I tend to use links to construct context, something like this:
    663 
    664     <img class="img-center" src="/static/media/llm-links.png">
    665   </p>
    666 </div>
    667 
    668 <div class="wide64">
    669   <h3>Rewrites</h3>
    670   <p>
    671     Although I don't use it very often, <code>gptel</code> comes with rewrite functionality, activated when the transient menu is called on a seleted region. It can be used on both text and code, and the output can be <code>diff</code>ed, iterated on, accepted, or rejected. Additionally, it can serve as a kind of autocomplete, by having an LLM implement the skeleton of a function or code block.
    672   </p>
    673 </div>
    674 
    675 <div class="wide64">
    676   <h3>Translation</h3>
    677   <p>
    678     For small or unimportant text, Google Translate via the command-line with <code><a href="https://github.com/soimort/translate-shell">translate-shell</a></code> works well enough. Otherwise, I find the translation output from local LLMs is typically more sensitive to context.
    679   </p>
    680 </div>
    681 
    682 <video autoplay controls loop muted disablepictureinpicture
    683        class="video" src="/static/media/llm-translate.mp4"
    684        type="video/mp4">
    685   Your browser does not support video.
    686 </video>
    687 
    688 <div class="wide64">
    689   <h3>Code</h3>
    690   <p>
    691     My experience using LLMs for code has been mixed. For scripts and small programs, iterating in a single conversation works well. However, with larger codebases, I have not found that LLMs can contribute meaningfully, reliably. This used to be an area of relative strength for hosted models, but I surmise aggressive quantization has begun to reduce their effectiveness.
    692   </p>
    693 
    694   <p>
    695     So far, I have had limited success with agents. My experience has been that they burn through tokens to understand context, but still manage to miss important nuance. This experience has made me hesitant to add tool support for file operations. I am actively exploring some techniques on this front.
    696   </p>
    697 
    698   <p>
    699     For now, I have come to distrust the initial output from any model. Instead, I provide context through <code>org-mode</code> links in project-specific files. I have LLM(s) walk through potential changes, which I review and implement by hand. Generally, this approach saves time, but often, I still work faster on my own.
    700   </p>
    701 </div>
    702 
    703 <div class="wide64">
    704   <h2>Reflections</h2>
    705   <blockquote>
    706     <p>
    707       <i>
    708         The question of whether a computer can think is no more interesting than
    709         the question of whether a submarine can swim.
    710       </i>
    711     </p>
    712 
    713     <p>
    714       Edsger Dijkstra
    715     </p>
    716   </blockquote>
    717 
    718   <p>
    719     Despite encountering frustrations with LLM use, it is hard to shake
    720     the feeling of experiencing a leap in capability. There is something
    721     magical to the technology, especially when run locally — the coil whine of
    722     the GPU evoking the spirit of Rodin's
    723     <a href="https://en.wikipedia.org/wiki/The_Thinker"><i>Thinker</i></a>.
    724     Learning <a href="https://www.3blue1brown.com/topics/neural-networks">how
    725     LLMs work</a> has offered<a href="https://arxiv.org/abs/2007.09560">
    726     another lens</a> through which to view the world.
    727   </p>
    728 
    729   <p>
    730     My hope is that time will distribute and democratize the technology, in terms of hardware (for local use) and software (system integration). For most users, the barrier to entry for Emacs is high. Other frontends could unlock comparable power and flexibility with support for:
    731     <ul>
    732       <li>The ability to assist the user in developing custom tools</li>
    733       <li>Notebooks featuring executable code blocks</li>
    734       <li>Links for local and remote content, including other conversations</li>
    735       <li>Switching models and providers at any point</li>
    736       <li>Mail and task integration</li>
    737       <li>Offline operation with local models</li>
    738       <li>Remote access — Emacs can be accessed via <code><a href="https://www.openssh.org/">SSH</a></code>, <code>gptel</code> files via <code><a href="https://www.gnu.org/software/tramp/">TRAMP</a></code></li>
    739     </ul>
    740   </p>
    741 
    742   <p>
    743     There are many topics of concern and discussion around LLMs. From my work with them so far, I am more anxious about some than others. Local inference alone reveals how much energy these models can require. On the other hand, the limitations of the technology leave me extremely skeptical of imminent superintelligence. But what we have now, limitations included, is useful — and has potential.
    744   </p>
    745 </div>