Add 2025-11-24 entry - src

commit f809c1e7e078404680920043f020957bcf32ee37
parent bd9052786265ea8defb772f2e72177ff363f7b11
Author: dwrz <dwrz@dwrz.net>
Date:   Tue, 25 Nov 2025 17:01:08 +0000

Add 2025-11-24 entry

Diffstat:
A cmd/web/site/entry/static/2025-11-24/2025-11-24.html  | 698 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1 file changed, 698 insertions(+), 0 deletions(-)
diff --git a/cmd/web/site/entry/static/2025-11-24/2025-11-24.html b/cmd/web/site/entry/static/2025-11-24/2025-11-24.html
@@ -0,0 +1,698 @@
+
+<p>
+  This video shows a <a href="https://en.wikipedia.org/wiki/Large_language_model">large language model</a> (<code>LLM</code>), running on my workstation, using <a href="https://www.gnu.org/software/emacs/">Emacs</a> to determine my location, retrieve the weather forecast, and email me the results.
+</p>
+
+<video autoplay loop muted
+       class="video" src="/static/media/llm.mp4"
+       type="video/mp4">
+  Your browser does not support video.
+</video>
+
+<p>
+  With <a href="https://karthinks.com">karthink</a>'s <a href="https://github.com/karthink/gptel">gptel</a> package and some custom code, Emacs is capable of:
+</p>
+
+<ul>
+  <li>Querying models from hosted providers (<a href="https://www.anthropic.com/">Anthropic</a>, <a href="https://openai.com/">OpenAI</a>, <a href="https://openrouter.ai/">OpenRouter</a>), or local models (<a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a>, <a href="https://ollama.com/">ollama</a>, etcetera).</li>
+  <li>Switching rapidly between models and configurations.</li>
+  <li>Saving conversations and using them as context for other conversations.</li>
+  <li>Including files, buffers, and terminals as context for queries.</li>
+  <li>Searching the web and reading web pages.</li>
+  <li>Searching, reading, and sending email.</li>
+  <li>Consulting agendas, projects, and tasks.</li>
+  <li>Executing Emacs Lisp code and shell commands.</li>
+  <li>Generating images via <a href="https://www.comfy.org/">ComfyUI</a>.</li>
+  <li>Geolocating the device and checking the current date and time.</li>
+  <li>Reading <a href="https://en.wikipedia.org/wiki/Man_page">man</a> pages.</li>
+  <li>Retrieving the user's name and email.</li>
+</ul>
+
+<p>
+  Because LLMs are able to understand and write <a href="https://en.wikipedia.org/wiki/Emacs_Lisp">Emacs Lisp</a> code, so one can use them to further extend their capabilities; the improvements are recursive. Below, I note some of the setup required to enable this functionality.
+</p>
+
+<h2>Emacs</h2>
+
+<p>
+  With <code><a href="https://www.gnu.org/software/emacs/manual/html_node/use-package/">use-package</a></code>, <a href="https://melpa.org/">MELPA</a>, and <a href="https://www.passwordstore.org/">pass</a> for password management, a minimal configuration for <code>gptel</code> looks like this:
+</p>
+
+<pre><code>(use-package gptel
+ :commands (gptel gtpel-send gptel-send-region gptel-send-buffer)
+ :config
+ (setq gptel-api-key (password-store-get "open-ai/emacs")
+       gptel-curl--common-args
+       '("--disable" "--location" "--silent" "--compressed" "-XPOST" "-D-")
+       gptel-default-mode 'org-mode)
+ :ensure t)
+</code></pre>
+
+<p>
+  This is enough to start querying <a href="https://openai.com/api/">OpenAI's API</a> from Emacs.
+</p>
+
+<p>
+  To use Anthropic's API:
+</p>
+
+<pre><code>(gptel-make-anthropic "Anthropic"
+ :key (password-store-get "anthropic/api/emacs")
+ :stream t)
+</code></pre>
+
+<p>
+  I use OpenRouter, which grants access to models across providers:
+</p>
+
+<pre><code>(gptel-make-openai "OpenRouter"
+ :endpoint "/api/v1/chat/completions"
+ :host "openrouter.ai"
+ :key (password-store-get "openrouter.ai/keys/emacs")
+ :models '(anthropic/claude-opus-4.1
+           anthropic/claude-sonnet-4.5
+           anthropic/claude-3.5-sonnet
+           cohere/command-a
+           deepseek/deepseek-r1-0528
+           deepseek/deepseek-v3.1-terminus:exacto
+           google/gemini-3-pro-preview
+           mistralai/devstral-medium
+           mistralai/magistral-medium-2506:thinking
+           moonshotai/kimi-k2-0905:exacto
+           moonshotai/kimi-k2-thinking
+           openai/gpt-5.1
+           openai/gpt-5.1-codex
+           openai/gpt-5-pro
+           perplexity/sonar-deep-research
+           qwen/qwen3-max
+           qwen/qwen3-vl-235b-a22b-thinking
+           qwen/qwen3-coder:exacto
+           z-ai/glm-4.6:exacto)
+ :stream t)
+</code></pre>
+
+<p>
+  The choice of model depends on the task and its budget. Even where those two parameters are comparable, it is sometimes useful to switch from one model to another. One may have a blind spot, where another will have insight.
+</p>
+
+<p>
+  With <code>gptel</code>, it is easy to switch models at any point in a conversation, or to take the output of one to feed as context for another. For example, I've used <a href="https://www.perplexity.ai/">Perplexity's</a> <a href="https://openrouter.ai/perplexity/sonar-deep-research">Sonar Deep Research</a> to create briefings on a topic, and then used another LLM to summarize findings or answer specific questions, augmented with further web search.
+</p>
+
+<h3>Tools</h3>
+
+<p>
+  Tools augment a model's perception, memory, or capabilities. The <code>gptel-make-tool</code> function allows one to define tools for use by an LLM.
+</p>
+
+<p>
+  In making a tool, on can rely on the extensive functionality already offered by Emacs. For example, the <code>read_url</code> tool uses <code><a href=" https://www.gnu.org/software/emacs/manual/html_node/url/Retrieving-URLs.html">url-retrieve-synchronously</a></code>, <code>get_user_name</code> and <code>get_user_email</code> read the <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/User-Identification.html#index-user_002dfull_002dname">user-full-name</a></code> and <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/User-Identification.html#index-user_002dmail_002daddress">user-mail-address</a></code> variables. <code>now</code>, used to retrieve the current date and time, uses <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Time-Parsing.html#index-format_002dtime_002dstring">format_time_string</a></code>:
+</p>
+
+<pre><code>(gptel-make-tool
+ :name "now"
+ :category "time"
+ :function (lambda () (format-time-string "%Y-%m-%d %H:%M:%S %Z"))
+ :description "Retrieves the current local date, time, and timezone."
+ :include t)
+</code></pre>
+
+<p>
+  Similarly, if Emacs is <a href="https://www.gnu.org/software/emacs/manual/html_node/emacs/Sending-Mail.html">configured to send mail</a>, the tool definition in Emacs Lisp is straightforward:
+</p>
+
+<pre><code>(gptel-make-tool
+ :name "mail_send"
+ :category "mail"
+ :confirm t
+ :description "Send an email with the user's Emacs mail configuration."
+ :function
+ (lambda (to subject body)
+   (with-temp-buffer
+     (insert "To: " to "\n"
+             "From: " user-mail-address "\n"
+             "Subject: " subject "\n\n"
+             body)
+     (sendmail-send-it)))
+ :args
+ '((:name "to"
+          :type string
+          :description "The recipient's email address.")
+   (:name "subject"
+          :type string
+          :description "The subject of the email.")
+   (:name "body"
+          :type string
+          :description "The body of the email text.")))
+</code></pre>
+
+<p>
+  For more complex functionality, my preference has been to write shell scripts. There are a few advantages to this approach:
+  <ul>
+    <li>Tools are easier to develop and debug, since they are easily invoked manually.</li>
+    <li>The tool definitions are simpler. For example, my <code>qwen-image</code> script includes a large JSON for the <code>ComfyUI</code> flow. I prefer to leave it outside my Emacs configuration.</li>
+    <li>Tools are accessible to LLMs that may not be running in the Emacs environment (agents, one-off scripts).</li>
+    <li>Fluency. LLMs seem better at writing bash (or Python, or Go) than Emacs Lisp, so it easier to lean on this inherent expertise in developing the tools themselves.</li>
+  </ul>
+</p>
+
+<h4>Web Search</h4>
+
+<p>
+  For example, for web search, I initially used the tool described in the <a href="https://github.com/karthink/gptel/wiki/Tools-collection">gptel wiki</a>:
+</p>
+
+<pre><code>(defvar brave-search-api-key (password-store-get "search.brave.com/api/emacs")
+  "API key for accessing the Brave Search API.")
+(defun brave-search-query (query)
+  "Perform a web search using the Brave Search API with the given QUERY."
+  (let ((url-request-method "GET")
+        (url-request-extra-headers
+         =(("X-Subscription-Token" . ,brave-search-api-key)))
+        (url (format "https://api.search.brave.com/res/v1/web/search?q=%s"
+                     (url-encode-url query))))
+    (with-current-buffer (url-retrieve-synchronously url)
+      (goto-char (point-min))
+      (when (re-search-forward "^$" nil 'move)
+        (let ((json-object-type 'hash-table))
+          (json-parse-string
+           (buffer-substring-no-properties (point) (point-max))))))))
+
+(gptel-make-tool
+ :name "brave_search"
+ :category "web"
+ :function #'brave-search-query
+ :description "Perform a web search using the Brave Search API"
+ :args (list '(:name "query"
+                     :type string
+                     :description "The search query string")))
+</code></pre>
+
+<p>
+  However, there are times I want to inspect the search results, so I refactored to use a script:
+</p>
+
+<pre><code>#!/usr/bin/env bash
+
+set -euo pipefail
+
+API_URL="https://api.search.brave.com/res/v1/web/search"
+
+check_deps() {
+  for cmd in curl jq pass; do
+    command -v "${cmd}" >/dev/null || {
+      echo "missing: ${cmd}" >&2
+      exit 1
+    }
+  done
+}
+
+perform_search() {
+  local query="${1}"
+  local res
+
+  res=$(curl -s -G \
+             -H "X-Subscription-Token: $(pass "search.brave.com/api/emacs")" \
+             -H "Accept: application/json" \
+             --data-urlencode "q=${query}" \
+             "${API_URL}")
+  if echo "${res}" | jq -e . >/dev/null 2>&1; then
+    echo "${res}"
+  else
+    echo "error: failed to retrieve valid JSON res: ${res}" >&2
+    exit 1
+  fi
+}
+
+main() {
+  check_deps
+
+  if [ $# -eq 0 ]; then
+    echo "Usage: ${0} <query>" >&2
+  exit 1
+  fi
+
+  perform_search "${*}"
+}
+
+main "${@}"
+</code></pre>
+
+<p>
+  Which can be called manually from a shell: <code>brave-search 'quine definition' | jq -C | less</code>.
+</p>
+
+<p>
+  The tool definition condenses to:
+</p>
+
+<pre><code>(gptel-make-tool
+ :name "brave_search"
+ :category "web"
+ :function
+ (lambda (query)
+   (shell-command-to-string
+    (format "brave-search %s"
+            (shell-quote-argument query))))
+ :description "Perform a web search using the Brave Search API"
+ :args
+ (list '(:name "query"
+               :type string
+               :description "The search query string")))
+</code></pre>
+
+<h4>Context</h4>
+
+<p>
+  One limitation that I have occasionally run into with tools is context overflow — when the data retrieved by the tool exceeds what can fit into the LLM's context.
+</p>
+
+<p>
+  For example, the <code>man</code> tool makes it possible for an LLM to read <code>man</code> pages. It can help a model correctly recall flags for a command:
+</p>
+
+<pre><code>(gptel-make-tool
+ :name "man"
+ :category "documentation"
+ :function
+ (lambda (page_name)
+   (shell-command-to-string
+    (concat "man --pager cat" page_name)))
+ :description "Read a Unix manual page."
+ :args
+ '((:name "page_name"
+          :type string
+          :description
+          "The name of the man page to read. Can optionally include a section number, for example: '2 read' or 'cat(1)'.")))
+</code></pre>
+
+<p>
+  This tool broke when calling the <a href="https://www.gnu.org/software/units/">GNU units</a> <code>man</code> page, which on my system, is currently about 40,000 tokens. This was unfortunate, since converting temperatures with <code>units</code> isn't intuitive:
+</p>
+
+<pre><code>units 'tempC(100)' tempF
+</code></pre>
+
+<p>
+  With <code>gptel</code>, one fallback is Emacs' built in <code>man</code> functionality. The appropriate region can be selected with <code>-r</code> in the transient menu. In some cases, this is faster than a tool call.
+</p>
+
+<p>
+  I ran into a similar problem with the <code>read_url</code> tool (also found on <a href="https://github.com/karthink/gptel/wiki/Tools-collection">gptel wiki</a>). It can break if the response is larger than the context window.
+</p>
+
+<pre><code>(gptel-make-tool
+  :name "read_url"
+  :category "web"
+  :function
+  (lambda (url)
+    (with-current-buffer
+        (url-retrieve-synchronously url)
+      (goto-char (point-min)) (forward-paragraph)
+      (let ((dom (libxml-parse-html-region
+                  (point) (point-max))))
+        (run-at-time 0 nil #'kill-buffer
+                     (current-buffer))
+        (with-temp-buffer
+          (shr-insert-document dom)
+          (buffer-substring-no-properties
+           (point-min)
+           (point-max))))))
+  :description "Fetch and read the contents of a URL"
+  :args (list '(:name "url"
+                      :type string
+                      :description "The URL to read")))
+</code></pre>
+
+<p>
+  When I have run into this problem, the issue was bloated functional content — JavaScript code CSS. If the content is not dynamically generated, one call fallback to Emacs' web browser, <code><a href="https://www.gnu.org/software/emacs/manual/html_mono/eww.html">eww</a></code>. The buffer or selected regions can be added as context. A more sophisticated tool could help in these cases. Otherwise, I hope that LLMs will help steer the web back towards readability, either by acting as an aggregator and filter, or as an evolutionary pressure in favor of static content.
+</p>
+
+<h4>Security</h4>
+
+<p>
+  The <code><a href="https://github.com/karthink/gptel/wiki/Tools-collection#run_command">run_command</a></code> tool, also described in the <code>gptel</code> tool collection, confers the ability to execute shell commands. Its use requires care. A compromised model could use it to issue malicious commands; alternatively, a poorly formatted command could have unintended consequences. <code>gptel</code> offers the <code>:confirm</code> key to enable inspection and approval of a tool call.
+</p>
+
+<pre><code>(gptel-make-tool
+ :name "run_command"
+ :category "command"
+ :confirm t
+ :function
+ (lambda (command)
+   (with-temp-message
+       (format "Executing command: =%s=" command)
+     (shell-command-to-string command)))
+ :description
+ "Execute a shell command; returns the output as a string."
+ :args
+ '((:name "command"
+          :type string
+          :description "The complete shell command to execute.")))
+</code></pre>
+
+<p>
+  Inspection limits the ability of the LLM to operate asynchronously, without human intervention. There are a few solutions to this problem, the easiest being to offer tools with more limited scope.
+</p>
+
+<h3>Presets</h3>
+
+<p>
+  <code>gptel</code>'s transient menu makes it fast and easy to manage LLM use. With a few keystrokes, it is possible to add, edit, or remove context, switch the model one wants to query, change the input and output, or edit the system message. This can be further accelerated by defining presets with <code>gptel-make-preset</code>.
+</p>
+
+<p>
+  For example, with <a href="https://huggingface.co/openai/gpt-oss-120b">GPT-OSS 120B</a> (one of OpenAI's <a href="https://openai.com/open-models/">open weights</a> models), I find a system prompt necessary to minimize the use of tables and excessive text styling. A preset loads the appropriate settings:
+</p>
+
+<pre><code>(gptel-make-preset 'assistant/gpt
+  :description "GPT-OSS general assistant."
+  :backend "llama.cpp"
+  :model 'gpt
+  :include-reasoning nil
+  :system
+  "You are a large language model queried from Emacs. Your conversation with the user occurs in an org-mode buffer.
+
+- Use org-mode syntax only (no Markdown).
+- Use tables ONLY for tabular data with few columns and rows.
+- Avoid extended text in table cells. If cells need paragraphs, use a list instead.
+- Default to plain paragraphs and simple lists.
+- Minimize styling. Use *bold* or /italic/ only where emphasis is essential. Use ~code~ for technical terms.
+- If citing facts or resources, output references as org-mode links.
+- Use code blocks for calculations or code examples.")
+</code></pre>
+
+<p>
+  From the transient menu, this preset can be selected with two keystrokes: <code>@</code> and then <code>a</code>.
+</p>
+
+<h4>Memory</h4>
+
+<p>
+  Presets can be used to implement read-only memory for an LLM. This preset uses <a href="https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking">Qwen3 VL 30B-A3B</a> with a <code>memory.org</code> file automatically included in the context:
+</p>
+
+<pre><code>(gptel-make-preset 'assistant/qwen
+ :description "Qwen Emacs assistant."
+ :backend "llama.cpp"
+ :model 'qwen3_vl_30b-a3b
+ :context '("~/memory.org"))
+</code></pre>
+
+<p>
+  One could grant LLMs the ability to append to <code>memory.org</code> with a tool, though I am skeptical that they would use it judiciously.
+</p>
+
+<h2>Local LLMs</h2>
+
+<p>
+  Running LLMs on one's own devices offers some advantages over third-party providers:
+  <ul>
+    <li>Redundancy: they work offline, even if providers are experiencing an outage.</li>
+    <li>Privacy: queries and data remain on the device.</li>
+    <li>Control: You know exactly which model is running, with what settings, at what quantization.</li>
+  </ul>
+</p>
+
+<p>
+  The main trade-off is intelligence, though for many purposes, the gap is closing fast. I've found local models can summarize or transform data effectively, help with language translation and learning, extract data from images and PDFs, and perform simple research tasks. I rely on hosted models primarily for complex coding tasks, or whenever larger effective context is required.
+</p>
+
+<h3>llama.cpp</h3>
+
+<p>
+  <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a> makes it easy to run models locally:
+</p>
+
+<pre><code>git clone https://github.com/ggml-org/llama.cpp.git
+
+cd llama.cpp
+
+cmake -B build
+
+cmake --build build --config Release
+
+mv build/bin/llama-server ~/.local/bin/ # Or elsewhere in PATH.
+
+llama-server -hf unsloth/Qwen3-4B-GGUF:q8_0
+</code></pre>
+
+<p>
+  This will build <code>llama.cpp</code> with support for CPU based inference, move <code>llama-server</code> into <code>~/.local/bin/</code>, and then download and run <a href="https://unsloth.ai/">Unsloth</a>'s <code>Q8</code> quantization of the <a href="https://huggingface.co/Qwen/Qwen3-4B">Qwen3 4B</a>. The <code>llama.cpp</code> <a href="https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md"> documentation</a> explains how to build for GPUs and other hardware — not much more work than the default build.
+</p>
+
+<h3>Weights</h3>
+
+<p>
+  Part of the art of using LLMs is finding the appropriate model to use. Some factors to consider are available hardware, intended use (task, language), and desired pricing (money paid for input and output). Some models, like <a href="https://ai.google.dev/gemma/docs/core">Gemma3</a> and <a href=""https://github.com/QwenLM/Qwen3-VL">Qwen3-VL</a> offer multimodal support, and can parse images. Others, like Google's <a href="https://deepmind.google/models/gemma/medgemma/">Medgemma</a> specialize in specific subject areas, or like <a href=https://mistral.ai/">Mistral</a>'s <a href="https://mistral.ai/news/devstral">Devstral</a>, agentic use.
+</p>
+
+<p>
+  For local use, hardware tends to be the main limiter. One has to fit the model into available memory and consider the acceptable speed for one's use case. A rough guideline is to use the smallest model or quantization for the required task. From the opposite direction, one can look for the largest model that can fit into available memory. The rule of thumb is that a <code>Q8_0</code> uses about as much memory as there are parameters, so an 8 billion parameter model will use about 8 GB of RAM or VRAM. A <code>Q4_0</code> quant would use half that — 4 GB — while at 16-bit, 16 GB.
+</p>
+
+<p>
+  My workstation, laptop, and mobile (<code>llama.cpp</code> can be built and run in <code><a href="https://termux.dev/en/">termux</a></code>) all run different classes of weights. On my mobile device, I have about 12GB of RAM, but background utilization is already around 8GB. So, when necessary, I use 4B models at <code>Q8_0</code>: Gemma3, Qwen3-VL, and Medgemma. If a laptop has 16GB of RAM with 2GB in use, 8B models might run well enough. The workstation, which has a GPU, can run larger models, faster. There are other tricks one can use — <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention">flash attention</a>, <a href="https://research.google/blog/looking-back-at-speculative-decoding/">speculative decoding</a>, MoE offloading — to optimize model performance across different hardware configurations.
+</p>
+
+<h3>llama-swap</h3>
+
+<p>
+  One current limitation of <code>llama.cpp</code> is that unless you load multiple models at once, switching models requires manually starting a new instance of <code>llama-server</code>. To swap models on demand, <code><a href="https://github.com/mostlygeek/llama-swap">llama-swap</a></code> can be used.
+</p>
+
+<p>
+  <code>llama-swap</code> uses a YAML configuration file; which is <a href="https://github.com/mostlygeek/llama-swap/wiki/Configuration">well documented</a>. I use something like the following:
+</p>
+
+<pre><code>logLevel: debug
+
+macros:
+  "models": "/home/llama-swap/models"
+
+models:
+  gemma3:
+    cmd: |
+      llama-server
+      --ctx-size 0
+      --gpu-layers 888
+      --jinja
+      --min-p 0.0
+      --model ${models}/gemma-3-27b-it-ud-q8_k_xl.gguf
+      --mmproj ${models}/mmproj-gemma3-27b-bf16.gguf
+      --port ${PORT}
+      --repeat-penalty 1.0
+      --temp 1.0
+      --top-k 64
+      --top-p 0.95
+    ttl: 900
+    name: "gemma3_27b"
+  gpt:
+    cmd: |
+      llama-server
+      --chat-template-kwargs '{"reasoning_effort": "high"}'
+      --ctx-size 0
+      --gpu-layers 888
+      --jinja
+      --model ${models}/gpt-oss-120b-f16.gguf
+      --port ${PORT}
+      --temp 1.0
+      --top-k 0
+      --top-p 1.0
+    ttl: 900
+    name: "gpt-oss_120b"
+  qwen3_vl_30b-a3b:
+    cmd: |
+      llama-server
+      --ctx-size 131072
+      --gpu-layers 888
+      --jinja
+      --min-p 0
+      --model ${models}/qwen3-vl-30b-a3b-thinking-ud-q8_k_xl.gguf
+      --mmproj ${models}/mmproj-qwen3-vl-30ba3b-bf16.gguf
+      --port ${PORT}
+      --temp 0.6
+      --top-k 20
+      --top-p 0.95
+    ttl: 900
+    name: "qwen3_vl_30b-a3b-thinking"
+</code></pre>
+
+<h3>nginx</h3>
+
+<p>
+  Since my workstation has a GPU and can be accessed via <a href="https://www.wireguard.com/">WireGuard</a> from other devices, it is set up to serve models. For HTTPS, I use <code><a href="https://certbot.eff.org/">certbot</a></code> with an <code><a href="https://nginx.org/">nginx</a></code> reverse proxy, running in front of <code>llama-swap</code>. With <code>nginx</code>, some settings are important for streaming responses from LLMs, namely <code>proxy_buffering off;</code> and <code>proxy_cache off;</code>.
+</p>
+
+<pre><code>user http;
+worker_processes 1;
+worker_cpu_affinity auto;
+
+events {
+    worker_connections 1024;
+}
+
+http {
+    charset utf-8;
+    sendfile on;
+    tcp_nopush on;
+    tcp_nodelay on;
+    server_tokens off;
+    types_hash_max_size 4096;
+    client_max_body_size 32M;
+
+    # MIME
+    include mime.types;
+    default_type application/octet-stream;
+
+    # logging
+    access_log /var/log/nginx/access.log;
+    error_log /var/log/nginx/error.log warn;
+
+    include /etc/nginx/conf.d/*.conf;
+}
+</code></pre>
+
+<p>Then, for <code>/etc/nginx/conf.d/llama-swap.conf</code>:</p>
+
+<pre><code>server {
+	listen 80;
+	server_name llm.dwrz.net;
+	return 301 https://$server_name$request_uri;
+}
+
+server {
+	listen 443 ssl;
+        http2 on;
+	server_name llm.dwrz.net;
+
+	ssl_certificate /etc/letsencrypt/live/llm.dwrz.net/fullchain.pem;
+	ssl_certificate_key /etc/letsencrypt/live/llm.dwrz.net/privkey.pem;
+
+	location / {
+		proxy_buffering off;
+                proxy_cache off;
+		proxy_pass http://localhost:11434;
+		proxy_read_timeout 3600s;
+		proxy_set_header Host $host;
+		proxy_set_header X-Real-IP $remote_addr;
+		proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+		proxy_set_header X-Forwarded-Proto $scheme;
+	}
+}
+</code></pre>
+
+<h3>Emacs Configuration</h3>
+
+<p>
+  <code>llama-server</code> offers an <a href="https://platform.openai.com/docs/api-reference/introduction">OpenAI API</a> compatible API. <code>gptel</code> can be configured to utilize local models with something like the following:
+</p>
+
+<pre><code>(gptel-make-openai "llama.cpp"
+  :stream t
+  :protocol "http"
+  :host "localhost"
+  :models
+  '((gemma3
+     :capabilities (media tool json url)
+     :mime-types ("image/jpeg"
+                  "image/png"
+                  "image/gif"
+                  "image/webp"))
+    gpt
+    (medgemma_27b
+     :capabilities (media tool json url)
+     :mime-types ("image/jpeg"
+                  "image/png"
+                  "image/gif"
+                  "image/webp"))
+    (qwen3_vl_30b-a3b
+     :capabilities (media tool json url)
+     :mime-types ("image/jpeg"
+                  "image/png"
+                  "image/gif"
+                  "image/webp"))
+    (qwen3_vl_32b
+     :capabilities (media tool json url)
+     :mime-types ("image/jpeg"
+                  "image/png"
+                  "image/gif"
+                  "image/webp"))))
+</code></pre>
+
+<h2>Techniques</h2>
+
+<p>
+  There are a variety of ways I use Emacs with LLMs:
+</p>
+
+<h3>Simple Q&A</h3>
+
+<p>
+  With the <code>gptel</code> transient menu, press <code>m</code> to prompt from the minibuffer, and <code>e</code> to output the answer to the echo area, then <code>Enter</code> to input the prompt.
+</p>
+
+<h3>Brief Conversations</h3>
+
+<p>
+  For brief multi-turn conversations that require no persistence, <code>gptel</code> can be used in the <code>*scratch*</code> buffer. I usually pair this with setting any context via the transient menu, <code>-b</code>, <code>-f</code>, or <code>-r</code> as necessary.
+</p>
+
+<h3>Image-to-Text</h3>
+<p>
+  With multimodal LLMs like Gemma3 and Qwen3-VL, one can extract text and tables from images. 
+</p>
+
+<h3>Research</h3>
+<p>
+  If I know I well need to reference a topc later, I usually start out with an <code><a href="https://orgmode.org/">org-mode</a></code> file. In this case, I tend to use links to construct context.
+</p>
+
+<h3>Translation</h3>
+<p>
+  For small or unimportant text, Google Translate via the command-line with <code><a href="https://github.com/soimort/translate-shell">translate-shell</a></code> works well enough. Otherwise, I find the translation output from local LLMs is typically better, more aware of the context.
+</p>
+
+<h3>Code</h3>
+<p>
+  My experience writing code with LLMs has been mixed. For scripts and small programs, iterating in a single conversation can work well enough. However, at work, with a larger codebase, only a few models have been able to contribute meaningfully. Hosted models have worked well at some points, but lately, the quality has dropped off significantly, I imagine due to excessive quantization. I believe the quantization of the model should be clearly labeled and priced accordingly. Since that is not the case, I have come to distrust the initial output from any model.
+</p>
+
+<p>
+  So far, I have also had limited success with agents, which in some cases, also cost more than I would care to spend. I find agents waste too many resources understanding the context, and even then, often fail to capture important nuace. This is one reason I have not yet added tool support for reading or modify files and directories. 
+</p>
+
+<p>
+  Instead, I have found the middle ground to be manually providing context in project or task specific files, using <code>org-mode</code> links. I ask the LLMs to walk me through code changes, which I then review and implement by hand. In some cases, the output is good enough that it saves time. In others, it still ends up being faster to implement on my own, and in a few cases, I wish I had never bothered.
+</p>
+
+<h2>Conclusion</h2>
+
+<p>
+  I first started using Emacs as my text editor 20 years ago. For over ten years now, I have used it on a daily basis — for writing and coding, email, managing finances and tasks, as my calculator, and for interacting with both local and remote hosts. I continue as a student of this software, discovering new functionality and techniques. I have been surprised by how well this 50-year old software has adapted to the frontier of technology. Despite flaws and limitations, the core fundamental design has ensured its endurance.
+</p>
+
+<p>
+  The barrier for entry for Emacs is high. For everyday users, the power and flexibility that <code>gptel</code> offers could be unlocked by offering support for:
+  <ul>
+    <li>Notebooks with support for code and other blocks</li>
+    <li>Links for local and remote content</li>
+    <li>Referencing conversations</li>
+    <li>Switching models and providers, including local models</li>
+    <li>Mail and task maangement integration</li>
+    <li>Offline operation — Emacs will work with local models even offline.</li>
+    <li>Remote operation — Emacs can be accessed remotely via SSH or TRAMP.</li>
+  </ul>
+</p>
+
+<p>
+  My work with LLMs so far has given me both concern and optimism. Local inference surfaces the energy requirements, yet daily limitations make me skeptical of imminent superintelligence. In the same way that calculators are better in a domain than humans, LLMs may offer areas of comparative advantage. The key question is which tasks we can delegate to them reliably and efficiently, such that the effort of building scaffolding, maintaining guardrails, and managing operations costs less than doing the work ourselves.
+</p>
+
+<p>
+  There are many areas of concern and discussion around LLMs. From my work with them so far, I am more anxious about some than others. Local LLMs reveal the computation and energy requirements for inference alone. On the other hand, barring the <a href="https://en.wikipedia.org/wiki/Eichmann_in_Jerusalem">banality of evil</a>, the limitations I see on a daily basis make me skeptical that we are anywhere close to seeing super-intelligent machines. Perhaps in the way that calculators are better than humans, we will see LLMs have areas of comparative strength — and weakness. 
+</p>
+
+<p>
+  My interest is primarily on the potential utility of the technology. It has already proven its ability to understand natural language, take the drudgery out of some work, or serve as a useful second set of eyes. The question for me is the magnitude of the impact. Which of the more advanced tasks will we be able to hand off to LLMs, (a) in a trusted manner, and (b) in such a way that building the required guardrails and scaffolding, and managing their operation, will take less time than doing the task oneself? 
+</p>

	src Go monorepo.
	git clone git://code.dwrz.net/src
	Log \| Files \| Refs