src

Go monorepo.
git clone git://code.dwrz.net/src
Log | Files | Refs

commit 247830430a37871a15fbc53f675e9cc4ef442d42
parent 788a3988bceb85f26b4bd0ca98d24c51d1000e2d
Author: dwrz <dwrz@dwrz.net>
Date:   Fri, 28 Nov 2025 20:23:00 +0000

Update Emacs LLM entry

Diffstat:
Mcmd/web/site/entry/static/2025-12-01/2025-12-01.html | 748++++++++++++++++++++++++++++++++++++++++---------------------------------------
Mcmd/web/site/entry/static/2025-12-01/metadata.json | 4++--
2 files changed, 378 insertions(+), 374 deletions(-)

diff --git a/cmd/web/site/entry/static/2025-12-01/2025-12-01.html b/cmd/web/site/entry/static/2025-12-01/2025-12-01.html @@ -1,43 +1,43 @@ -<p> - This video shows a <a href="https://en.wikipedia.org/wiki/Large_language_model">large language model</a> (LLM), running on my workstation, using <a href="https://www.gnu.org/software/emacs/">Emacs</a> to determine my location, retrieve weather data, and email me the results: -</p> - +<div class="wide64"> + <p> + This video shows a <a href="https://en.wikipedia.org/wiki/Large_language_model">large language model</a> (LLM), running on my workstation, using <a href="https://www.gnu.org/software/emacs/">Emacs</a> to determine my location, retrieve weather data, and email me the results: + </p> +</div> <video autoplay loop muted disablepictureinpicture - class="video" src="/static/media/llm.mp4" + class="video video-wide" src="/static/media/llm.mp4" type="video/mp4"> Your browser does not support video. </video> +<div class="wide64"> + <p> + With <a href="https://karthinks.com">karthink</a>'s <a href="https://github.com/karthink/gptel">gptel</a> package and some custom code, Emacs is capable of: + </p> + + <ul> + <li>Querying models from hosted providers (<a href="https://www.anthropic.com/">Anthropic</a>, <a href="https://openai.com/">OpenAI</a>, <a href="https://openrouter.ai/">OpenRouter</a>), or local models (<a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a>, <a href="https://ollama.com/">ollama</a>).</li> + <li>Switching rapidly between models and configurations, with only a few keystrokes.</li> + <li>Saving conversations to the local filesystem, and using them as context for other conversations.</li> + <li>Including files, buffers, and terminals as context for queries.</li> + <li>Searching the web and reading web pages.</li> + <li>Searching, reading, and sending email.</li> + <li>Consulting agendas, projects, and tasks.</li> + <li>Executing Emacs Lisp code and shell commands.</li> + <li>Generating images via the <a href="https://www.comfy.org/">ComfyUI</a> API.</li> + <li>Geolocating the device and checking the current date and time.</li> + <li>Reading <a href="https://en.wikipedia.org/wiki/Man_page">man</a> pages.</li> + <li>Retrieving the user's name and email.</li> + </ul> + <p> + Because LLMs understand and write <a href="https://en.wikipedia.org/wiki/Emacs_Lisp">Emacs Lisp</a> code, they can extend their own capabilities; the improvements are recursive. Below, I note some of the setup required to enable this functionality. + </p> +</div> -<p> - With <a href="https://karthinks.com">karthink</a>'s <a href="https://github.com/karthink/gptel">gptel</a> package and some custom code, Emacs is capable of: -</p> - -<ul> - <li>Querying models from hosted providers (<a href="https://www.anthropic.com/">Anthropic</a>, <a href="https://openai.com/">OpenAI</a>, <a href="https://openrouter.ai/">OpenRouter</a>), or local models (<a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a>, <a href="https://ollama.com/">ollama</a>).</li> - <li>Switching rapidly between models and configurations, with only a few keystrokes.</li> - <li>Saving conversations to the local filesystem, and using them as context for other conversations.</li> - <li>Including files, buffers, and terminals as context for queries.</li> - <li>Searching the web and reading web pages.</li> - <li>Searching, reading, and sending email.</li> - <li>Consulting agendas, projects, and tasks.</li> - <li>Executing Emacs Lisp code and shell commands.</li> - <li>Generating images via the <a href="https://www.comfy.org/">ComfyUI</a> API.</li> - <li>Geolocating the device and checking the current date and time.</li> - <li>Reading <a href="https://en.wikipedia.org/wiki/Man_page">man</a> pages.</li> - <li>Retrieving the user's name and email.</li> -</ul> - -<p> - Because LLMs understand and write <a href="https://en.wikipedia.org/wiki/Emacs_Lisp">Emacs Lisp</a> code, they can help extend their own capabilities; the improvements are recursive. Below, I note some of the setup required to enable this functionality. -</p> - -<h2>Emacs</h2> - -<p> - With <code><a href="https://www.gnu.org/software/emacs/manual/html_node/use-package/">use-package</a></code>, <a href="https://melpa.org/">MELPA</a>, and <a href="https://www.passwordstore.org/">pass</a> for password management, a minimal configuration for <code>gptel</code> looks like this: -</p> - -<pre><code>(use-package gptel +<div class="wide64"> + <h2>Emacs</h2> + <p> + With <code><a href="https://www.gnu.org/software/emacs/manual/html_node/use-package/">use-package</a></code>, <a href="https://melpa.org/">MELPA</a>, and <a href="https://www.passwordstore.org/">pass</a> for password management, a minimal configuration for <code>gptel</code> looks like this: + </p> + <pre><code>(use-package gptel :commands (gptel gtpel-send gptel-send-region gptel-send-buffer) :config (setq gptel-api-key (password-store-get "open-ai/emacs") @@ -45,26 +45,21 @@ '("--disable" "--location" "--silent" "--compressed" "-XPOST" "-D-") gptel-default-mode 'org-mode) :ensure t) -</code></pre> - -<p> - This is enough to start querying <a href="https://openai.com/api/">OpenAI's API</a> from Emacs. -</p> - -<p> - To use Anthropic's API: -</p> - -<pre><code>(gptel-make-anthropic "Anthropic" + </code></pre> + <p> + This is enough to start querying <a href="https://openai.com/api/">OpenAI's API</a> from Emacs. + </p> + <p> + To use Anthropic's API: + </p> + <pre><code>(gptel-make-anthropic "Anthropic" :key (password-store-get "anthropic/api/emacs") :stream t) -</code></pre> - -<p> - I prefer OpenRouter, to access models across providers: -</p> - -<pre><code>(gptel-make-openai "OpenRouter" + </code></pre> + <p> + I prefer OpenRouter, to access models across providers: + </p> + <pre><code>(gptel-make-openai "OpenRouter" :endpoint "/api/v1/chat/completions" :host "openrouter.ai" :key (password-store-get "openrouter.ai/keys/emacs") @@ -88,39 +83,34 @@ qwen/qwen3-coder:exacto z-ai/glm-4.6:exacto) :stream t) -</code></pre> - -<p> - The choice of model depends on the task and its budget. Even where those two parameters are comparable, it is sometimes useful to switch models. One may have a blind spot, where another will have insight. -</p> - -<p> - With <code>gptel</code>, it is easy to switch models mid-conversation, or use the output from one model as context for another. For example, I've used <a href="https://www.perplexity.ai/">Perplexity's</a> <a href="https://openrouter.ai/perplexity/sonar-deep-research">Sonar Deep Research</a> to create briefings, then used another LLM to summarize findings or answer specific questions, augmented with web search. -</p> - -<h3>Tools</h3> - -<p> - Tools augment a model's perception, memory, or capabilities. The <code>gptel-make-tool</code> function allows one to define tools for use by an LLM. -</p> - -<p> - When making tools, one can leverage Emacs' existing functionality. For example, the <code>read_url</code> tool uses <code><a href=" https://www.gnu.org/software/emacs/manual/html_node/url/Retrieving-URLs.html">url-retrieve-synchronously</a></code>, while <code>get_user_name</code> and <code>get_user_email</code> read <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/User-Identification.html#index-user_002dfull_002dname">user-full-name</a></code> and <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/User-Identification.html#index-user_002dmail_002daddress">user-mail-address</a></code>. <code>now</code>, used to retrieve the current date and time, uses <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Time-Parsing.html#index-format_002dtime_002dstring">format_time_string</a></code>: -</p> + </code></pre> + <p> + The choice of model depends on the task and its budget. Even where those two parameters are comparable, it is sometimes useful to switch models. One may have a blind spot, where another will have insight. + </p> + <p> + With <code>gptel</code>, it is easy to switch models mid-conversation, or use the output from one model as context for another. For example, I've used <a href="https://www.perplexity.ai/">Perplexity's</a> <a href="https://openrouter.ai/perplexity/sonar-deep-research">Sonar Deep Research</a> to create briefings, then used another LLM to summarize findings or answer specific questions, augmented with web search. + </p> +</div> -<pre><code>(gptel-make-tool +<div class="wide64"> + <h3>Tools</h3> + <p> + Tools augment a model's perception, memory, or capabilities. The <code>gptel-make-tool</code> function allows one to define tools for use by an LLM. + </p> + <p> + When making tools, one can leverage Emacs' existing functionality. For example, the <code>read_url</code> tool uses <code><a href=" https://www.gnu.org/software/emacs/manual/html_node/url/Retrieving-URLs.html">url-retrieve-synchronously</a></code>, while <code>get_user_name</code> and <code>get_user_email</code> read <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/User-Identification.html#index-user_002dfull_002dname">user-full-name</a></code> and <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/User-Identification.html#index-user_002dmail_002daddress">user-mail-address</a></code>. <code>now</code>, used to retrieve the current date and time, uses <code><a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Time-Parsing.html#index-format_002dtime_002dstring">format_time_string</a></code>: + </p> + <pre><code>(gptel-make-tool :name "now" :category "time" :function (lambda () (format-time-string "%Y-%m-%d %H:%M:%S %Z")) :description "Retrieves the current local date, time, and timezone." :include t) -</code></pre> - -<p> - Similarly, if Emacs is <a href="https://www.gnu.org/software/emacs/manual/html_node/emacs/Sending-Mail.html">configured to send mail</a>, the tool definition is straightforward: -</p> - -<pre><code>(gptel-make-tool + </code></pre> + <p> + Similarly, if Emacs is <a href="https://www.gnu.org/software/emacs/manual/html_node/emacs/Sending-Mail.html">configured to send mail</a>, the tool definition is straightforward: + </p> + <pre><code>(gptel-make-tool :name "mail_send" :category "mail" :confirm t @@ -143,29 +133,27 @@ (:name "body" :type string :description "The body of the email text."))) -</code></pre> - -<p> - For more complex functionality, I prefer writing shell scripts, for several reasons: - <ul> - <li>The tool definitions are simpler. For example, my <code>qwen-image</code> script includes a large JSON for the ComfyUI flow. I prefer to leave it outside my Emacs configuration.</li> - <li>Tools are accessible to LLMs that may not be running in the Emacs environment (agents, one-off scripts).</li> - <li>Fluency. LLMs seem better at writing bash (or Python, or Go) than Emacs Lisp, so it easier to lean on this inherent expertise in developing the tools themselves.</li> - </ul> -</p> - -<img class="img-center" src="/static/media/drawing-hands.jpg"> -<div class="caption"> - <p>M.C. Escher, <i>Drawing Hands</i> (1948)</p> + </code></pre> + <p> + For more complex functionality, I prefer writing shell scripts, for several reasons: + <ul> + <li>The tool definitions are simpler. For example, my <code>qwen-image</code> script includes a large JSON for the ComfyUI flow. I prefer to leave it outside my Emacs configuration.</li> + <li>Tools are accessible to LLMs that may not be running in the Emacs environment (agents, one-off scripts).</li> + <li>Fluency. LLMs seem better at writing bash (or Python, or Go) than Emacs Lisp, so it easier to lean on this inherent expertise in developing the tools themselves.</li> + </ul> + </p> + <img class="img-center" src="/static/media/drawing-hands.jpg"> + <div class="caption"> + <p>M.C. Escher, <i>Drawing Hands</i> (1948)</p> + </div> </div> -<h4>Web Search</h4> - -<p> - For example, for web search, I initially used the tool described in the <code>gptel</code> <a href="https://github.com/karthink/gptel/wiki/Tools-collection">wiki</a>: -</p> - -<pre><code>(defvar brave-search-api-key (password-store-get "search.brave.com/api/emacs") +<div class="wide64"> + <h4>Web Search</h4> + <p> + For example, for web search, I initially used the tool described in the <code>gptel</code> <a href="https://github.com/karthink/gptel/wiki/Tools-collection">wiki</a>: + </p> + <pre><code>(defvar brave-search-api-key (password-store-get "search.brave.com/api/emacs") "API key for accessing the Brave Search API.") (defun brave-search-query (query) @@ -190,13 +178,11 @@ :args (list '(:name "query" :type string :description "The search query string"))) -</code></pre> - -<p> - However, there are times I want to inspect the search results. I use this script: -</p> - -<pre><code>#!/usr/bin/env bash + </code></pre> + <p> + However, there are times I want to inspect the search results. I use this script: + </p> + <pre><code>#!/usr/bin/env bash set -euo pipefail @@ -233,24 +219,21 @@ main() { if [ $# -eq 0 ]; then echo "Usage: ${0} <query>" >&2 - exit 1 - fi - - perform_search "${*}" - } - - main "${@}" -</code></pre> - -<p> - Which can be called manually from a shell: <code>brave-search 'quine definition' | jq -C | less</code>. -</p> + exit 1 + fi -<p> - The tool definition condenses to: -</p> + perform_search "${*}" + } -<pre><code>(gptel-make-tool + main "${@}" + </code></pre> + <p> + Which can be called manually from a shell: <code>brave-search 'quine definition' | jq -C | less</code>. + </p> + <p> + The tool definition condenses to: + </p> + <pre><code>(gptel-make-tool :name "brave_search" :category "web" :function @@ -263,19 +246,17 @@ main() { (list '(:name "query" :type string :description "The search query string"))) -</code></pre> - -<h4>Context</h4> - -<p> - One limitation that I have run into with tools is context overflow — when retrieved data exceeds an LLM's context window. -</p> - -<p> - For example, this tool lets an LLM read <code>man</code> pages, helping it correctly recall command flags: -</p> - -<pre><code>(gptel-make-tool + </code></pre> +</div> +<div class="wide64"> + <h4>Context</h4> + <p> + One limitation that I have run into with tools is context overflow — when retrieved data exceeds an LLM's context window. + </p> + <p> + For example, this tool lets an LLM read <code>man</code> pages, helping it correctly recall command flags: + </p> + <pre><code>(gptel-make-tool :name "man" :category "documentation" :function @@ -288,30 +269,28 @@ main() { :type string :description "The name of the man page to read. Can optionally include a section number, for example: '2 read' or 'cat(1)'."))) -</code></pre> + </code></pre> -<p> - It broke when calling the <a href="https://www.gnu.org/software/units/">GNU units</a> <code>man</code> page, which exceeds 40,000 tokens on my system. This was unfortunate, since some coversions, like temperature, are unintuitive: -</p> - -<pre><code>units 'tempC(100)' tempF -</code></pre> - -<p> - With <code>gptel</code>, one fallback is Emacs' built in <code>man</code> functionality. The appropriate region can be selected with <code>-r</code> in the transient menu. In some cases, this is faster than a tool call. -</p> + <p> + It broke when calling the <a href="https://www.gnu.org/software/units/">GNU units</a> <code>man</code> page, which exceeds 40,000 tokens on my system. This was unfortunate, since some coversions, like temperature, are unintuitive: + </p> + <pre><code>units 'tempC(100)' tempF + </code></pre> + <p> + With <code>gptel</code>, one fallback is Emacs' built in <code>man</code> functionality. The appropriate region can be selected with <code>-r</code> in the transient menu. In some cases, this is faster than a tool call. + </p> +</div> <video autoplay loop muted disablepictureinpicture class="video" src="/static/media/llm-temp.mp4" type="video/mp4"> Your browser does not support video. </video> - -<p> - I ran into a similar problem with the <code>read_url</code> tool (also found on <a href="https://github.com/karthink/gptel/wiki/Tools-collection">gptel wiki</a>). It can break if the response is larger than the context window. -</p> - -<pre><code>(gptel-make-tool +<div class="wide64"> + <p> + I ran into a similar problem with the <code>read_url</code> tool (also found on <a href="https://github.com/karthink/gptel/wiki/Tools-collection">gptel wiki</a>). It can break if the response is larger than the context window. + </p> + <pre><code>(gptel-make-tool :name "read_url" :category "web" :function @@ -332,19 +311,19 @@ main() { :args (list '(:name "url" :type string :description "The URL to read"))) -</code></pre> - -<p> - When I have run into this problem, the issue was bloated functional content — JavaScript code CSS. If the content is not dynamically generated, one call fallback to Emacs' web browser, <code><a href="https://www.gnu.org/software/emacs/manual/html_mono/eww.html">eww</a></code>. The buffer or selected regions can be added as context. A more sophisticated tool could help in these cases. Long term, I hope that LLMs will steer the web back towards readability, either by acting as an aggregator and filter, or as evolutionary pressure in favor of static content. -</p> - -<h4>Security</h4> + </code></pre> + <p> + When I have run into this problem, the issue was bloated functional content — JavaScript code CSS. If the content is not dynamically generated, one call fallback to Emacs' web browser, <code><a href="https://www.gnu.org/software/emacs/manual/html_mono/eww.html">eww</a></code>. The buffer or selected regions can be added as context. A more sophisticated tool could help in these cases. Long term, I hope that LLMs will steer the web back towards readability, either by acting as an aggregator and filter, or as evolutionary pressure in favor of static content. + </p> +</div> -<p> - The <code><a href="https://github.com/karthink/gptel/wiki/Tools-collection#run_command">run_command</a></code> tool, also found in the <code>gptel</code> tool collection, enables shell command execution, requires careful consideration. A compromised model could issue malicious commands, or a poorly formatted command could have unintended consequences. <code>gptel</code>'s <code>:confirm</code> key can be used to inspect and approve tool calls. -</p> +<div class="wide64"> + <h4>Security</h4> + <p> + The <code><a href="https://github.com/karthink/gptel/wiki/Tools-collection#run_command">run_command</a></code> tool, also found in the <code>gptel</code> tool collection, enables shell command execution, and requires careful consideration. A compromised model could issue malicious commands, or a poorly prepared command could have unintended consequences. <code>gptel</code>'s <code>:confirm</code> key can be used to inspect and approve tool calls. + </p> -<pre><code>(gptel-make-tool + <pre><code>(gptel-make-tool :name "run_command" :category "command" :confirm t @@ -359,11 +338,12 @@ main() { '((:name "command" :type string :description "The complete shell command to execute."))) -</code></pre> + </code></pre> -<p> - Inspection limits the LLM's ability to operate asynchronously, without human intervention. There are a few solutions to this problem, the easiest being to offer tools with more limited scope. -</p> + <p> + Inspection limits the LLM's ability to operate asynchronously, without human intervention. There are a few solutions to this problem, the easiest being to offer tools with more limited scope. + </p> +</div> <video autoplay loop muted disablepictureinpicture class="video" src="/static/media/llm-inspect.mp4" @@ -371,17 +351,15 @@ main() { Your browser does not support video. </video> -<h3>Presets</h3> - -<p> - With <code>gptel</code>'s transient menu, only a few keystrokes are need to add, edit, or remove context, switch the model one wants to query, change the input and output, or edit the system message. Presets accelerate switching between settings, and are defined with <code>gptel-make-preset</code>. -</p> - -<p> - For example, with <a href="https://huggingface.co/openai/gpt-oss-120b">GPT-OSS 120B</a> (one of OpenAI's <a href="https://openai.com/open-models/">open weights</a> models), a system prompt is necessary to minimize the use of tables and excessive text styling. A preset can load the appropriate settings: -</p> - -<pre><code>(gptel-make-preset 'assistant/gpt +<div class="wide64"> + <h3>Presets</h3> + <p> + With <code>gptel</code>'s transient menu, only a few keystrokes are need to add, edit, or remove context, switch the model one wants to query, change the input and output, or edit the system message. Presets accelerate switching between settings, and are defined with <code>gptel-make-preset</code>. + </p> + <p> + For example, with <a href="https://huggingface.co/openai/gpt-oss-120b">GPT-OSS 120B</a> (one of OpenAI's <a href="https://openai.com/open-models/">open weights</a> models), a system prompt is necessary to minimize the use of tables and excessive text styling. A preset can load the appropriate settings: + </p> + <pre><code>(gptel-make-preset 'assistant/gpt :description "GPT-OSS general assistant." :backend "llama.cpp" :model 'gpt @@ -396,51 +374,48 @@ main() { - Minimize styling. Use *bold* or /italic/ only where emphasis is essential. Use ~code~ for technical terms. - If citing facts or resources, output references as org-mode links. - Use code blocks for calculations or code examples.") -</code></pre> - -<p> - From the transient menu, this preset can be selected with two keystrokes: <code>@</code> and then <code>a</code>. -</p> - -<h4>Memory</h4> + </code></pre> + <p> + From the transient menu, this preset can be selected with two keystrokes: <code>@</code> and then <code>a</code>. + </p> +</div> -<p> - Presets can be used to implement read-only memory for an LLM. This preset uses <a href="https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking">Qwen3 VL 30B-A3B</a> with a <code>memory.org</code> file automatically included in the context: -</p> +<div class="wide64"> + <h4>Memory</h4> + <p> + Presets can be used to implement read-only memory for an LLM. This preset uses <a href="https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking">Qwen3 VL 30B-A3B</a> with a <code>memory.org</code> file automatically included in the context: + </p> -<pre><code>(gptel-make-preset 'assistant/qwen + <pre><code>(gptel-make-preset 'assistant/qwen :description "Qwen Emacs assistant." :backend "llama.cpp" :model 'qwen3_vl_30b-a3b :context '("~/memory.org")) -</code></pre> - -<p> - The file can include any information that should always be included as context. One could also grant LLMs the ability to append to <code>memory.org</code>, though I am skeptical that they would do so judiciously. -</p> - -<h2>Local LLMs</h2> - -<p> - Running LLMs on one's own devices offers some advantages over third-party providers: - <ul> - <li>Redundancy: they work offline, even if providers are experiencing an outage.</li> - <li>Privacy: queries and data remain on the device.</li> - <li>Control: You know exactly which model is running, with what settings, at what quantization.</li> - </ul> -</p> - -<p> - The main trade-off is intelligence, though for many purposes, the gap is closing fast. Local models excel at summarizing data, language translation, image and PDF extraction, and simple research tasks. I rely on hosted models primarily for complex coding tasks, or when a larger effective context is required. -</p> - -<h3>llama.cpp</h3> + </code></pre> -<p> - <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a> makes it easy to run models locally: -</p> + <p> + The file can include any information that should always be included as context. One could also grant LLMs the ability to append to <code>memory.org</code>, though I am skeptical that they would do so judiciously. + </p> +</div> -<pre><code>git clone https://github.com/ggml-org/llama.cpp.git +<div class="wide64"> + <h2>Local LLMs</h2> + <p> + Running LLMs on one's own devices offers some advantages over third-party providers: + <ul> + <li>Redundancy: they work offline, even if providers are experiencing an outage.</li> + <li>Privacy: queries and data remain on the device.</li> + <li>Control: You know exactly which model is running, with what settings, at what quantization.</li> + </ul> + </p> + <p> + The main trade-off is intelligence, though for many purposes, the gap is closing fast. Local models excel at summarizing data, language translation, image and PDF extraction, and simple research tasks. I rely on hosted models primarily for complex coding tasks, or when a larger effective context is required. + </p> + <h3>llama.cpp</h3> + <p> + <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a> makes it easy to run models locally: + </p> + <pre><code>git clone https://github.com/ggml-org/llama.cpp.git cd llama.cpp @@ -451,13 +426,12 @@ cmake --build build --config Release mv build/bin/llama-server ~/.local/bin/ # Or elsewhere in PATH. llama-server -hf unsloth/Qwen3-4B-GGUF:q8_0 -</code></pre> - -<p> - This will build <code>llama.cpp</code> with support for CPU based inference, move <code>llama-server</code> into <code>~/.local/bin/</code>, and then download and run <a href="https://unsloth.ai/">Unsloth</a>'s <code>Q8</code> quantization of the <a href="https://huggingface.co/Qwen/Qwen3-4B">Qwen3 4B</a>. The <code>llama.cpp</code> <a href="https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md"> documentation</a> explains how to build for GPUs and other hardware — not much more work than the default build. -</p> - -<p><code>llama-server</code> offers a web interface, available at port 8080 by default.</p> + </code></pre> + <p> + This will build <code>llama.cpp</code> with support for CPU based inference, move <code>llama-server</code> into <code>~/.local/bin/</code>, and then download and run <a href="https://unsloth.ai/">Unsloth</a>'s <code>Q8</code> quantization of the <a href="https://huggingface.co/Qwen/Qwen3-4B">Qwen3 4B</a>. The <code>llama.cpp</code> <a href="https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md"> documentation</a> explains how to build for GPUs and other hardware — not much more work than the default build. + </p> + <p><code>llama-server</code> offers a web interface, available at port 8080 by default.</p> +</div> <video autoplay loop muted disablepictureinpicture class="video" src="/static/media/llm-ls.mp4" @@ -465,31 +439,28 @@ llama-server -hf unsloth/Qwen3-4B-GGUF:q8_0 Your browser does not support video. </video> -<h3>Weights</h3> - -<p> - Part of the art of using LLMs is selecting an appropriate model. Some factors to consider are available hardware, intended use (task, language), and desired pricing (input and output costs). Some models offer specialized capabilities — <a href="https://ai.google.dev/gemma/docs/core">Gemma3</a> and <a href=""https://github.com/QwenLM/Qwen3-VL">Qwen3-VL</a> offer multimodal input, <a href="https://deepmind.google/models/gemma/medgemma/">Medgemma</a> specializes in medical knowledge, and <a href=https://mistral.ai/">Mistral</a>'s <a href="https://mistral.ai/news/devstral">Devstral</a> focuses on agentic use. -</p> - -<p> - For local use, hardware tends to be the main limiter. One has to fit the model into available memory, and consider the acceptable performance for one's use case. A rough guideline is to use the smallest model or quantization for the required task. Or, from the opposite direction, to look for the largest model that can fit into available memory. The rule of thumb is that a <code>Q8_0</code> quantization uses about as much memory as there are parameters, so an 8 billion parameter model will use about 8 GB of RAM or VRAM. A <code>Q4_0</code> quant would use half that — 4 GB — while at 16-bit, 16 GB. -</p> - -<p> - My workstation, laptop, and mobile (<code>llama.cpp</code> can be used from <code><a href="https://termux.dev/en/">termux</a></code>) all run different classes of weights. On my mobile device, I have about 12GB of RAM, but background utilization is already around 8GB. So, when necessary, I use 4B models at <code>Q8_0</code> or less: Gemma3, Qwen3-VL, and Medgemma. If a laptop has 16GB of RAM with 2GB in use, 8B models might run well enough. The workstation, which has a GPU, can run larger models, faster. There are other tricks one can use — <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention">flash attention</a>, <a href="https://research.google/blog/looking-back-at-speculative-decoding/">speculative decoding</a>, MoE offloading — to optimize performance across different hardware configurations. -</p> - -<h3>llama-swap</h3> - -<p> - One current limitation of <code>llama.cpp</code> is that unless you load multiple models at once, switching models requires manually starting a new instance of <code>llama-server</code>. To swap models on demand, <code><a href="https://github.com/mostlygeek/llama-swap">llama-swap</a></code> can be used. -</p> - -<p> - <code>llama-swap</code> uses a YAML configuration file, which is <a href="https://github.com/mostlygeek/llama-swap/wiki/Configuration">well documented</a>. I use something like the following: -</p> +<div class="wide64"> + <h3>Weights</h3> + <p> + Part of the art of using LLMs is selecting an appropriate model. Some factors to consider are available hardware, intended use (task, language), and desired pricing (input and output costs). Some models offer specialized capabilities — <a href="https://ai.google.dev/gemma/docs/core">Gemma3</a> and <a href=""https://github.com/QwenLM/Qwen3-VL">Qwen3-VL</a> offer multimodal input, <a href="https://deepmind.google/models/gemma/medgemma/">Medgemma</a> specializes in medical knowledge, and <a href=https://mistral.ai/">Mistral</a>'s <a href="https://mistral.ai/news/devstral">Devstral</a> focuses on agentic use. + </p> + <p> + For local use, hardware tends to be the main limiter. One has to fit the model into available memory, and consider the acceptable performance for one's use case. A rough guideline is to use the smallest model or quantization for the required task. Or, from the opposite direction, to look for the largest model that can fit into available memory. The rule of thumb is that a <code>Q8_0</code> quantization uses about as much memory as there are parameters, so an 8 billion parameter model will use about 8 GB of RAM or VRAM. A <code>Q4_0</code> quant would use half that — 4 GB — while at 16-bit, 16 GB. + </p> + <p> + My workstation, laptop, and mobile (<code>llama.cpp</code> can be used from <code><a href="https://termux.dev/en/">termux</a></code>) all run different classes of weights. On my mobile device, I have about 12GB of RAM, but background utilization is already around 8GB. So, when necessary, I use 4B models at <code>Q8_0</code> or less: Gemma3, Qwen3-VL, and Medgemma. If a laptop has 16GB of RAM with 2GB in use, 8B models might run well enough. The workstation, which has a GPU, can run larger models, faster. There are other tricks one can use — <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention">flash attention</a>, <a href="https://research.google/blog/looking-back-at-speculative-decoding/">speculative decoding</a>, MoE offloading — to optimize performance across different hardware configurations. + </p> +</div> -<pre><code>logLevel: debug +<div class="wide64"> + <h3>llama-swap</h3> + <p> + One current limitation of <code>llama.cpp</code> is that unless you load multiple models at once, switching models requires manually starting a new instance of <code>llama-server</code>. To swap models on demand, <code><a href="https://github.com/mostlygeek/llama-swap">llama-swap</a></code> can be used. + </p> + <p> + <code>llama-swap</code> uses a YAML configuration file, which is <a href="https://github.com/mostlygeek/llama-swap/wiki/Configuration">well documented</a>. I use something like the following: + </p> + <pre><code>logLevel: debug macros: "models": "/home/llama-swap/models" @@ -540,15 +511,15 @@ models: --top-p 0.95 ttl: 900 name: "qwen3_vl_30b-a3b-thinking" -</code></pre> - -<h3>nginx</h3> - -<p> - Since my workstation has a GPU and can be accessed on the local network or via <a href="https://www.wireguard.com/">WireGuard</a> from other devices, I use <code><a href="https://nginx.org/">nginx</a></code> as a reverse proxy in front of <code>llama-swap</code>, with certificates generated by <code><a href="https://certbot.eff.org/">certbot</a></code>. For streaming LLM responses, <code>proxy_buffering off;</code> and <code>proxy_cache off;</code> are essential settings. -</p> + </code></pre> +</div> +<div class="wide64"> + <h3>nginx</h3> + <p> + Since my workstation has a GPU and can be accessed on the local network or via <a href="https://www.wireguard.com/">WireGuard</a> from other devices, I use <code><a href="https://nginx.org/">nginx</a></code> as a reverse proxy in front of <code>llama-swap</code>, with certificates generated by <code><a href="https://certbot.eff.org/">certbot</a></code>. For streaming LLM responses, <code>proxy_buffering off;</code> and <code>proxy_cache off;</code> are essential settings. + </p> -<pre><code>user http; + <pre><code>user http; worker_processes 1; worker_cpu_affinity auto; @@ -575,11 +546,11 @@ http { include /etc/nginx/conf.d/*.conf; } -</code></pre> + </code></pre> -<p>Then, for <code>/etc/nginx/conf.d/llama-swap.conf</code>:</p> + <p>Then, for <code>/etc/nginx/conf.d/llama-swap.conf</code>:</p> -<pre><code>server { + <pre><code>server { listen 80; server_name llm.dwrz.net; return 301 https://$server_name$request_uri; @@ -604,15 +575,16 @@ server { proxy_set_header X-Forwarded-Proto $scheme; } } -</code></pre> - -<h3>Emacs Configuration</h3> + </code></pre> +</div> +<div class="wide64"> + <h3>Emacs Configuration</h3> -<p> - <code>llama-server</code> offers an <a href="https://platform.openai.com/docs/api-reference/introduction">OpenAI API</a> compatible API. <code>gptel</code> can be configured to utilize local models with something like the following: -</p> + <p> + <code>llama-server</code> offers an <a href="https://platform.openai.com/docs/api-reference/introduction">OpenAI API</a> compatible API. <code>gptel</code> can be configured to utilize local models with something like the following: + </p> -<pre><code>(gptel-make-openai "llama.cpp" + <pre><code>(gptel-make-openai "llama.cpp" :stream t :protocol "http" :host "localhost" @@ -642,110 +614,142 @@ server { "image/png" "image/gif" "image/webp")))) -</code></pre> - -<h2>Techniques</h2> - -<p> - Having covered the setup and configuration, here are some practical ways I use Emacs with LLMs, demonstrated with examples: -</p> - -<h3>Simple Q&A</h3> - -<p> - With the <code>gptel</code> transient menu, press <code>m</code> to prompt from the minibuffer, and <code>e</code> to output the answer to the echo area, then <code>Enter</code> to input the prompt. - - <video autoplay loop muted disablepictureinpicture - class="video" src="/static/media/llm-qa.mp4" - type="video/mp4"> - Your browser does not support video. - </video> -</p> - -<h3>Brief Conversations</h3> - -<p> - For brief multi-turn conversations that require no persistence, <code>gptel</code> can be used in the <code>*scratch*</code> buffer. Context can be added via the transient menu, <code>-b</code>, <code>-f</code>, or <code>-r</code> as necessary. The conversation is not persisted unless the buffer is saved. -</p> - -<h3>Image-to-Text</h3> -<p> - With multimodal LLMs like Gemma3 and Qwen3-VL, one can extract text and tables from images. - - <video autoplay loop muted disablepictureinpicture - class="video" src="/static/media/llm-itt.mp4" - type="video/mp4"> - Your browser does not support video. - </video> -</p> - -<h3>Text-to-Image</h3> -<p> - My primary use case is to revisit themes from some of my dreams. Here, a local LLM retrieves a URL, reads its contents, and then generates an image with ComfyUI: - <video autoplay loop muted disablepictureinpicture - class="video" src="/static/media/llm-image.mp4" - type="video/mp4"> - Your browser does not support video. - </video> - - The result: - <img class="img-center" src="/static/media/comfy-ui-dream.png"> -</p> - -<h3>Research</h3> -<p> - If I know I well need to reference a topc later, I usually start out with an <code><a href="https://orgmode.org/">org-mode</a></code> file. In this case, I tend to use links to construct context, something like this: - - <img class="img-center" src="/static/media/llm-links.png"> -</p> - -<h3>Rewrites</h3> -<p> - Although I don't use it very often, <code>gptel</code> comes with rewrite functionality, activated when the transient menu is called on a seleted region. It can be used on both text and code, and the output can be <code>diff</code>ed, iterated on, accepted, or rejected. Additionally, it can serve as a kind of autocomplete, by having a LLM implement the skeleton of a function or code block. -</p> - -<h3>Translation</h3> -<p> - For small or unimportant text, Google Translate via the command-line with <code><a href="https://github.com/soimort/translate-shell">translate-shell</a></code> works well enough. Otherwise, I find the translation output from local LLMs is typically more sensitive to context. - - <video autoplay loop muted disablepictureinpicture - class="video" src="/static/media/llm-translate.mp4" - type="video/mp4"> - Your browser does not support video. - </video> -</p> - -<h3>Code</h3> -<p> - My experience using LLMs for code has been mixed. For scripts and small programs, iterating in a single conversation works well. However, with larger codebases, I have not found that LLMs can contribute meaningfully, reliably. This used to be an area of relative strength for hosted models, but I surmise aggressive quantization has begun to reduce their effectiveness. -</p> - -<p> - So far, I have had limited success with agents. My experience has been that they burn through tokens to understand context, but still manage to miss important nuance. This experience has made me hesitant to add tool support for file operations. I am actively exploring some techniques on this front. -</p> - -<p> - For now, I have come to distrust the initial output from any model. Instead, I provide context through <code>org-mode</code> links in project-specific files. I have LLM(s) walk through potential changes, which I review and implement by hand. Generally, this approach saves time, but often, I still work faster on my own. -</p> - -<h2>Conclusion</h2> - -<p> - I first used Emacs as a text editor 20 years ago. For over a decade, I have used it daily — for writing and coding, task and finance management, email, as a calculator, and to interact with local and remote hosts. I continue to discover new functionality and techniques, and was suprised to see how this 50-year old program has adapted to the frontier of technology. Despite flaws and limitations, Emacs' endurance reflects its foundational design. -</p> - -<p> - Unfortunately, the barrier to entry for Emacs is high. For everyday users, comparable power and flexibility could be unlocked with support for: - <ul> - <li>Notebooks featuring executable code blocks</li> - <li>Links for local and remote content, including other conversations</li> - <li>Switching models and providers at any point</li> - <li>Mail and task integration</li> - <li>Offline operation with local models</li> - <li>Remote access — Emacs can be accessed via SSH or TRAMP</li> - </ul> -</p> + </code></pre> +</div> +<div class="wide64"> + <h2>Techniques</h2> + <p> + Having covered the setup and configuration, here are some practical ways I use Emacs with LLMs, demonstrated with examples: + </p> +</div> +<div class="wide64"> + <h3>Simple Q&A</h3> + <p> + With the <code>gptel</code> transient menu, press <code>m</code> to prompt from the minibuffer, and <code>e</code> to output the answer to the echo area, then <code>Enter</code> to input the prompt. + </p> +</div> + +<video autoplay loop muted disablepictureinpicture + class="video" src="/static/media/llm-qa.mp4" + type="video/mp4"> + Your browser does not support video. +</video> + +<div class="wide64"> + <h3>Brief Conversations</h3> + + <p> + For brief multi-turn conversations that require no persistence, <code>gptel</code> can be used in the <code>*scratch*</code> buffer. Context can be added via the transient menu, <code>-b</code>, <code>-f</code>, or <code>-r</code> as necessary. The conversation is not persisted unless the buffer is saved. + </p> + + <h3>Image-to-Text</h3> + <p> + With multimodal LLMs like Gemma3 and Qwen3-VL, one can extract text and tables from images. + </p> +</div> -<p> - So far, my experiments with LLMs has left me with concern and optimism. Local inference reveals the energy requirements, yet daily limitations make me skeptical of imminent superintelligence. In the same way that calculators are better than humans, LLMs may offer areas of comparative advantage. The key question is which tasks we can delegate reliably and efficiently, such that the effort of building scaffolding, maintaining guardrails, and managing operations costs less than doing the work ourselves. -</p> +<video autoplay loop muted disablepictureinpicture + class="video" src="/static/media/llm-itt.mp4" + type="video/mp4"> + Your browser does not support video. +</video> + +<div class="wide64"> + <h3>Text-to-Image</h3> + <p> + My primary use case is to revisit themes from some of my dreams. Here, a local LLM retrieves a URL, reads its contents, and then generates an image with ComfyUI: + </p> +</div> +<video autoplay loop muted disablepictureinpicture + class="video" src="/static/media/llm-image.mp4" + type="video/mp4"> + Your browser does not support video. +</video> + +<div class="wide64"> + <p> + The result: + <img class="img-center" src="/static/media/comfy-ui-dream.png"> + </p> +</div> + +<div class="wide64"> + <h3>Research</h3> + <p> + If I know I well need to reference a topc later, I usually start out with an <code><a href="https://orgmode.org/">org-mode</a></code> file. In this case, I tend to use links to construct context, something like this: + + <img class="img-center" src="/static/media/llm-links.png"> + </p> +</div> + +<div class="wide64"> + <h3>Rewrites</h3> + <p> + Although I don't use it very often, <code>gptel</code> comes with rewrite functionality, activated when the transient menu is called on a seleted region. It can be used on both text and code, and the output can be <code>diff</code>ed, iterated on, accepted, or rejected. Additionally, it can serve as a kind of autocomplete, by having a LLM implement the skeleton of a function or code block. + </p> +</div> + +<div class="wide64"> + <h3>Translation</h3> + <p> + For small or unimportant text, Google Translate via the command-line with <code><a href="https://github.com/soimort/translate-shell">translate-shell</a></code> works well enough. Otherwise, I find the translation output from local LLMs is typically more sensitive to context. + </p> +</div> + +<video autoplay loop muted disablepictureinpicture + class="video" src="/static/media/llm-translate.mp4" + type="video/mp4"> + Your browser does not support video. +</video> + +<div class="wide64"> + <h3>Code</h3> + <p> + My experience using LLMs for code has been mixed. For scripts and small programs, iterating in a single conversation works well. However, with larger codebases, I have not found that LLMs can contribute meaningfully, reliably. This used to be an area of relative strength for hosted models, but I surmise aggressive quantization has begun to reduce their effectiveness. + </p> + + <p> + So far, I have had limited success with agents. My experience has been that they burn through tokens to understand context, but still manage to miss important nuance. This experience has made me hesitant to add tool support for file operations. I am actively exploring some techniques on this front. + </p> + + <p> + For now, I have come to distrust the initial output from any model. Instead, I provide context through <code>org-mode</code> links in project-specific files. I have LLM(s) walk through potential changes, which I review and implement by hand. Generally, this approach saves time, but often, I still work faster on my own. + </p> +</div> + +<div class="wide64"> + <h2>Reflections</h2> + <blockquote> + <p> + <i> + The question of whether a computer can think is no more interesting than + the question of whether a submarine can swim. + </i> + </p> + + <p> + Edsger Dijkstra + </p> + </blockquote> + + <p> + I first used Emacs as a text editor 20 years ago. For over a decade, I have used it daily — for writing and coding, task and finance management, email, as a calculator, and to interact with local and remote hosts. I continue to discover new functionality and techniques, and was suprised to see how this 50-year old program has adapted to the frontier of technology. + </p> + + <p> + Unfortunately, for most users, the barrier to entry for Emacs is high. For other frontends, comparable power and flexibility could be unlocked with support for: + <ul> + <li>The ability to modify their own environment and capabilities</li> + <li>Notebooks featuring executable code blocks</li> + <li>Links for local and remote content, including other conversations</li> + <li>Switching models and providers at any point</li> + <li>Mail and task integration</li> + <li>Offline operation with local models</li> + <li>Remote access — Emacs can be accessed via <code><a href="https://www.openssh.org/">SSH</a></code>, <code>gptel</code> files via <code><a href="https://www.gnu.org/software/tramp/">TRAMP</a></code></li> + </ul> + </p> + + <p> + There are many topics of concern and discussion around LLMs. From my work with them so far, I am more anxious about some than others. Local inference alone reveals how much energy these models can require. On the other hand, the limitations of the technology leave me extremely skeptical of imminent superintelligence. What we have now, limitations included, is useful — and has potential. + </p> +</div> diff --git a/cmd/web/site/entry/static/2025-12-01/metadata.json b/cmd/web/site/entry/static/2025-12-01/metadata.json @@ -1,6 +1,6 @@ { - "cover": "dream.png", + "cover": "drawing-hands.jpg", "date": "2025-12-01T00:00:00Z", "published": true, - "title": "Using LLMs with Emacs" + "title": "Recursive Intelligence: Using LLMs with Emacs" }