ai: OpenWebUI Missing Mac Manual with Ollama and ComfyUI notes v1.8 2025-03-18

Well, Open WebUI is an incredible project that provides a nice graphical front end for local AI development, but the documentation is lacking. You can figure some things out by looking at the documentation, but mainly it is trial and error.

So here are my notes for using the various features that are on the edge as most people are doing docker development on Windows machines. with CUDA, this is targeted at getting it running on Apple Silicon.

The documentation is pretty sparse and the features are coming in fast, the best way to use the latest is just to track their GitHub Discussions group.

Before you start, take advantage of install-ollama.sh

OK, this is a bit of a pain, but we have a set of shell utilities that make maintaining all this stuff easier in Rich Tong’s repo, if you create a mono repo called src and then add bin and lib then you can do a ./install-ai.sh -v and get the latest models curated for free.

Installation with pipx or uvx or Tauri

Sadly, uvx installation doesn’t work. It is missing an installation, but you can do a simple one, brew install ollama && pipx install open-webui --python 3.12 and it will work. Before version 0.5.4 you had to use Python 3.11 but 3.12 works now.

What this does is use brew to install ollama and then pipx is this incredibly wonderful thing that creates a Python virtual environment (venv) and then adds it to a local path useful

The thing on a Mac is that you want to avoid using docker because then you have to split your memory into a docker controlled component and a system one. And you will never get that right.

You can also use brew install uv && uvx open-webui to get to the same place, but I kind of like pipx install because it does the command line munging so you can just do a open-webui serve without needing to know about doing a `uvx open-webui`. If you want to update it you need `pipx update-all`

Finally, there is work going on to use Tauri to package open-webui as well, so it is just a DMG file as Open Webui Desktop, but I couldn’t get this to work.

This does mean that you have to go to great lengths to do pipx installation for the rest and have them running as separate small servers. That is way better and more memory efficient than using docker as you get natural splits between MacOS native applications (like FinalCutPro for instance or ComfyUI or DiffusionBee) and your other tools.

Going Local and OFFLINE_MODE

You probably want to go local since that’s the point of these experiments and there are and there are many parameters you should change. If you are on an airplane or have crappy WiFi start it with `OFFLINE_MODE=true open-webui serve` so the backend doesn’t spend five minutes time out.

The main one is to load the best of the local models. We do have a script that does this, but basically from the command line note that this assumes different memory sizes and is probably already out of date:

# create a new window and start ollama
ollama serve &
# now do the pulls I like to pull the tagged parameters
# so it is easier to know what you are loading
ollama pull llama3.2:3b tulu3:8b
# if you have a 64GB machine
ollama pull qwq:32b qwen2.5:32b llama3.2-vision:11b
# if you have a 128GB machine
ollama pull llama3.3:70b tulu3:70b nemotron:70b llama3.2-vision:90b

Downloading other GGUF models

OK, one confusing this is that even though Hugging Face has 1.2M models up there only a few can be downloaded by Open WebUI. This is because underneath, Ollama is just a wrapper around Llama.cpp only accepts GGUF files that can come from Ollama.com or from Huggingface.co (defaulting to Q4_K_M quantization, but that’s a whole other post), but on Huggingface itself if you click on the model drop down it will generate the proper pull request for you.

As an aside GGUF stands for Georgi Gerganov’s Universal Format. It’s a bit of a pain because the tool to do this is in his llama.cpp library with a neat trick using docker to run this stuff

Dealing with HuggingFace models

Most of the rest of the Hugging Face models are in HF format, so you need to convert them to GGUF and as a community service, publish the conversion. Here we enter utilities hell which I’m going to cover in the next blog post about how to do these things, but this is a great example the major ways are like the Ghost of Christmas past, present, and future:

Past. Pray they have a Homebrew package, but if not it’s painful. Git clone a repo, install naked pip requirements into the system, and pray. This is illustrated in the last command box. The main issues here are that first you are cloning a bunch of stuff you don’t need and have to maintain and also that you have to remember where these things are in README.md or something and then you are doing a pip install into the system environment and who knows what version of python you are using just so you can run a single script convert-hf-to-gguf.py and then manually make Modelfile that explains how the inputs work. Sigh.
Present. The current hotness is stuffing everything inside a docker container and then you get this amazingly complicated command line and you have to understand that there is an internal file system and an external one. And of course, with Docker, you have to allocate separate space for it and you get these huge Docker containers with a full operating system in them basically (duplicated on a Mac) just to get a few lines of Python running. Then there are online looks like gguf-my-repo that are on Hugging Face (more on the future of tools in another post), but of course, there is a way to programmatically do this.
Future. The emergence of npx, uvx, pipx, condax and pkgx is the family of executables that create even lighter-weight environments that are language-specific. Instead of a big docker virtual machine on Apple Silicon, you end up with just enough to run the script which is usually a virtualized environment so multiple versions of Python. Or with tools like Dagger at least they high the containers behind a nice user interface (although of course, the container I tried didn’t work. Of course with things like npmx, uvx, the nice world of brew update gets replaced by an update from each tool, but it is very lightweight!

I can’t say really which is easiest, but since it’s one time, the HuggingFace running application is pretty great and it is free as long as you take less than 120 seconds of CPU time.

# get the hugging face tools
brew install huggingface-cli
# or if you are cool this is sort of super pipx
brew install pkgx
pkgx install huggingface-cli
# the format is _repo_/_model_
ORG=qwen
MODEL=qvq-72b-preview
# if the above fails you need a GGUF conversion
ollama pull "hf.co/$ORG/$MODEL"
MODEL_DIR="$HOME/wsn/data/models"
mkdir -p "$MODEL_DIR"
huggingface-cli download $REPO/$MODEL --local-dir "$MODEL_DIR/$MODEL" --include "*"
#Convert to GGUF sigh co kdf
docker run --rm -v "$MODEL_DIR/$MODEL":/repo ghcr.io/ggerganov/llama.cpp:full --convert "/repo" --outtype f32 --outfile /repo/$MODEL.gguf
# This creates a file fp32 file in $MODEL_DIR/$MODEL.gguf
#Quantize from F32.gguf to Q4_K_M
docker run --rm -v "$MODEL_DIR":/repo ghcr.io/ggerganov/llama.cpp:full --quantize "/repo/$MODEL.gguf" "/repo/$MODEL.Q4_K_M.gguf" "Q4_K_M"
# Or the old way
git clone git@github.com:ggerganov/llama.cpp
cd llama.cp
# yes see the previous posts about uv
uv pip install -r requirements.txt
# now run the conversion
./convert-hf-to-gguf.py $MODEL_DIR/$MODEL --outfile $MODEL_DIR/$MODEL.gguf --output q8_0
# Now you have to creat the Modelfile to match this
# Sigh this more complicated than it looks because the 
# meta data on how system prompts and user prompts
# work is not in the hugging face file itself.

How it all works and logging

Because it is pretty confusing works, but here it is. Note that the easiest way to see the logs is to run all these processes I need a separate terminal window so you can see what is coming out of standard output. Here is how you know each process is working:

open-webui. It will end with uvicorn started
ngrok. This will end with a message saying look at http://127.0.0.1:4040
ollama. Will end with a message telling you how much compute RAM it has (96GB on a 128GB M4 Max by the way)
tika. This will end with a message that says started at http://localhost:9998

flowchart TD User <–Port 8080–> OF[Open WebUI Frontend] –> Context Context –> Marshall <–Port 5173–> OB[Open WebUI Backend] Modelfiles –> Marshall OB –Port 11434–> Ollama –fork–> Llama.cpp Llama.cpp –output–> Ollama –Port 11434 –> OB –5173–> Context –> OF Hf[Huggingface.co hf.co/repo.model] –pull–> Modelfiles Hf –> OS[OpenWebUI Cache] Hf –sentence transformers–> OIC[Open WebUI Cache] O.c[Ollama.com] –pull–> Ollama <—> OS[Ollama Store $HOME/.ollama/models] OB –> Oc[OpenWebUI Connections] Oc –> Oapi[Open AI ./api/v1] Oapi –> OpenAI Oapi –> Grok Oapi –> Cerebras Oapi –> Mistral Oapi –> Deepseek OF –Upload–> Knowledge –#know–> Context Files –> OF –> Context

Making the whole config visible with DATA_DIR set to a git repo if you use git lfs but Google Drive is better

One thing is with pipx installs the data directory is buried way down in `~/.local`. So finding the key files is hard. And if you developing and have a release version then you are constantly swapping things. Fortunately, you can at run time set DATA_DIR to something easier to find to share uploads, vector databases, etc. with `DATA_DIR=~/wsn/data` or put it into a git repo if you want real version control so configs can be shared. This has lots of binaries so if you do this you need git lfs but makes debugging much easier since you can share your entire configuration this way like DATA_DIR=$WS_DIR/git/src/app/open-webui-data` but it will blow up fast or use Google Drive to share this. But this last can cause concurrent update problems.

maybe would recommend instead either making the default location. Symlink of use the DATA_DIR. If you put it into a Shared Drive then different team members can use Aya a demo setup easily.

Backup the Setting Database

This is a real pain if you have to delete over and over, so I just use chezmoi to capture their SQL database. They don’t use INI files, but have an Alembic database that keeps track of the many parameters.

You can go to Lower Left > Admin Settings > Database > Export config to JSON note that the API keys are here in plain text, so do not check this in, you should store it someplace like 1Password if you need it.

Resetting the User Database and Backups

I found that I was locked out of this, the simplest thing to do is to delete all the files where the configurations are kept. You can also send an environment variable to do this:

RESET_DATABASE=1 open-webui serve &

If you are doing a pipx installation, the actual location of the webui.db is buried because it is in the ./data directory of the working venv, this is dependent on the python that you are using, but with pipx it will be in a strange directory buried in pipx:

$HOME/.local/pipx/venvs/open---webui/lib/python3.12/site-packages/open_webui/data
# here are the interesting files
webui.db # the alembic database
uploads  # the files you have uploaded
vectordb  # where your RAG information is stored
cache/audio  # .wav and transcripts
cache/image/generations  # where .pngs live

If you do have a problem with the. user database. Which is did, you can also reset the configuration with RESET__CONFIG_ON_START=1. They talk about a config.json, but I can’t find it anywhere.

Backup of config.json, webui.db and chats

You want to pretty frequently do backups because on each version change, you can lose your configuration and also there are bugs in the system so you can bork your configuration. I try to do this every day or so. This works well, but if you set the DATA_DIR as noted above and use git or something then you don’t have to do this all the time.

Settings > Admin Settings > Database and do both an export config.js which has all your API keys
Export Database which gives you webui.db which has more configurations
Export Chats because you will lose those.
These all have API keys and things, so you can put the chats in a repo, but I would put the config.json and webui.db into 1Password or someplace secure like iCloud Drive. Not in a repo.

The many configuration settings

One note is that while Open WebUI takes in many environment variables, those marked PersistentConfig are only read once and then disappear into the webui.db and you can only change them in the Open WebUI interface. They have this idea of an OPEN_API_KEY for instance but this disappears into the database.

If you care about your settings, you can go to Lower Left > Admin Settings > Database > Export Config to JSON files and save it.

Multiple OpenAI-compatible API points

While you can use functions and pipelines, so many models are available from an OpenAI-compatible interface it makes sense just to have them all here. I can’t find a way to load them programmatically though, so you have to redo this every time you set up a system in Lower Left > Admin Settings > Settings > Connections

Here’s a list that I use. Since Open WebUI doesn’t tell you where the models live, you have to intuit this by different rules for model names to Lower Left > Admin Settings > Settings > Models get more metadata.

They don’t tell you who the provider is, but in the all-important Capabilities at the bottom, you can see if it supports Vision or Citations.

Also, you can’t tell which provider you are using from the user interface, but the syntax of the model namermes is subtly different so that’s a hint. See the third column, but if you see a link icon on the right it is hosted in the cloud while if it has a number then it is local. The local model syntax is easy just look for a number after the model (which is its size), it has the form hf.co/org/rep if it downloads from there or whatever random name they pick on ollama.com, it is typically lower kebab case with the version with a dash, so qwen2.5:72b or llama3.2-vision:90b. The net is that only Cerebras can really confuse you if you follow this decoder ring. Note that in Admin Settings > Model you can see in the subtitle what is running it, so it might be google_genai

Using Connection Prefixes

Note that when you have lots and lots of models, you can use the optional model prefix function, when you create the connection below, just add a prefix like ollama or openai and then in the Model names will become ollama.llama3.2:1b but if you do this then make sure that any backend code or agents you write have this notation as well as the system is doing a straight string match, but this will keep clear

URL	Comment	Model Syntax	Model Prefix
https://api.openai.com/v1	They have lots of old and date versioned models	They use lower kebab case like `gpt-4o-audio-preview-2024-10-01`. Note that they never do `gpt4`, it is `gpt-4`	openai
https://api.groq.com/openai/v1	Very high speed, not as fast as cerebras but way more variety. They have lots of old models. Like cerebra’s these often has short context lengths	Names are lower case in x provider/model[-]version syntax: `llama-3.2-90b-vision-preview`	groq
https://api.deepseek.com/v1	deepseek-chat is V3 and the pricing is incredibly low, so use it! and deepseek-r1 is deepseek-reasoner	They use lower-case kebab so it is `deepseek-chat` and `deepseek-reasoner` without versions	ds
https://openrouter.ai/api/v1	This is the most confusing because they route to every other provider like OpenAI, Amazon, Google, and many other provides using Open Models. If you want free models just search for “Free”	Their syntax is Initial Case as Provider: Model. They are easily confused with the real Google models. For instance, `Google: Gemini 2.0 Flash` looks the same as Google’s base offering, so you need to go look in Model to see what you are using. They have a sea of models like `Deepseek V3` that a similar passthrough. Free ones look like `Meta: Llama3.2 90B Vision Instruct (free)`	or
https://mistral.ai/v1	These are open source models that end with *stral	The model names are in kebab case like `codestral-mamba-latest`	mis
https://api.cerebras.ai/v1	These use a high performance design like grok, usually with limited context windows 8K or less	Note they change their models alot, but the format is lower snakecase so hard to distinguish, `llama-3.3-70b` seems to be the only one up today for instance	cer
https://dashscope-intl.aliyuncs.com/compatible-mode/v1	Alibaba proprietary (and open) like Qwen-Max	they use lower-case without any versions, such as `qwen-max`	Ali
https://api.totalgpt.ai	Infermatic.ai. This came up as an alternative to OpenRouter.ai, but it is expensive so not using it and don’t want to pay $15/month. They do use vLLM underneath
http://localhost:8081	Llama.cpp’s llama-server and the main use is that they do KV caching so are very fast for context-aware RAG. It’s easiest to get into ./ollama/models and find the right blob that is the model weights	You will get the actual file name of the model served.	lcpp
http://localhost:4000	OpenWebUI used to support LiteLLM, but don’t anymore. So you can run LiteLLM separately if you want. LiteLLM which shims 100+ LLMs with the OpenAI call format, but then you need another component that is an LLM proxy with `pipx install 'litellm[proxy]'` and then `LiteLLM --model huggingface/bigcode/starcoder` puts a proxy at the port `4000`		lite
http://localhost:52415/v1	Exo links machines together and runs VLLM between them and uses MLX on Apple machines. You will see in OpenWebUi all the available models. Exo is useful if you want an overflow machine, so for instance if you have ComfyUI and want to run models, just have another machine and the LLMs will overflow to that one. There is no explicit pulling, just select the model and exo will do a download of it.	The models are also in lower snake case and include things like how many bits and it is distilled. When you select a model, it will download it. And you can see this in the exo TU. So they look like `deepseek-r1-distill-qwen-7b`	exo
http://localhost:8080/v1	This is the connection to mlx_lm.server which uses mlx directly	All the models are listed and download automatically. Note these are pure compressed models so are really Q4. Their names are from hg.co so look like org/model so like `mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit`	mlx

Making EXO work locally and remotely

EXO is a very powerful tool that merges the compute in multiple computers to share compute. It uses different models on different architectures and provides each node with a set of layers to run and then transmit the intermediate points values between them with IP networking.

Note you need tail scale to make networking outside of single network work and have to set TAILSCALE_API_KEY and TAILSCALE_TAILNET if you are not on a shared LAN and you start it with these set, then it will use Tailscale to create a network automatically. YOu do not need to use Tailscale directly, it will just do it for exo which si really nice.

Ollama Compatible models

There are fewer of these, but there are a few important cases. Note you can use the model prefix for these so local for 11434 and then remote for others

Ollama APII	Comment	Model Format
http://localhost:11434	Default port changed with OLLAMA_BASE_URL	look for a link icon and lower case with size `llama3.3:70b`
http://richs-macbook-pro-2021.local:11434	With Macs, you can use their lower snake case name with .local	Same format and will duplicate local Ollama so very confusing

Exo and running distributed LLMs

This is a really cool library that automatically networks together Linux or Mac systems into a single LLM. Installation is a little tricky, but you basically have to

Git clone your own fork of https://github.com/exo-explore/exo
Then pick your favorite venv. We use uv, so you do a uv init --no-workspace
The trick is that they have a customer setup.py file that looks for GPUs, nVidia, MLX and so forth and loads the appropriate pip packages. There is no easy way to duplicate this with the pyproject.toml that is created. So to get it started, you have to temporarily rename pyproject.toml and then do the installation.
The other bug that you will find is that if exo Is in the PATH then you will get a strange error 13 because it is confused about the exo in the path, so you need to start it directly from the bin where it is located

mv pyproject.toml pyproject.bak.toml
uv pip install -e .
mv pyproject.bak.toml pyproject.toml
# once the above is done, then exo is run from an absolute path
"$PWD/venv/bin/exo"

Functions: Analyzing with the Code Interpereter in 0.5.10 and Jupyter in 0.5.15 with None, Token (recommended) or Password

The newest version has a code interpreter so that it will automatically run the code. You need to turn on the Code Interpreter at the bottom of the chat and then when it runs, it will say “Analyzing” and show the results of running the code. Pretty cool!

And now you can connect to Jupyter Labs if get it set up, although I’ve not quite gotten it to work, but the basic idea is to install Jupyter-lab and run it where pipx is very convenient and this starts it off on the local host at port 8888:

# this doesn't work mainly because the App needs to be in user space
pipx install jupyter-lab
pipx inject jupyter-lab <Any pip plugins you want>
jupyter-lab

You have two choices, pyodide is fast but limited, and it is sandboxed. You can’t add files and the packages are set ahead of time. You have to git clone the repo or change a script. For more power, you can start a Jupiter server. This is to be in user space, so do a uv sync and then a uv add JupyterLab and any other Python you need.

You have to disable tokens and disable xsrf forgery protection at the command line. Then the run goes over the default port set in admin setting code execution automatically. So you need a magic command line. First, you need to install it into a real environment and then you can disable the Token, set your Token (recommended) or go to the trouble of creating a hashed Password.

# Go to a working directory where you keep artifacts and ideally a repo
cd ~/ws/git/src/user/studio-demo
# Now create a uv virtual environment pyproject.toml
uv init --no-workspace
# if you using asdf direnv and a python version
asdf cmd direnv set python 3.12
# add the jupyterlab extensions you want...
uv add jupyterlab-vim
# Now if you want zero security (not recommended, but it works) where you set Open WebUI > Admin Settings > Code Execution and set the Jupyter Authentication to None
uv run jupyter lab --no-browser --ServerApp.token='' -ServerApp.disable_check_xsrf=true

# If you want a specific token that doesn't vary, this 
# is like a clear text password
uv run jupyter lab --no-browser --ServerApp.token='1234'
# this is deprecated above use a different flag but this does not actually work and creates a random provider
# the token must be very long or it will be ignored 1234 won't work and should be an hexadecimal string
uv run jupyter lab --no-browser --IdentifyProvider.token="b5b1fd4c9d3e988ed075d1079db314d52147c26746a8b0cc"
# note the token is only used once, once it is done, then a cookie is set
# also the token is not echoed onto the console, but replaced by three dots.

# If you want a password, you need to hash it and stuff it into 1Password
# and put the setting into .envrc
# Assuming you created an entry called JupyterLab Local Password Dev and the random password is in the field password
[[ -v JUPYTERLAB_HASHED_PASSWORD ]] || export JUPYTERLAB_HASHED_PASSWORD="$(op item get "JupyterLab Local Password" --field "password" --reveal)"
# now start jupyter with the password, but 
uv run jupyter lab --no-browser --ServerApp.password="$JUPYTERLAB_HASHED_PASSWORD"

You can also set this to automatically run the models by clicking on the “Code Interpreter” button underneath the chat window. If it works, you should see an “analyzing” flashing button underneath and then it will spit code and the result out.

If Open WebUI detects code, I presume by looking for a Markdown tag, then you can also manually run, save and copy it.

I’m not quite clear how this happens and it is hard to see the prompting, but I ‘m guessing some sort of tag is emitted. Not all models support this mode as you can see from this chart, and as models shrink in size, around 7b for Qwen and 3b for Llama they fail to generate, but here’s the detailed chart. Note the new Mistral-Small is pretty good:

Model	Manual Run	Code Interpretor Auto
Qwen2.5:7b,14b,32b	Yes, 7b garbled	Yes, 7b No
Qwen-coder2.5:7b,14b,32b	Yes, 7b garbled	Yes, 7b No
LLama3.2-vision:11b	Yes	Yes
Llama3.3:1b,3b	Yes	Yes, 1b No Atuo
Granite3.2:8b	Yes	No
Deepseek-r1:32b	Yes	No
Command-r7b:7b	Yes	No
Phi4:14b	No, out as XML	No
Olmo2:13b	Yes	No
Dolphin3:8b	Yes	No
Mistral-small:24b	Yes	Yes

but the bigger models like gpt-4o, llama (but not the 1b parameter versions), and anthropic do this automatically.

Notably, but deepseek does not in the full or distilled version. Phi4:14b also does not, although XML is detected, you can save it but not run it.

Functions: Getting Anthropic, Google, and Perplexity Running

Some models are not OpenAI compatible, so you need to find the right Functions to use it. To do this, go to the Lower Left > Admin Settings > Functions and look for the Discover a Function at the bottom, these are the two I enable. You do need a login to Openwebui.com which is different from the localhost. This page is a complete mess of different things from functions to prompts to other stuff, but the best view I think is to go to Models > Functions which will show you the most used functions. Note that the website is slow, it looks like it is doing dynamic generation so it can take seconds to click from one page to another.

What is a function, well basically it is a way to do a single call out from OpenWebUI (this is compared with Pipelines which allow multiple stages and it is more separated).

Once you load them they have this concept of a Valve which is just a variable, so once you load. So what you do once you find a function is to click on the GET button and then it will ask if you want to Import to WebUI and you need to type the URL of your Open WebUI localhost (this is usually http://localhost:8080. This will copy it into your local host and then you choose Save to stick into your environment. Not super clean but it works.

Note that most of these functions are not documented so it is hard to know what depends on what, so you can get quite a few errors, like Visualize for instance requires OpenAI.

Also, note that in the Settings > Models section, you can tell it comes from a function because a anthropic.claude3.5 appears or google_genai.gemini-1.0-pro-latest

Function Name	Settings	Model Syntax
Anthropic	Here is where you get Claude	The names are lower kebob case with version and type like this: `anthropic/claude-3.5-sonnet`
Google GenAI	Note that the GOOGLE_API_KEY import doesn’t work, you will need to add to the manually as a Value to this function	The Models are Initial Caps with a colon and then name like `Google: Gemini 2.0 Flash Thinking Experimental`
Perplexity	I didn’t try the API import, I just added it to the API Key. This is nearly OpenAI compatible but not enough and you have to manually update the functions with new models	These are pretty easy to spot as they have `Perplexity/Sonar Reasoning 128k 8k output` as their header and they include their context lengths which is great.

Adding a new Function like Perplexity but looks like a problem with length of model_id or some other unpredictable behavior

This is also pretty mysterious, but here is what you should do:

Put your function under source code control, for instance create an open-webui-functions repo in your personal area
Now can search the OpenWebUI functions directory for a similar function. Then you can copy the python and put it into a .py file in your repo.
Debug it by adding it to your Open Webui instance
Then when done, go back to OpenWebUi.com and then there is a little plus icon to the right of search. Select function and click that and you will enter the meta data.
You definitely need a unique function name that it will suggest, make sure your function includes the location of the function in GitHub and you are done!

I definitely had some strange problems the next day, I kept on getting 400 errors even for code that was identical. I think there’s a bug where switching functions and out doesn’t work, so you probably want to reboot if you are doing lots of debugging. I had to definitely at least kill open-webui and start over. This is sort of pain because it takes quite a big of time for the backend to load. Even when it says read, I find it takes a minute to two particularly if some services are missing.

The interesting thing is that if I give it the model identifier perplexity it works and perplexity_2025 does not work, we get an 400 Error, but perplexity works, so the theory is that there is a character limit, but it is very strange

Visual Tree of Thought adds mcts to capable models

Note that one of the most popular functions, Visual Tree of Thought will add an mcts for Monte Carlo Tree Search to the above when this is enabled. I couldn’t get this to work

Using Remote Ollama, Ngrok, and OpenWebUI

You can do this in the Ollama API list, so for instance, if you have another MacBook with Ollama running, you get to it with http://richs-macbook-pro-2021.local:11434 as an example works and if you use ngrok, then you can actually go this remotely.

Pretty handy for a quick way to get a departmental server

One of the nice things about OpenWebUI is. that it just calls APIs, in this case Ollama is the default. You can also run. Ollama remotely and then you just start it with OLLAMA__HOST=0.0.0.0 ollama serve and then. it will serve anything on the Internet. This is a little dangerous of course but convenient

Then you just go to Admin Settings > Connection > Ollama Host and add the domain name something like http://richs-macbook-pro-2021.local:11484 and it will serve from there. Very nice for departmental setups. Get a Mac mini M4 Pro and serve. your entire workgroup.

Ngrok which does authentication and has a little server on the host machine is another answer. The only problem is that Ngrok generates an AVT Anti-virus error since it is used in many hacks. What you can do is to create an account on ngrok.com and then create an ngrok server with ngrok http --url _your static domain_ --oauth google --oauth--allow-domain __your domain__ which should protect you.

The setup here is a little more complicated:

You have to log in to ngrok.com and get an account
Then brew install ngrok
Note that many anti-virus programs mark ngrok as a bad program because it is commonly used in hacks. You need to go to your Antivirus and exclude the executable which should be in somewhere like /opt/homebrew/Caskroom/ngrok/<version>/ngrok
Now you need to authenticate with your ngrok config add-authtoken you get this from their console
Then you can run ngrok remote the port 8080 of the Ollama server with ngrok http --url _the static domain_ 8080 --oauth-google --oauth-allow domain=tongfamily.com which says remote port 8080 and protect it with Google authentication and only allow accounts from tongfamily.com

So basically at this point, you are using Open WebUI locally from its point of view and the bugs. with web sockets is not an issue.

Enabling RAG Documents and Web Retrieval

OK, now in Lower Left > Admin Settings > Settings > Documents are about a million configuration settings that enable RAG, the big ones are the embedding mode.

The basic idea is pretty simple, you can use the # notation or choose upload file and it adds it to the chat RAG area. There are two ways to do this.

First, you upload your local documents in the Workspace > Knowledge section. Note that the documentation is actually very out of date here. Knowledge is basically a folder system, so you can turn on different pieces. This allows you to upload directories and sync them, so it’s a nice way to have say a repo with your documents and then. you can sync. Then when you enter # in a chat, it will show you all the available documents you can load. Then it will RAQ the data and the LLM can. use that data. You can RAG a single file or you can RAG an entire Collection. It automatically can add Citations if the LLM. you chose supports it. Tulu3:8B for instance works well.
You can also temporarily load files by choosing the Upload option in any chat. But it is nice to have the documents already there.
You can enable Google Drive by setting GOOGLE_DRIVE_API_KEY and GOOGLE_DRIVE_CLIENT_ID and it will be available. Go to the Google Console and Enable the Google Drive API need for Web Apps, then you create an API key and make sure to Edit the API key to restrict it to just Google Drive. Then. you need a Drive Client Id as well, but there are no specific instructions for this so look below, I figured ouit how it works.
You can do a download of Web source as well with #https://tongfamily.com but this is nearly useless given all the gunk that is in a typical website, they don’t really tell you how to fix this, but there is a huge Web Search section. It will actually show you the document that it pulled, when you hit enter and you can click on the document itself to see what is there

There’s a new RAG and Web search that is agentic in 0.5.15 that I haven’t tried.

How to tell if your download is working, look at the console output

The way that you can tell. if it works is to go to the console and see if the open__webui.env is downloading things when you hit enter and you should see the model getting loaded

How RAG Embedding works and how to defeat it in 0.5.18

They don’t tell you what is going on here, but the RAG system uses a completely different method of dealing with models than the core Chat system, and it is not well documented, but here is what happens as described below. Note that with large context models as of 0.5.18, you can just bypass all of this and the raw model is enabled by changing Admin Settings > Documents (which is a pretty good choice for demos if you have a large memory machine and can preload the documents):

Unlike Ollama, OpenWebUI RAG supports the HuggingFace models, so you don’t need to do any conversion. That’s the good news.
The bad news is that on Apple Silicon at least it looks like these models *do not* use the Neural Engine hardware so are slow
Second is that the huggingface-cli cache all the models it has in. Note that to set this all up, create an HF_TOKEN and use 1Password to retrieve it in .bash_profile or .zshrc.
The cache can get big, it ate me out of lots of disk space and lives in ~/.cache/huggingface/hub so you might want to symlink to your backing storage if it is too big.

The next is that default models are very small at less than 1B so you don’t really see a performance hit and they use the SentenceTransformers library of HuggingFace. The other options are:

SentenceTransformers. The default. They download directly from hugging face, so the syntax to get a new model is org/repo so for instance Nvidia/NV-Embed-V2 is valid. They don’t really tell you what the syntax in the Embedding model line is as aside, so that is it
OpenAI. You can use theirs, which we avoid since we want this to be all local
Ollama. They do allow you to use Ollama for the models as well and the syntax here is just the name of the model. Note that in ollama.com, you can search just for embedding models and some valid names are nomic-embed-txt or bge-large

Hybrid Search separates Embedding from Reranking

Options to improve RAG, you can select Hybrid Search this means that there is a separate model to generate embeddings and to decide which document chunks to fetch and then a much slower but more accurate reranker that takes the bucket of chunks and thinks more about which ones to pick

Note that the reranker doesn’t seem to have a Ollama option, so you are stuck in CPU mode if you use this

As a refresher, there are two parts of RAG, first is the embedding model which converts every word into a multidimensional token. The. idea is that the more dimensions, the more you can find similarities. The best ones have 5,000 dimensions and the job is to find a list of documents that look similar. The idea is to quickly retrieve a lot of documents and then the reranker works slowly to figure out what is the most relevant.

The reranker also know as a cross-encoder takes the query and a document and give s. you a similarity score. You use it to figure out which documents are most relevant. The Top K means you pick the top 3 (if K=3) of these.

Picking RAG models not all of which work

There are a series of models starting with the recommended ones and also looking at the mteb/leaderboard on huggingface and I went through to figure out out what is working and what is not. The way to know if it works is not that obvious, you either watch the console output or when you click on the download, but the success message is misleading you have to wait to see if it save “Embedding Model Set to…” nothing may happen that is there could be an error and you will not know. The testing is laborious, you have to reload a corpus and then see if you get reasonable output when you run the RAG with the pound sign to add a document, but the base models are pretty good delivering a 62.62 score with the base model, all-miniLM-L6-v2

✅sentence-transformers/all-MiniLM-L6-v2 which seemed to work and the performance is documented at Sbert.net and is the default but is not particularly high performing at least as far as getting good results. This is https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 which scores 62.62 (state of the art is more like 72), so not bad for a tiny 23MB model and good for CPU only use. It takes 2 seconds to load a 118Kb document
✅ sentence-transformers/all-mpnet-base-v2. Has the highest performance by a small fraction and it does seem to load OK which as https://huggingface.co/sentence-transformers/all-mpnet-base-v2 scored 57.7, so you should use L6 if you are only using CPU models

Ollama models. You can also have Ollama do the inferencing but you have to have one of the few GGUF models out there from newest to oldest and cross referencing these models with the MTEB Leaderboard, but the net is that you can get to 64.23 score by using bge-large from ollama and if you hosting it remotely it offloads your local machine and its a 563M parameter model so you need quite a bit more to beat a 23MB model and it takes 8 seconds to process a 127K token model.

This link does seem to have its problems, I get network error sometimes and the whole open-webui has to be restarted

granite-embedding. From IBM 30M and 278M which are 57.25 and 56.97 respective, so not that good from https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual and https://huggingface.co/ibm-granite/granite-embedding-30m-english
snowflake-arctic-embed2. Snowflakes latest 57m which looks close to https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0 and shows a 59.48 score. Not that great
https://huggingface.co/BAAI/bge-large-en-v1.5 aka https://ollama.com/library/bge-large scores 64.23% which is the best of the ollama bunch
paraphrase-multilingual. A sentence transformer model 278m or https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2 which is 57.6 so not great
bge-m3. 567m does not have full scores on MLEB at https://huggingface.co/biswa921/bge-m3
mxbai-embed-large. 335m from Mixedbread.ai or https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1 does decent at 63.25
all-minilm. From SBert.net 22m and 33m and judging by model size I’m guessing this is https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 for the 33MB model scoring 56.53 and https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 for the 22m model. These are the defaults for open-webui
nomic-embed-text. This is a 137M parameter model which is https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 and scores 62.28 which is not bad for a tiny model.

Now for the top ranked models and this uses the internal Open WebUI infrastructure which is nice since there are many more SentenceTransformer formatted models than GGUF and the real attraction is that you can get performance in the 72 range but it is very slow but the 1.5BQwen model is a good compromise delivering 64 score for 13 seconds.

❌ https://huggingface.co/nvidia/NV-Embed-v2. 8B parameters 72.3 score. Note that on model download, I got a timeout error, but this is not exposed in the User Interface it just returns and it looks like the model is loaded. It’s getting a no response returned from Hugging face?
❌ https://huggingface.co/infgrad/jasper_en_vision_language_v1). It is not rejected by huggingface ‘NoneType’ object has no attribute ‘encode’. Should some message about model load failure get surfaced in the UI rather than looking at logs
❌ https://huggingface.co/dunzhang/stella_en_1.5B_v5. 1.5B parameters It says no model found with this name and something about no periods allow in the name. 71.2 score. Again no error message and this fails
✅ https://huggingface.co/Salesforce/SFR-Embedding-2_R. It is very slow but it scores 70.3 on the average but even with GPU work, it takes time to make it return a result because it is pretty heavyweight at 7B parameters. It takes minutes to product a result from a 117K token input file. And it takes a while to run on web results. A bad choice if you pair it with a slow search engine like Google PSE (see below).
✅ https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct. This does load correctly but as a 7.B parameter model it takes some time. Score is 70.24 and takes a few minutes on an M4 Max to process 117KB.
🟡 https://huggingface.co/BAAI/bge-en-icl. This is a heavyweight 7B parameter model need 26GB of storage and it definitely jams Neural Engine on Apple Silicon. 71.7 score and very slow, takes a minute to process 120KB. This appears to work properly and I can see vectors being returned. The tulu3:8b model does generate a full response and does not cite properly. Llama3.2:3b seems to work fine but doesn’t cite.
❌https://huggingface.co/dunzhang/stella_en_400M_v5 is a lightweight high performance model like its 1.5B brother. This fails with a no attribute encode in the console and you can tell it doesn’t work because there is no download.
❌https://huggingface.co/BAAI/bge-multilingual-gemma2. This does download properly and it is a huge 9.2B parameter model scoring 69.88. This seemed to work and then application ran away consuming all available memory. Seems like an OpenWebUI bug.
✅ https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct this is a big father down. There are many other 7B parameter models, but this is just 1.5B at has a 67.13% score, so hopeful it will be great. This took 13 seconds to embed a 117KB document so about 50% longer than bge-large. And it handles citations perfectly.

RAG Benchmarking for embedding and chunk fetching

Nothing is more important than your corpus, so here are suggestions for testing:

Create a single large file with all your relevant data. For instance, we just take our entire website and pour all the markdown files into a single one.
Use a stopwatch, and set the Embedding model (and reranker too). Then go to the chat press + and upload that file. When the cursor stops blinking that will tell you the tokens per second (up to 59K tps on an M4 Max down to 3K tps).
Then when you ask the question, look at the console and run your stopwatch again, when you see the vectors come back in the console, that’s the time to pick the chunks (and if you are running a reranker to run that too).
Now to look at quality, go to the actual document that is returned and it will have the relevance and the content. Eyeball it and see how good it is.
Finally, the international with the model is important so read the text that comes out.

RAG Recommendation

Basically you can see how good these RAG solutions are by taking a big corpus and then when you enter it, see what parts are actually RAG selected. A quick manual look at the footnotes is really interesting.

For most people, if you have a constrained machine, then the default sentence-transformers/all-MiniLM-L6-v2 is a good choice. It scores 62.8/100 and is very fast 2 seconds for 118K Tokens. And are sub-second when doing the RAG search.
For power users https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct, Then for some more accuracy, you get to 64 with ollama, but if you have the hardware, the ✅ is hard to beat at 67 and still processing at 13 seconds for 118K tokens.
If you really need quality, then really big models like Qwen2-7B are not that practical, they take 90 seconds or more to embed a 118K token document to load even on an M4 Max. But the results are more detailed, so there is that. As an aside on our test dataset, although Salesforce was supposed to be better we found Qwen2-7B more accurate.

Hybrid Search and Rerankers

The idea here is pretty simple, have a fast model do the initial embedding and finding of RAG chunks and then have a heavier weight reranker look at the top K chosen chunks and pick the right chunks by priority. It’s hard to judge whether these are better, but pretty attractive to say use a small model like mini

For rerankers, the mteb/leaderboard lets you know which heavy weight model you want here. So you should pick a lightweight initial model like the default sentence-transformers/all-MIniLM-L6-v2 or Alibaba-NLP/gte-Qwen2-1.5b-instrcut and then use a big model to do more work

✅baai/bge-reranker-v2-m3 and sentence-transformers/all-miniLM-L6-v2. This is also the default in the user interface itself and it doesn’t have an MTEB score, so hard to know how good it is but it is a 568M model so a good pair to the default 22M one on small machines. And wow looking at relevance of the chunks it does an awesome job with perhaps a total upload of 2 seconds for 118K Tokens and processing time of 8 seconds for top 8. I don’t know the MTEB score but very usable with GPU acceleration. The quality was OK, it found one good chunk, but the second one was just ok and the order was jumbled
✅Alibaba-NLP/gte-Qwen2-7B-instruct and Alibaba-NLP/gte-Qwen2-1.5B-instruct. This is the performance leader but very heavy which is a big price to pay with no GPU. at 7.6B parameters scores 61.4% and is heavy and the 1.5B is heavy. 14 seconds to upload, then reranking takes 30 seconds and Tulu emits in 2 seconds. The quality of the RAG output is excellent as well.
✅✅✅sentence-transformers/all-miniLM-L6-v2 and Alibaba-NLP/gte-Qwen2-7B-instruct. Given the quality of the Qwen 1.5B/7B was so good compared with L6/rranker, we tried to see if just adding the 7B was the key, so again 2 seconds to upload and the 9 seconds to process top K and 11 seconds for top 8 and the results were also excellent
✅sentence-transformers/all-miniLM-L6-v2 and dunzhang/stella_en_1.5B_v5 comes in again as the best reranker at 61.2% and this seems to work for ranking but not embedding which is interesting. The actual results were really fast 2 seconds and then 5 seconds to return, but the actual chunks returned were not that relevant
✅✅sentence-transformers/all-miniLM-L6-v2 and https://huggingface.co/ibm-granite/granite-embedding-125m-english comes out at 60.33 average which is not as good as Stella but worth considering since it is only a 125M parameter model so you get a 2 second to load and a 3 second readout for top 3 items

The conclusion I that you should probably use the bigger reranker as that gives you better answers even if you have to wait 10 seconds.

With a small machine, all-miniLM-L6v2 with granite/granite-embedding-125-english is good
If you can afford the time and hav e a fast machine, the use of L6-v2 with Qwen2-7B for reranking is really important. To me, having more accurate results matters.

RAG Crashing if Chunks are too Big

I got a hard crash when I set the chunk size too large at 5000 with 500 overlap. It is some sort of PyTorch bug when arrays get too big (true for Automatic1111). I reset it it 1000 characters with 100 character overlap. Hopefully that helps. This seems to be an issue when uploading a big document. There is not a great resource for how to tune this that I can find. The Langchain defaults are 1500 and 100 overlap. A lot of course depends on the context length you are using. In the old days, you old had 2-4K tokens to play with, so 1500 taken up is a lot. But as you can see below you can run into other limitations. At least 5,000 seems like it is too many. The main idea of chunk size is that it should be just about big enough to contain a unit of an idea (like a single comment or answer in a Q&A database and not so much as to confuse things). If it is too big that’s called the needle in the haystack problem. But if you are for instance just looking for keywords if you are splitting words, then it can be as small as 1 character with zero overlap (if its a character splitting, you just get letters), so make sure you have the splitting set to tokens:

WARNI [chromadb.segment.impl.vector.local_persistent_hnsw] Number of requested results 8 is greater than number of elements in index 5, updating n_results = 5
/AppleInternal/Library/BuildRoots/d187755d-b9a3-11ef-83e5-aabfac210453/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:850: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: total bytes of NDArray > 2**32'

OpenWebUI Thrashes with Custom Context Length

Ok, this is an interesting bug, basically Ollama will reload the model with context length changes, so because OpenWebUI runs an Ollama LLM to generate customer titles and completions, it causes the main model to be reloaded over and over. So there is a fix in Ollama which has been floating for two months (yes I asked why it has not been applied). In the meantime, you should turn off the completion and title generation but turning off the generation is good, but maybe it is worth it to fix Ollama.

Ollama is on a remote machine and beware stability issues

You can also change to Ollama itself instead of the included embedding model but there are no guides, but looking the ollama.com site and seeing what is newest and most popular, the ones to try are. The main issue is some stability where it willl fail with network errors. Note that with Ollama, you do not get to do hybrid search in RAG

The slowness of OpenWebUI Sentence Transformers for RAG on the Mac and a fix in 0.5.5 and later

These are much bigger than the other default models, but that’s what a big computer is for and they only run once doing the embedding. The downloads for this take forever.

It turns out that this is because there is no Apple Silicon detection, the so called “mps” device for PyTorch. I just added a PR for this and it seem to work well. This is now fixed on 0.5.4 it says, but when I load 0.5.4, it still uses the CPU, so got to figure that out and a patch is in process.

A Guide: Adding Knowledge aka RAG Corpus

There are a few conveniences that they have which include Workspace > Knowledge that lets you add directories automatically and sync them. Note that if you change the embedding model, you need to reimport all the documents.

This stuff seems a bit buggy by the way I’m not sure what is going on, but you need to look at the backend terminal output to see if things are loading

Tika Document Content Extraction for exotic document ingestion

It’s not super clear how to get the names of these models and it also lets you set the Content Extraction to default or Tika, I’m not sure what this does. It also has a PDF extraction system as well that is not clear how it does it as that’s a hard problem.

The Content Extractor is either default or Apache Tika, they recommend using docker for this, but again I’m looking for native stuff, but you basically get it running on http://localhost:9998 and then watch out. This Tika thing is a super parser that knows thousands of data types. This this is just a Java program, so I’m wondering if there is a. pipx like thing that just installs it. And there is a brew install tika

I’ve not yet seen any gains doing this, but theoretically they have hundreds of plugins to read things, so I added it.

The default scanner is actually very good and I tried these sample documents:

Docx. That is the Office format no problem for both default and tika
Excel XLSX. The thing barfs with a crashed for both default and tika
PowerPoint PPTX. Ingested just the text for both default and tika
Adobe PDF. Ingest fine just the text only for both default and tika does but it needs a JPEG-2000 plugin that seems truly hard to install and requires a bunch of work to load it somewhere where it can be found.
Image PNG. This seems to just get copied up and won’t work for RAG
Archive ZIP of documents and code. Caused real headaches, the system just hangs for default but for Tika it does seem to read it but I can’t see the result. For this it does seem to be using the GPU for something when trying to uncompressed a ZIP file and then all of OpenWebUI hangs, so I think it is a problem.
Video MP3. Seeing if it can find the transcript amd then Tika crashes

So net, net for most common documents Default seems to be fine and I’m not quite sure how to diagnose and fix Tika since its a Java application

RAG Changing Top K and Prompt

There are some other simple heuristics like changing the Top K to 8 from 3 so there is more context for the model. Also changing the prompt might help, but it is all black magic really. The move to 10 will make the reranking job harder but hopefully give it more choices.

Large context models take minutes to load but is amazingly good

If your data set is small, you might just skip all this and just insert the entire file set into the system and see how it behaves. It turns out this is harder than you think because the default context windows is just 2048 tokens and this is silently truncating user inputs. Also a number of optimizations are not taken, you have to set them:

If you upload a file, then it uses RAG by default, so to test the long context, you need to concatenate all the input into out text file. file . -type f =exec cat {} ; | pbcopy is your friend, this loads up the clipboard and then you can paste it in. Or you can choose upload file if you like but it’s harder to see the contents you have to click on the document.
The first thing thing you learn is that Open WebUI by default chops all context lengths to just 2K for performance.
So you have to go into each model in Admin Settings > Models > _Your Model > Advanced and set context length to what ever the real limit is. I can’t quite figure out how to find this in Open WebUI, but ollama info tulu3 for example shows the context length. You can also change this globally if you think all the models you use support say 128K tokens with Settings > General > Advanced Parameters > Show > Context Length
Then you need to optimize the KV Cache and the use of Flash Attention by setting OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q8_0. This halves the memory with no noticeable impact. For small machines, try q4_0 and this works great.

Once you do this, when you Upload a file, you will have the choice:

Upload the a file with the + at the chat
Click on the file uploaded
At the upper right, you can choose “Segmented retrieval” which means use the RAG or the whole document is just inserted. If you have set the context high enough, it is going to take some time. For instance, Tulu3:8B runs at 200 prompt tokens per second so a 118K token document is going to take a while at about two minutes or about 988 tps
I can’t find a way to cache the KV Cache that is created (like Anthropic does), but that will get rid of this. It should be in the Knowledge section I think. Llama.cpp which is used underneath will do runtime caching but although llama-cli supports this open-webui doesn’t use it.
Note that is want knowledge always loaded you can create a model file and then in the model config you can give it a workspace/knowledge so that you don’t have to constantly upload a file.

Llama.cpp KV Cache instead of Ollama

The solution to the problem is prompt caching, you load the initial sequence which with llama.cpp you can do like this, so that you basically provide the initial prompt the first time, it saves the kv-cache and then when you run it again, it doesn’t have to generate this. Llama.cpp provides alot of control and actually manages the cache looking for similarity in both the user and system text. The most interesting one is --lookup-cache-static and --system-prompt-file and this is honored by ./lllama-server so it does process the system prompt for all the slots and then caches it for all the users.

Note there is an unrelated concept where you just store the last prompt and the response and cache that (that’s not really an LLM thing and doesn’t prime a system for additional questions. So if you want this control and don’t need Ollama pulls and so forth just running llama.cpp is pretty cool and it doesn’t seem that hard to enable it. You lose a few things like model enumeration.

Note that with llama-server, you can’t set a preprompt, but it will remember previous prompts so if you feed it with your prompt first, it will be very fast as it is using kv caching.

# run the first time and save content.gguf
./main -c 32768 -m models/mixtral-8x7b-instruct-v0.1.Q8_0.gguf --prompt-cache context.gguf --keep -1 -f initialPrompt.txt
# run the second time
./main -c 32768 -m models/mixtral-8x7b-instruct-v0.1.Q8_0.gguf -ins --prompt-cache context.gguf --prompt-cache-ro --keep -1 -f initialPrompt.txt

MLX_LM Server is 30% faster and MLX training

The MLX stuff is supposed to be 30% faster than running with GGUF so you have to pipx install mlx and then you have to use huggingface-cli to download models and it only runs one model at a time. I did a quick benchmark and at least with Ollama vs MLX it wasn’t much faster. Could be Open WebUI overhead. More testing is needed.

I have not tried it yet, but I like Llama.cpp, MLX is the new Apple-derived framework that uses hugging face models. And it supports writing the cache, which is perfect for fine-tuning use. And it even supports QLoRA, LORA at the edge training. With JSONL training data. And it can preload CAG. Can’t wait to try it. You just brew install MLX-lm and run MLX_lm.server which will create an OpenAI Compatible API at “HTTP://localhost:8080/v1” by default or you can change with mlx_lm.server --port 9000 and then it is http://localhost:9000/v1. Themain problem with Open WebUI integration is that eats the first <think> token so it doesn’t hide it.

One cool thing about MLX is that you can do fast training on a Mac, but I haven’t tried that yet. This is supposed to be 30% faster than GGUF generation with Ollama as this is not Mac specific, so trying Deepseek-r1:70b_q4_K_M with a prewarmed so the model is in memory. It’s pretty hard to compare these, since Q4_K_M is differentially encoded and more like 4.6 bits per weight, but the q4 that mlx uses is just a straight 4 bit precision. Also things like quantization matter. Note that to get the numbers of tokens, you need to turn on “usage” and use the query “python first 10 fibonacci.”, I get a better per token performance by quite a lot, particularly for big models with MLX.

Note that neither Exo or MLX support the thinking token and MLX usage shows the number of tokens but not the timing. Exo does not support usage at all, so you need to time it yourself and then use the GPT-4 token counter to estimate how many tokens.

But the charts show that mlx_lm is about 20%-40% faster than Ollama GGUF. Exo which uses the same MLX is about as fast as raw MLX, but Exo adds 20% overhead, so you are back at the same place. Basically, for maximum performance with long queries MLX is not a bad bet.

Fibbonaci	Prompt/Response Tokens	Time to Last Token	Token/Sec
mlx_lm.server	1/704	1:10m	10.0
ollama	24/393	49sec	8.5
exo (direct chat)	1369	2:27sec	9.5
exo	8/1099	2:14sec	8.2
Taylor Swift
mlx_lm.server	1/673	1:08	9.9
ollama	13/1116	2:39	7.1
exo (direct chat)	4/1957	2:33	9.6
exo	13/568	1:08	8.5

You can add new models with huggingface-cli download _hf.co_url which is from the MLX Community and look for MLX models with lmx_lm.manage --scan and remove wttih mlx_lm.manage --delete _model_name_

Apple Neural Engine in its infancy

Note that none of these support ANE acceleration, although there is a project to use the Apple Neural Engine for small models that I have to track.

Web Search

Very confusing, but basically you have to a few things to enable Web Search, it doesn’t work out of the box:

Add the search engine to the system as noted below which ones are good. I’m not quite sure which one I like yet.
Click on the web search button in the chat. What is not clear is what actually happens when you hit send,, but basically, this will cause it to feed the string you send it to the search engine. The engine will return a bunch of site contents. These searches are returning json, so you basically get the site and some summary data. It then uses RAG to figure out what’s relevant and answer the question. So as an example using the Jina engine (see below for discussions), it will take about 30 seconds, but it send back a set of pages. If your LLM supports citations, then it will add a footnote where it found the data. I installed each and it seems that the basic issue is that when you ask for the top 5 results, they should return website, but the problem is that some of them are very verbose, you do not want the HTML goo just the content. You can actually see what it has sucked in because each page brought in is clickable and you can see the relevance ranking and also the contents. Hint most of the searches are worthless because the summary is so small.
Note that you have to turn on the above Web Search button every time you enter a new query
The other mode is to use the pound sign syntax. This means that if it finds a match in the knowledge base, it will do a RAG. But if you do #_url_ then it will fetch just page and load it as a document to search . If you type #https://tongfamily.com and then #https://tne.ai it will suck in the contents of this as a local file and apply rag to it. And you can click on each document after it finishes spinning to see what is retrieved, then you can run your query. So this is liked controlled searching.
If you install an LLM that does this citations and there are literally a sea of them (see below) so you get a full citation like Bing does. For instance Tulu3:8b supports this and you get nice footnotes

Web Search has an LLM that Interprets your query

OK, this is super confusing, it turns out that if you type, #Richard Tong famous it does not just feed that to the search engine, instead, there is an LLM that reads this and which generates up to five alternative queries and then picks the top one. That is why when you do the search, you will see the search that is actually run can be nothing like what you typed.

I don’t think there is a way to fix this, but it maybe worth a Pull Request to defeat this. I can see why they do this, but for people with good Google-Fu, it’s not very useful. I have to investigate what the “Task Model” is, but I presume that like Sentence Transformers for RAG that it is buried somewhere and has aburied prompt

The same Task model for chat titles

It is running locally with sentence transformers built into openwebui just as RAG embedding and speech models do. I need to investigate what it is and where the prompt is by default. The RAG model also has a prompt. But there a sea of variables you can set for RAG and the chat titles that you can tweak. This is how you get these colorful titles for the chat titles that is basically trying to summarize the first thing you chat.

Cool Trick: Adding it as a Browser Search Engine

OK, this is a cool trick, if you have a custom browser and have going to localhost all the time for searches.

Go to your browser where you can add custom search engines. For instance with Brave, you go to brave://settings/searchEngines and next to Site Search, there is an Add button where you add this which means that if you type :oi your_prompt it will run OpenWebUI against it

Name: Open WebUI
Shortcut: :oi
URL: http://localhost:8080/?q=%s

Search Engine Selection

The main thing here is you want the search engine to return a nice JSON so the LLM can process it. I’m doing this in order of what the pulldown menu is, presuming that’s a measure of what the default should be:

Serxng: This can either be a local server (where you will need docker or you need to git clone the repo). It can also be a free public instance one of the free sites like https://kantan.cat/search?q=<query> with their string leads to rate limit errors. If you set it to one concurrent request then it works properly and returns a document but, I get 403 forbidden. You can also run this as a python application. Basically you need to run your own if you want it to work you have to find a not so busy instance like https://search.rhscz.eu where the main trick is figuring out the Query URL which is https://localhost/search?q=<query>, but you should look at the console, most of these sites return an Error 403 forbidden when you want format=json. A nice option is to locally host Serxng, I looked at this and it is basically a web server that you can brew install and then a python venv so not that hard to run.
Google PSE. This requires API key if you don’t want ads and is limited to 10K queries which is fine for this with JSON Output. So you can get a key from Google and you need to create a Programmable Search Engine as well at this place and you enter the Google PSE Engine ID in Admin Settings > Settings > Web Search > Web Search Engine is PSE. When this is working, you should see in the console web pages being adding to the collections. This whole setup is really slow taking five or more minutes to bring back a result when looking for the top 10 sites (so set it to 3 the default or the top five). It does throw lots of warnings about not using HTTPS, since we only give it an ID, there must be something wrong with the url creation internally.
Brave Search requires a free API key limit to 1/second and 2000 per month, this actually works pretty well, but searching 10 sites is going to take 10 seconds.
Kagi. The search is in closed beta and relatively expensive at $25/1000 calls. It is ad free though.
Mojeek. This is a UK site with no ads that charges 1 pound/1000 calls. I”m trying to get a free site first though to see how well they perform.
Serpstack. This is quite fast and you just need an API key and gives you 100 free queries a month. It’s $30/month for JSON returned data for 5,000 searches. The search was pretty relevant, it found this site, tongfamily.com, creativedestructionlab and linkedin.com plus microsoftalumni.com and pitchoobk.com. All were good references but I got the dreaded return on token bug, where is would just stop at the first generated token. I suspect I’m exceeding the 2K token buffer (see below on how to fix)
SearchAPI. This given the google search engine looks for sites rather than the specific URL. It seams like it is just finding the top hits from the # sign. If you feed it an exact URL it searches with the google engine just that site and does a retrieval. Otherwise it gives you a bunch of results.
Duckduckgo. Like SearchAPI does not return the contents, seems like it is the search page and isn’t fast
Serply. This is a special search engine that uses API keys and is designed specifically for searching for market data. The nice thing is that it returns simple JSON that is just the data of the search which is sort of useful. One problem of course is that it only returns the search results and not the full data in the sites themselves. There are a huge number of different ways to run the search such as news searches and so forth, but unfortunately I get a 'NoneType' object has no attribute 'pop' error.

Changing the context length of 2048 to 131072 needs 100GB to store the context

In this day and age, that is really short. On the other hand even with quantization, in our tests a 118K document with an average of 4.8 characters per English word and 0.75 tokens per word or 4 characters per token or 29.5K tokens takes about 20-30GB of RAM because the embedding vectors are massive (500 to 1.5K vectors of 4-bit) or about 750B/token, so just setting it to 128K context length (131,072 tokens) would need over 100GB just to store the tokens.

You cannot set a default context length as of Open WebUI 0.5.4, so you need to set for each model which is a pain. When I’m doing RAG and things I just set it to 131,072 but you can do better by calculating the maximum size.

Using the Search #url at a time then ask

The syntax is string, the # with a URL, but this is very inconsistent and
seems buggy, but it isn’t, its just not well documented. What you do is that you:

Enter each URL with a pound sign.
Press enter after each one
You should see them become documents
You can click on each and see the contents
When you click, you can use the toggle which is set for RAG or “entire document”.
Then enter your query
You will see document below
When you click on each you will see the RAG section that is used. Each will have a relevance score

Each document added is added to the context when you run a query but sometimes, you actually get the search engine results. And then annotation shows you the part that the RAG on the Web search has selected results. The default is to use “Focused Retrieval”, but in the document you can select “Using Entire Document”.

As an example, enter each of these with a SEND, that’s the main trick. What is happening is that each of these queries is treated as a separate temporary RAG document. You can click on them and choose if you want “Focused Retrieval” or “Whole Document”. This long context models and lots of memory as we said before, it’s often nicer to just stick the whole thing into context (sometimes called CAG for Context Aware Retrieval).

#https://tne.ai/about
#https://tne.ai/solution
#https://tne.ai/app
#https://tne.ai/sys
summarize company

If you want to do a Google like search

Then you should then:

Type the search phrase you want
Clickon the button that says Web search
This then invoices the Task model which generates some possible queries and it selects the first one and runs it so don’t surprised if the query sent is very different.
This will call the default search engine and return up to three sites.
You can set how many site you want in settings

As an example with this prompt and Web Search clicked, you will get the three most relevant sites summarized.https://github.com/open-webui/open-webui/issues/9921

Note there is a P R to make this even better with multiple search generated by the invisible search llm

Richard Tong biography

Local RAG with uploaded documents

This actually works well, you just add documents with the plus and it summarizes, for instance Deepseek Coder 2.5, seems to have issues with RAG but the local models like LLama3.2 work OK. There do see to be some bugs because
sometimes it says it can’t find the document files.

Here is the process:

Upload files
Then you can run a prompt on them.
There is a setting for what RAG to use

Uploading from Google Drive

This is a pain because you have to create a google web app and generate a client id and an API key. I failed at doing this. I thought I gened properly but it says unauthorized? I think the problem is you need picker API and drive API enabled. But here is a complete guide to how to do it:

First you have to go to your Google Workspace as an Admin and enable Apps > Drive and Docs > Features and Applications > Allow users to access Google Drive with the Drive SDK APIs.
Now enable both the Google Drive and the Google Picker APIs in the Google Cloud Console. You must have a project already and then you go to the Enable API section, search for Google Picker and Google Drive and enable them. Go to the API & Services section and there is an Enable APIs and Services button at the top
Next you need a Google API Key that is tied to these, so go to the API Credentials section and click Create Credentials and pick API Key. For security, make sure that the API restrictions are set to Google Drive API and Google Picker API. SAve this in your 1Password and create a .envrc to set GOOGLE_DRIVE_API_KEY where you start open webui.
Now go back to Create Credentials and create an OAuth client ID. Give it a good name and then in Authorized JavaSCript origins, enter the URL that you use to access Open Webui, the default is HTTP://localhost:8080. Copy the Client ID on the right and add to your 1Password and set as GOOGLE_DRIVE_CLIENT_ID. Make sure this is exported where you start open webui
Finally, start Open WebUI and go to Admin Settings > Settings > Documents > Google Drive and select it.
It should all work, now test by going to a New Chat and click on Plus sign, you should see a new Google Drive entry. Click on that, you will be authenticated and should see your Google Drive

Note that the documentation as of 0.5.7 is incorrect, the API key and client id are not persistent, they need to be set everytime you start Open WebUI.

Speech to Text via OpenAI APIs

They have external options like OpenAI which is pretty simple, you give it the normal OpenAi endpoint, https://api.openai.com/v1, the key and then the STT Model is a little mysterious, but there is the set of whisper models from 2022-24, the paper mentions they have large-v2, large-v3 and large-v3-turbo, but its not clear what the names are in OpenAI’s interface, but it does say they are using large-v2 models including whisper-1 and in the API document, they have some of the available parameters and whisper-1 which confusingly is actually whisper V2 is the only one available.

Speech to Text via Local Whisper models

The Whisper models are both Speech to Text ones that are from OpenAi but run locally

It also mentions faster-whisper but again this does to seem to have a list of models seems pretty small at distil-whisper-large-v3 is not found, so I tried some of these

large-v2
distil-whisper-large-v3
small

But these do not seem correct, but looking at the error logs that come out of stderr, but if you say type in “foo”, then you can see what the valid models are which are currently as of December 2024 by looking at the stdout of the open-webui backend.:

tiny.en
tiny
base.en
base
small.en
small
medium.en
medium
large-v1
large-v2
large-v3
large
distil-large-v2
distil-medium.en
distil-small.en
distil-large-v3

I have to say distill-large-v3 sounds pretty good to me although I haven’t found the statistics.

Hugging Face models for STT do not work

And in looking at this YouTube video it’s clear that in addition to these default repos, you can specify any hugging face url so for instance so sentence-transformers/all-mpnet-base-v2 which does try to download but doesn’t actually load, so stick to the ones listed. This is supposed to be the best one and the default. There are other tools that show you how to do voice chat, but I can’t find anything about what models to use.

Text to Speech local is best done with WebAPI and Zoe and Brave/Chrome not Safari

Well again, there are a huge number of different things and it is pretty confusing what is in there right now with the current 0.53 version:

OpenAI. You fill in the API key as usual, but it is not clear what TTS Models you have available the two options appear to be tts-1 and tts-1-hd. The main thing is that tts-1 has lower latency but tts-1-hd has fewer errors. In testing, the delay with tts-1-hd seems minimal compared with LLM processing time so I use tts-1-hd. You also have a collection of voices, alloy, echo, fable, onyx (a lower soothing voice), nova (a higher voice, sounds like HER), and shimmer (a higher voice). I picked nova
Azure Speech and Eleven. These are paid and I didn’t try them.
Web API is another misnomer, this uses the default voice that is available on your Mac. In spelunking around, enabling this is hard, but you go to System Settings > Accessibility > Spoken Content and then the trick is to click on System Voice and find the biggest possible package, this turns out to be Zoe for English. By the way, if you want to do some demos, you can also turn on Speak Selection and then when you hit Option-Esc it will say what you’ve highlighted. It sounds way better than Siri, but the between-word intonation isn’t bad. Right now given the problems below, I’d say this is the best local option. One bug is that it will give many many different names the voice pull down, but leave it at Default to get Zoe. Note that this WebAPI only works properly on Brave, Chrome and Chromium. It does not work on Safari

Here are the best voices for various languages look in System Settings > Spoken Content > System voice and hit the (i) icon

English	Zoe (Premium)
Chinese Mandarin	Li-Mu (Enhanced)
German	Petra (Premium)
Vietnamese	Linh
French	Amelie (Premium)
Hindi	Kiara (Premium)

Koroku instead of Web API TTS for User mode only

Very strangely in 0.16.0, this is actually in Use settings audio and not admin settings, but the quality is supposed to be very good. Will report if it is better than MacOS Zoe. I’ve listened to both and they both sound decent. But definitely not as good as say an Eleven voice with its intonations, but Koroku is a very good idea since it is in the browser and doesn’t require any local Mac setup. But note that this appears to be tied to the language of the browser, which is not like the Web API setting where you can set it independently

One confusing thing is that for Text to Speech, the Admin settings doesn’t have a Koruku setting, but it is true for every user, so you have to set it per user and then the other local option is Koroku, but Eleven and others do not appear, so I think Default now means what ever is the default from the admin as when you select say OpenAI in Admin.

What seems to be happening is that If you choose Web API, you get what the browser thinks Is the standard API for TTS, but it is not the system voice I definitely confirmed. That the default doesn’t have any of the Mac Voice.

On Brave, you see the voice types along with an annotation that tells you the language type, but in Safari, you just get the voice name, so you see lots of duplicates. Also with Safari, you do not see system voices like Zoe but you do with Brave

Also in User mode, you can select “Non Local” voices which seems to give another set of English voices that are settable by the user.

Trying to use TTS with Transformers Local Nightmare with MPS problems fixed in 0.5.20

This was true with Open WebUI 0.5.16, but with 0.5.20, it seems to work better. Note that these voices are all in US English, although they have collection of accents. Note that the code here is <cmu>_<language>_<speaker> so cmu_awb_arctic... means from Carnegie Mellon, was means Scottish Male.

The codes on Hugging Face are presumably initials of the speakers. Note that the Speech T5 does have support for other languages, but only the CMU voice set which is English only is supported:

bdl. US Male
slt. US female. This sounds pretty artificial
jmk. Canadian male
awb. Scottish male
rms. US male
clb. US Female
kip. Indian male

Transformers (local). This looks like another CPU-driven model, again, it’s hard to know what the valid names are here. The user interface says it uses SpeechT5 and CMU Arctic Embeddings. I’m guessing based on the RAG portion that these are huggingface models, but it’s not clear what the names are since the SpeechT5 doesn’t have any huggingface names and the CMU listing is for a dataset. I just tried a file name cmu_us_awb_arctic-wav-arctic_a00012 which sort of worked.
This seems to cause some serious issues as after I try this, I get network error and have to restart everything, but then when I rebooted and tried again, it worked for a bit. As an aside, they have literally 80 pages of models with the last one being cmu_us_slt_arctic-wav-arctic_b0539 which sounds sort of good
The big problem is that this error that ends with Output channels > 65536 not supported by MPS Device. The first one is a lower voice and the second is a higher one. There are a series of active bugs here, but the option to enable CPU doesn’t seem to work. So frustrating, but it appears that pytorch has a problem with convolutions that are larger than 64K elements on Apple Silicon.
They use something called Fast Whisperer which is CUDA-enabled but not MPS so it is CPU-bound on the Mac. And, yes I’m trying to figure out how to make this work. So the speech-to-text is not GPU accelerated. This is a little confusing because the problem above seems to imply they are using MPs

OpenAI-Edge-TTS and OpenedAI

These are two open source models. Sadly the later is now obsolete, but OpenAI Edge TTS does seem to work and provides an OpenAI compatible Speech API. Since the TTS Settings when set to OpenAI allows you to edit the URL, then you can just use if you turn on the server. And most critically you can use an Edge TTS voices, so this can get you to non-English versions like de-DE-AmalaNeural for instance, so here are the settings for each

OpenAI	https://api.openai.com/v1	OpenAI Key
OpenAI Edge-TTS	http://localhost:5050/v1	No key needed

Other Closed Source TTS Models: Deepgram

There are some new closed source like Deepgram with models noted in their main page like. They give you $200 in credits unlike ElevenLabs, but unfortunately you can use it for STT, so just for reference.

aura-asteria-en. American femail
aura-orpheus-en. American male
auro-angus-en. Irish male
aura-arcas-en. American masculine
aura-athena-en. British female
aura-helios-env

There speech to text models are base and nova-3

The Voice call and Microphone Buttons for Voice Assistant

Well, there are two other cool modes that this supports decently and there is a lot of work to make an interactive local assistant. And the documentation is currently blank on how to do text-to-speech 🙁 Note that as of 0.5.10, this interface has disappeared. There is no longer a microphone or a headphone icon.

You can click on the microphone, the main thing here is that the browser is doing this, so you have to wait a little bit before talking and then when done, you click on the stop icon. The STT (Speech to Text works pretty well though)
OpenedAI-Speech. This is another choice that I need to try as it provides a local OpenAI compatible endpoint so you can just use a local application (the same way that Connections provides this for models). It hasn’t been updated in five months though, so I wonder how usable it is. But it provides both tts-1 and tts-1-hd as well as all the voices. And it doesn’t seem to work reliably plus Coqui went bankrupt.
OPenai-edge-tts. This is another project that is similar but just uses edge-tts from Microsoft. You can run with docker or it is just a Python project that you can uv run easily.
The Voice Call is pretty cool. If you have something like a Vision model, you can get both voice and also images into the system. It is definitely not fast enough for typical conversations even on an M4 Max, but we are getting there!

Installation of ComfyUI and its Bizarre UI

If you are going all locally, the one thing that would be nice to have is local image generation. There are plenty of local image recognition models like llama3.2-vision:90b but to do this you have to get ComfyUI running. The setup instructions are a mess, but they are working on a desktop application like DiffusionBee (which is amazing by the way).

The setup is beyond byzantine though as it requires cloning a repo, but they do have a desktop application setup that seems to work now. There are some strange things about this application:

You can download the test application for ComfyUI Desktop.
It takes a good long time for it to compile and run, it takes about 30 seconds to boot up, so be patient
There are some sample templates, but it’s not obvious how to add new models. The guide says you have to spelunker around Civitai and then download, but you have to refresh the application to see these. If you use the flux.1 template, it isn’t compatible
At the upper right, there is a Manager button. yes, I don’t know why that is there, but it is apparently a separate application. Then you get a curated list of models to install, but it wants a checkpoint, so if you filter for that, you can try to install it. The one I use on DiffusionBee is Flux.1 [dev], so I like to give that a try. The Schnell one is the other one that is default downloaded.
After you load a model, you need to hit in the Manager and Refresh to see them in the Load Checkpoint box.
This thing has a huge graphical editor so figuring out the right way to lay things out isn’t easy. But they have some defaults.
The installation happens by default in ~/Documents/ComfyUI so if you have iCloud sync running make sure you have enough space.

They have something called Model Manager which is the icon at the upper right, but the settings are at the lower left. Go figure. The most important of the lower left is the Server Config because while the default port for Comfy is normally 8188, for the Comfy Desktop it is 8000. So beware.

Model Manager is your friend

This turns out to be the place where you can download new components and models. For instance, the default just saves PNG files in ~/Documents/ComfyUI/Outputs as plain PNGs. But you can download something called Image Saver which does more.

This is also the place where you can download additional checkpoints (Models really). And there are a sea of them

Image Names but no meta data

This is one thing that OpenAI and DiffusionBee does very well, it picks sensible names and includes the prompts as JSON, you can’t do this with Comfy without some extra work of course, so you can futz with the Node names to make it better and various configuration strings for the default Save Image.

So for instance, the syntax will give you this string, %node.widget% and you also get %date:yyy-MM-dd% to get the date . T While this doesn’t work well, if you hook it to OpenWebUI then you get very sensible names that have the prompt in them without work

# note the string can includes slashes to create directories
%node_name.widget_name%

Save and Open the Image by Right-clicking

You don’t have a save image instead, you go to the Image output and then right click and you can Open and Save an image. Took me some real time to figure that out.

Sensible Image names and how to Turn fields into Inputs

OK, so one of the things that Comfy does (as all graphical tools do) is to make simple things hard. For instance, if you want to add the metadata like steps model and prompts into a JPEG, you have to do the following:

Go to Model Manager and download one of the many Image Savers. With a project like this there are of course a bunch of these that are all forked from each other or abandoned. In looking at the choice so SaveImage With Meta Data, Comfy Image Saver (not updated in two years) and Comfy UI Image Saver are all the same thing. I finally picked the last one because it was updated last month and has more stars that the other one.
Once you choose download Custom Node, you have to reboot and now in the graph, if you right click you can add it. Of course it’s a mystery which of the hierarchical menus it is, but these normally just add their own set.
Now here is the really confusing part. I was expecting that the meta data would be nicely formatted if you choose JPEg (note if you use the default PNG), there are no EXIF fields so nothing happens. No, instead what happens is all the data is dumped into a single comment field! And it is a gigantic long string. Sigh
Now you have a choice, you can manually make sure the entires like width and height match your inputs or you can go to the very laborious work of abstracting these data points.
There are some strange things like the Scheduler needs the Comfy version to connect and some names do not connect probably because of data times for Scheduler and Sample Selector.
Also for the negative and positive prompts you need to insert a String Literal and convert the text to an Input. Sigh.

Abstracting Datapoints for logging with Turn Widget into Input

This is really mysterious, but the various properties like height and width are called widgets and they live inside their various nodes. However if you right-click on a node, you will see an entry called “Turn Widget into Input”. What this does is remove the variable and it sticks an input node.

Now you right click and add a node from Image Saver which is something like “Width Selector”, you put in 512 and then you have to drag an edge to both the image sizer and the final output. You end up with massive spaghetti as a result. But it works.

Sigh. In some place, if you go to Node > Image Saver > utils, you will find replacements, so for instance they have an Load Checkpoint which already has an output for the Checkpoint name and you can replace it.

Hooking Comfy to OpenWebUI

Ok, here is where the fun begins, you can’t pick different workflows with OpenWebUI, you are only allowed a single one.

The first step is to get a working workflow running in Comfy UI. Which is no mean feat. The main thing we tried was:

Use the Model Manager to load up Flux.1 Dev at the upper left
Then open the Browse Templates and choose Image Generation and pick “Flux Schnell”
Replace the Schnell with the Dev and generate some images

Now that you’ve debugged the workflow, you are ready to load it into OpenWebUI

Go to Settings (on the Desktop Comfy this is complicated). But go to the settings at the lower left and you have to turn on Dev mode
Then you go to Export (API) will give you a JSON file that is the Comfy Workflow
Now go to OpenWebUI and you can add it there. And this is where it gets mysterious
You will have to know the localhost port which is normally 8188 but is set to 8000 for the Comfy Desktop. If you click on the icon next to it it will verify that Comfy is running
The confusing part is that OpenWebUI gives you this crazy table that you need to fill out, it wants to know the location of the prompt, you have to fill in the name of the field and then the node where it lives. OpenWebUI doesn’t parse the JSON so you have to fill it out yourself by reading the JSON

The documentation talks about needing to set the ComfyUI Workflow Nodes which is pretty confusing, so here is the default table of inputs from Open WebUI and the names, so for instance, if you look through the JSON, you should see at some point the prompt text, then look at the number of the node, so if the text, they you can see inputs and the field “text” holds the prompt. OpenWebUI tries to choose reasonable defaults for the field names, so it fills these in, but has no idea where the actual nodes so from this fragment you can see the node numbers:

{
  "6": {
    "inputs": {
      "text": "technology warriors in a counciln",
      "clip": [
        "30",
        1
      ]
    },
...
  "30": {
    "inputs": {
      "ckpt_name": "FLUX1/flux1-dev-fp8.safetensors"
    },
...
 "27": {
    "inputs": {
      "width": 512,
      "height": 512,
      "batch_size": 1
    },

OpenWebUI	Comfy Field Name	Comfy Node
Prompt	text	6
Model	ckpt_name	30
Width	width	27
Height	height	27
Steps	steps	31
Seed	seed	31

Once you have the nodes set (and also make sure the Comfy Field Names are correct, because some workflows can change these.

Now set these parameters

Once you’ve done the node mapping

Set Default Model take the ckpt_name in ComfyUI and copy it to the Set Default Model field, set it, so for instance “flux1-schnell-fp8.safetensors”
Set Image Size note that it wants the notation in “512×512” format. And also for things like Flux1, they are trained on 512 by 512 images so this will be much faster by 2-3x than say setting it at HD like 1920×1080
Set Steps. The Schnell Flux.1 model as the name implies is fast and converges as few as 4 steps. I set it to 8-16 depending on how certain I am about the prompt. More steps take longer. 8 Steps on an M4 Max takes about 4

Using the Text to Image Generation and increasing the num_keep remember

It doesn’t happen automatically, but you should see a small picture icon after every response from the LLM, just click on that and an image should come out taking the LLM output and generating an image, so the workflow is:

Find a nice LLM and tell it “Make an image prompt and add your things”. This will generate a nice long output
Note that you can actually Edit any output easiy just by clicking on the pencil icon at the bottom of any prompt. Fix the prompt and then..
Click on the image icon and out it comes!

You can tell if it is working by going back to the Comfy Desktop and on the left, you should see image generation in progress!

The way to make this work is that you need to increase the num_keep parameter either in the vision model itself or the individual chat. Each chat has its own set of parameters now. The default is just 24. That means it will only remember the last 24 tokens. I reset this to 8196 so that for image generation it remembers

Image and Text to Image

Well, there is a nice workflow with Janus Pro from Deepseek that does this entirely in Comfy. You just add an image and then add prompts and it will generate both a text description and an image. To do this completely in Open WebUI:

Upload an image
Load a vision model like LLama3.2-vision:7B. Note that if the image is large like 12MP, then it can take 50GB or more of memory to run it, so most of the time you can’t use a big model like the 90B version and have an image at the same time. At this point, Ollama starts to thrash and throws some of the work to the CPU
This will generate a description of the image
Now typically you will want to mute the image, so make sure to set the keep_num high so it remembers all the tokens that are generated and just adds a prompt like change the image to winter time make an image generation description
It should then generate something appropriate for image generation. If you click on the image generation button below, it will run it and you will see the mutated image.

Image Recognition and Voice Calling

To make image recognition work, you need a model that understands images, the two that I have are llama3.2-visual:90b and llama3.2-vision:11b

You can add the images by clicking on the plus icon and then doing an upload file and then saying “Describe fashion” and it will say what it sees. You can then hook it up to image generation by modifying the prompt, “like saying Make it summer” and then using image gen on the new output.

You can also start voice calling and you will see that you can also just have a camera image of you and you can ask it to describe what it sees. That’s pretty cool.

Comfy Video generation

Yes you can do this too. Just set up Hunyuan and you can track progress only in Comfy itself by watching the console output.

More Add-ons: Tools, Functions and Pipelines

In addition to these easy customizations, there are sea of Models, Tools, Functions, and PIpelines at openwebui.com. I tried some and found that some of them really mess up the system like it won’t boot, so be careful what you load and always backup everything

Do not understand this yet, but it is supposed to allow you to run Code
in the system. The easiest thing to do is just to use the Model system where
many providers use the OpenAI API and you can plug in there, but here is their nomenclature:

Pipelines. For most of them, you can just use Connections if it is just an API plugin, and Functions if you just need something simple. Pipelines and more complex and technically they just look like an OpenAI API endpoint. There is a bit of a trick to get it to work, you have to install this as an OpenAI-compatible end point with default http://localhost:9998 and then set the key to the magical 0p3n-w3bu! or set PIPELINES_API_KEY to that. Then you have to find some pipelines to load which can be done via GitHub
Tools. These run externally, so they don’t see any of the environment inside OpenWebUI.
Functions. These run “in context” so they can manipulate and act on the different things inside Open WebUI, so they can do things like display and work. You have to manually assign Functions to Models in the Workspace > Model section. They can do things like pre-process data as an Inlet Function or post-process with an Outlet Function. We used the Functions for Google Gemini and Anthropic. But be careful loading these, they are inside the system, and when loading a bunch, I managed to crash Open WebUI
Pipes. They can be Pipe which means that it looks like a single Model.
Manifold is a collection of Models.
Valves are user-configurable data (I know right taking the Pipe metaphor pretty far)

User Valves are things that anyone can set and are just variables, but they like this valve and pipe analogy.

Working Pipelines: Only 5

OK I painfully went through all the example pipelines I could find and tried each. The process is that you. goto the Admin Settings > Pipeline and then paste in the Github URL for each and then see if a Valve setting appears. These are the only ones that seem to work as of v0.5.20. I don’t use Azure or CloudFlare AI, so I don’t know if these are really needed or if they have OpenAI Compatible APIs.

The one interesting pipeline runs MLX (see below), but this doesn’t seem to work anymore (https://github.com/open-webui/pipelines/blob/main/examples/pipelines/providers/mlx_manifold_pipeline.py)

https://github.com/open-webui/pipelines/blob/main/examples/pipelines/providers/azure_deepseek_r1_pipeline.py	This let’s you access the Azure version of Deepseek
https://github.com/open-webui/pipelines/blob/main/examples/pipelines/providers/azure_openai_manifold_pipeline.py	Azure version of OpenAI
https://github.com/open-webui/pipelines/blob/main/examples/pipelines/providers/azure_openai_pipeline.py	I don’t really understand the difference between these two
https://github.com/open-webui/pipelines/blob/main/examples/pipelines/providers/cloudflare_ai_pipeline.py	I didn’t every know that CloudFlare has an AI set of models 🙂
https://github.com/open-webui/pipelines/blob/main/examples/pipelines/providers/litellm_manifold_pipeline.py	LiteLLM is another provider that runs on the client. I’ve loaded it but not sure why you would run it.

Running the Weather Tool with Qwen2.5 or any of the Tool models

OK, here are the ways that this works, its a little complicated:

First, go to ollama.com and click on tool entry and you can see models that support tool calling. These are models like Llama 3.3, and Qwen 2.5. Notably, the newer Deepseek models do not support tool-calling
Now go to Admin Settings > Functions > Discover Functions. Then you to to Open WebUI Community and have to inspect the code and put the address of the system, typically localhost:8080 and then import it.
Enable the tools in Settings > Admin Settings > Models > Qwen2.5:32b > Functions and click on
Now run the model and you have to ask it just so that it calls the tool. The Run Code and Weather functions work but see below
The main issue is that many of the functions like Visualize Data do not support the new interface and you will get the error No module named open_webui.apps.webui. But things like Weather, Stock, do work
Note that the Tools do not appear in the Function area even though you add them there, but these tools will appear as buttons that are above each query box and are set somewhere totally different in Workspace > Functions
Some of these Tools require parameters which are called Valves, so make sure you fill these out. and you have to set a UserValve, but I can’t figure out in 0.5.7 how to set them as no User Values show, so the Finnair report doesn’t work at all. And I couldn’t get the Yahoo one working either, it says it can’t do it and then search just spins.

Creating Custom Models with Ollama and Open WebUI

Ollama has a nice feature that is a lot like Dockerfiles called Modelfile which lets you create your system prompt that you can see with ollama show --model-file llama3.3:70b and you can “call” it too and the syntax uses Jenga templates:

From ollama3.3
Template """
{{- if .System }}
{{ .System }}

Now confusingly, Open WebUI has its own definition of a Modelfile which is completely different and you have to create it strangely but it is much more flexible because you can attach RAG and tools to it.

Goto Settings > Admin Settings > Models then at the upper right click on the “Manage Models”
The goto “Create Model” and you have to add the model name and then you have to type in some minimal JSON to bootstrap it like `{“model”: “my-model”, “from”: “llama3.3”} and click on the upload icon and it will save it. Note that the From is hard-coded, so you can’t switch or edit this. If you make a mistake then you have to find and delete the model
Now when you search for that name, you will get a complete pane that will help you add all that you need.

Writing a Custom Front-end to Open-WebUI backend

The Open WebUI backend is awesome and you can see all the things you can do with it if you are local development mode by looking at the Swagger documentation at http://localhost:8080/docs but one thing they ask about is the API Key. You have to generate this for a specific user by logging in and going to Settings > Admin > API keys > API Key to get an autogenerated one.

Note that this is dynamically set, so if you want a uniform system across lots of development machines, you either have to be very careful and make sure you keep copying from a webui.db or you can hack into the SQLLite database to set it.

Here is where we get into the internals of OpenWebUI. The code is structured as a SvelteKit frontend that calls a Python backend (in ./src and ./backend/open-webui respectively). The main thing to understand is that the Python backend spends alot of time generating calls or accessing through SQLAlchemy a SQL Lite database webui.db and that the code to manipulate the database is in in ./backend/open-webui/models which refers to data models (as in model-view-controller) and not LLM models. So for instance users.py defines the user database. SQLAlchemy is cool in that if you create the model that way

Side Note: Using UV run and uvx and scripts and asdf and direnv interactions

There are so many of these things that require that you clone and then pip install something. A the very least you should just use uv for this so at least you get an environment. We will post more on how to do this, but the main point is that there are that there are three ways to install Python now with uv:

From a python source repo. You create a pyproject.toml with uv init and then you add dependencies with uv add pytorch etc. This creates a pyproject.toml and you can create a virtual environment with uv venv and now when you go to that directory, you can source .venv/bin/activate and it will start the venv for that directory and then deactivate gets you out. Alternatively, you just insert uv run in front of everything and it starts it up. This is required since sourcing is not something that direnv can do since it is run as a subshell so you only get export from it, but you can do even better see below
If you are a lover of asdf and asdf-direnv, you can automate this with .envrc where you insert a layout python and it should pick It up then there is another scheme where you do not have to do this manual activation like pipenv does (but is too slow). Just add layout python to your .envrc. Note that unlike , uv venv this creates a virtual environment along that is python version specific. So it is more general that uv venv you can have more than one version so if you have conflict you have to be clever about what python modules you use. You also need to modify your PS to pickup the VIRTUAL_ENV that is created so you have some idea in the command line what is happening but note that the zsh power line does this automatically and power line BASH as well. Note that if you use asdf and direnv, then you will never use the system python or others while inside your $HOME directory. Note you can use uv directly as well with you just need to add some code to your $XDG_CONFIG/direnv/direnvrc so you can use layout uv in your .envrc and it all works.
Python script with uv.You can have a single script with all its dependencies in a doctoring. So take a single file script and run uv run script.py which works if it has no dependencies or just the standard library ones or if you run uv init --script script.py --python 3.12 if will inject the right doctoring, so uv run script.py creates a venv and just works.
Pip package with cli aka tools. So if you have python pip package properly and it has command line entry points, then you can do a uvx ruff. This works beca8use python packages have entry points as part of their packaging, so uvx --from httpie http works when the entry point is different from the package name. And you can even as for extras with uvx --from mypy[faster-cache] mypy --xml-report report works which is really nice. This works because when you package things, you get entry point specifications. You can create this with uv build and then you can uv publish it. The basic idea is that command line tools can be packaged in your pyproject.toml using typer which is the. new hotness for CLI applications (based on FastAPI for Web applications):

# in ./src/greetings/cli.py
import typer
from .greet import greet # the app lives ./src/greetings/greet.py
app = typer.Typer()
app.command()(greet)
if __name__ == "__main__":
  app()
# make sure there is a null __init__.py
# this supports python -m greetings
# __main__.py
if __name__ == "__main__":
  from greetings.cli import app
  app()
# assumes the cli is in ./src/greetings/cli called a app()
[project.scripts]
greet = "greetings.cli:app"

Side note the frustrations of Unifi Threat blocking

They’ve moved the threat protection logs yet again and buried them, so if you do a git push and it hangs, then go to unifi.ui.com > _Your Controller_ > Network > Insights (the bulb on the left). Then pick Flows from the dropdown and click on Threats

The second bizarre thing is that you can click on things like “Allow Threat Signature”, but if you accidentally click on Block this IP then there is no way to toggle it.

Instead you have to go to the Firewall rules in Network > Settings > Security > Traffic & Firewall Rules and then search for the IP you just blocked

Then you have to scroll all the way down this huge table to find a Manage button. Why this is not at the top is a mystery, then it creates a checkbox and then scroll all the way down again to Remove.