Well, Open WebUI is an incredible project that provides a nice graphical front end for local AI development, but the documentation is really lacking. You can figure some things out by looking the documentation, but mainly it is trial and error.
So here are my notes for using the various features that are on the edge as most people are doing docker development on Windows machines. with CUDA, this is targeted at getting it running on Apple Silicon.
Installation with pipx or uvx or Tauri
Sadly, uvx installation doesn’t work. It is missing install, but you can do a simple, brew install ollama && pipx install open-webui --python 3.11
and it will work. What this does is use brew to install ollama and then pipx is this incredibly wonderful thing that creates a python virtual environment (venv) and then adds it to a local path really useful
The thing on a Mac is that you want to avoid using docker because then you have to split your memory into a docker controlled component and a system one. And you will never get that right.
You can also use brew install uv && uvx open-webui
to get to the same place, but I kind of like pipx install
because it does the command line munging so you can just do a open-webui serve
without needed to know about uvx.
Finally there is work going on to use Tauri to package open-webui as well, so it is just a DMG file as Open Webui Desktop, but I couldn’t get this to work.
This does mean that you have to go to great lengths to do pipx installation for the rest and have them running as separate small servers. That is way better and more memory efficient than using docker as you get natural splits between MacOS native applications (like FinalCutPro for instance or DiffusionBee) and your other tools
Going Totally Local
You probably want to go totally local since that’s the point of these experiments and there are and there are many parameters you should change. The main one is to load the best of the local models. We do have a script that does this, but basically from the command line note that this assumes different memory sizes and is probably already out of date:
# create a new window and start ollama
ollama serve &
# now do the pulls I like to pull the tagged parameters
# so it is easier to know what you are loading
ollama pull llama3.2:3b tulu3:8b
# if you have a 64GB machine
ollama pull qwq:32b qwen2.5:32b llama3.2-vision:11b
# if you have a 128GB machine
ollama pull llama3.3:70b tulu3:70b nemotron:70b llama3.2-vision:90b
Downloading other GGUF models
OK one confusing this is that even though Hugging Face has 1.2M models up there only a few can be downloaded by Open WebUI. This is because underneath, Ollama is just a wrapper around Llama.cpp only accepts GGUF files that can come from Ollama.com or from Huggingface.co (defaulting to Q4_K_M quantization, but that’s a whole other post), but on Huggingface itself if you click on the model drop down it will generate the proper pull request for you.
As an aside GGUF stands for Georgi Gerganov’s Universal Format. It’s a bit of a pain because the tool to do do this is in his llama.cpp library with a neat trick using docker to run this stuff
Dealing with HuggingFace models
Most of the rest of the Hugging Face models are in HF format, so you need to convert them to GGUF and as a community service, publish the conversion. Here we enter utilities hell which I’m going to cover in the next blog post about how to do these things, but this is a great example, basically, the major ways are like the Ghost of Christmas past, present and future:
- Past. Pray they have a Homebrew package, but if not it’s painful. Git clone a repo, install naked pip requirements into the system and pray. This is illustrated the last command box. The main issues here are that first you are cloning a bunch of stuff you don’t need and have to maintain and also that you have to remember where these things are in README.md or something and then you are doing a pip install into the system environment and who knows what version of python you are using just so you can run a single script
convert-hf-to-gguf.py
and then manually make Modelfile that explains how the inputs work. Sigh. - Present. The current hotness is stuffing everything inside a docker container and then you get this amazingly complicated command line and you have to understand that there is an internal file system and an external one. And of course with Docker, you have to allocate separate space for it and you get these huge Docker containers with a full operating system in them basically (duplicated on a Mac) just to get a few lines of python running. Then there are online looks like gguf-my-repo that are on Hugging Face (more on the future of tools in another post), but of course there is way to programmatically do this.
- Future. The emergence of npx, uvx, pipx, condax and pkgx that is the family of executables that create an even lighter weight environments that are language specific. Instead of a big docker virtual machine on Apple Silicon, you end up with just enough to run the script which is usually a virtualized environment so multiple versions of Python. Or with tools like Dagger at least they high the containers behind a nice user interface (although of course the container I tried didn’t work. Of course with things like npc, uvx, the nice world of brew update gets replaced by an update from each tool, but it is very lightweight!
I can’t say really which is easiest, but since it’s one time, the HuggingFace running application is pretty great and it is free as long as you take less than 120 seconds of CPU time.
# get the hugging face tools
brew install huggingface-cli
# or if you are cool this is sort of super pipx
brew install pkgx
pkgx install huggingface-cli
# the format is _repo_/_model_
ORG=qwen
MODEL=qvq-72b-preview
# if the above fails you need a GGUF conversion
ollama pull "hf.co/$ORG/$MODEL"
MODEL_DIR="$HOME/wsn/data/models"
mkdir -p "$MODEL_DIR"
huggingface-cli download $REPO/$MODEL --local-dir "$MODEL_DIR/$MODEL" --include "*"
#Convert to GGUF sigh co kdf
docker run --rm -v "$MODEL_DIR/$MODEL":/repo ghcr.io/ggerganov/llama.cpp:full --convert "/repo" --outtype f32 --outfile /repo/$MODEL.gguf
# This creates a file fp32 file in $MODEL_DIR/$MODEL.gguf
#Quantize from F32.gguf to Q4_K_M
docker run --rm -v "$MODEL_DIR":/repo ghcr.io/ggerganov/llama.cpp:full --quantize "/repo/$MODEL.gguf" "/repo/$MODEL.Q4_K_M.gguf" "Q4_K_M"
# Or the old way
git clone git@github.com:ggerganov/llama.cpp
cd llama.cp
# yes see the previous posts about uv
uv pip install -r requirements.txt
# now run the conversion
./convert-hf-to-gguf.py $MODEL_DIR/$MODEL --outfile $MODEL_DIR/$MODEL.gguf --output q8_0
# Now you have to creat the Modelfile to match this
# Sigh this more complicated than it looks because the
# meta data on how system prompts and user prompts
# work is not in the hugging face file itself.
How it all works and logging
Because it is pretty confusing works, but here it is. Note that the easiest way to see the logs is to run all these processes I need a separate terminal window so you can see what is coming out of standard output. Here is how you know each process is working:
- open-webui. It will end with uvicorn started
- ngrok. This will end with a message saying look at http://127.0.0.1:4040
- ollama. Will end with a message telling you how much compute RAM it has (96GB on a 128GB M4 Max by the way)
- tika. This will end with a message says started at http://localhost:9998
Backup the Setting Database
This is a real pain if you have to delete over and over, so I just use chezmoi to capture their SQL database. They don’t use INI files, but have an Alembic database that keeps track of the many parameters.
You can go to Lower Left > Admin Settings > Database > Export config to JSON
note that the API keys are here in plain text, so do not check this in, you should store it someplace like 1Password if you need it.
Resetting the User Database and Backups
I found that I was locked out of this, the simplest thing to do is to delete all the files where the configurations are kept. You can also send an environment variable to do this:
RESET_DATABASE=1 open-webui serve &
If you are doing a pipx installation, the actual location of the webui.db is really buried because it is in the ./data directory of the working venv, this is dependent on the python tha you are using, but with pipx it will be in a strange directory buried in pipx:
$HOME/.local/pipx/venvs/open---webui/lib/python3.12/site-packages/open_webui/data
# here are the interesting files
webui.db # the alembic database
uploads # the files you have uploaded
vectordb # where your RAG information is stored
cache/audio # .wav and transcripts
cache/image/generations # where .pngs live
If you do have a problem with the. user database. Which is did, you can also reset the configuration with RESET__CONFIG_ON_START=1. They talk about a config.json, but I can’t find it anywhere.
Backup of config.json, webui.db and chats
You want to pretty frequently do backups because on each version change, you can lose your configuration and also there are bugs in the system so you can bork your configuration. I try to do this every day or so.
- Settings > Admin Settings > Database and do both an export config.js which has all your API keys
- Export Database which gives you
webui.db
which has more configurations - Export Chats because you will lose those.
- These all have API keys and things, so you can put the chats in a repo, but I would put the config.json and webui.db into 1Password or someplace secure like iCloud Drive. Definitely not in a repo.
The many configuration settings
One note is that while Open WebUI takes in many environment variables, those marked PersistentConfig are only read once and then disappear into the webui.db and you can only change them in the Open WebUI interface. They have this idea of an OPEN_API_KEY for instance but this disappears into the database.
If you care about your settings, you can go to Lower Left > Admin Settings > Database > Export Config to JSON files
and save it.
Multiple OpenAI Compatible API points
While you can use functions and pipelines, so many models are available from an OpenAI compatible interface it makes sense just to have them all here. I can’t find a way to load them programmatically though, so you have redo this everytime you setup a system in Lower Left > Admin Settings > Settings > Connections
Here’s a list that I use. Since Open WebUI doesn’t tell you where the models live, you have to intuit this by different rules for model names go to Lower Left > Admin Settings > Settings > Models
you get more meta data.
They don’t tell you who the provider is, but in the all important Capabilities
at the bottom you can see if it supports Vision
or Citations
.
Also you can’t tell which provider you are using from the user interface, but the syntax of the model names are subtly different so that’s a hint. See the third column, but if you see a link icon on the right it is hosted in the cloud while if it has a number then it is local. The local model syntax is easy just look for a number after the model (which is its size), it has the form hf.co/org/rep
if it downloads from there or whatever random name they pick on ollama.com, but it is typically lower kebab case with the version with a dash, so qwen2.5:72b
or llama3.2-vision:90b
. The net is that only Cerebras can really confuse you if you follow this decoder ring:
URL | Comment | Model Syntax |
---|---|---|
https://api.openai.com/v1 | They have lots of old and date versioned models | They use lower kebab case like gpt-4o-audio-preview-2024-10-01 . Note that they never do gpt4 , it is gpt-4 |
https://api.groq.com/openai/v1 | Very high speed, not as fast as cerebras but way more variety. They have lots of old models | Names are lower case in provider/model[-]version syntax: llama-3.2-90b-vision-preview llama3-8b-8192 llama-3.3-70b-versatile |
https://api.deepseek.com/v1 | deepseek-chat is V3 and the pricing is incredibly low, so use it! | There are people copying them, but the true models are just deepseek-chat and deepseek coder |
https://openrouter.ai/api/v1 | This is the most confusing because they route to every other provider like OpenAI, Amazon, Google, and many small ones. The tag (free) means no charge, so go for that. They don’t host anything themselves FYI, so Look for open source ones | Their syntax is Initial Case with Provider: Model. So they are easily confused with the real Google models. For instance, Google: Gemini 2.0 Flash looks the same as Googles base offering. They has a sea of models like Deepseek V3 that a similarly passthrough. Free ones look like Meta: Llama3.2 90B Vision Instruct (free) |
https://mistral.ai/v1 | Proprietary models from the French 🙂 search for models ending in stral | The model names are in kebab case like codestral-mamba-latest |
https://api.cerebras.ai/v1 | Accelerate models that are supposed to faster than Groq | The model names are confusing, but the also use lower kebab case like llama3.1-8b, llama3.1-70b and confusingly llama-3.3-70b so a dash in the version |
https://api.totalgpt.ai | Infermatic.ai. This came up as an alternative to OpenRouter.ai, but it is expensive so not using it and don’t want to pay $15/month. They do use vLLM underneath | |
http://localhost:4000 | They used to support LiteLLM, but don’t anymore. So you can run LiteLLM separately if you want. LiteLLM which shims 100+ LLMs with the OpenAI call format, but then you need another component that is an LLM proxy with pipx install 'litellm[proxy]' and then LiteLLM --model huggingface/bigcode/starcoder puts a proxy at port 4000 |
Functions: Getting Anthropic and Google Running
There are some models which are not OpenAI compatible, so you need to find the right Functions to use it. To do this, go to the Lower Left > Admin Settings > Functions
and look for the Discover a Function
at the bottom, these are the two I enable. You do need a login to Openwebui.com which is different from the localhost. This page is a complete mess of different things from functions to prompts to other stuff, but the best view I think is to go to Models > Functions
which will show you the most used functions. Note that the website is really slow, it looks like it is doing dynamic generation so it can take seconds to click from one page to another.
What is a function, well basically it is a way to do a single call out from OpenWebUI (this is compared with Pipelines which allow multiple stages and it is more separated).
Once you load them they have this concept of a Valve which is really just a variable, so once you load. So what you do once you find a function is to click on the GET
button and then it will ask if you want to Import to WebUI
and you need to type the URL of your Open WebUI localhost (this is usually http://localhost:8080
. This will copy it into your local host and then you choose Save to stick into your environment. Not super clean but it works.
Note that most of these functions are not documented so it is hard to know what depends on what, so you can get quite a few errors, like Visualize for instance requires OpenAI.
Also note that in the Settings > Models
section, you can tell it comes from a function because a anthropic.claude3.
5 appears or google_genai.gemini-1.0-pro-latest
Function Name | Settings | Model Syntax |
---|---|---|
Anthropic | Here is where you get Claude | The names are lower kebob case with version and type like this: anthropic/claude-3.5-sonnet |
Google GenAI | Note that the GOOGLE_API_KEY import doesn’t work, you will need to add to the manually. | The Models are Initial Caps with a colon and then name like Google: Gemini 2.0 Flash Thinking Experimental |
Using Remote Ollama, Ngrok and OpenWebUI
You can do this in the Ollama API list, so for instance, if you have another MacBook with Ollama running, you get to it with http://richs-macbook-pro-2021.local:11434
as an example works and if you use ngrok, then you can actually go this remotely.
Pretty handy for a quick way to get a departmental server
One of the nice things about OpenWebUI is. that it just calls APIs, in this case Ollama is the default. You can also run. Ollama remotely and then you just start it with OLLAMA__HOST=0.0.0.0 ollama serve
and then. it will serve anything on the Internet. This is a little dangerous of course but convenient
Then you just go to Admin Settings > Connection > Ollama Host
and add the domain name something like http://richs-macbook-pro-2021.local:11484
and it will serve from there. Very nice for departmental setups. Get a Mac mini M4 Pro and serve. your entire workgroup.
Ngrok which does authentication and has a little server on the host machine is another answer. The only problem is that Ngrok generates an AVT Anti-virus error since it is used in many hacks. What you can do is to create an account on ngrok.com and then create an ngrok server with ngrok http --url _your static domain_ --oauth google --oauth--allow-domain __your domain__
which should protect you.
The setup here is a little more complicated:
- You have to logon to ngrok.com and get an account
- Then
brew install ngrok
- Note that many anti-virus programs mark ngrok as a bad program because it is commonly used in hacks. You need to go to your Antivirus and exclude the executable which should be in somewhere like
/opt/homebrew/Caskroom/ngrok/<version>/ngrok
- Now you need to authenticate with your
ngrok config add-authtoken you get this from their console
- Then you can run ngrok remote the port 8080 of the Ollama server with
ngrok http --url _the static domain_ 8080 --oauth-google --oauth-allow domain=tongfamily.com
which says remote port 8080 and protect it with google authentication and only allow accounts from tongfamily.com
So basically at this point you are using Open WebUI locally from its point of view and the bugs. with web sockets are not an issue.
Enabling RAG Documents and Web Retrieval
OK, now in Lower Left > Admin Settings > Settings > Documents
are about a million configuration settings that enable RAG, the big ones are the embedding mode.
The basic idea is pretty simple, you can use the #
notation or choose upload file and it adds it to the chat RAG area. There are two ways to do this.
- First, you upload your local documents in the Workspace > Knowledge section. Note that the documentation is actually very out of date here. Knowledge is basically a folder system, so you can turn on different pieces. This allows you to upload directories and sync them, so it’s a nice way to have say a repo with your documents and then. you can sync. Then when you enter
#
in a chat, it will show you all the available documents you can load. Then it will RAQ the data and the LLM can. use that data. You can RAG a single file or you can RAG an entire Collection. It automatically can add Citations if the LLM. you chose supports it. Tulu3:8B for instance works well. - You can also temporarily load files by choosing the Upload option in any chat. But it is nice to have the documents already there.
- You can enable Google Drive by setting
GOOGLE_DRIVE_API_KEY
andGOOGLE_DRIVE_CLIENT_ID
and it will be available. Go to the Google Console and Enable the Google Drive API need for Web Apps, then you create an API key and make sure to Edit the API key to restrict it to just Google Drive. Then. you need a Drive Client Id as well, but there are no specific instructions for this :—=( - You can do a download of Web source as well with
#https://tongfamily.com
but this is nearly useless given all the gunk that is in a typical website, they don’t really tell you how to fix this, but there is a huge Web Search section. It will actually show you the document that it pulled, when you hit enter and you can click on the document itself to see what is there
How to tell if your download is working, look at the console output
The way that you can tell. if it works is to go to the console and see if the open__webui.env is downloading things when you hit enter and you should see the model getting loaded
How RAG Embedding actually works
They really don’t tell you want is going on here, but the RAG system uses a completely different method of dealing with models that the core Chat system and it is not well documented, but here is what happens:
- Unlike Ollama, OpenWebUI RAG supports the based HuggingFace models, so you don’t need to do any conversion. That’s the good news.
- The bad news is that on Apple Silicon at least it looks like these models *do not* use the Neural Engine hardware so are really slow
- Second is that the hugging face cli caches all the models it has in. Note that to set this all up, create a HF_TOKEN and use 1Password to retrieve it in .bash_profile or .zshrc.
- The cache can get really big, it ate me out of lots of disk space and lives in
~/.cache/huggingface/hub
so you might want to symlink to your backing storage if it is too big.
The next is that default models are very small at less than 1B so you don’t really see a performance hit and they use the SentenceTransformers library of HuggingFace. The other options are:
- SentenceTransformers. The default. They download directly from hugging face, so the syntax to get a new model is
org/repo
so for instance Nvidia/NV-Embed-V2 is valid. They don’t really tell you what the syntax in the Embedding model line is as aside, so that is it - OpenAI. You can use theirs, which we avoid since we want this to be all local
- Ollama. They do allow you to use Ollama for the models as well and the syntax here is just the name of the model. Note that in ollama.com, you can search just for embedding models and some valid names are
nomic-embed-txt
orbge-large
Hybrid Search separate Embedding from Reranking
Options to improve RAG, you can select Hybrid Search this means that there is a separate model to generate embeddings and to decide which document chunks to fetch and then a much slower but more accurate reranker that takes the bucket of chunks and thinks more about which ones to pick
Note that the reranker doesn’t seem to have a Ollama option, so you are stuck in CPU mode if you use this
As a refresher, there are two parts of RAG, first is the embedding model which converts every word into a multidimensional token. The. idea is that the more dimensions, the more you can find similarities. The best ones have 5,000 dimensions and the job is to find a list of documents that look similar. The idea is to quickly retrieve a lot of documents and then the reranker works slowly to figure out what is the most relevant.
The reranker also know as a cross-encoder takes the query and a document and give s. you a similarity score. You use it to figure out which documents are most relevant. The Top K means you pick the top 3 (if K=3) of these.
Picking RAG models not all of which work
There are a series of models starting with the recommended ones and also looking at the mteb/leaderboard on huggingface and I went through to figure out out what is working and what is not. The way to know if it works is not that obvious, you either watch the console output or when you click on the download, but the success message is misleading you have to wait to see if it save “Embedding Model Set to…” nothing may happen that is there could be an error and you will not know. The testing is laborious, you have to reload a corpus and then see if you get reasonable output when you run the RAG with the pound sign to add a document:
- ✅sentence-transformers/all-MiniLM-L6-v2 which seemed to work and the performance is documented at Sbert.net and is the default but is not particularly high performing
- ✅ sentence-transformers/all-mpnet-base-v2. Has the highest performance by a small fraction and it does seem to load OK.
- https://huggingface.co/BAAI/bge-large-en-v1.5 aka https://ollama.com/library/bge-large scores 64.23% so distinctly lower as a 335M parameter model, but a good choice since it is GPU accelerate. The other ollama models are not treated here
- ❌ https://huggingface.co/nvidia/NV-Embed-v2. 8B parameters 72.3 score. Note that on model download, I got a timeout error, but this is not exposed in the User Interface it just returns and it looks like the model is loaded. It’s getting a no response returned from Hugging face?
- ❌ https://huggingface.co/infgrad/jasper_en_vision_language_v1). It is not rejected by huggingface ‘NoneType’ object has no attribute ‘encode’. Should some message about model load failure get surfaced in the UI rather than looking at logs
- ❌ https://huggingface.co/dunzhang/stella_en_1.5B_v5. 1.5B parameters It says no model found with this name and something about no periods allow in the name. 71.2 score. Again no error message and this fails
- 🟡 https://huggingface.co/BAAI/bge-en-icl. This is a heavyweight 7B parameter model need 26GB of storage and it definitely jams Neural Engine on Apple Silicon. 71.7 score and very slow, takes a minute to process 120KB. This appears to work properly and I can see vectors being returned. The tulu3:8b model does generate a full response and doe snot cite properly. Llama3.2:3b seems to work fine but doesn’t cite.
For rerankers, the mteb/leaderboard ranks, net, net it probably makes sense to pick Stella or bge-large here
- baai/bge-reranker-v2-m3. This is also the default in the user interface itself
- Alibaba-NLP/gte-Qwen2-7B-instruct. This is the performance leader but very heavy which is a big price to pay with no GPU. at 7.6B parameters scores 61.4% and is heavy
- dunzhang/stella_en_1.5B_v5 comes in again as the best reranker at 61.2% and this seems to work for ranking but not embedding.
- https://huggingface.co/BAAI/bge-large-en-v1.5 aka https://ollama.com/library/bge-large scores 60% on reranking, however the Ollama support seemed very unreliable. Works with Knowledge management but does not work with upload. And once it fails with an “NoneType” error the whole thing needs to reboot.
The conclusion is stick with the defaults if it’s working OK. If you want simplicity and don’t care about CPU only then just use stella_en_1.5B_v5.
If you want some more speed, then for the embedding use gbe-large and then Stella for the reranker so faster but less accuracy. I should probably just make a GGUF version of Stella, but that’s another post.
The speed difference is pretty dramatic. For 24K tokens, the creation took less than 20 seconds. The Stella running on the CPU takes about 5 seconds to analyze 7 documents.
The one bug I found is that Ollama does not work properly when you are doing an upload document, it returns a 400, but works ok when you are using Workspace > Knowledge. Sigh, but Stella is lightening fast on a CPU
Using Ollama Encoders stability issues
The main reason you probably don’t want to do this is that Open WebUI stuff is all CPU based. I can tell because it’s pretty slow and the GPUs are not being used at all on stats. The cpu based embedding is slow, but the reranker running on the CPU is basically unusable. It is using the GPU which is nice
You can also change to Ollama itself instead of the included embedding model but there are no guides, but looking the ollama.com site and seeing what is newest and most popular, the ones to try are. The main thing is to make sure that you have pulled these models first. Don’t be like me, otherwise you will get all kinds of upload errors! But the upload works very fast and you can see the GPU running, but the decode doesn’t work at all and it seems like the RAG does break as everything just emits a single letter so I’m not sure about stability
- bge-large. This looks like a good candidate as BGE comes up a lot
- bge-m3. Hard to tell what this is but it popular, m3 stands for multi function, multi lingual and multigranula. This model doesn’t really work and returns a single
T
which not the wonderful although you can see the GPU running - nomic-embed-text. This is the oldest and has by far the most pulls
The slowness of OpenWebUI Sentence Transformers on the Mac and a fix in 0.5.4?
These are much bigger than the other default models, but that’s what a big computer is for and they only run once doing the embedding. The downloads for this take forever.
It turns out that this is because there is no Apple Silicon detection, the so called “mps” device for PyTorch. I just added a PR for this and it seem to work well. This is now fixed on 0.5.4 it says, but when I load 0.5.4, it still uses the CPU, so got to figure that out and a patch is in process.
Adding Knowledge
There are a few conveniences that they have which include Workspace > Knowledge that lets you add directories automatically and sync them. Note that if you change the embedding model, you need to reimport all the documents.
This stuff seems a bit buggy by the way I’m not sure what is going on, but you need to look at the backend terminal output to see if things are loading
Content Extraction with Tika
It’s not super clear how to get the names of these models and it also lets you set the Content Extraction to default or Tika, I’m not sure what this does. It also has a PDF extraction system as well that is not clear how it does it as that’s a hard problem.
The Content Extractor is either default or Apache Tika, they recommend using docker for this, but again I’m looking for native stuff, but you basically get it running on http://localhost:9998
and then watch out. This Tika thing is a super parser that knows thousands of data types. This this is just a Java program, so I’m wondering if there is a. pipx like thing that just installs it. And there is a brew install tika
I’ve not yet seen any gains doing this, but theoretically they have hundreds of plugins to read things, so I added it.
Changing Top K and Prompt
There are some other simple heuristics like changing the Top K to 10 from 3 so there is more context for the model. Also changing the prompt might help, but it is all black magic really.
Experiments
First I tried a bunch of things and ended up
- Sentence Transformers with all-MiniLM-L6-v2 without reranking for Tike content extraction. This seems to work fine
At some point I borked the whole setup and started to just get a single character. A reboot seems to have solved this
- Going fully GPU with Ollama and bge-large with alibaba-nlp/gte-Qwen2-7B-instruct and Tika content extraction. This only leaves the reranker as a CPU based system.
- tika to nvidia/nv-embed-v2 to alibaba-nlp/gtw-qwen2-7b-instruct. This works but is CPU based and didn’t seem to work
- Tried to concatenate an entire website, but this appears to cause the RAG to function so not clear how to test RAG vs just long context of you do not do the steps below…
Large context model slow to load
If your data set is small, you might just skip all this and just insert the entire file set into the system and see how it behaves. It turns out this is harder than you think because the default context windows is just 2048 tokens and this is silently truncating user inputs. Also a number of optimizations are not taken, you have to set them:
- If you upload a file, then it uses RAG by default, so to test the long context, you need to concatenate all the input into out text file.
file . -type f =exec cat {} ; | pbcopy
is your friend, this loads up the clipboard and then you can paste it in. - The first thing thing you learn is that Open WebUI by default chops all context lengths to just 2K for performance.
- So you have to go into each model in Admin Settings > Models > _Your Model > Advanced and set
context length
to what ever the real limit is. I can’t quite figure out how to find this in Open WebUI, butollama info llama3.2
for example shows the context length. You can also change this globally if you think all the models you use support say 128K tokens withSettings > General > Advanced Parameters > Show > Context Length
- Then you need to optimize the KV Cache and the use of Flash Attention by setting OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q8_0. This halves the memory with no noticeable impact. For small machines, try
q4_0
and this works great.
Once you do this, when you Upload a file, you will have the choice:
- Upload the a file with the + at the chat
- Click on the file uploaded
- At the upper right, you can choose “Segmented retrieval” which means use the RAG or the whole document is just inserted. If you have set the context high enough, it is going to take some time. For instance, Tulu3:8B runs at 200 prompt tokens per second so a 20K token document is going to take a while.
- I can’t find a way to cache the KV Cache that is created (like Anthropic does), but that will get rid of this. It should be in the Knowledge section I think.
Web Search
Very confusing, but basically you have to a few things
- Add the search engine to the system
- Click on the web upload button in the chat
- A very strange syntax where
#
says do a web search but this has to be separated by entering each one in turn and then running
Once you have RAG working then you can use Web Search which is a variant of Local RAG although they are not clear about this, but here is how it works:
- If you type
#https://tongfamily.com
it will suck in the contents of this as a local file and apply rag to it. - If you install a Search agent that does this and there are literally a sea of them.
- I installed each and it seems that the basic issue is that when you ask for the top 5 results, they should return website, but the problem is that some of them are very verbose, you do not want the HTML goo just the content.
Cool Trick: Adding it as a Search Engine
OK, this is a cool trick, if you have a custom browser and hate going to localhost all the time:
- Go to your browser where you can add custom search engines. For instance with Brave, you go to http://brave/search-engine
Search Engine Selection
The main thing here is you want the search engine to return a nice JSON so the LLM can process it
- Brave Search requires a free API key limit to 1/second and 2000 per month, this does seem to return the site at all, so depressing
- Serxng: Does not work for running over the web. The Serxng the documentation wants a docker container which is a bunch of trouble, but trying to use one of the free sites like
https://kantan.cat/search?q=<query>
with their string leads to rate limit errors. If you set it to one concurrent request then it works properly and returns a document but, I get 403 forbidden. You can also run this as a python application. Basically you need to run your own if you want it to work - Serpstack. This is quite fast and you just need an API key and it
- SearchAPI. This given the google search engine looks for sites rather than the specific URL. It seams like it is just finding the top hits from the # sign. If you feed it an exact URL it searches with the google engine just that site and does a retrieval. Otherwise it gives you a bunch of results.
- Duckduckgo. Like SearchAPI does not return the contents, seems like it is the search page and isn’t fast
- Serply. This is a special search engine that uses API keys and is designed specifically for searching for market data. The nice thing is that it returns simple JSON that is just the data of the search which is sort of useful. One problem of course is that it only returns the search results and not the full data in the sites themselves. There are a huge number of different ways to run the search such as news searches and so forth, but unfortunately I get a
'NoneType' object has no attribute 'pop'
error.
Using the Search #url at a time then ask
The syntax is string, the #
with a URL, but this is very inconsistent and
seems buggy, but it isn’t, its just not well documented. What you do is that you:
- Enter each URL with a pound sign.
- Press enter after each one
- You should see them become documents
- You can click on each and see the contents
- When you click, you can use the toggle which is set for RAG or “entire document”.
- Then enter your query
- You will see document below
- When you click on each you will see the RAG section that is used. Each will have a relevance score
Each document added is added to the context when you run a query but sometimes, you actually get the search engine results. And then annotation shows you the part that the RAG on the Web search has selected results. The default is to use “Focused Retrieval”, but in the document you can select “Using Entire Document”.
As an example, enter each of these with a SEND, that’s the main trick
#https://tne.ai/about
#https://tne.ai/solution
#https://tne.ai/app
#https://tne.ai/sys
summarize company
If you want to do a Google like search
Then you should then:
- Type the search phrase you want
- Clickon the button that says Web search
- This will call the default search engine and return up to three sites.
- You can set how many site you want in settings
As an example with this prompt and Web Search clicked, you will get the three most relevant sites summarized.
Richard Tong biography
Local RAG with uploaded documents
This actually works well, you just add documents with the plus and it summarizes, for instance Deepseek Coder 2.5, seems to have issues with RAG but the local models like LLama3.2 work OK. There do see to be some bugs because
sometimes it says it can’t find the document files.
Here is the process:
- Upload files
- Then you can run a prompt on them.
- There is a setting for what RAG to use
Speech Models
This part is really undocumented, but the UI says check for the cmu-arctic-xvectors, but I assuming this works the same as the RAG stuff, I’m assuming it wants huggingface syntax org/repo
, but this does not have any models in it.
It also mentions faster-whisper but again this does to seem to have a list of models seems pretty small at distil-whisper-large-v3 is not found:
- large-v2
- distil-whisper-large-v3
- small
But these do not seem correct, but looking at the error logs that come out of stderr, but if you say type in “foo”, then you can see what the valid models are which are currently as of December 2024. The other options are the OpenAI API and a WebAPI for the browser speech system
- tiny.en
- tiny
- base.en
- base
- small.en
- small
- medium.en
- medium
- large-v1
- large-v2
- large-v3
- large
- distil-large-v2
- distil-medium.en
- distil-small.en
- distil-large-v3
I have to say distill-large-v3 sounds pretty good to me.
And in looking at this YouTube video it’s clear that in addition to these default repos, you can specify any hugging face url so for instance so sentence-transformers/all-mpnet-base-v2
which does try to download but doesn’t actually load, so stick to the ones listed. This is supposed to be the best one and the default. There are other tools that show you how to do voice chat, but I can’t find anything about what models to use.
Making Speech to Text work
Well again, there are huge number of different things and its pretty confusing what is in there right now with the current 0.53 version:
- OpenAI. You fill in the API key as usual, but it is not clear what TTS Models you have available but the two options appear to be tts-1 and tts-hd. The main thing is that tts-1 has lower latency but tts-1-hd has fewer errors. In testing, the delay with tts-1-hd seems minimal compared with LLM processing time so I use tts-1-hd. You also have a collection of voices, alloy , echo, fable, onyx (a lower soothing voice), nova (a higher voice, sounds like HER) and shimmer (higher voice). I picked nova
- Transformers (local). This looks like another CPU driven model, again, it’s hard to know what the valid names are here. The user interface says it uses SpeechT5 and CMU Arctic Embeddings. I’m guessing based on the RAG portion that these are huggingface models, but it’s not clear what the names are since the SpeechT5 doesn’t have any huggingface names and the CMU listing is for a dataset. I just tried a file name
cmu_us_awb_arctic-wav-arctic_a0001
2. This seems to cause some serious issues as after I try this, I getnetwork error
and have to restart everything, but then when I rebooted and tried again, it worked! As an aside, they have literally 80 pages of models with the last one beingcmu_us_slt_arctic-wav-arctic_b0539
. It keeps generating this error that ends withOutput channels > 6553
6 not supported by MPS Device. The first one is a lower voice and the second a higher one.
The Voice call and Microphone Buttons for Voice Assistant
Well there are two other cool modes that this supports decently and there is a lot of work to make an interactive local assistant. And the documentation is currently blank on how to do text-to-speech 🙁
- You can click on the microphone, the main thing here is that the browser is doing this, so you have to wait a little bit before talking and then when done, you click on the stop icon. The STT (Speech to Text works pretty well though)
- OpenedAI-Speech. This is another choice that I need to try as it provides a local OpenAI compatible end point so you can just use a local application (the same way that Connections provides this for models). It hasn’t been updated in five months though, so I wonder how usable it is. But it provides both tts-1 and tts-1-hd as well as all the voices. And it doesn’t seem to work reliably plus Coqui went bankrupt.
- OPenai-edge-tts. This is another project that is similar but just uses edge-tts from Microsoft. You can run with docker or it is just a python project that you can
uv run
easily. - The Voice Call is pretty cool. If you have something like a Vision model, you can get both voice and also images into the system. It is definitely not fast enough for typical conversations even on an M4 Max, but we are getting there!
Installation of ComfyUI and its Bizarre UI
If you are going all locally, the one thing that would be nice to have is local image generation. There are plenty of local image recognition models like llama3.2-vision:90b
but to do this you have to get ComfyUI running. The setup instructions are a mess, but they are working on a desktop application like DiffusionBee (which is amazing by the way).
The setup is beyond byzantine though as it requires cloning a repo, but they do have a desktop application setup that seems to work now. There are some strange things about this application:
- You can download the test application for ComfyUI Desktop.
- It takes a good long time for it to actually compile and run, it looks about 30 seconds to boot up, so be patient
- There are some sample templates, but it’s not obvious how to add new models. The guide says you have to spelunker around Civitai and then download, but you have to refresh the application to see these. If you use the flux.1 template, it isn’t compatible
- At the upper right there is a Manager button. yes, I don’t know why that is there, but it apparently a separate application. Then you get a curate list of models to install, but it wants a checkpoint, so if you filter for that, you can try to install it. The one I use on DiffusionBee is Flux.1 [dev], so I like to give that a try. The Schnell one is the other one that is default downloaded.
- After you load a model, you need to hit in the Manager, Refresh to see them in the Load Checkpoint box.
- This thing has a huge graphical editor so figuring out the right way to lay things out isn’t easy. But they have some defaults.
- The installation happens by default in
~/Documents/ComfyUI
so if you have iCloud sync running make sure you have enough space.
Tools, Functions and Pipelines
Do not really understand this yet, but it is supposed to allow you to run Code
in the system. The easiest thing to do is just to use the Model system where
there are many providers that use the OpenAI API and you can plugin there, but here is their nomenclature:
- Pipelines. Most of them you can just use Connections if it is just an API plugin, Functions if you just need something simple. Pipelines and more complex and technically they just look like an OpenAI API end point.
- Tools. These run externally, so they don’t see any of the environment inside OpenWebUI.
- Functions. These run “in context” so they can manipulate and act on the different things inside Open WebUI, so they can do things like display and work. You have to manually assign Functions to Models in the Workspace > Model section. They can do things like pre-process data as an Inlet Function or post process with an Outlet Function. We used the Functions for Google Gemini and Anthropic. But be careful loading these, they are inside the system and when loading a bunch, I managed to crash Open WebUI
- Pipes. They can be Pipe which means that it looks like a single Model.
- Manifold is a collection of Models.
- Valves are user configurable data (I know right taking the Pipe metaphor pretty far)
User Valves are things that anyone can set.
Side Note: Using UV run and uvx and scripts and asdf and direnv interactions
There are so many of these things that require that you clone and then pip install something. A the very least you should just use uv for this so at least you get an environment. We will post more on how to do this, but the main point is that there are that there are three ways to install python now with uv:
- From a python source repo. You create a pyproject.toml with
uv init
and then you add dependencies withuv add pytorch
etc. This creates a pyproject.toml and you can create a virtual environment withuv venv
and now when you go to that directory, you cansource .venv/bin/activate
and it will start the venv for that directory and thendeactivate
gets you out. Alternatively, you just insertuv run
in front of everything and it starts it up. This is required since sourcing is not something that direnv can do since it is run as a subshell so you only get export from it, but you can do even better see below - If you are a lover of asdf and asdf-direnv, you can automate this with
.envrc
where you insert alayout python
and it should pick It up then there is another scheme where you do not have to do this manual activation like pipenv does (but is too slow). Just addlayout python
to your .envrc. Note that unlike ,uv venv this creates a virtual environment along that is python version specific. So it is more general that uv venv
you can have more than one version so if you have conflict you have to be clever about what python modules you use. You also need to modify your PS to pickup the VIRTUAL_ENV that is created so you have some idea in the command line what is happening but note that the zsh power line does this automatically and power line BASH as well. Note that if you use asdf and direnv, then you will never use the system python or others while inside your $HOME directory. Note you can use uv directly as well with you just need to add some code to your $XDG_CONFIG/direnv/direnvrc so you can uselayout uv
in your .envrc and it all works. - Python script with uv.You can have a single script with all its dependencies in a doctoring. So take a single file script and run
uv run script.py
which works if it has no dependencies or just the standard library ones or if you runuv init --script script.py --python 3.12
if will inject the right doctoring, souv run script.py
creates a venv and just works. - Pip package with cli aka tools. So if you have python pip package properly and it has command line entry points, then you can do a
uvx ruff
. This works beca8use python packages have entry points as part of their packaging, souvx --from httpie http
works when the entry point is different from the package name. And you can even as for extras withuvx --from mypy[faster-cache] mypy --xml-report report
works which is really nice. This works because when you package things, you get entry point specifications. You can create this with uv build and then you can uv publish it. The basic idea is that command line tools can be packaged in your pyproject.toml using typer which is the. new hotness for CLI applications (based on FastAPI for Web applications):
# in ./src/greetings/cli.py
import typer
from .greet import greet # the app lives ./src/greetings/greet.py
app = typer.Typer()
app.command()(greet)
if __name__ == "__main__":
app()
# make sure there is a null __init__.py
# this supports python -m greetings
# __main__.py
if __name__ == "__main__":
from greetings.cli import app
app()
# assumes the cli is in ./src/greetings/cli called a app()
[project.scripts]
greet = "greetings.cli:app"
Side note the frustrations of Unifi Threat
They’ve moved the threat protection logs yet again and buried them, so if you do a git push and it hangs, then go to unifi.ui.com > _Your Controller_ > Network > Insights (the bulb on the left). Then pick Flows from the dropdown and click on Threats
The second bizarre thing is that you can click on things like “Allow Threat Signature”, but if you accidentally click on Block this IP
then there is no way to toggle it.
Instead you have to go to the Firewall rules in Network > Settings > Security > Traffic & Firewall Rules
and then search for the IP you just blocked
Then you have to scroll all the way down this huge table to find a Manage
button. Why this is not at the top is a mystery, then it creates a checkbox and then scroll all the way down again to Remove.
Leave a Reply