Also worth checking out https://github.com/saharNooby/rwkv.cpp which is based on Georgi's library and offers support for the RWKV family of models which are Apache-2.0 licensed.
I’ve got some of their smaller Raven models running locally on my M1 (only 16GB of RAM).
I’m also in the middle of making it user friendly to run these models on all platforms (built with Flutter). First MacOS release will be out before this weekend: https://github.com/BrutalCoding/shady.ai
I get around ~140 ms per token running a 13B parameter model on a thinkpad laptop with a 14 core Intel i7-9750 processor. Because it's CPU inference the initial prompt processing takes longer than on GPU so total latency is still higher than I'd like. I'm working on some caching solutions that should make this bareable for things like chat.
This is not true, GPT-3 can perform chain-of-thought reasoning through in-context learning either by one/few-shot examples or zero-shot by starting a prompt with "let's think step by step" (less reliable).
GPT-3.5 (what's being used here) is a little better at zero-shot in-context learning as it's been intstruction fine-tuned so it's only given the general format in the context.
I think you're focusing on a few narrow examples where LLMs are underperforming and generalising about the technology as a whole. This ignores the fact that Microsoft already has a succesful LLM-based product in the market with Github Copilot. It's a real tool (not a party-trick technology) that people actually pay for and use every day.
Search is one application, and it might be crap right now, but for Microsoft it only needs to provide incremental value, for Google it's life or death. Microsoft is still better positioned in both the enterprise (Azure, Office365, Teams) and developer (Github, VSCode) markets.
Copilot mostly spews distracting nonsense, but when it’s useful (like with repetitive boilerplate where it doesn’t have to “think” much) it’s really nice. But if that’s the bar, I don’t think were ready for something like search, which is much more difficult and important to get right for the average person to get more good than harm from it.
Few people seem to know this, but you can disable auto-suggest in Copilot, so it only suggests things when you proactively ask it to. I only prompt it when I know it will be helpful and it's a huge time saver when used that way.
Sometimes, Copilot is brilliant. I have encountered solutions that are miles better than anything i had found on the internet nor expected to find in the first place.
The issue involved heavy numerical computation with numpy, and it found a library call for that that covered exactly my issue.
I've had similar experiences. Sometimes it just knows what you want and saves you a minute searching. Sometimes way more than a minute.
But I find it also hallucinates in code, coming up with function calls that aren't in the API but would sound like a natural thing to call.
Overall it's a positive though, it's pretty easy to tell for your other coding tools if the suggestion is for something made up, and the benefits of filling in your next little thought are very real.
Google's search results are pretty terrible. I actually have a hard time telling which is a result and which is an ad anymore tbh. I really don't think the bar is that high.
So one question here is: Why reduce the distribution (with long tail or whatever) to a single estimate number? If the distribution represents the range of possible outcomes well, then the single number throws away most of the information in the distribution.
I strongly agree, giving people the distribution conveys a lot of information especially if everyone is clear on what the parameters of that distribution mean (ie: what's the low estimate mean?).
At the same time, there are occasions where it can be useful to collapse a distribution for some types of reports, or for quickly looking across estimates.
If you run a Kubernetes cluster for self-hosting software or development I highly recommend setting up a Tailscale subnet router [1]. This will allow you to access any IP (pods or services) in your cluster from any of your Tailscale-connected computers. You can even configure Tailscale DNS to point to the DNS server in your cluster to connect using the service names directly ie. http://my-service.namespace.svc.cluster.local
pikchr is awesome. A project I did recently was a WASM-compiled pikchr library to generate diagrams directly
in the browser [1]. Here's a very early demo of a live editor you can play around with [2].
Not fully-featured yet but what I'd like to eventually do is set it up in a similar way to the mermaidjs editor [3]. They encode the entire diagram in the url. That makes it really easy to link to from markdown documents and has the nice benefit that the diagram is immutable for a given url so you don't need a backend to store anything.
Cool trick, thanks for sharing. I don't get why there isn't a suitable snscanf function that takes the buffer length as an argument and returns the number of bytes parsed?
If you're getting into this stuff a great resource I found is the ZipCPU Tutorial [0] by Dan Gisselquist. The tutorial covers both Verilog design and Formal Verification methods. It uses open source tools like Verilator and SymbiYosys so getting started is pretty easy.
I just want to share with people that in no uncertain terms Dan Gisselquist\ZipCPU is transphobic and many members of the Open FPGA community are not a fan of him.
- Human Reading Speed (English): ~250 words per minute
- Human Speaking Speed (English): ~150 words per minute
Should be treated like the Doherty Threshold [1] for generative content.
[1] https://lawsofux.com/doherty-threshold/