Hacker Newsnew | past | comments | ask | show | jobs | submit | joerick's commentslogin

It should land in 3.15, so October next year. https://peps.python.org/pep-0790/

I've been wondering for a while if a program could "learn" covariance somehow. Through real-world usage.

Otherwise, it feels to me that it'd be consistently wrong to model the variables as independent. And any program of notable size is gonna be far too big to consider correlations between all the variables.

As for how one might do the learning, I don't know yet!


If we assume no conspiracy, what's the sentencing like for perjury due to carelessness?


Google do have an API for this. It has limits but perfectly good for personal use.

https://developers.google.com/custom-search/v1/overview


Unfortunately 100 queries per day is quite low for LLMs, which tend to average 5-10 searches per prompt in my experience. And paying for the search API doesn’t seem to be worth it compared to something like a ChatGPT subscription.


You're not limited to 100 queries per day though. You're limited to 10,000 queries per day.


> Custom Search JSON API provides 100 search queries per day for free. If you need more, you may sign up for billing in the API Console. Additional requests cost $5 per 1000 queries, up to 10k queries per day.

What I meant is that pricing for 10k queries per day does not make sense instead of a simple $20/month ChatGPT subscription.


10k queries would cost $50/day


If you actually did 10k queries a day with any free search service you'd quickly find yourself banned.

ChatGPT deep research does a lot of searches, but it's also heavily rate-limited even on paid accounts.

Building a search index, running a lot of servers, storing and querying all that data costs money. Even a cheap search API is gonna cost a bit for 10k queries if the results are any good.

Kagi charges me 15 USD and last month I've done 863 searches with it, more than worth it for the result quality. For me the Google API would be cheaper. I'm pretty sure Kagi would kick me out if I did 10k searches a day even if I'm paying for "unlimited searches".

A similar API from Bing costs between 15-25 USD for 1000 searches. Exa.ai costs 5 dollars for 1000 searches, and it goes up to 25 dollars if you want more than 25 results for your query.

Good web search is quite expensive.


The Google programmable search engine is unlimited and can search the web: https://programmablesearchengine.google.com/about/


That's intended for human users. If you try to use it for automated requests, you'll get banned for botting fairly quickly.


They have a 10k/day API. I'm sure that's enough for one person.


That's the one that's described upthread as "paying for the search API doesn’t seem to be worth it."


Beautifully put.


Pretty fascinating from an information theory point of view. Surprising that it works at all. Is this, like, the JPEG of uniformly distributed, uncorrelated data?


We don't know. They basically look for sequences that approximate NN weights well, in the same way sinusoidal functions work well with "natural" images, but not with graphics with hard edges.


You might find The Library of Babel fascinating [1, 2]

1: https://libraryofbabel.info/

2: https://news.ycombinator.com/item?id=9480949


I came across this page in a discussion about dithering. It was posted by @rikroots.

Fascinating stuff. I love to see the application of the 2D Fourier transform for analysis. Added bonus - they tile brilliantly.


Hahaha. We used to get desyncs on networked games of Generals pretty regularly. I remember if a game took more than 30-40 mins I'd start to get a spidey sense things were about to go wrong


Generals was one of our favorite LAN games. It seemed that as time went on this problem got worse and worse somehow, to the point where we gave up trying to play it at all. I have fresh hope again that this might one day be fixed!


when generals was released it was the same time my high-school were encouraging everyone enrolled in specific classes to get laptops as part of a pilot program (with some of the money coming from the government). I started bringing a network switch to school so we could play during free periods/lunch (the wifi was problematic).


The thing that puzzles me about embeddings is that they're so untargeted, they represent everything about the input string.

Is there a method for dimensionality reduction of embeddings for different applications? Let's say I'm building a system to find similar tech support conversations and I am only interested in the content of the discussion, not the tone of it.

How could I derive an embedding that represents only content and not tone?


You can do math with word embeddings. A famous example (which I now see has also been mentioned in the article) is to compute the "woman vector" by subtracting "man" from "woman". You can then add the "woman vector" to e.g. the "king" vector to obtain a vector which is somewhat close to "queen".

To adapt this to your problem of ignoring writing style in queries, you could collect a few text samples with different writing styles but same content to compute a "style direction". Then when you do a query for some specific content, subtract the projection of your query embedding onto the style direction to eliminate the style:

    query_without_style = query - dot(query, style_direction) * style_direction
I suspect this also works with text embeddings, but you might have to train the embedding network in some special way to maximize the effectiveness of embedding arithmetic. Vector normalization might also be important, or maybe not. Probably depends on the training.

Another approach would be to compute a "content direction" instead of a "style direction" and eliminate every aspect of a query that is not content. Depending on what kind of texts you are working with, data collection for one or the other direction might be easier or have more/fewer biases.

And if you feel especially lazy when collecting data to compute embedding directions, you can generate texts with different styles using e.g. ChatGPT. This will probably not work as well as carefully handpicked texts, but you can make up for it with volume to some degree.


Interesting, but your hypothesis assumes that 'tone' is one-dimensional, that there is a single axis you can remove. I think tone is very multidimensional, I'd expect to be removing multiple 'directions' from the embedding.


No, I don’t think the author is saying one dimensional - the vectors are represented by magnitudes in almost all of the embedding dimensions.

They are still a “direction” in the way that [0.5, 0.5] in x,y space is a 45 degree angle, and in that direction it has a magnitude of around 0.7

So of course you could probably define some other vector space where many of the different labeled vectors are translated to magnitudes in the original embedding space, letting you do things like have a “tone” slider.


I think GP is saying that GGP assumes "tone" is one direction, in the sense there exists a vector V representing "tone direction", and you can scale "tone" independently by multiplying that vector with a scalar - hence, 1 dimension.

I'd say this assumption is both right and wrong. Wrong, because it's unlikely there's a direction in embedding space corresponding to a platonic ideal of "tone". Right, because I suspect that, for sufficiently large embedding space (on the order of what goes into current LLMs), any continuous concept we can articulate will have a corresponding direction in the embedding space, that's roughly as sharp as our ability to precisely define the concept.


I would say rather that the "standard example" is simplified, but it does capture an essential truth about the vectors. The surprise is not that the real world is complicated and nothing is simply expressible as a vector and that treating it as such doesn't 100% work in every way in every circumstance all of the time. That's obvious. Everyone who might work with embeddings gets it, and if they don't, they soon will. The surprise is that it does work as well as it does and does seem to be capturing more than a naive skepticism would expect.


You could of course compute multiple "tone" directions for every "tone" you can identify and subtract all of them. It might work better, but it will definitely be more work.


Though not exactly what you are after, Contextual Document Embeddings (https://huggingface.co/jxm/cde-small-v1), which generate embeddings based on "surrounding context" might be of some interest.

With 281M params it's also relatively small (at least for an embedding model) so one can play with it relatively easily.


Depends on the nature of the content you’re working with, but I’ve had some good results using an LLM during indexing to generate a search document by rephrasing the original text in a standardized way. Then you can search against the embeddings of that document, and perhaps boost based on keyword similarity to the original text.


This is also often referred to as Hypothetical Document Embeddings (https://arxiv.org/abs/2212.10496).


Do you have examples of this? Please say more!


Nice workaround. I just wish there was a less 'lossy' way to go about it!


Could you explicitly train a set of embeddings that performed that step in the process? For example which computing the loss, you compare the difference against the normalized text rather than the original. Or alternatively do this as a fine-tuning. Then you would have embedding that optimized for the characteristics you care about.


Normal full text search stuff helps reduce the search space - eg lemming, stemming, query simplification stuff were all way before LLMs.


There are a few things you can do. If these access patterns are well known ahead of time, you can train subdomain behavior into the embedding models by using prefixing. E.g. content: fixing a broken printer, tone: frustration about broken printer, and "fixing a broken printer" can all be served by a single model.

We have customers doing this in production in other contexts.

If you have fundamentally different access patterns (e.g. doc -> doc retrieval instead of query -> doc retrieval) then it's often time to just maintain another embedding index with a different model.


They don't represent everything. In theory they do but in reality the choice of dimensions is a function of the model itself. It's unique to each model.


Yeah, 'everything' as in 'everything that the model cares about' :)


I've just begun to dabble with embeddings and LLMs, but recently I've been thing about tryin to use principle component analysis[1] to either project to desirable subspaces, or project out undesirable subspaces.

In your case it would be to take a bunch of texts which roughly mean the same thing but with variance in tone, compute PCA of the normalized embeddings, take the top axsis (or top few) and project it out (ie subtract the projection) of the embeddings for the documents you care about before doing the cosine similarity.

Something along those lines.

Could be it's a terrible idea, haven't had time to do much with it yet due to work.

[1]: https://en.wikipedia.org/wiki/Principal_component_analysis


Agreed.. biggest problem with off the shelf embeddings I hit. Need a way to decompose embeddings.


You could fine-tune the embedding model to reduce cosine distance on a more specific function.


The touchscreen might be better than you're giving it credit for. Direct interaction (vs the indirectness of mice, keyboards, gesture) is one benefit that I can't see done better any other way. I think people will always have some kind of handheld touchscreen as a result.


Hmm, touchscreens are a good counter point, but I do think they might be replaceable with AR glasses. It doesn’t seem difficult to mimic a feedback mechanism with only visual feedback and/or a light vibration from the goggles themselves.

I actually think it may be the opposite: direct input like keyboards might be around longer simply because they don’t require you to be looking at your hands to use them. Both touchscreens and AR goggles do.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: