Rendered at 14:04:53 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
imjonse 10 minutes ago [-]
"The TurboQuant paper (ICLR 2026) contains serious issues in how it describes RaBitQ, including incorrect technical claims and misleading theory/experiment comparisons.
We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views.
We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (https://openreview.net/forum?id=tO3AS
KZlok
).
We would greatly appreciate your attention and help in sharing it."
I think the biggest issue isn’t the tool itself, but access and stability.
I had more trouble finding reliable AI accounts than using them tbh
konaraddi 2 hours ago [-]
> applying this compression algorithm at scale may significantly relax the memory bottleneck issue.
I don’t think they’re going to downsize though, I think the big players are just going to use the freed up memory for more workflows or larger models because the big players want to scale up. It’s a cat and mouse race for the best models.
miohtama 13 minutes ago [-]
It will also help with local inference, making AI without big players possible.
Verdex 1 hours ago [-]
Known in the business as 'pulling a jevons'
chr15m 10 minutes ago [-]
Is this something that will show up in Ollama any time soon to increase context size of local models?
fph 4 hours ago [-]
Despite the shortage, RAM is still cheaper than mathematicians.
captainbland 2 hours ago [-]
I don't know, I think if you weighed up the costs of AI related datacentre spend vs. the average mathematics academic's salary you could come to a different conclusion.
Verdex 1 hours ago [-]
It's also less frustrating to organize world wide ram production and logistics than to deal with a single mathematician.
Constantly sitting around trying to solve problems that nobody has made headway on for hundreds of years. Or inventing theorems around 15th century mysticism that won't be applicable for hundreds of years.
Now if you'll excuse me I need to multiply some numbers by 3 and divide them by 2 ... I'm so close guys.
Eddy_Viscosity2 51 minutes ago [-]
The comment feels a bit like Verdex may have dated a mathematician at some point and it went sour.
mandeepj 2 hours ago [-]
But not everyone has to pay mathematicians, like RAM :-)
_fizz_buzz_ 1 hours ago [-]
Doubt it. You have to pay these mathematicians once and then you can deploy to millions of sites.
Almondsetat 2 hours ago [-]
At the same time, processing is much cheaper than memory
3yr-i-frew-up 3 hours ago [-]
[dead]
simne 44 minutes ago [-]
Sure, we need better math, it is obvious.
Unfortunately, nobody at big companies know, what exactly math will win, so competition not end.
So, researchers will try one solution, then other solution, etc, until find something perfect, or until semiconductors production (Moore's Law) made enough semiconductors to run current models fast enough.
I believe, somebody already have silver bullet of ideal AI algorithm, which will lead all us to AGI, when scaled in some big company, but this knowledge is not obvious at the moment.
exabrial 20 minutes ago [-]
I was thinking it needs speciality hardware. Sort of like how GPUs were born…
barbegal 52 minutes ago [-]
Does the KV cache really grow to use more memory than the model weights? The reduction in overall RAM relies on the KV cache being a substantial proportion of the memory usage but with very large models I can't see how that holds true.
Lerc 6 hours ago [-]
This is one of the basic avenues for advancement.
Compute, bytes of ram used, bytes in model, bytes accessed per iteration, bytes of data used for training.
You can trade the balance if you can find another way to do things, extreme quantisation is but one direction to try. KANs were aiming for more compute and fewer parameters. The recent optimisation project have been pushing at these various properties. Sometimes gains in one comes at the cost of another, but that needn't always be the case.
mustyoshi 1 hours ago [-]
The drop in memory stocks seems counterintuitive to me.
The demand for memory isn't going to go down, we'll just be able to do more with the same amount of memory.
abdelhousni 3 hours ago [-]
The same could be said about other IT domain... When you see single webpages that weight by tens of MB you wonder how we came to this.
Yokohiii 2 hours ago [-]
Detachment from reality. Code elegance is more important then anything else. As simple as that.
Aardwolf 34 minutes ago [-]
Even if it has better math, wouldn't there then just be more of them, still requiring as much RAM?
BTW what caused previous computation booms, such as internet / cloud / ... to not increase hardware prices by that much?
I mean, AI is a type of computation, but before AI we also needed data centers for types of computation, and more of them would have done more computation too. Both for AI and pre-AI, more hardware = more computation = more advantage, and one is only limited by how much you can buy. All those pre-AI things were also a big hype at some point in time.
What exactly is the thing that's different about AI computation and pre-AI computation that makes one want to buy so much that it increases consumer hardware prices now?
And might this ever stop, or will humanity have expensive hardware from now on?
alienbaby 2 hours ago [-]
Ive thought for a while that the real gains now will not come from throwing more hardware at the problem, but advances in mathematical techniques to make things for more efficient.
Bydgoszczo 21 minutes ago [-]
And maverick 2
LoganDark 5 hours ago [-]
We will not see memory demand decrease because this will simply allow AI companies to run more instances. They still want an infinite amount of memory at the moment, no matter how AI improves.
jurgenburgen 4 hours ago [-]
If models become more efficient we will move more of the work to local devices instead of using SaaS models. We’re still in the mainframe era of LLM.
throwatdem12311 3 hours ago [-]
The hyperscalers do not want us running models at the edge and they will spend infinite amounts of circular fake money to ensure hardware remains prohibitively expensive forever.
topspin 2 hours ago [-]
> they will spend infinite amounts of circular fake money
> forever
If that's the plan (there is no plan) then it expires at some point, because it's a spiral and such spirals always bottom out.
throwatdem12311 2 hours ago [-]
And when that happens people STILL won’t be able to afford the hardware.
Imustaskforhelp 2 hours ago [-]
> of circular fake money
Oh it gets worse than that, the money which caused all of this by OpenAI was taken from Japanese banks at cheap interest rates (by softbank for the stargate project), and the Japanese Banks are able to do it because of Japanese people/Japanese companies and also the collateral are stocks which are inflated by the value of people who invest their hard earned money into the markets
So in a way they are using real hard earned money to fund all of this, they are using your money to basically attack you behind your backs.
> and they will spend infinite amounts of circular fake money to ensure hardware remains prohibitively expensive forever.
That's ridiculous, "infinite money" isn't a thing. They will spend as much as they can not because they want to keep local solutions out, but because it enables them to provide cheaper services and capture more of the market. We all eventually benefit from that.
3 minutes ago [-]
delecti 48 minutes ago [-]
As I understand this advancement, this doesn't let you run bigger models, it lets you maintain more chat context. So Anthropic and OpenAI won't need as much hardware running inference to serve their users, but it doesn't do much to make bigger models work on smaller hardware.
Though I'm not an expert, maybe my understanding of the memory allocation is wrong.
mustyoshi 1 hours ago [-]
I don't see how we'll ever get to widespread local LLM.
The power efficiency alone is a strong enough pressure to use centralized model providers.
My 3090 running 24b or 32b models is fun, but I know I'm paying way more per token in electricity, on top of lower quality tokens.
It's fun to run them locally, but for anything actually useful it's cheaper to just pay API prices currently.
singpolyma3 1 hours ago [-]
Until you put up your solar and then power is almost free...
vidarh 29 minutes ago [-]
The amortised cost including the panels and labour is nowhere near "almost free".
Ray20 2 hours ago [-]
> If models become more efficient
Then we can make them even bigger.
Imustaskforhelp 2 hours ago [-]
> Then we can make them even bigger.
But what if it becomes "good enough", that for most intents and purposes, small models can be "good enough"
There are some people here/on r/localllama who I have seen run some small models and sometimes even run multiple of them to solve/iterate quickly and have a larger model plug into it and fix anything remaining.
This would still mean that larger/SOTA models might have some demand but I don't think that the demand would be nearly enough that people think, I mean, we all still kind of feel like there are different models which are good for different tasks and a good recommendation is to benchmark different models for your own use cases as sometimes there are some small models who can be good within your particular domain worth having within your toolset.
Almondsetat 2 hours ago [-]
Because the true goal is AGI, not just nice little tools to solve subsets of problems. The first company which can achieve human level intelligence will just be able to self-improve at such a rate as to create a gigantic moat
9rx 1 hours ago [-]
> The first company which can achieve human level intelligence will just be able to...
They say prostitution is the oldest industry of all. We know how to achieve human-level intelligence quite well. The outstanding challenge is figuring out how to produce an energy efficient human-level intelligence.
Ray20 1 hours ago [-]
> But what if it becomes "good enough", that for most intents and purposes, small models can be "good enough"
It's simple: then we'll make our intents and purposes bigger.
DeathArrow 4 hours ago [-]
I don't think we are there yet. Models running in data centers will still be noticeably better as efficiency will allow them to build and run better models.
Not many people would like today models comparable to what was SOTA 2 years ago.
To run models locally and have results as good as the models running in data centers we need both efficiency and to hit a wall in AI improvement.
None of those two conditions seem to become true for the near future.
ssyhape 3 hours ago [-]
I like the mainframe comparison but isn't there a key difference? Mainframes died because hardware got cheap -- that's predictable. LLM efficiency improving enough to run locally needs algorithmic breakthroughs, which... aren't. My gut says we'll end up with a split. Stuff where latency matters (copilot, local agents) moves to edge once models actually fit on a laptop. But training and big context windows stay in the cloud because that's where the data lives. One thing I keep going back and forth on: is MoE "better math" or just "better engineering"? Feels like that distinction matters a lot for where this all goes.
redrove 4 hours ago [-]
I disagree. I think a sharp drop in memory requirements of at least an order of magnitude will cause demand to adjust accordingly.
cyanydeez 3 hours ago [-]
Department of Transportation always thinks adding more lanes will reduce traffic.
It doesn't, it induces demand. Why? Because there's always too many people with cars who will fill those lanes.
nkmnz 2 hours ago [-]
Citation needed. I've heard this quite often, but so far, I haven't seen proof of the stated causality.
PS: This doesn't mean that better public transportation could deliver more bang for the buck than the n-th additional car lane. But never ever have I heard from anybody that they chose to buy a car or use an existing car more often because an additional lane has been built.
j16sdiz 2 hours ago [-]
Have you tried the "Reference" section on the Wikipedia article?
You've never heard anyone choose to take side streets instead of the highway because of traffic jams? No one ever goes out of their way to avoid heavily trafficed areas?
> If I were Google, I wouldn’t release research that exposes a competitive advantage.
Isn't that a classic tit for tat decision and head for a loss?
Excellence and prestige are valuable too. You get those expensive ML for a small discount, public/professional perception, etc. Considering the public communication from Google, that isn't complete sociopathic, they know this war isn't won in one night, they are the only sustainably funded company in the competition. Surely they are at risk with their business, but can either go rampant or focus. They decided to focus.
amelius 3 hours ago [-]
Can we say something about the compression factor for pure knowledge of these models?
Doesn't seem relevant here. TurboQuant isn't a domain-specific technique like the BL is talking about, it's a general optimisation for transformers that helps leverage computation more effectively.
We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views.
We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (https://openreview.net/forum?id=tO3AS KZlok ).
We would greatly appreciate your attention and help in sharing it."
https://x.com/gaoj0017/status/2037532673812443214
I don’t think they’re going to downsize though, I think the big players are just going to use the freed up memory for more workflows or larger models because the big players want to scale up. It’s a cat and mouse race for the best models.
Constantly sitting around trying to solve problems that nobody has made headway on for hundreds of years. Or inventing theorems around 15th century mysticism that won't be applicable for hundreds of years.
Now if you'll excuse me I need to multiply some numbers by 3 and divide them by 2 ... I'm so close guys.
Unfortunately, nobody at big companies know, what exactly math will win, so competition not end.
So, researchers will try one solution, then other solution, etc, until find something perfect, or until semiconductors production (Moore's Law) made enough semiconductors to run current models fast enough.
I believe, somebody already have silver bullet of ideal AI algorithm, which will lead all us to AGI, when scaled in some big company, but this knowledge is not obvious at the moment.
Compute, bytes of ram used, bytes in model, bytes accessed per iteration, bytes of data used for training.
You can trade the balance if you can find another way to do things, extreme quantisation is but one direction to try. KANs were aiming for more compute and fewer parameters. The recent optimisation project have been pushing at these various properties. Sometimes gains in one comes at the cost of another, but that needn't always be the case.
The demand for memory isn't going to go down, we'll just be able to do more with the same amount of memory.
BTW what caused previous computation booms, such as internet / cloud / ... to not increase hardware prices by that much?
I mean, AI is a type of computation, but before AI we also needed data centers for types of computation, and more of them would have done more computation too. Both for AI and pre-AI, more hardware = more computation = more advantage, and one is only limited by how much you can buy. All those pre-AI things were also a big hype at some point in time.
What exactly is the thing that's different about AI computation and pre-AI computation that makes one want to buy so much that it increases consumer hardware prices now?
And might this ever stop, or will humanity have expensive hardware from now on?
If that's the plan (there is no plan) then it expires at some point, because it's a spiral and such spirals always bottom out.
Oh it gets worse than that, the money which caused all of this by OpenAI was taken from Japanese banks at cheap interest rates (by softbank for the stargate project), and the Japanese Banks are able to do it because of Japanese people/Japanese companies and also the collateral are stocks which are inflated by the value of people who invest their hard earned money into the markets
So in a way they are using real hard earned money to fund all of this, they are using your money to basically attack you behind your backs.
I once wrote an really long comment about the shaky finances of stargate, I feel like suggesting it here: https://news.ycombinator.com/item?id=47297428
That's ridiculous, "infinite money" isn't a thing. They will spend as much as they can not because they want to keep local solutions out, but because it enables them to provide cheaper services and capture more of the market. We all eventually benefit from that.
Though I'm not an expert, maybe my understanding of the memory allocation is wrong.
The power efficiency alone is a strong enough pressure to use centralized model providers.
My 3090 running 24b or 32b models is fun, but I know I'm paying way more per token in electricity, on top of lower quality tokens.
It's fun to run them locally, but for anything actually useful it's cheaper to just pay API prices currently.
Then we can make them even bigger.
But what if it becomes "good enough", that for most intents and purposes, small models can be "good enough"
There are some people here/on r/localllama who I have seen run some small models and sometimes even run multiple of them to solve/iterate quickly and have a larger model plug into it and fix anything remaining.
This would still mean that larger/SOTA models might have some demand but I don't think that the demand would be nearly enough that people think, I mean, we all still kind of feel like there are different models which are good for different tasks and a good recommendation is to benchmark different models for your own use cases as sometimes there are some small models who can be good within your particular domain worth having within your toolset.
They say prostitution is the oldest industry of all. We know how to achieve human-level intelligence quite well. The outstanding challenge is figuring out how to produce an energy efficient human-level intelligence.
It's simple: then we'll make our intents and purposes bigger.
Not many people would like today models comparable to what was SOTA 2 years ago.
To run models locally and have results as good as the models running in data centers we need both efficiency and to hit a wall in AI improvement.
None of those two conditions seem to become true for the near future.
It doesn't, it induces demand. Why? Because there's always too many people with cars who will fill those lanes.
PS: This doesn't mean that better public transportation could deliver more bang for the buck than the n-th additional car lane. But never ever have I heard from anybody that they chose to buy a car or use an existing car more often because an additional lane has been built.
https://en.wikipedia.org/wiki/Induced_demand#cite_note-vande...
Isn't that a classic tit for tat decision and head for a loss?
Excellence and prestige are valuable too. You get those expensive ML for a small discount, public/professional perception, etc. Considering the public communication from Google, that isn't complete sociopathic, they know this war isn't won in one night, they are the only sustainably funded company in the competition. Surely they are at risk with their business, but can either go rampant or focus. They decided to focus.
[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html