How has DeepSeek improved the Transformer architecture?

258 points by superasn 5 months ago

juancn 5 months ago

The compute scheduling part of the paper is also vey good, the way they balanced load to keep compute and communication in check.

There is also a lot of thought put into all the tiny bits of optimization to reduce memory usage, using FP8 effectively without significant loss of precision nor dynamic range.

None of the techniques by themselves are really mind blowing, but the whole of it is very well done.

The DeepSeekV3 paper is really a good read: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee...

ahartmetz 5 months ago

When everyone kind of ignores performance because compute is cheap and speed will double anway in 18 months (note: hasn't been true for 15 years), the willingness to optimize is almost a secret weapon. The first 50% or so are usually not even difficult because there is so much low-hanging fruit, and in most environments there's a lot of helpful tooling to measure exactly which parts are slow.
- sgt101 5 months ago
  
  Compute has been more than doubling because people have been spending silly money on it. How long ago would a proposal for a $10m cluster for ML have been thought surreal by any funding agency? Certainly less than 10 years ago. Now people are talking of spending billions and billions.
  Madness.
- HarHarVeryFunny 5 months ago
  
  When people are talking about $100M-$1B frontier model training runs, then obviously efficiency matters!
  Sure training cost will go down with time, but if you are only using 10% of the compute of your competition (TFA: DeepSeek vs LLaMa) then you could be saving 100's of millions per training run!
  - ahartmetz 5 months ago
    
    I was more stating the perception that compute is cheap than the fact that compute is cheap - often enough it isn't! But carelessness about performance happens, well, by default really.
- steve_adams_86 5 months ago
  
  At my org this is a crazy problem. Before I arrived, people would throw all kinds of compute at problems. They still do. When you've got AWS over there ready to gobble up whatever tasks you've got, and the org is willing to pay, things get really sloppy.
  It's also a science-based organization like OpenAI. Very intelligent people, but they aren't programmers first.
- ASalazarMX 5 months ago
  
  I think the AI megacorps plan was always SaaS. Their focus was never on self-hosting, so optimization was useless: their customers would pay for unoptimized services whether they wanted or not.
  Making AI practical for self-hosting was the real disruption of DeepSeek.
fsndz 5 months ago

The secret is to basically use RL to create a model that will generate synthetic data. Then you use the synthetic dataset to fine-tune a pretrained model. The secret is basically synthetic data imo: https://medium.com/thoughts-on-machine-learning/the-laymans-...
cyanydeez 5 months ago

Keep in mind: America made them do this.

ilaksh 5 months ago

Why is it that the larger models are better at understanding and following more and more complex instructions. And generally just smarter?

With DeepSeek we can now run on non-GPU servers with a lot of RAM. But surely quite a lot of the 671 GB or whatever is knowledge that is usually irrelevant?

I guess what I sort of am thinking of is something like a model that comes with its own built in vector db and search as part of every inference cycle or something.

But I know that there is something about the larger models that is required for really intelligent responses. Or at least that is what it seems because smaller models are just not as smart.

If we could figure out how to change it so that you would rarely need to update the background knowledge during inference and most of that could live on disk, that would make this dramatically more economical.

Maybe a model could have retrieval built in, and trained on reducing the number of retrievals the longer the context is. Or something.

AJRF 5 months ago

> But surely quite a lot of the 671 GB or whatever is knowledge that is usually irrelevant?
Small correction - It's 671B Parameters - not 671 Gigabytes (doing some rudimentary math if you want to run the entire model in memory it would take ~750GB (671b * fp8 == 8 bits * 1.2 (20% overhead)) = 749.901 GiB)
It's a MoE model so you don't actually need to load all 750gb at once.
I think maybe what you are asking is "Why do more params make a better model?"
Generally speaking its because if you have more units of representation (params) you can encode more information about the relationships in the data used to train the model.
Think of it like building a LEGO city.
A model with fewer parameters is like having a small LEGO set with fewer blocks. You can still build something cool, like a little house or a car, but you're limited in how detailed or complex it can be.
A model with more parameters is like having a giant LEGO set with thousands of pieces in all shapes and colours. Now, you can build an entire city with skyscrapers, parks, and detailed streets.
---
In terms of "is a lot of of irrelevant?" - This is a hot area of research!
It's very difficult currently to know what parameters are relevant and what aren't - there is an area of research called mechanistic interpretability that aims to illuminate this - if you are interested - Anthropic released a good paper called "Golden Gate Claude" on this.
- reagle 5 months ago
  
  To extend your Lego metaphor to the question of “is a lot of this irrelevant?“ Does your Lego model of a city need to have interior floors, furniture, and fixtures in order to satisfy your requirements? Perhaps in some cases, but not in most.
- ilaksh 5 months ago
  
  I know it's a MoE and I didn't need the five year old explanation of why larger models are smarter. I'm also aware of interpretability research. You should read my question much more carefully and think about it harder.
  - tharant 5 months ago
    
    Although the practicality of what you described towards the end of your original comment conceptually demonstrates an MoE-like architecture, the fact that you explicitly mentioned not understanding why larger models are smarter and then proceeded to try to couch-engineer a new, smaller architecture suggests that you were in fact not aware of the MoE architecture and thus the ELI5 LEGO approach was reasonably helpful. I’ve read your question carefully many times, and I’ve read others’ comments in the thread; you seem frustrated that folks aren’t answering your questions when in fact they have been answered — albeit not in the way you seem to want; how can we fix this?
zamadatix 5 months ago

This is, more or less, what mixture-of-experts (MoE) section is picking away at. The difference is rather than trying to break it out via how rare or common the info is it's broken out by specialization. There isn't as much a focus on keeping the inactive portions on disk because it's more economical to host it all but in a way that lets you use parallelism of requests across the experts. This has the added effect you can constantly select the best expert as the answer is generated without losing efficient hosting.
- ilaksh 5 months ago
  
  I know what MoE is. Maybe read my comments more carefully and give me the benefit of the doubt.
  - zamadatix 5 months ago
    
    My comment would've done an astoundingly bad job at introducing you to what mixture of experts is, had that been its goal. It's really about why the MoE-style enhancements don't target how to keep parts on disk when optimizing the model to be most economical to host. There's really not any doubt in that, it's just an observation as to why they optimize the way they do.
    If you were put off by defining terms on first use: that's just good form, not something related to you.
joshuakogut 5 months ago

Yesterday when I started evaluating Deepseek-R1 V3 it was insanely better at code generation using elaborate prompts, I asked it to write me some boilerplate code in python using the ebaysdk library to pull a list of all products sold by user with $name and it spit it out, just a few tweaks and it was ready to go.
I tried the same thing on the 7B and 32B model today, neither are as effective as codellama.
- ilaksh 5 months ago
  
  I think people didn't understand my comment. I am very aware of this already.
  - seba_dos1 5 months ago
    
    I think you failed to convey what you meant to with your comment.
    If you want your contribution to the discussion to be meaningful, you may want to give it another go.
- A4ET8a8uTh0_v2 5 months ago
  
  I am intrigued. What did you use to run your deepseek instance?
lossolo 5 months ago

You would need to extract logical patterns and concepts somehow, not just word relationships. I know what you mean, this introduces another level of abstraction between relationships. If there is no way to extract these patterns, or if there are no real logical patterns present but only statistical relationships (larger model = more relationships = better prompt following etc) between words without any real 'emergent abilities' then Transformers are essentially a dead end in the context of AGI.
HarHarVeryFunny 5 months ago

I'm sure that a smaller generalist model with RAG would work for many cases, especially where the RAG is just looking up some facts or technique, but would you really want a smart high school kid who's googled brain surgery to be operating on your brain? Books are useful for looking up facts, but there's no substitute for experience/training in actually getting good at something.
doubleyou 5 months ago

if you google LLM youll see the first L stands for large.

whimsicalism 5 months ago

none of these techniques except MLA are new

eldenring 5 months ago

They're not new in the same way Attention wasn't new when the transformer paper was written.
No one (publically) had really pushed any of these techniques far, especially not for such a big run.
- whimsicalism 5 months ago
  
  no one publicly pushes any techniques very far except for meta and it’s true they continue to train dense models for whatever reason.
  the transformer was an entirely new architecture, very different step change than this
  e: and alibaba
  - leetharris 5 months ago
    
    They likely continue to train dense models because they are far easier to fine tune and this is a huge use case for the Llama models
    
    whimsicalism 5 months ago
    
    It probably also has to do with their internal infra. If it were just about dense models being easier for the OSS community to use & build on, they should probably be training MoEs and then distilling to dense.
bilbo0s 5 months ago

There's new stuff on lower layers. Some of the math is, interesting? A novel method of scaling mantissas and exponents. Yes, some of the operations have to use higher precision. Yes, some values like optimizer states, gradients and weights still need higher precision. But what they can do in 8 they do in 8. Of course, like everyone, they're reduced to begging NVidia to quantize on global to shared transfer in order to realize the true potential of what they're trying to do. But I mean, hey, that's where we all are and most papers I read don't have near as many interesting and novel techniques in them.
I think recomputing MLA and RMS on backprop is something few would have done.
Dispensing with tensor parallelism by kind of overlapping forward and backprop. That would not have been intuitive to me. (I do, however, make room for the possibility that I'm just not terribly good at this anymore.)
I don't know? I just think there's a lot of new takes in there.
- whimsicalism 5 months ago
  
  i think some are reading my comment as critical of deepseek, but i'm more trying to say it is an infrastructural/engineering feat moreso than an architectural innovation. this article doesn't even mention fp8. these have been by far the most interesting technical reports i've read in a while
cma 5 months ago

Flash attention was also a set of common techniques in other areas of optimized software, yet the big guys weren't doing the optimizations when it came out and it significantly improved everything.
- whimsicalism 5 months ago
  
  yes, i agree that low-level & infra work is where a lot of deepseek's improvement came from
WithinReason 5 months ago

There is a big difference between inventing a technique and productising it.
- anonymousDan 5 months ago
  
  One issue is that a lot of techniques proposed (especially from academic research) are hard to validate at scale given the resources required. At least DeepSeek helps a little in that regard.
- grazing_fields 5 months ago
  
  [dead]

doener 5 months ago

I hate it so much that HN automatically removes some words in headlines like „how.“ You can add them after posting though for a while by editing the headline.

dboreham 5 months ago

Perhaps an faq, but why the weird quote characters?
- doener 5 months ago
  
  Because I‘m German and that‘s the way we use them in Germany. So my German mobile keyboard does this automatically, yes. Oftentimes I change it in English messages, sometimes it slips.
  - bflesch 5 months ago
    
    I'm also German and have never seen such weird quotes. Maybe this is some weird windows charset issue but definitely not widespread way of quoting text.
    
    dxyms 5 months ago
    
    It definitely exists in German as well as some other European countries, https://en.m.wikipedia.org/wiki/Quotation_mark I checked a few German newspapers, some use it, some don’t
    
    actionfromafar 5 months ago
    
    I don't know why, but in my browser, the closing quote is showed correctly (forward slanting) in the edit box, but backwards slanting when submitted. Weird.
    „how“
    
    vanderZwan 5 months ago
    
    Different fonts, most likely (I can't say for sure because I customized the CSS for HN using the Stylus browser extension)
- vanderZwan 5 months ago
  
  I'm guessing they're using a mobile device with a keyboard that does this automatically.
- iamacyborg 5 months ago
  
  Wait until you see how the French do quotes
  - cyberax 5 months ago
    
    Mandarin Chinese keyboards「have entered the chat」.
    
    numpad0 5 months ago
    
    Do Chinese people actually do that? I thought 「」 has ever so slightly different bearing than "" but then again I don't speak Chinese
    
    cyberax 5 months ago
    
    Sometimes. It's formal, but text autocomplete systems can sometimes insert them. The other form of quotes is similar to English quotes, but looks a bit different.

1970-01-01 5 months ago

Has DeepSeek challenged the very weird hallucination problem? Reducing hallucinations now seems to be the remaining fundamental issue that needs scientific research. Everything else feels like an engineering problem.

bane 5 months ago

To me, the second biggest problem is that the models aren't really conversational yet. They can maintain some state between prompt and response, but in normal human-human interactions responses can be interrupted by either party with additional detail or context provided.
"Write Python code for the game of Tetris" resulting in working code that resembles Tetris is great. But the back and forth asking for clarification or details (or even post-solution adjustments) isn't there. The models dwell and draw almost entirely self-referentially from their own reasoning through the entire exchange.
"Do you want to keep score?" "How should scoring work?" "Do you want aftertouch?" "What about pushing down, should it be instantaneous or at some multiple of the normal drop speed?" "What should that multiple be?"
as well as questions from the prompter that inquire about capabilities and possibilities. "Can you add one 5-part piece that shows up randomly on average every 100 pieces on average?" or "Is it possible to make the drop speed function as an acceleration rather than a linear drop speed?"....these are somewhat possible, but sometimes require the model to re-reason the entire solution again.
So right now, even the best models may or may not provide working code that generates something that resembles a Tetris game, but have no specifics beyond what some internal self-referential reasoning provides, even if that reasoning happens in stages.
Such a capability would help users of these models troubleshoot or fix specific problems or express specific desires....the Tetris game works but has no left-hand L blocks for example. Or the scoring makes no sense. Everything happens in a sort of highly superficial approach where the reasoning is used to fill in gaps in the top-down understand of the problem the model is working on.
- hnuser123456 5 months ago
  
  I have my own automated LLM developer tool. I give a project description, and the script repeatedly asks the LLM for a code attempt, runs the code, returns the output to the LLM, asking if it passes/fails the project description, repeating until it judges the output as a pass. Once/if it thinks the code is complete, it asks the human user to provide feedback or press enter to accept the last iteration and exit.
  For example, I can ask it to write a python script to get the public IP, geolocation, and weather, trying different known free public APIs until it succeeds. But the first successful try was dumping a ton of weather JSON to the console, so I gave feedback to make it human readable with one line each for IP, location, and weather, with a few details for the location and weather. That worked, but it used the wrong units for the region, so I asked it to also use local units, and then both the LLM and myself judged that the project was complete. Now, if I want to accomplish the same project in fewer prompts, I know to specify human-readable output in region-appropriate units.
  This only uses text based LLMs, but the logical next step would be to have a multimodal network review images or video of the running program to continue to self-evaluate and improve.
  - mistrial9 5 months ago
    
    are you familiar with
    https://gorilla.cs.berkeley.edu/leaderboard.html
    
    hnuser123456 5 months ago
    
    I was not, thanks for showing me!
- MrLeap 5 months ago
  
  > The models dwell and draw almost entirely self-referentially from their own reasoning through the entire exchange.
  Sounds like a deficiency in theory of the mind.
  Maybe explains some of the outputs I've seen from deepseek where it conjectures about the reasons why you said whatever you said. Perhaps this is where we're at for mitigations for what you've noticed.
- timdellinger 5 months ago
  
  This sounds like a prompting issue.
  If your prompt instructs the model to ask such questions along the way, the model will, in fact, do so!
  But yes, it would be nice if the model were smart enough to realize when it's in a situation where it should ask the user a few questions, and when it should just get on with things.
corimaith 5 months ago

>Everything else feels like an engineering problem.
That's probably the key to understanding why the hallucination "problem" isn't going to be fixed because language models, as probabilistic models it's an inherent feature and they were never designed to be expert systems in the first place.
Building an knowledge representation system that can properly model the world itself is going more into the foundations of mathematics and logic than it is to do with engineering, of which the current frameworks like FOL are very lacking and there aren't many people in the world who are working on such problems.
nyrikki 5 months ago

Hallucinations are a fundamental property of transformers, it can be minimized but never eliminated.
https://www.mdpi.com/1999-4893/13/7/175
> the open-domain Frame Problem is equivalent to the Halting Problem and is therefore undecidable.
Diaconescu's Theorem will help understand where Rice's theorm comes to play here.
Littlestone and Warmuth's work will explain where PAC Learning really depends on a many to one reduction that is similar to fixed points.
Viewing supervised learning as paramedic linear regression, this dependent on IID, and unsupervised learning as clustering thus dependent on AC will help with the above.
Both IID and AC imply PEM, is another lens.
Basically for problems like protein folding, which has rules that have the Markovian and Ergodic properties it will work reliably well for science.
The basic three properties of (confident, competent, and inevitable wrong) will always be with us.
Doesn't mean that we can't do useful things with them, but if you are waiting for the hallucinations problem to be 'solved' you will be waiting for a very long time.
What this new combo of elements does do is seriously help with being able to leverage base models to do very powerful things, while not waiting for some huge groups to train a general model that fits your needs.
This is a 'no effective procedure/algorithm exists' problem. Leveraging LLMs for frontier search will open up possible paths, but the limits of the tool will still be there.
Stable orbits of the planets is an example of another limit of math, but JPL still does a great job as an example.
Obviously someone may falsify this paper... but the safe bet is that it holds.
https://arxiv.org/abs/2401.11817
Heck Laplacian determism has been falsified, but as scientists are more interested in finding useful models that doesn't mean it isn't useful.
All models are wrong, some are useful is the TL;DR
- zone411 5 months ago
  
  The problem is confabulations. In my benchmark (https://github.com/lechmazur/confabulations/), you see models produce non-existent answers in response to misleading questions that are based on provided text documents. This can be addressed.
- Jerrrry 5 months ago
  > the open-domain Frame Problem is equivalent to the Halting Problem and is therefore undecidable.
  Thank you, Code as Data problems are innate to the von Nuemman architecture, but I could never articulate how LLMs are so huge they are essentially Turing-complete and equivalent computationally.
  You _can_ combinate through them, just not in our universe.
  - adgjlsfhk1 5 months ago
    
    this is very wrong. LLMs are very much not Turing complete, but they are algorithms on a computer so they definitely can't compute anything uncomputable
    
    Jerrrry 5 months ago
    
    "Essentially"
    This is the #1 pedantic thing HN frequenters flail over.
    Turing built a limited machine, bound to a finite tape.
    Bravo.
    
    nyrikki 5 months ago
    
    Turing machines are typically described as having an infinite tape. It may not be able to access that tape in finite time, but the tape is not bound to a finite tape
    But it doesn't matter, it is an abstract model of computation.
    But it doesn't matter Church–Turing thesis states that a function on the natural numbers can be calculated by an effective method if and only if it is computable by a Turing machine.
    It doesn't matter if you put the algorithm on paper, on 1 or k tapes etc...
    Rice's theorem I mentioned above is like Scott–Curry theorem in Lambda calculus. Lambda calculus is Turing complete, that is, it is a universal model of computation that can be used to simulate any Turing machine.
    The similar problems with 'trivial properties' in TMs end up being recursively inseparable sets in Lambda calculus.
GaggiX 5 months ago

From what I see, the Deepseek R1 model seems to be better calibrated (knowing what it knows) than any other model, at least on the HLE benchmark: https://lastexam.ai/
seba_dos1 5 months ago

There's nothing weird about so called hallucination ("confabulation" would be a better term), it's the expected behavior. If your use case cannot deal with it, it's not a good use case for these models.
And yes, if you thought this means these models are being commonly misapplied, you'd be correct. This will continue until the bubble bursts.
HarHarVeryFunny 5 months ago

There was an amusing tongue-in-cheek comment from a recent guest (Prof. Rao) on MLST .. he said that reasoning models no longer hallucinate - they gaslight you instead... give a wrong answer and try to convince you why it's right! :)
whimsicalism 5 months ago

hallucinations decrease with scale and reasoning, the model just gets better and stops making stuff up.
- littlestymaar 5 months ago
  
  o1 still hallucinates badly though.
- Jerrrry 5 months ago
  
  False, facts only need to be seen once, and one mis-step in reasoning and your CoT is derailed.
  - whimsicalism 5 months ago
    
    > one mis-step in reasoning and your CoT is derailed.
    tell me you've never seen reasoning traces without telling me