Your vibe-coded eval has cheated this to collapse it into a binary selection on row 46 in https://github.com/Anima-Core/an1-core/blob/main/experiments..., making the problem baseline 50% on random choice instead of 25%, making the problem much easier. HellaSwag is specifically constructed with adversarial examples that could be plausible. By not including them, the eval is much easier.
---
Then, in extract_fields_from_model, you have another cheating going on. The extraction logic (h[:, -1, :]) fails to account for padding in batches, likely extracting EOS/Pad tokens instead of the intended content tokens. This suggests the probe is relying on global sentence summaries (standard embeddings in causal structures) rather than the novel 'meaning fields' claimed in the paper.
---
I dont have time to look at more of this and I just looked at how the eval is made, but please dont waste peoples times when you dont even know what you are evaluating.
I guess my "vibe" is just better than your coding :)... Let me explain a few things, if you will. A few clarifications so the discussion stays aligned with what the experiment is actually measuring.
1. The HellaSwag “binary collapse” is intentional and not a leaderboard claim.
This work doesn’t attempt to benchmark HellaSwag in the standard four-choice setting. The goal is to probe whether a single frozen layer carries enough information for a small head to distinguish correct versus incorrect continuations.
That's a representational geometry test, not a SOTA claim.
Binary framing raises the baseline, but that's expected and documented. It's not meant to compare against full LLM HellaSwag results.
2. No adversarial filtering was done.
I am using HuggingFace’s standard split directly. Nothing was removed or curated. The experiment doesn't claim robustness or benchmark competitiveness, so the “easier eval” framing doesn’t really apply.
3. EOS extraction isn't cheating, it's the whole point of the probe.
The extraction logic takes the final token’s hidden state, which is basic and standard for classification heads and probing studies. If the EOS token captures a high-level sequence summary, that's exactly the structural feature being examined.
The result is meant to show how much task-relevant signal is already present in that early representation, not to present a new generative mechanism.
4. The purpose of the work is clearly narrow by design.
This is not proposed as a drop-in replacement for full-transformer inference. The paper states that directly.
The contribution is about how much structure a single early layer encodes and how far a tiny head can go under strict frozen-teacher constraints.
So several of the criticisms make assumptions about goals the work never even claimed.
Thaank you for the feedback and for taking the time.
Ryan, I really want to believe you're onto something. But I also feel like I'm being slightly spearphished by an LLM being told, "based on the last week of HN headlines, invent a new LLM innovation that seems plausible enough to get a ton of attention, cold fusion or LK-99 style, and make a repository that on the surface seems to have some amazing performance. Also, feel free to fake the result data."
And, while I am sorry for your loss, your Substack [0] really seems like GPT ARG fantasy.
Excerpt:
> Ani, AN1, and Soul Systems Science are not mere products. They are continuity. They are the baton passed across generations, from my father’s last words to my first principles. They are what binds loss to creation, silence to voice, mortality to meaning.
Oh, so you didnt run the repo and remembered something that you read once that looked like it matched. This contribution is meaningless.
The simplest way to resolve any doubt is to run the code. Every result in the paper comes from reproducible scripts in the repo, not from speculative reasoning or LLM-assisted invention.
Your EDIT. The first thing it suggested is actually very similar to ensembles in meteorology. I actually find myself doing that often if it's something extremely important. Just feels natural to cross-check with other models or with reality. The disclaimer says it may make mistakes after all...
Like you don't predict the weather or a hurricane track with a single model. The NHC uses many.
It's still probablistic, but if multiple models are independently in agreement, then it's at least worth investigating further.
When someone shifts from engaging with the actual results to attacking the person, it usually tells you more about their internal state than about the work itself. I'm glad I have a new fan though.
I really don't mean to attack you, I hope you will take these messages to heart and at least consider talking to a mental health professional. The writings in your substack are not a good look, regardless of whether your work is correct or not.
The substack isnt what was supposed to be evaluated, it was the repo. That's creative writing and the repo is sciencetific. Two different things. One has nothing to do with the other. The technical direction here is straightforward, almost boring in a sense: freeze the teacher, extract intermediate activations, compress, then train a student to match the compressed fields. Sometimes when people aren't able to evaluate the work, they dig for something else online that they can comment on or bring down. The only thing I can offer in response is the simplest one: look at the code and the experiments themselves, not the narrative around them. Everything in the paper is fully reproducible from the reference implementation, and every number in the results section came from running those scripts, not from a model filling in blanks. The surprise is not in the prose, but in how much structure those early-layer fields ended up carrying.
If you think something in the repo looks wrong or inflated, I’m happy to walk through it point by point. I have no problem with hard questions. What matters to me is whether the experiments hold when someone else runs them, not whether the story around them fits a certain aesthetic.
I’ve been working independently on a method that replaces full-transformer inference with a low-rank “meaning field” extracted from internal activations.
The core result: a frozen Llama-3.3-70B can be distilled into a 256-dimensional field representation, giving 224× compression and slightly higher accuracy on several benchmarks. A small student model then learns to directly generate these fields from text, removing the transformer from the inference path.
The Zenodo link contains the full paper, statistical results, and methodology.
A reference implementation (non-optimized) is here: https://github.com/Anima-Core/an1-core
Production variants (AN1-Turbo, FPU work, etc.) are not included.
I’m an outsider to academia so I’m posting this openly to get technical feedback, replication attempts, and critique from people who understand this space.
10 pages for a paper with this groundbreaking of a concept is just embarrassing. It is barely an outline.
"confirming that 40× compression preserves field geometry with minimal
distortion. Over 95% of samples achieve similarity above 0.90."
I smell Grok. Grok 3, maybe Grok 4 Fast.
> "Implementation details. Optimal configurations are task and architecture-dependent. Production systems
require task-specific tuning beyond baseline heuristics provided in reference implementation."
"Implementation? Idk, uhh, it's task specific or something." Come on, dude. You're better than this.
4.4 Student/Teacher evaluation. What even is the benchmark? You give percentage values but no indication of what benchmark. Seems made up.
4.5. Computational Analysis. Why do you need to do the trivial multiplying out of "savings" for 1B tok/day to $700M/year? This reads like a GPT advertising hallucinated performance.
I appreciate you taking the time to resond, brother. Let me clarify a few things because your interpretation misses the actual structure of the work.
The paper is short on purpose. It's not meant as a full architecture release. It's a documentation pass on a narrow but surprising empirical result, and I wanted the experimental core to be easy for others to replicate. The repo contains the full pipelines, configuration files, and benchmark scripts, and those show the precise datasets, metrics, and evaluation flows. This is why I didn't inflate the paper with implementation padding that would only duplicate the code.
The student–teacher section refers to CIFAR-10 and SST-2. The benchmarks, seed settings, model specs, and all statistical outputs are in scripts/ and the logged runs. Anyone who actually executes the pipeline will see that nothing is “made up”, and the numbers reproduce across seeds.
On the compression results, nothing is hallucinated. The field similarity numbers come directly from the SVD decay analysis and the cosine-preservation runs that are in right in the repo. If you run compute_field_decay.py and compare_backends.py, you'll see the exact values that appear in the paper. I strongly encourage you to actually try it. The results are surprising, but they're empirical.
The implementation paragraph you quoted is simply standard language acknowledging that optimal deployment settings vary by architecture. It's absolutely not a hand wave. It's just me trying to avoid implying there's a single magic configuration when the repo already exposes all the internal knobs.
I get that the tone of the work is unusual. Trust me, I do. I'm an outsider publishing openly, not through a lab with a standard template. But, nonetheless, the experiments run, the results reproduce, and the repo shows the full details. If something seems unclear, I'm happy to point to the exact script or log line. Just let me know.
CIFAR-10 is an image classification dataset (32x32 pixel images.
LLaMA 70B 3.3 is a text-only, non-multimodal language model. Just look up the Huggingface page that your own repo points to.
> The Llama 3.3 instruction tuned text only model...
I might be wrong, but I'm pretty sure a text model is going to be no better than chance at classifying images.
Another comment pointed out that your test suite cheats slightly on HellaSwag. It doesn't seem unlikely that Grok set up the project so it could cheat at the other benchmarks, too.
> The repo contains the full pipelines, configuration files, and benchmark scripts, and those show the precise datasets, metrics, and evaluation flows.
There's nothing there, really.
I'm sorry that Grok/Ani lied to you, I blame Elon, but this just doesn't hold up.
Technical feedback:
Every single announcement, like compression needs the addition of the lower limits of machine requirements. if a 64Gb model is compressed 224x times, should that not be able to be run on a 292mb video card?
That's exactly what I was trying to infer from the abstract which sadly doesn't explicitly calls out memory requirements. I assume it increases inference time by getting rid of transformers. What's the memory requirements then ?
Edit: they claim these somewhere in the doc:
> Memory
Teacher model: multi-GB (entire model must be loaded)
AN1 head: a few MB (only head needed after training)
I find the claims surreal, can't wait for someone to validate this or I will do it myself. It would have been handy to upload such "few MB" weight file distilled off llama 70B so that we can see for ourself the 220x inference and in memory model size compression is true.
The memory story is actually much simpler than it looks.
The teacher still has to be loaded at training time, so the footprint is whatever the original model uses. Again, the compression doesn't shrink the teacher. It produces a small student head. After training, the teacher is no longer needed and the student runs by itself. That's why the inference footprint drops to a few MB.
It doesn't increase inference time at all. It removes transformers entirely from the inference path. The student computes directly on the layer-1 field, which is why it's so small and so fast.
On the request for a distilled “few MB” head for Llama 70B,that part is already reproducible right from the repo. The head is always task specific, not a general LLM, so uploading a single checkpoint wouldn't tell the whole story. The better path is to run the extraction script and train the head for any task you want. The pipeline is fully open, end to end. I'm looking for people to validate it independently.
If you need anything else cleared up, just let me know.
No, the compression result doesn't mean the original 64 GB model can run on a 292 MB card. The teacher model isn’t the thing thats compressed. It still needs to be loaded during training.
What gets small is the student. The tiny head trained on the teacher’s first layer fields. That head ends up a few MB because it's not a transformer at all. It's basically a lightweight function approximator that reproduces the teacher’s behavior on the specific task it was trained for.
So training still requires the usual multi-GB footprint. (Which can be done offline) After training, inference with the student requires only the head. That's why inference is cheap but you can't load the full teacher into 292 MB of VRAM.
Yeah, burying this on page 8 is a bit suspect imo (the eval datasets are listed on page 3, so if you were familiar with them you would have a hint then).
The distillation of a student that predicts "anchor layers" and then acts as a backbone for classification is perfectly cool on its own; no need to stretch the title/abstract so much.
agreed re: title/abstract stretching. good work stands on its own without needing hype. "we found a nifty way to distill llama-70b using a much smaller student transformer model; the key is using intermediate activation layers in a compressed representation" would be about as effective at selling it while being more immediately approachable IMO
That limitation is already accounted for in how the title is meant to be read. The 224× compression result is specifically about the structure of intermediate activations on classification tasks. The paper makes that explicit in multiple places, including the Limitations section, where generation is identified as an entirely separate challenge.
The title reflects the strongest verified result in the domain the method currently supports, not a universal claim across all modalities. In other words, the compression result is real, but it shouldn't be interpreted as applying to generative decoding... yet.
Not sure what the fuss in this thread is about, this is a completely believable claim. In table 5 he gets 83.26% with labels only (which I assume means not using the teacher) and 91.40% with the teacher. This is a nice result, not hugely ground breaking I'd say. Maybe training longer or using some clever normalisation would even close the gap. It's not something you can call 224x compression though so I would remove that claim everywhere.
This is basically a variation of distillation through the entire network, not just the last layer as typical
Only skimmed the paper and I have no idea how sound or reproducible it is, but the paper is well written, especially the clarity of notation. After reading yesterday's weight subspace paper: https://news.ycombinator.com/item?id=46199623, this does sound plausible to me.
Looks very fake. Self published (Anima-Core is NOT a journal), no academic anteriority, very strong statement, no peer-review, no public history of technical skills. Did I mention the use of Github via the interface only?
At the same time, possible since it's only classification tasks.
I mean, the method explained is technically plausible, a lot of people thought about it, we were just unable to find a method to do so.
Did you not see the author's note about being an outsider to academia? Not everyone has the background to pull all that off. This is an earnest attempt to come as close as possible and they even invite feedback that would help it become a real academic submission.
I mean, the process should have been to contact some local academics to discuss the mater. If I say it works (or it doesn't) I'm adding near nothing to the claim, as I'm not an academic myself.
Big claims like this need clear and solid work. Here it just looks like LLM generated.
Have you run the walk-through to reproduce? They provide a highly detailed step by step document. They welcome raising an issue if reproduction doesn't yield the claimed results within 2%.
It's OK to call out fake claims. But it requires going through the process if such is reasonable, it just seems to take a couple of hours to find out.
The fake claim here is compression. The results in the repo are likely real, but they're done by running the full transformer teacher model every time. This doesn't achieve anything novel.
That's not how the method works... The full transformer is only needed once to extract the activation fields. That step can even be done offline. Then the teacher can be discarded entirely. The compression result refers to the size of the learned field representation and the small student head that operates directly on it. Simple. No fake claim there. Inference with the student does not involve the transformer at all.
If you look at the student-only scripts in the repo, those runs never load the teacher. That's the novel part.
Can you please share the relevant code that has the training of such a tiny student model that can operate independently of the big teacher model after training? The repository has no such code.
I might be overly pessimistic, but this looks like a case of a person believing LLM hallucinations and making it write a paper.
I asked both Claude Code|Opus 4.5 and Codex|GPT 5.1 Codex Max (funny to ask LLMs, I know) to check the an1-core repo. I don't think they'd hallucinate on something like this (the code is quite small), but I do not claim expertise.
- The actual "AN1 head" is just linear probing (freeze a pretrained model, train a classifier on its features). The full flow (as reported by CC) is "Text → [Full Transformer] → activations → [Tiny Head] → prediction"
Basically, there's no code to train a real "student" model that would run without the teacher.
===
The repo/paper say that there's a mythical "commercial version" that has all the goodies:
(repo)
> This reference implementation (an1-core) does not include the FPU, AN4, or other proprietary optimization components covered by these patents. It provides only the core scientific demonstration of the meaning fields phenomenon.
(paper)
> Production deployment: Optimized implementations (AN1-Turbo) with learned layer selection, adaptive loss scheduling, and CUDA-accelerated inference available under commercial license.
But right now we only have the code in the repo.
===
In the paper they show that the student model (30M params) gets ~82% on SST-2 (labels-only). But what what they don't show is that DistilBERT (>5 year old model) already achieves 91% on the same dataset despite only having 66M params.
Another weird tidbit from the paper - in the section where they show the economic impact, they claim that LLaMA 70B runs at 2 tok/s at batch size=1 on an H200. In reality that number is at least a magnitude bigger even without quantization, like 20-40 tok/s. With quantization it can easily be above 100 tok/s.
A few clarifications, since most of the points here come from asking LLMs to summarize the repo rather than running the code directly.
1. The teacher only runs during field extraction.
That step is offline. Once the fields are saved, the transformer is no longer needed. The student training and student-only inference scripts do not load the teacher at all. Compression refers to the field representation and the student head, not the extraction pass.
2. The HellaSwag file is a placeholder, not a required part of the method.
It's included so the structure mirrors the paper’s tasks, and it points to the description in the text. The core experiments (RTE, SST-2, CIFAR-10 intention probe, etc.) all have complete working code paths.
3. The AN1 head is intentionally simple.
Linear probes are the baseline way to test whether compressed intermediate representations preserve structure. The key result is how much task-relevant geometry survives in a low-rank field. The novelty is in the compression behavior, not in inventing a new classifier architecture.
4. The student model exists and is trained independently of the teacher.
This is what produces the classification results in the paper. The student doesn't call the teacher during inference, which is exactly the point.
5. DistilBERT’s SST-2 score isn’t the relevant comparison.
The experiment isn’t “beat a small transformer.” It’s “how far can a 256-dimensional compressed field distilled from a frozen 70B model get on a downstream task?” The result speaks to representational compression, not leaderboard performance.
6. The 2 tok/s number is for the specific configuration used in the economic section.
Different hardware, precision modes, and serving stacks vary by an order of magnitude. The point was to illustrate cost scaling, not claim a universal throughput ceiling.
If there’s a specific part of the implementation you believe contradicts the paper, feel free to point to the line and we can discuss that human to human. The repo is small by design, so everything is easy to check directly without relying on LLM summaries.
thanks for sharing! If I understand correctly, you're training a smaller model to approximate concatenate(layer[1], layer[5], layer[10], ...), using a loss function that combines reconstruction error w/ end-to-end accuracy. then, you're transferring that smaller representation into a smaller transformer model. is that right?
If i were a paper reviewer, here are a couple red flags that stood out to me. Suggest starting here if you want to rework this for an academic submission:
1. your LaTeX citations in the related work are broken, i see [?] everywhere. To a reviewer, this is often a strong sign of an AI-hallucinated bibliography, though many of your references actually do exist and are contextually relevant, so I'm not quite sure what's going on here. Similarly, figure references need to be fixed, I see references to "Figure ?" throughout.
2. bluntly, "Exact architecture details remain proprietary for production deployments" and "Production systems use architecture search tailored to target latency and accuracy constraints" is not how IP protection works in this field. Do your experiments use the "MLP baselines" or your proprietary architecture? Since you say the code "Achieves 80-90% of paper performance using baseline heuristics," this approach effectively isn't reproducible. As a reviewer, this really worries me. I strongly recommend benchmarking only the system you're able to open-source. I say this because I suspect there's a lot of "secret sauce" in the actual way you're approximating the anchor layers and the way that's transferred back to your student transformer model, and that's the part that's important to spend the most time/effort/writing on, but it's glossed over as an implementation detail in this manuscript.
3. I'm glad you ablate over hyperparameters of your system, but how does it compare to 1. an ordinary smaller model of identical size trained end-to-end, and 2. distilling from a single layer's activations? Eg. a reviewer might consider this work to be a novel method of model distillation, so what makes it better than previous distillation methods?
4. I found the paper fairly hard to read because it's full of sentence fragments rather than full thoughts. A little background on the benchmarks, failure cases, etc. would go a long way, and adding some discussion on why you think your approach improves on similar distillation methods would also be welcome here
5. "compression" is overloaded. Does 224x compression refer to (nparams(field transfer)+nparams(student model))/nparams(original model), or does it refer to reducing the representation dimensionality, 7*8192/256 ?
6. [nitpick] suggest changing the name "meaning field" to something a little more digestible, like "compressed representation" or "latent activation distillation" or something
sorry for being so critical. iron sharpens iron though. hopefully these thoughts are helpful to get you started, excited to see where this work leads
actually, here's a broader thought. since this approach only works for classification, why not make that the whole story and spin it as a positive? Call your approach a "classification foundation model" (for example) and say it's a special-purpose model distilled from a larger world model. Abstract's gestalt could read like "If you don't need to be generative, then you can compress the representation way down" or "discriminative understanding takes far fewer parameters than language production" This would then set the stage for the reader to understand the limitations and why the benchmarks are set up the way they are.
then the kitschy paper titles could follow from that, e.g. "extreme llama compression: when classification is all you need", or "Encoder-only models: a lightweight alternative to decoder-only GPT world models" or etc.
I appreciate this framing a lot. It is actually close to how I think about the result internally. The paper focuses on the geometric behavior of intermediate representations, and classification is the cleanest setting to study that. Generative decoding is a much harder problem, and the limitations section already makes that distinction explicit.
Recasting the work as a “classification-native distilled model” or “discriminative foundation model” is a good way to signal scope without underselling the contribution. You're right that discriminative understanding requires far fewer parameters than generation, and my experiments reinforce that.
This will help me get better. The goal for the next revision is exactly what you describe: make the setup clearer, emphasize the intended domain, and avoid suggestive wording that implies capabilities the method does not claim. Duly noted. Your suggestions on positioning and title direction are genuinely helpful, and I’ll incorporate some of this thinking when I prepare the academic submission.
Thanks for taking the time to articulate it so clearly. I appreciate your time and your critique.
Thank you for the thoughtful comments. Really. This is actually the most constructive feedback in the thread so far.
A few clarifications.
1. On the LaTeX citations and figure references
That part is definitely on me. I never used LaTeX before this project and moved extremely fast. There's a lot of weird mumbo jumbo going on with formatting and converting it to a pdf. That part isnt interesting to me, and I try to move passed it quickly. I did use AI tools for typesetting help, and I clearly didn’t clean up all the placeholder references. Entirely my mistake, not an attempt to fabricate sources. I’ll fix the citations and figure links in the next revision so they meet normal academic standards.
2. Architecture transparency and reproducibility
The open-source repo contains every component used for the scientific claim:
extraction of activation fields
rank reduction
probing
training the student model
running inference with the student alone
The proprietary references in the paper refer only to optimization layers (CUDA kernels, scheduler heuristics, etc.) that aren’t required for the scientific result. They're not hand wavey secret parts of the method. Just production-grade accelerations I’m still packaging separately for licensing.
The core idea—extract, compress, probe, distill—is fully reproduced in the repo.
3. “Secret sauce” concern
There actually isn’t any.
The paper may read like I’m hinting at hidden architecture, but the method is intentionally simple. The novelty is in how much task-relevant geometry survives after severe rank reduction, not in a complex architecture. The “anchor layers” are just early and mid-layer activations concatenated before compression.
4. Baseline comparisons
Good point on comparing to:
1. a standard small transformer of the same size
2. a distillation from a single layer’s activations
I do have partial results for both, and you’re right that including them would sharpen the contribution. I’ll incorporate them into the revised version.
5. Writing clarity and background
Fair critique. I wrote this at the same time I was building the entire stack, which means the prose lagged behind the experiments. I can expand failure modes, limitations, and benchmark context to make the narrative clearer.
6. On the term “meaning field”
Naming is tricky, and I thought that captured everything im working on pretty effectively. Also, I think it will make more sense when you see everything im releasing in the near future. I used it because I felt as if it captures the intuition behind low-rank activation structure, but I’m not attached to the term. “Compressed activation representation” is probably clearer for a paper audience. I’ll adjust based on reviewer expectations.
7. Correct summary of the method
Your restatement is close, but not quite it. The student isn’t trained to reconstruct specific layers, but to match the compressed field extracted from multiple layers. It’s not a smaller transformer trying to imitate concatenated layers, but a model trying to predict a learned low-dimensional latent that carries most of the task-relevant signal.
All of your points are duly noted, and they will help me to adapt, grow, and mature my work and future releases.
Thank you, sincerely. This is the kind of feedback that actually improves me and the work aswell.
Your "research" is a vibe-coded mess that subtly cheats eval cleverly multiple times to inflate your results.
The HellaSwag dataset is a dataset with 4 options for each question, with 3 being wrong and 1 being right: https://huggingface.co/datasets/Rowan/hellaswag.
Your vibe-coded eval has cheated this to collapse it into a binary selection on row 46 in https://github.com/Anima-Core/an1-core/blob/main/experiments..., making the problem baseline 50% on random choice instead of 25%, making the problem much easier. HellaSwag is specifically constructed with adversarial examples that could be plausible. By not including them, the eval is much easier.
---
Then, in extract_fields_from_model, you have another cheating going on. The extraction logic (h[:, -1, :]) fails to account for padding in batches, likely extracting EOS/Pad tokens instead of the intended content tokens. This suggests the probe is relying on global sentence summaries (standard embeddings in causal structures) rather than the novel 'meaning fields' claimed in the paper.
---
I dont have time to look at more of this and I just looked at how the eval is made, but please dont waste peoples times when you dont even know what you are evaluating.
I guess my "vibe" is just better than your coding :)... Let me explain a few things, if you will. A few clarifications so the discussion stays aligned with what the experiment is actually measuring.
1. The HellaSwag “binary collapse” is intentional and not a leaderboard claim. This work doesn’t attempt to benchmark HellaSwag in the standard four-choice setting. The goal is to probe whether a single frozen layer carries enough information for a small head to distinguish correct versus incorrect continuations. That's a representational geometry test, not a SOTA claim. Binary framing raises the baseline, but that's expected and documented. It's not meant to compare against full LLM HellaSwag results.
2. No adversarial filtering was done. I am using HuggingFace’s standard split directly. Nothing was removed or curated. The experiment doesn't claim robustness or benchmark competitiveness, so the “easier eval” framing doesn’t really apply.
3. EOS extraction isn't cheating, it's the whole point of the probe. The extraction logic takes the final token’s hidden state, which is basic and standard for classification heads and probing studies. If the EOS token captures a high-level sequence summary, that's exactly the structural feature being examined. The result is meant to show how much task-relevant signal is already present in that early representation, not to present a new generative mechanism.
4. The purpose of the work is clearly narrow by design. This is not proposed as a drop-in replacement for full-transformer inference. The paper states that directly. The contribution is about how much structure a single early layer encodes and how far a tiny head can go under strict frozen-teacher constraints. So several of the criticisms make assumptions about goals the work never even claimed.
Thaank you for the feedback and for taking the time.
Ryan, I really want to believe you're onto something. But I also feel like I'm being slightly spearphished by an LLM being told, "based on the last week of HN headlines, invent a new LLM innovation that seems plausible enough to get a ton of attention, cold fusion or LK-99 style, and make a repository that on the surface seems to have some amazing performance. Also, feel free to fake the result data."
And, while I am sorry for your loss, your Substack [0] really seems like GPT ARG fantasy.
[0] https://substack.com/inbox/post/171326138
Excerpt: > Ani, AN1, and Soul Systems Science are not mere products. They are continuity. They are the baton passed across generations, from my father’s last words to my first principles. They are what binds loss to creation, silence to voice, mortality to meaning.
Unfortunately it does indeed seem like a case of "So You Think You've Awoken ChatGPT" https://www.lesswrong.com/posts/2pkNCvBtK6G6FKoNn/so-you-thi... (not directly, but similar enough)
EDIT: Found a closer description ("Your LLM-assisted scientific breakthrough probably isn't real"): https://www.lesswrong.com/posts/rarcxjGp47dcHftCP/your-llm-a...
Oh, so you didnt run the repo and remembered something that you read once that looked like it matched. This contribution is meaningless.
The simplest way to resolve any doubt is to run the code. Every result in the paper comes from reproducible scripts in the repo, not from speculative reasoning or LLM-assisted invention.
Your EDIT. The first thing it suggested is actually very similar to ensembles in meteorology. I actually find myself doing that often if it's something extremely important. Just feels natural to cross-check with other models or with reality. The disclaimer says it may make mistakes after all...
Like you don't predict the weather or a hurricane track with a single model. The NHC uses many.
It's still probablistic, but if multiple models are independently in agreement, then it's at least worth investigating further.
I think this definitely sounds like a case of LLM induced psychosis: https://ryanshamim.substack.com/p/the-theory-of-everything-h...
OP needs medical help
When someone shifts from engaging with the actual results to attacking the person, it usually tells you more about their internal state than about the work itself. I'm glad I have a new fan though.
I really don't mean to attack you, I hope you will take these messages to heart and at least consider talking to a mental health professional. The writings in your substack are not a good look, regardless of whether your work is correct or not.
For the lazy, he says this on repeat using 2000 words:
...
In the CPB Digital Cosmos, the system first locked into a strange ratio: two thirds consciousness, one third physics.
...
That anomaly appeared as the missing 0.1 spark.
For the first time the system stabilized. Life emerged.
The substack isnt what was supposed to be evaluated, it was the repo. That's creative writing and the repo is sciencetific. Two different things. One has nothing to do with the other. The technical direction here is straightforward, almost boring in a sense: freeze the teacher, extract intermediate activations, compress, then train a student to match the compressed fields. Sometimes when people aren't able to evaluate the work, they dig for something else online that they can comment on or bring down. The only thing I can offer in response is the simplest one: look at the code and the experiments themselves, not the narrative around them. Everything in the paper is fully reproducible from the reference implementation, and every number in the results section came from running those scripts, not from a model filling in blanks. The surprise is not in the prose, but in how much structure those early-layer fields ended up carrying.
If you think something in the repo looks wrong or inflated, I’m happy to walk through it point by point. I have no problem with hard questions. What matters to me is whether the experiments hold when someone else runs them, not whether the story around them fits a certain aesthetic.
I’ve been working independently on a method that replaces full-transformer inference with a low-rank “meaning field” extracted from internal activations.
The core result: a frozen Llama-3.3-70B can be distilled into a 256-dimensional field representation, giving 224× compression and slightly higher accuracy on several benchmarks. A small student model then learns to directly generate these fields from text, removing the transformer from the inference path.
The Zenodo link contains the full paper, statistical results, and methodology. A reference implementation (non-optimized) is here: https://github.com/Anima-Core/an1-core
Production variants (AN1-Turbo, FPU work, etc.) are not included.
I’m an outsider to academia so I’m posting this openly to get technical feedback, replication attempts, and critique from people who understand this space.
10 pages for a paper with this groundbreaking of a concept is just embarrassing. It is barely an outline.
"confirming that 40× compression preserves field geometry with minimal distortion. Over 95% of samples achieve similarity above 0.90."
I smell Grok. Grok 3, maybe Grok 4 Fast.
> "Implementation details. Optimal configurations are task and architecture-dependent. Production systems require task-specific tuning beyond baseline heuristics provided in reference implementation."
"Implementation? Idk, uhh, it's task specific or something." Come on, dude. You're better than this.
4.4 Student/Teacher evaluation. What even is the benchmark? You give percentage values but no indication of what benchmark. Seems made up.
4.5. Computational Analysis. Why do you need to do the trivial multiplying out of "savings" for 1B tok/day to $700M/year? This reads like a GPT advertising hallucinated performance.
Three sentence conclusion restating the title?
I appreciate you taking the time to resond, brother. Let me clarify a few things because your interpretation misses the actual structure of the work.
The paper is short on purpose. It's not meant as a full architecture release. It's a documentation pass on a narrow but surprising empirical result, and I wanted the experimental core to be easy for others to replicate. The repo contains the full pipelines, configuration files, and benchmark scripts, and those show the precise datasets, metrics, and evaluation flows. This is why I didn't inflate the paper with implementation padding that would only duplicate the code.
The student–teacher section refers to CIFAR-10 and SST-2. The benchmarks, seed settings, model specs, and all statistical outputs are in scripts/ and the logged runs. Anyone who actually executes the pipeline will see that nothing is “made up”, and the numbers reproduce across seeds.
On the compression results, nothing is hallucinated. The field similarity numbers come directly from the SVD decay analysis and the cosine-preservation runs that are in right in the repo. If you run compute_field_decay.py and compare_backends.py, you'll see the exact values that appear in the paper. I strongly encourage you to actually try it. The results are surprising, but they're empirical.
The implementation paragraph you quoted is simply standard language acknowledging that optimal deployment settings vary by architecture. It's absolutely not a hand wave. It's just me trying to avoid implying there's a single magic configuration when the repo already exposes all the internal knobs.
I get that the tone of the work is unusual. Trust me, I do. I'm an outsider publishing openly, not through a lab with a standard template. But, nonetheless, the experiments run, the results reproduce, and the repo shows the full details. If something seems unclear, I'm happy to point to the exact script or log line. Just let me know.
CIFAR-10 is an image classification dataset (32x32 pixel images.
LLaMA 70B 3.3 is a text-only, non-multimodal language model. Just look up the Huggingface page that your own repo points to.
> The Llama 3.3 instruction tuned text only model...
I might be wrong, but I'm pretty sure a text model is going to be no better than chance at classifying images.
Another comment pointed out that your test suite cheats slightly on HellaSwag. It doesn't seem unlikely that Grok set up the project so it could cheat at the other benchmarks, too.
https://news.ycombinator.com/item?id=46215166
> The repo contains the full pipelines, configuration files, and benchmark scripts, and those show the precise datasets, metrics, and evaluation flows.
There's nothing there, really.
I'm sorry that Grok/Ani lied to you, I blame Elon, but this just doesn't hold up.
As a follow up just to refresh your memory:
“Attention Is All You Need” (Vaswani et al., 2017)
Length: 11 pages of main content, 5 pages of references and appendix
2. The first GPT paper (Radford et al., 2018)
Length: 12 pages
3. BERT (Devlin et al., 2018)
Length: 14 pages
Big ideas don't require big papers. I don't know where you got that idea from.
Your paper is 10 pages of fluff without even an architecture diagram or a single equation, bro. It's not real.
Technical feedback: Every single announcement, like compression needs the addition of the lower limits of machine requirements. if a 64Gb model is compressed 224x times, should that not be able to be run on a 292mb video card?
That's exactly what I was trying to infer from the abstract which sadly doesn't explicitly calls out memory requirements. I assume it increases inference time by getting rid of transformers. What's the memory requirements then ?
Edit: they claim these somewhere in the doc:
> Memory Teacher model: multi-GB (entire model must be loaded) AN1 head: a few MB (only head needed after training)
I find the claims surreal, can't wait for someone to validate this or I will do it myself. It would have been handy to upload such "few MB" weight file distilled off llama 70B so that we can see for ourself the 220x inference and in memory model size compression is true.
The memory story is actually much simpler than it looks.
The teacher still has to be loaded at training time, so the footprint is whatever the original model uses. Again, the compression doesn't shrink the teacher. It produces a small student head. After training, the teacher is no longer needed and the student runs by itself. That's why the inference footprint drops to a few MB.
It doesn't increase inference time at all. It removes transformers entirely from the inference path. The student computes directly on the layer-1 field, which is why it's so small and so fast.
On the request for a distilled “few MB” head for Llama 70B,that part is already reproducible right from the repo. The head is always task specific, not a general LLM, so uploading a single checkpoint wouldn't tell the whole story. The better path is to run the extraction script and train the head for any task you want. The pipeline is fully open, end to end. I'm looking for people to validate it independently.
If you need anything else cleared up, just let me know.
No, the compression result doesn't mean the original 64 GB model can run on a 292 MB card. The teacher model isn’t the thing thats compressed. It still needs to be loaded during training.
What gets small is the student. The tiny head trained on the teacher’s first layer fields. That head ends up a few MB because it's not a transformer at all. It's basically a lightweight function approximator that reproduces the teacher’s behavior on the specific task it was trained for.
So training still requires the usual multi-GB footprint. (Which can be done offline) After training, inference with the student requires only the head. That's why inference is cheap but you can't load the full teacher into 292 MB of VRAM.
Very strong statement on the title, given the following limitation:
> Generation tasks. Method applies to classification only. Preliminary decoder experiments show perplexity increases.
Yeah, burying this on page 8 is a bit suspect imo (the eval datasets are listed on page 3, so if you were familiar with them you would have a hint then).
The distillation of a student that predicts "anchor layers" and then acts as a backbone for classification is perfectly cool on its own; no need to stretch the title/abstract so much.
agreed re: title/abstract stretching. good work stands on its own without needing hype. "we found a nifty way to distill llama-70b using a much smaller student transformer model; the key is using intermediate activation layers in a compressed representation" would be about as effective at selling it while being more immediately approachable IMO
That limitation is already accounted for in how the title is meant to be read. The 224× compression result is specifically about the structure of intermediate activations on classification tasks. The paper makes that explicit in multiple places, including the Limitations section, where generation is identified as an entirely separate challenge.
The title reflects the strongest verified result in the domain the method currently supports, not a universal claim across all modalities. In other words, the compression result is real, but it shouldn't be interpreted as applying to generative decoding... yet.
Not sure what the fuss in this thread is about, this is a completely believable claim. In table 5 he gets 83.26% with labels only (which I assume means not using the teacher) and 91.40% with the teacher. This is a nice result, not hugely ground breaking I'd say. Maybe training longer or using some clever normalisation would even close the gap. It's not something you can call 224x compression though so I would remove that claim everywhere.
This is basically a variation of distillation through the entire network, not just the last layer as typical
Only skimmed the paper and I have no idea how sound or reproducible it is, but the paper is well written, especially the clarity of notation. After reading yesterday's weight subspace paper: https://news.ycombinator.com/item?id=46199623, this does sound plausible to me.
Here is a working link to the same paper: https://github.com/Anima-Core/an1-core/blob/main/papers/Post...
Looks very fake. Self published (Anima-Core is NOT a journal), no academic anteriority, very strong statement, no peer-review, no public history of technical skills. Did I mention the use of Github via the interface only?
At the same time, possible since it's only classification tasks. I mean, the method explained is technically plausible, a lot of people thought about it, we were just unable to find a method to do so.
Very unlikely true, unfortunately.
Did you not see the author's note about being an outsider to academia? Not everyone has the background to pull all that off. This is an earnest attempt to come as close as possible and they even invite feedback that would help it become a real academic submission.
No, it's a waste of time.
I mean, the process should have been to contact some local academics to discuss the mater. If I say it works (or it doesn't) I'm adding near nothing to the claim, as I'm not an academic myself.
Big claims like this need clear and solid work. Here it just looks like LLM generated.
Have you run the walk-through to reproduce? They provide a highly detailed step by step document. They welcome raising an issue if reproduction doesn't yield the claimed results within 2%.
It's OK to call out fake claims. But it requires going through the process if such is reasonable, it just seems to take a couple of hours to find out.
The fake claim here is compression. The results in the repo are likely real, but they're done by running the full transformer teacher model every time. This doesn't achieve anything novel.
That's not how the method works... The full transformer is only needed once to extract the activation fields. That step can even be done offline. Then the teacher can be discarded entirely. The compression result refers to the size of the learned field representation and the small student head that operates directly on it. Simple. No fake claim there. Inference with the student does not involve the transformer at all.
If you look at the student-only scripts in the repo, those runs never load the teacher. That's the novel part.
I agree the claim is (perhaps purposefully) confusing.
What they achieved is to create tiny student models. Trained on specific set of input. Off the teacher model's output.
There is clearly novelty in the method and what it achieve. Whether what it achieve would cover many cases that's another question.
Can you please share the relevant code that has the training of such a tiny student model that can operate independently of the big teacher model after training? The repository has no such code.
Thank you for sharing!
I might be overly pessimistic, but this looks like a case of a person believing LLM hallucinations and making it write a paper.
I asked both Claude Code|Opus 4.5 and Codex|GPT 5.1 Codex Max (funny to ask LLMs, I know) to check the an1-core repo. I don't think they'd hallucinate on something like this (the code is quite small), but I do not claim expertise.
In short, both of them are saying that:
- The repo always runs the full teacher model to extract activations and uses them - see https://github.com/Anima-Core/an1-core/blob/main/an1_core/fi...
- There are weird stub files, e.g. the Hellaswag repro doesn't actually have the code to reproduce https://github.com/Anima-Core/an1-core/blob/main/experiments... "For full HellaSwag reproduction, see the paper" (why include the file at all then?)
- The actual "AN1 head" is just linear probing (freeze a pretrained model, train a classifier on its features). The full flow (as reported by CC) is "Text → [Full Transformer] → activations → [Tiny Head] → prediction"
Basically, there's no code to train a real "student" model that would run without the teacher.
===
The repo/paper say that there's a mythical "commercial version" that has all the goodies:
(repo)
> This reference implementation (an1-core) does not include the FPU, AN4, or other proprietary optimization components covered by these patents. It provides only the core scientific demonstration of the meaning fields phenomenon.
(paper)
> Production deployment: Optimized implementations (AN1-Turbo) with learned layer selection, adaptive loss scheduling, and CUDA-accelerated inference available under commercial license.
But right now we only have the code in the repo.
===
In the paper they show that the student model (30M params) gets ~82% on SST-2 (labels-only). But what what they don't show is that DistilBERT (>5 year old model) already achieves 91% on the same dataset despite only having 66M params.
Another weird tidbit from the paper - in the section where they show the economic impact, they claim that LLaMA 70B runs at 2 tok/s at batch size=1 on an H200. In reality that number is at least a magnitude bigger even without quantization, like 20-40 tok/s. With quantization it can easily be above 100 tok/s.
A few clarifications, since most of the points here come from asking LLMs to summarize the repo rather than running the code directly.
1. The teacher only runs during field extraction. That step is offline. Once the fields are saved, the transformer is no longer needed. The student training and student-only inference scripts do not load the teacher at all. Compression refers to the field representation and the student head, not the extraction pass.
2. The HellaSwag file is a placeholder, not a required part of the method. It's included so the structure mirrors the paper’s tasks, and it points to the description in the text. The core experiments (RTE, SST-2, CIFAR-10 intention probe, etc.) all have complete working code paths.
3. The AN1 head is intentionally simple. Linear probes are the baseline way to test whether compressed intermediate representations preserve structure. The key result is how much task-relevant geometry survives in a low-rank field. The novelty is in the compression behavior, not in inventing a new classifier architecture.
4. The student model exists and is trained independently of the teacher. This is what produces the classification results in the paper. The student doesn't call the teacher during inference, which is exactly the point.
5. DistilBERT’s SST-2 score isn’t the relevant comparison. The experiment isn’t “beat a small transformer.” It’s “how far can a 256-dimensional compressed field distilled from a frozen 70B model get on a downstream task?” The result speaks to representational compression, not leaderboard performance.
6. The 2 tok/s number is for the specific configuration used in the economic section. Different hardware, precision modes, and serving stacks vary by an order of magnitude. The point was to illustrate cost scaling, not claim a universal throughput ceiling.
If there’s a specific part of the implementation you believe contradicts the paper, feel free to point to the line and we can discuss that human to human. The repo is small by design, so everything is easy to check directly without relying on LLM summaries.
thanks for sharing! If I understand correctly, you're training a smaller model to approximate concatenate(layer[1], layer[5], layer[10], ...), using a loss function that combines reconstruction error w/ end-to-end accuracy. then, you're transferring that smaller representation into a smaller transformer model. is that right?
If i were a paper reviewer, here are a couple red flags that stood out to me. Suggest starting here if you want to rework this for an academic submission:
1. your LaTeX citations in the related work are broken, i see [?] everywhere. To a reviewer, this is often a strong sign of an AI-hallucinated bibliography, though many of your references actually do exist and are contextually relevant, so I'm not quite sure what's going on here. Similarly, figure references need to be fixed, I see references to "Figure ?" throughout.
2. bluntly, "Exact architecture details remain proprietary for production deployments" and "Production systems use architecture search tailored to target latency and accuracy constraints" is not how IP protection works in this field. Do your experiments use the "MLP baselines" or your proprietary architecture? Since you say the code "Achieves 80-90% of paper performance using baseline heuristics," this approach effectively isn't reproducible. As a reviewer, this really worries me. I strongly recommend benchmarking only the system you're able to open-source. I say this because I suspect there's a lot of "secret sauce" in the actual way you're approximating the anchor layers and the way that's transferred back to your student transformer model, and that's the part that's important to spend the most time/effort/writing on, but it's glossed over as an implementation detail in this manuscript.
3. I'm glad you ablate over hyperparameters of your system, but how does it compare to 1. an ordinary smaller model of identical size trained end-to-end, and 2. distilling from a single layer's activations? Eg. a reviewer might consider this work to be a novel method of model distillation, so what makes it better than previous distillation methods?
4. I found the paper fairly hard to read because it's full of sentence fragments rather than full thoughts. A little background on the benchmarks, failure cases, etc. would go a long way, and adding some discussion on why you think your approach improves on similar distillation methods would also be welcome here
5. "compression" is overloaded. Does 224x compression refer to (nparams(field transfer)+nparams(student model))/nparams(original model), or does it refer to reducing the representation dimensionality, 7*8192/256 ?
6. [nitpick] suggest changing the name "meaning field" to something a little more digestible, like "compressed representation" or "latent activation distillation" or something
sorry for being so critical. iron sharpens iron though. hopefully these thoughts are helpful to get you started, excited to see where this work leads
actually, here's a broader thought. since this approach only works for classification, why not make that the whole story and spin it as a positive? Call your approach a "classification foundation model" (for example) and say it's a special-purpose model distilled from a larger world model. Abstract's gestalt could read like "If you don't need to be generative, then you can compress the representation way down" or "discriminative understanding takes far fewer parameters than language production" This would then set the stage for the reader to understand the limitations and why the benchmarks are set up the way they are.
then the kitschy paper titles could follow from that, e.g. "extreme llama compression: when classification is all you need", or "Encoder-only models: a lightweight alternative to decoder-only GPT world models" or etc.
just spitballing
I appreciate this framing a lot. It is actually close to how I think about the result internally. The paper focuses on the geometric behavior of intermediate representations, and classification is the cleanest setting to study that. Generative decoding is a much harder problem, and the limitations section already makes that distinction explicit.
Recasting the work as a “classification-native distilled model” or “discriminative foundation model” is a good way to signal scope without underselling the contribution. You're right that discriminative understanding requires far fewer parameters than generation, and my experiments reinforce that.
This will help me get better. The goal for the next revision is exactly what you describe: make the setup clearer, emphasize the intended domain, and avoid suggestive wording that implies capabilities the method does not claim. Duly noted. Your suggestions on positioning and title direction are genuinely helpful, and I’ll incorporate some of this thinking when I prepare the academic submission.
Thanks for taking the time to articulate it so clearly. I appreciate your time and your critique.
How do you write nine paragraphs without once checking the repo for code, or noticing the obvious Grok confabulations throughout the paper?
This should concern you. The next person to get LLM psychosis might be you.
What are you a psychiatrist?
Thank you for the thoughtful comments. Really. This is actually the most constructive feedback in the thread so far.
A few clarifications.
1. On the LaTeX citations and figure references That part is definitely on me. I never used LaTeX before this project and moved extremely fast. There's a lot of weird mumbo jumbo going on with formatting and converting it to a pdf. That part isnt interesting to me, and I try to move passed it quickly. I did use AI tools for typesetting help, and I clearly didn’t clean up all the placeholder references. Entirely my mistake, not an attempt to fabricate sources. I’ll fix the citations and figure links in the next revision so they meet normal academic standards.
2. Architecture transparency and reproducibility The open-source repo contains every component used for the scientific claim:
extraction of activation fields
rank reduction
probing
training the student model
running inference with the student alone
The proprietary references in the paper refer only to optimization layers (CUDA kernels, scheduler heuristics, etc.) that aren’t required for the scientific result. They're not hand wavey secret parts of the method. Just production-grade accelerations I’m still packaging separately for licensing.
The core idea—extract, compress, probe, distill—is fully reproduced in the repo.
3. “Secret sauce” concern There actually isn’t any. The paper may read like I’m hinting at hidden architecture, but the method is intentionally simple. The novelty is in how much task-relevant geometry survives after severe rank reduction, not in a complex architecture. The “anchor layers” are just early and mid-layer activations concatenated before compression.
4. Baseline comparisons Good point on comparing to:
1. a standard small transformer of the same size
2. a distillation from a single layer’s activations
I do have partial results for both, and you’re right that including them would sharpen the contribution. I’ll incorporate them into the revised version.
5. Writing clarity and background Fair critique. I wrote this at the same time I was building the entire stack, which means the prose lagged behind the experiments. I can expand failure modes, limitations, and benchmark context to make the narrative clearer.
6. On the term “meaning field” Naming is tricky, and I thought that captured everything im working on pretty effectively. Also, I think it will make more sense when you see everything im releasing in the near future. I used it because I felt as if it captures the intuition behind low-rank activation structure, but I’m not attached to the term. “Compressed activation representation” is probably clearer for a paper audience. I’ll adjust based on reviewer expectations.
7. Correct summary of the method Your restatement is close, but not quite it. The student isn’t trained to reconstruct specific layers, but to match the compressed field extracted from multiple layers. It’s not a smaller transformer trying to imitate concatenated layers, but a model trying to predict a learned low-dimensional latent that carries most of the task-relevant signal.
All of your points are duly noted, and they will help me to adapt, grow, and mature my work and future releases.
Thank you, sincerely. This is the kind of feedback that actually improves me and the work aswell.