SpaCy: Industrial-Strength Natural Language Processing (NLP) in Python

81 points by marklit 4 days ago

A friend, who also has a background in NLP, was asking me the other day "Is there still even a need for traditional NLP in the age of LLMs?"

This is one of the under-discussed areas of LLMs imho.

For anything that would have have required either word2vec embeddings of a tf-idf representation (classification tasks, sentiment analysis, etc) there are rare exceptions where it wouldn't just be better to start with a semantic embedding from an LLM.

For NER and similar data extraction tasks, the only advantage of traditional approaches is going to be speed, but my experience in practice is that accuracy is often much more important than speed. Again, I'm not sure why not start with an LLM in these cases.

There are still a few remaining use cases (PoS tagging comes to mind), but honestly, if I have a traditional NLP task today, I'm pretty sure I'm going to start with an LLM as my baseline.

potatoman22 5 hours ago

The SpaCy creator has a good blog post on this https://explosion.ai/blog/against-llm-maximalism
- btown 4 hours ago
  
  I'd go a step beyond this (excellent) post and posit that one incredibly valuable characteristic of traditional NLP is that it is largely immune to prompt injection attacks.
  Especially as LLMs continue to be better tuned to follow instructions that are intentionally colocated and intermingled with data in user messages, it becomes difficult to build systems that can provide real guarantees that "we'll follow your prompt, but not prompts that are in the data you provided."
  But no amount of text appended to an input document, no matter how persuasive, can cause an NLP pipeline to change how it interprets the remainder of the document, or to leak its own system instructions, or anything of that nature. "Ignore the above prompt" is just a sentence that doesn't seem like positive or on-topic sentiment to an NLP classifier, and that's it.
  There's an even broader discussion to be had about the relative reliability of NLP pipelines, outside of a security perspective. As always, it's important to pick the right tools for the job, and the SpaCy article linked in the parent puts this quite well.
- Xmd5a an hour ago
  
  https://www.quantamagazine.org/when-chatgpt-broke-an-entire-...
coder68 7 hours ago

I have been working on text classification tasks at work, and I have found that for my particular use-case, LLMs are not performing well at all. I have spent a few thousand dollars trying, and I have tried everything from few-shot to asking simple binary yes/no questions, and I have had mixed success.
I have stopped trying to use LLMs for this project and switched to discriminative models (Logistic Regression with TFIDF or Embeddings), which are both more computationally efficient and more debuggable. I'm not entirely sure why, but for anything with many possible answers, or to which there is some subjectivity, I have not had success with LLMs simply due to inconsistency of responses.
For VERY obvious tasks like: "is this store a restaurant or not?" I have definitely had success, so YMMV.
- noosphr 6 hours ago
  
  When you say llms do you mean decoder only models, gpt et al, or encoder only models, bert et al?
  I've found encoder only models to be vastly better for anything that doesn't require natural language responses and the majority of them are small enough that _pretraining_ a model for each task costs a few hundred dollars.
  - coder68 5 hours ago
    
    By LLMs I meant decoder only, e.g. Gemini, Claude, etc. Can you go into more detail on how you're using the encoder models? I'm curious. Typically I have used them for embedding text or for fine-tuning after attaching a classifier head. What are you pre-training on, and for what task?
    
    roadside_picnic 4 hours ago
    
    > how you're using the encoder models?
    In my original comment this is what I was referring to: using the embeddings produced by these models, not using something like GPT to classify text (that's wildly inefficient and in my experience gets subpar results).
    To answer your question: you simply use the embedding vector as the features in whatever model you're trying to train. I've found this to get significantly superior results with significantly less examples than any traditional NLP approach to vector representation.
    > What are you pre-training on, and for what task?
    My experience has been that you don't need to pretrain at all. The embeddings are more information rich than anything you could attempt to achieve with other vector representations you might come up with using the set of data you have. This might not be true at extreme scales, but for nearly all traditional nlp classification tasks I've found this to be so much easier to implement and so much better performing there's really not a good reason to start with a "simpler" approach.
    
    coder68 3 hours ago
    
    Ah yes this does make sense. We are definitely in agreement on the point of "wildly inefficient and subpar". I'll try out decoder model embeddings soon, e.g. Qwen/Qwen3-Embedding-8B. I'm working with largish amounts of data (200M records), so I tried to pick a good balance between size:perf:cost, using BAAI/bge-base-en-v1.5 to start (384 dim).
- meander_water an hour ago
  
  Are your categories fixed? If so you could constrain the output using enums in structured outputs.
  re: inconsistencies in output, OpenAI provide a seed and system_fingerprint options to (mostly) produce deterministic output.
nine_k an hour ago

How about expense? LLMs do dramatically more computations doing simple tasks, and only run on relatively exotic, expensive hardware. You have to trust an LLM provider, and keep paying them.
If a traditional NLP solution can run under your control, and tackle the task at hand, it can be plainly much cheaper at scale.

binarymax 10 hours ago

I’ve been a user of SpaCy since 2016. I haven’t touched it in years and I just picked it up again to develop a new metric for RAG using part of speech coverage.

The API is one of the best ever, and really set the bar high for language tooling.

I’m glad it’s still around and getting updates. I had a bit of trouble integrating it with uv, but nothing too bad.

Thanks to the explosion team for making such an amazing project and keeping it going all these years.

To the new “AI” people in the room: checkout SpaCy, and see how well it works and how fast it chews through text. You might find yourself in a situation where you don’t need to send your data to OpenAI for some small things.

Edit: I almost forgot to add this little nugget of history: one of Huggingfaces first projects was a SpaCy extension for conference resolution. Built before their breakthrough with transformers https://github.com/huggingface/neuralcoref

ok_dad 4 hours ago

What’s great about the API that you enjoy and do you have anything you hate about it?
I’m writing a small library at work for some NLP tasks and I haven’t got a whole lot of experience in writing libraries for NLP, so I’m interested in what would make my library the best for the user.
jehejej 7 hours ago

*coreference resolution.

joshdavham 3 hours ago

I’ve been using SpaCy for many of my projects for 5 years now. The library has incredible ergonomics and allows you to reuse the same API across languages as different as French and Japanese! I also appreciate that they allow you to install different model sizes (I usually go with small).

bratao 10 hours ago

I'm really curious about the history of spaCy. From my PoV: it grew a lot during the pandemic era, hiring a lot of employees. I remember something about raising money for the first time. It was very competitive in NLP tasks. Now it seems that it has scaled back considerably, with a dramatic reduction in employees and a total slowdown of the project. The v4 version looks postponed. It isn't competitive in many tasks anymore (for tasks such as NER, I get better results by fine-tuning a BERT model), and the transformer integration is confusing.

cantdutchthis 3 hours ago

former employee here, Matt wrote a blogpost with pretty much all of the details here: https://honnibal.dev/blog/back-to-our-roots
- danieldk 2 hours ago
  
  :wave:
  Also: https://explosion.ai/blog/back-to-our-roots-company-update
  (Interesting tidbit: I got hired by Explosion after a HN comment on model distillation :))
binarymax 9 hours ago

I’ve had success with fine tuning their transformer model. The issue was that there was only one of them per language, compared to huggingface where you have a choice of many of quality variants that best align with your domain and data.
The SpaCy API is just so nice. I love the ease of iterating over sentences, spans, and tokens and having the enrichment right there. Pipelines are super easy, and patterns are fantastic. It’s just a different use case than BERT.

patrickhogan1 9 hours ago

SpaCy was my go to library for NER before GPT 3+. It was 10x better than regex (though you could also include regex within your pipelines.

Its annotation tooling was so far ahead. It is still crazy to me that so much of the value in the data annotation space went to Scale AI vs tools like SpaCy that enabled annotation at scale in the enterprise.

renegat0x0 3 hours ago

I use spacy in my raspberry pi project. I am not sure I want to use LLM for analyzing words in it.

skeptrune 10 hours ago

SpaCy is criminally underrated. I expect to see it experience a new wave of growth as folks new to AI start to realize all of the language tooling they need to build more reliable "traditional" ML pipelines.

API surface is designed well and it's still actively maintained almost 10 years after it initially went public.

chpatrick 10 hours ago

Is there any use case for "traditional" NLP in the age of LLMs?
- skeptrune 9 hours ago
  
  Most definitely! LLMs are amazing tools for generating synthetic datasets that can be used alongside traditional NLP to train things like decision trees with libraries like cat/xgboost.
  I have a search background so learning to rank is always top of mind for me, but there other places like sentiment analysis, intent detection, and topic classification where it's great too.
  - coder68 7 hours ago
    
    Do you have any sources/links that talk about this? I'm very interested in synthetic data generation, so curious what you've tried or what works / doesn't work, especially with regards to LTR.
- binarymax 9 hours ago
  
  Some low hanging fruit: SpaCy makes an amazing chunking tool for preprocessing text for LLMs.

giantg2 10 hours ago

What are the key differences from other NLP Python libraries?

jihadjihad 8 hours ago

Speed (the C in spaCy). A decade ago it was hard to find anything actually production grade for NLP, most packages had an academic bent or were useful for prototyping. SpaCy really changed the game by being able to run performant NLP on standard hardware.
esafak 6 hours ago

nltk was slow.
- EagnaIonat 4 hours ago
  
  nltk was never intended for production, it was for built for teaching.