Machine Learning in Production (CMU Course)

mlip-cmu.github.io

509 points by azhenley 5 months ago

Babawomba 5 months ago

The material is definitely practical—Kafka, Docker, Kubernetes, and Jenkins are all industry-standard tools, and the focus on MLOps is refreshing. It’s great to see a course bridge the gap between ML and actual production systems, not just stop at building models. Love that they're also tackling explainability, fairness, and monitoring. These are the things that often get overlooked in practice.

Is it too entry-level? Looking at the labs, a lot of this seems like stuff a mid-level software engineer (or even a motivated beginner) could pick up on their own with tutorials. Git, Flask, container orchestration... all useful, but pretty basic for anyone who's already worked in production environments. The deeper challenges—like optimizing networking for distributed training or managing inference at scale—don’t seem to get as much attention. Maybe it comes up in the group projects?

Also wondering about the long-term relevance of some of the tools they’re using. Jenkins? Sure, it’s everywhere, but wouldn’t it make sense to introduce something more modern like GitHub Actions or ArgoCD for CI/CD? Same with Kubernetes—obviously a must-know, but what about alternatives or supplementary tools for edge deployments or serverless systems? Feels like an opportunity to push into the future a bit more.

underdeserver 5 months ago

Too entry level? Even if every tool is entry level, tying them all together and actually making it work is hard. I'd say it's mid-to-late B.Sc. material.
Relevance? Is there really a huge conceptual difference between Jenkins and the other CI/CD frameworks? If not, if I were them I would just choose a random popular one, and it seems to me that's just what they did.
- ggddv 5 months ago
  
  It’s kind of funny all those supposedly complicated technologies are actually pretty simple when you understand why you are using them. Docker is the best example, it’s hard to understand what is happening unless to understand the problem it’s solving.
kkukshtel 5 months ago

I think what you're missing here is that this is now _the_ entry point for year 1 CS students. People come in wanting to do ML. 20 years ago people came in and learned to write databases with Java and used similarly "will probably be deprecated tools". This is just the new starting point.
amelius 5 months ago

> Also wondering about the long-term relevance of some of the tools they’re using.
That's what I was wondering about too. It seems to me that eventually someone will build a tool that runs any neural network on any hardware, whether local on one machine, or distributed in the cloud.

golergka 5 months ago

Fascinating; I just looked through the labs, and as a fullstack developer without that much experience in LLMs, it looks like I'm already closely familiar with half of them (git, flask, kafka, kubernetes) and the other half is just... code. No crazy math that I've come to associate with ML.

Does it mean that ML ops is a field that's actually not that hard to approach for a regular developer without a PhD?

itissid 5 months ago

You can do a lot of work on MLOps and get very far without knowing much about ML. In a team with a senior ML engineers you are helping them scale and build stuff.
Like say you want to generate tons of synthetic data using simulations, you are likely to be more interested in questions say of batching, encoding formats, data loading etc than the actual process of generating unbiased data sets
If you need to collect and sample data from crowd sourcing, you likely need to know less about reservoir sampling than say figure out how to do it, online so it's fast or be efficient with $$$/compute spent on implementing the solution etc.
- jms55 5 months ago
  
  Reservoir sampling as in the stuff that's used in ReSTIR for graphics?
  It's funny to me where statistics ends up sometimes.
thecleaner 5 months ago

Quite right. Its just software engineering with a fancy name. This work classification is only slightly better thought out than DevOps. In most companies ML engineers are engineers that understand software and some parts of ML and in best cases are good at both, in worst cases are terrible at both.
Mr-Frog 5 months ago

the PhD requirement is either for actual research positions or a gatekeeping function to help companies narrow down their massive candidate pools
absolutelastone 5 months ago

I don't think "Ops" roles generally require a PhD.

belter 5 months ago

This seems to have very little on Data Quality and it is on Chapter 16...How much practical experience in Industry do the authors have? Because 90% of your time will be spent on Data Quality and Data Cleansing...

szvsw 5 months ago

Arguably that’s a separate (obviously critical) concern. I think it’s worth it to abstract that away as just a step that exists in the pipeline with its own set of concerns/challenges/methods etc that really requires its own deeper study to do well.
For instance, my ML work is almost entirely in the context of engineering simulation regression/surrogate development, where data quality/cleaning is almost no issue at all - all of the work is on the dataset generation side and on the model selection/training/deployment side.
Every job is different!
bigmutant 5 months ago

Agree, Data Quality in-the-wild is a huge concern. I've led efforts to establish Lineage/Quality in large orgs and doing this after-the-fact is a massive undertaking. Having this "up-front" before all the data pipelines (origination, transformation, pre-processing) calcify saves a lot of headache down the road.

dexwiz 5 months ago

Is there somewhere I could follow along with other non students?

daft_pink 5 months ago

Can anyone sign up or do we have to get accepted into one of the top computer science programs in the country?

whymauri 5 months ago

They posted class notes, book chapters, additional readings, and the class assignments.
Looks good to me! Same with the LLM Systems course.
- yucatansunshine 5 months ago
  
  which LLM systems course are you referencing?
  - wavelander 5 months ago
    
    https://llmsystem.github.io/llmsystem2024spring/docs/Syllabu...
  - elashri 5 months ago
    
    probably https://llmsystem.github.io/llmsystem2024spring/
azhenley 5 months ago

I believe in you! (I'm one of the instructors of the course :) )
- daft_pink 5 months ago
  
  Thanks!

stressinduktion 5 months ago

Does anyone know about literature or courses regarding building machine learning cluster infrastructure? I am mainly interested in building and scaling up the storage infrastructure, networking and scheduling approaches.

thecleaner 5 months ago

Nothing fancy. The core principles are the same, you'll need to adapt them depending upon the kind workload changes that ML introduces. For most ML systems 1. Storage infra: assuming storage for models or even data, use any blob storage like S3. Or a shared networked file system like EFS, Lustre etc. 2. Networking: if you're talking about networking using large GPUs, I am not aware of any definitive resource on this. 3. Scheduling: This is honestly a solved problem at this point, anything works - write your own coordinator that periodically runs docker image base jobs (you can hook one up quite quickly using some sort of system for metadata and triggers powered by message queuing), use Airflow, use AWS Batch for large scale jobs.
You missed model serving (I think ?). Tough and latency sensitive esp for recommender systems. Prone to latency spikes, traffic spikes. Even with a well-written Python code you can run into limitations quite quickly.
- stressinduktion 5 months ago
  
  Thanks! :)
  > Nothing fancy.
  Well, right now I am seeing lots of low-level innovation for networking/storage along with RoCE, Infiniband, Tesla's ttpoe, the recent addition of devmem-tcp to the linux kernel (https://docs.kernel.org/networking/devmem.html) and wondered if there are approaches on how to plug something like that together on a higher level and what the considerations are. I surely assume EFS or S3 might be too expensive for a (large) training infrastructure, but I can be wrong?
  > You missed model serving (I think ?).
  I think I have a better grasp on the engineering challenges there and could imagine an architecture to scale that out (I believe!).

astahlx 5 months ago

Great to see this course here. Christian is also great as a person and he makes great work. I know some of the beginnings of this course and book and can highly recommend it.

golly_ned 5 months ago

I've worked on ML platforms and systems for 9.5 years at every scale. The material looks great.

SnorkelTan 5 months ago

What would you recommend for a backend developer looking to make the switch from rest crud apps to ml platforms?
- golly_ned 5 months ago
  
  It depends on your starting point. A baseline level of ML is needed. Otherwise ML platforms account for three basic functions: features/data, model training, and model hosting.
  So do an end-to-end project where you:
  - start from a CSV dataset, with the goal of predicting some output column. A classic example is predicting whether a household's income is >$50K or not from census information.
  - transform/clean the data in a jupyter notebook and engineer features for input into a model. Export the features to disk into a format suitable for training.
  - train a simple linear model using a chosen framework: a regressor if you're predicting a numerical field, a classifier if its categorical.
  - iterate on model evaluation metrics through more feature engineering, scoring the model on unseen data to see its actual performance.
  - export the model in such a way it can be loaded or hosted. The format largely depends on the framework.
  - construct a docker container that exposes the model over HTTP and a handler for receiving prediction requests and transforming them for input into the model, and a client that sends requests to that model.
  That'll basically get an entire end-to-end run the entire MLE lifecycle. Every other part of development is a series of concentric loop between these steps, scaled out to ridiculous scale in several dimensions: number of features, size of dataset, steps in a data/feature processing pipeline to generate training datasets, model architecture and hyperparameters, latency/availability requirements for model servers...
  For bonus points:
  - track metrics and artifacts using a local mlflow deployment.
  - compare performance for different models.
  - examine feature importance to remove unnecessary (or net-negative) features.
  - use a NN model and train on GPU. Use profiling tools (depends on the framework) and Nvidia NSight to examine performance. Optimize.
  - host a big model on GPU. Profile and optimize.
  IMO: the biggest missing piece for ML systems/platform engineers is how to feed GPUs. If you can right-size workloads and feed a GPU with MLE workloads you'll get hired. MLE workloads vary wildly (ratio of data volume in vs. compute; size of model; balancing CPU compute for feature processing with GPU compute for model training). We're all working under massive GPU scarcity.
  - zi_ 5 months ago
    
    this is unbelievably helpful, wow -- thank you!
    curious: which part of the pipeline does the majority of 'business' value come from?
    
    jononor 5 months ago
    
    For the majority of usecases I have seen: solving a sufficiently large painpoint, understanding/formulating the problem, having/getting the right data, fitting well into a workflow of the users.
    All the technology challenges are actually on the "cost" side of the equation. Meaning, that the aim wrt business value should be do as little of it as possible (but not less!). For some use cases this can still be quite a lot... But more often on the "all the pieces need to be in place for the whole to work at all" rather than "each piece needs to be super optimized".
  - pl4nty 5 months ago
    
    this is really helpful, thanks. how much are third-party models changing these workflows (LLMs etc)? would you still spend as much time on feature engineering and evaluation? I'm wondering whether any saved time would be refocused on hosting, especially optimizing GPU utilization
- Superbowl5889 5 months ago
  
  I am looking for suggestions too, I am thinking of doing internships by side, which will atleast get me into ecosystem
- homelessDevOps 5 months ago
  
  Would love to read some ideas to

thecleaner 5 months ago

Maybe I am underestimating the course complexity but this sounds like an entry level course. Up until Model explanability tools, most of the stuff looks fairly straightforward tbh. Although, they're using industry standard tools for most use-cases which is good I think.

doctorpangloss 5 months ago

I like the idea of learning a single “Kubernetis”

lincpa 5 months ago

[dead]