This is a graduate-level course on computer networks, offering an in-depth exploration of selected advanced topics in networked systems. We will discuss the latest developments across the entire networking stack, the interactions between networks and high-level applications, and their connections with other system components such as computing and storage.
In this year's edition, we will use machine learning as a prime example to understand its unique requirements and challenges in the context of networking. As machine learning applications increasingly rely on larger models and faster accelerators, the demand for enhanced networking capabilities becomes imperative. Throughout this course, we will study cutting-edge networking solutions and principles for co-designing networks with computing and storage, to meet the evolving needs of machine learning applications. The course will include lectures, in-class presentations, paper discussions, and a research project.
- Instructor: Minlan Yu
- Guest instructor: Prof. Tushar Krishna (who will join some class discussions and provide feedback for course projects)
- Lecture time: TuTh 11:15 am to 12:30 pm
- Location: SEC 1.402
- Office hours: Tu 10-11 am, SEC 4.415
- Teaching fellows: Qianru Lao qianrulao@fas.harvard.edu; Shiji Xin sxin@fas.harvard.edu
- Prerequisite: This course has no prerequisites. Since this course will focus on reading papers on the latest topics in networking, you will need to be able to pick up the relevant background for each topic from textbooks or online materials.
- Recommended prep: system programming at the level of CS 61, CS 143, or CS 145.
If you are thinking of attending the class, please check the infrastructure page to set up your infrastructure as soon as possible
There are no required textbooks for the course. You will read papers before each class to get the most out of the class. For backgrounds, you are encouraged to refer to the following books: For basic networking concepts, you can refer to the textbook (K&R) Computer Networking: A Top-Down Approach by Jim Kurose and Keith Ross. The latest edition is the 8th, but earlier editions are fine.
- An alternative book is Computer Networks: A Systems Approach, by Larry Peterson and Bruce Davie. You can find an online version here.
- On the ML side, Prof Vijay Reddi is developing a book on Machine Learning Systems
- Please feel free to contact me if some concepts are difficult to understand; I'll provide more supplemental materials.
- Project: 50% (1% project proposal, 4% initial project presentation, 5% mid-term report, 5% final project presentation, 35% final report and code)
- Reviews: 35%
- Class presentation: 10%
- Class participation: 5% (including class attendance, in-class discussion, and online discussion on Ed)
Please see the detailed requirements after the syllabus.
We have introduced a group of new papers this year again. The papers we read have an emphasis on distributed systems and networking in the ML area. Review submission starts from 9/12 class.
- 9/3 Tu: Introduction (Minlan)
- 9/5 Th: Background on model and hardware, high-level course project ideas (Minlan)
- Optional reading: The Llama 3 Herd of Models
- 9/10 Tu: Data Parallelism and Sharding (Tushar)
- Reading: PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
- Optional reading: Meta blog post
- 9/12 Th: Model Parallelism and Pipelining (Yao Xiao, Zicheng Ma)
- 9/17 Tu: Parameter Server vs All Reduce (Gili Rusak, Yepeng Huang)
- 9/19 Th: Collective Communication Optimizations (Xingyu Xiang, Shi Feng)
- 9/24 Tu: LLM training (Javid Lakha, Leslie Gu)
- 9/26 Th: LLM serving (Wanxin Xie, Hanying Feng)
- 10/1 Tu: Tutorials by TFs: Optional homework project walkthrough; ASTRA-sim Distributed Machine Learning System simulator
- 10/3 Th: Course project pitch presentation
- 10/8 Tu: Throughput-latency tradeoffs (Taj Gulati, Cassie Dai)
- 10/10 Th: Distributed serving (Han Qi)
- 10/15 Tu: NCCL as a service (Dennis Eum)
- 10/17 Th: Flow scheduling (Vignav Ramesh, Naomi Bashkansky)
- 10/22 Tu: RDMA (Shirley Zhu, Shreeja Kikkisetti)
- 10/24 Th: Congestion control (Giovanni D'Antonio)
- Reading: FASTFLOW: Flexible Adaptive Congestion Control for High-Performance Datacenters (please note that the version 1 of this paper is titled: SMaRTT-REPS: Sender-based Marked Rapidly-adapting Trimmed & Timed Transport with Recycled Entropies)
- Optional reading: NVIDIA SpectrumX White Paper: Just check the adaptive routing part
- 10/29 Tu: Ethics
- 10/31 Th: Checkpointing (Jason Wang, Edward Kang)
- 11/5 Tu: Fault tolerance (Dagim Gebrie)
- 11/7 Th: Diagnosis (Tong Ding)
- 11/12 Tu: Data ingestion (Romeo Dean)
- 11/14 Th: LLM training in Production (Shiyu Ma, Inaki Arango)
- Reading: MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
- Optional reading: The Llama 3 Herd of Models
- 11/19 Tu: TPU (Emma Yang, Ayush Noori)
- 11/21 Th: Sustainable AI: Environmental Implications, Challenges and Opportunities (invited talk from Carole-Jean Wu, Meta)
- 11/26 Tu: Final project presentation (batch I)
- 11/28 Th: no class: Thanksgiving
- 12/3 Tu: Final project presentation (batch II)
- 12/11 Final Project Deadline (updated based on school examination group and dates)
- The reviews aim to help you become comfortable reading research papers on networking and systems.
- Students are expected to write reviews for the papers discussed in each class. Scores will be based on the top 90% of the reviews, meaning it is acceptable to miss THREE reviews throughout the course.
- Reviews are due by noon one day before class (Monday noon for Tuesday classes; Wednesday noon for Thursday classes). This allows the presenter to collect all your questions for class discussion. For lectures with guest speakers, the TF will collect the questions. Please raise your questions during class.
- Reviews submitted within a week after the deadline only get half of the scores. Reviews submitted later than that do not get any scores.
- Detailed review questions are available in HotCRP. In addition to the general review questions, each paper may have a specific question.
- The goal of the presentation and in-class discussion is to learn how to form your own opinions about a paper.
- Depending on the number of students, each student will give one to three talks during the course.
- The speaker should send their slides to me three days before the presentation. In class, we expect you to know all the details of the paper and be able to answer questions during the discussion. If you have any questions about the paper, feel free to reach out to me before the class.
- Some authors share slides online, and some conferences share conference talk videos. You are encouraged to check out these resources or reuse them for your presentation with clear citations. However, be aware that conference talks are often short and focus more on the motivation rather than the technical details. They may also highlight only the benefits of their approaches (Everyone likes his own work). So, if you reuse the slides, please add more technical details, ensure you understand the content thoroughly, and share your own opinions of the work (not just the authors').
- The presentation should cover the major content of the paper, including motivation (what problem the paper is solving; why this problem wasn't solved before), challenges (why this problem is difficult to solve), system design (how the authors address the challenges), evaluation (does it demonstrate that the problems/challenges are solved?), and your personal opinions of the paper.
- The talk should be around 45-50 minutes, excluding the review questions and discussions. This is longer than a normal conference talk to allow for more context on problem settings and detailed system design.
- Additionally, read all the reviews submitted by your classmates, list their questions in your slides, and lead the discussion of these questions in class.
- Be prepared to answer detailed questions about the paper during the discussion.
- The presentation will be graded based on both content (your understanding of the paper) and presentation (your delivery of the knowledge).
The semester-long project is an open-ended systems research project. Project topics are of your choice but should be related to networking. Projects should be done in groups of two or three and include a systems-building component. Note that we do not consider the number of students in a group in grading. Selected projects can be submitted as peer-reviewed workshop papers or posters.
- 9/8 Sun at noon: Form groups for course projects
- 9/22 Sun at noon: Course project proposal
- 9/23-27: Schedule individual meetings with Minlan to get feedback on your project proposal
- 10/3 Th: Course project pitch presentation
- 11/3 Sun at noon: Midterm project report due at noon
- 11/4-8: Schedule individual meetings with Minlan to get feedback on your midterm report
- 11/24 Tu, 12/3 Tu: Final project presentation
- 12/11 Final project due at noon
- 12/12 Review of other students' projects due at midnight
The project proposal serves as a checkpoint, providing a basis for your meetings with Minlan and your pitch presentations. Please check out the guidelines for pitch presentation below on what to write in your project proposal. You will receive the full 1% grade if you submit your proposal on time. Unfortunately, late submissions will not be accepted, and there is no opportunity to make up the grade. After submission, you can keep updating your proposal and bring your latest one to your meeting with Minlan.
Each group should deliver a 5-minute talk on their project ideas. Be mindful about the scope of your project to ensure it can be completed by your team within two and a half months. The presentation should include the following points (one slide per question):
- What problem are you solving?
- Why is it an important problem?
- What potential challenges might you face in solving the problem?
- What is your plan for the midterm report and division of work within the team?
Your grade depend on how concrete your problem and execution plan are
The midterm report should be about 2-4 pages and serve as a starting point for your final project report (see detailed requirements for the final report below) To achieve a high score for your midterm report, it is important to deliver an initial evaluation of your system. You don't need to complete the entire system; instead, focus on identifying the most critical component/question in your project and provide an initial quantitative evaluation. The midterm report should include the following:
- Describe the problem you plan to solve, why it is novel/unique, and the major challenges (similar to your project pitch presentation, but feel free to adapt it based on your new understanding of the problem).
- Describe the detailed design of your project and what you have implemented/evaluated so far.
- Provide one evaluation figure about your initial system (This will be the focus of your meeting with Minlan)
- Discuss the remaining challenges, how you plan to address them, and your plan for the remaining time.
This presentation should resemble a workshop talk. You might consider covering the following content (not necessarily in the same order):
- What problem are you solving?
- Why is it an important problem?
- What is your basic solution to the problem?
- What are the challenges in the problem?
- How did you solve these challenges? Or how do you plan to solve them?
- Your preliminary evaluation results
- What do you plan to improve for the final report?
The report should be similar in spirit to a workshop paper, spanning six pages of double-column, single-spaced, 10-point font, excluding references. Here is an example LaTeX framework for formatting and building your paper. As shown in the framework, you may consider the following sections for your report (adapted from Eddie's version):
- Title: Something grabby that correctly describes a part of the contribution.
- Abstract: A paragraph or two that concisely describes the motivation for the work (the problem addressed), the contribution of the work, and a highlight of your results.
- Introduction: The introduction often covers the following questions: what problem are you trying to solve? Why is your problem important? What are the key challenges in solving your problem? What are your high-level ideas for addressing these challenges? What is your key design/system architecture? What are your key findings and evaluation results?
- Design: Start with the high-level architecture of your system, and then describe the details of your design in enough relevant detail that a skilled system builder could replicate your work. Compare your design choices with alternative approaches to explain why you designed your system this way.
- Evaluation: For systems work, this often includes the following subsections: (1) Experimental setup: Describe how you ran your experiments. What kinds of machines? How much memory? How many trials? How did you prepare the machine before each trial? (2) The experiments themselves, grouped by purpose. Include figures. (3) A summary of the experimental results. Some good evaluations are organized around performance hypotheses: statements that the experiments aim to support or disprove. It is important to discuss the implications of your observed results and why you see such results.
- Related work: Describe related research, especially research closely related to your work. This section serves to provide citations and comparisons. For each group of citations, describe (1) the core idea, (2) what is complementary to your work, (3) what is more advanced than your work, and (4) what is advanced upon by your work. (2)–(4) are optional—some papers will be entirely complementary with or orthogonal to your work.
- Conclusion: Summarize your work and its contributions.
Together with the final report, you should submit the GitHub link of your project code. No need for superb software engineering, but ideally the code should be accompanied by enough documentation that a motivated user could attempt to replicate your results. You will need to demonstrate your product to the TFs at the final office hours.
The first four milestones (initial proposal, pitch presentation, midterm report, final project presentations) are mainly graded based on how well you keep up with the project progress at each stage. You will also get feedback at these milestones on how to improve your projects. The final project will be graded based on: Motivation, Design, delivered system, and its evaluation.
ChatGPT and generative AI tools are in general not useful for this course assignments (reviews and project reports). The default is that such use is disallowed unless with the permission of the course staff. Any such use must be appropriately acknowledged and cited. It is each student’s responsibility to assess the validity and applicability of any GAI output that is submitted; you bear the final responsibility. Violations of this policy will be considered academic misconduct. We draw your attention to the fact that different classes at Harvard could implement different AI policies, and it is the student’s responsibility to conform to expectations for each course. THere are also Harvard guidelines for GAI tools.
I would like to create a learning environment in our class that supports a diversity of thoughts, perspectives and experiences, and honours your identities (including race, gender, class, sexuality, socioeconomic status, religion, ability, etc.). I (like many people) am still in the process of learning about diverse perspectives and identities. If something was said in class (by anyone) that made you feel uncomfortable, please talk to me about it. If you feel like your performance in the class is being impacted by your experiences outside of class, please don’t hesitate to come and talk with me. As a participant in course discussions, you should also strive to honour the diversity of your classmates. (Statement extracted from one by Dr. Monica Linden at Brown University.)
If you have a health condition that affects your learning or classroom experience, please let me know as soon as possible. I will, of course, provide all the accommodations listed in your AEO letter (if you have one), but sometimes we can do even better if a student helps me understand what matters to them. (Statement adapted from one by Prof. Krzysztof Gajos.)