What I wish to build?

19 Aug, 2021

In my two previous posts, I’ve detailed what I’ve been learning and working on in 2020 and 2021. In this post, I will detail the reasons behing my primary interests as I begin searching for opportunities upon graduation of my master’s degree (expected - March 2022).

My primary interests can be divided into a few core themes - Large-Scale Machine Learning, ML Developer Tools and MLOps, Distributed Systems and Platform and Infrastructure. These themes also have a lot of overlap. At this point, I am confident that these themes will remain my priority long into the future and I’ll highlight the reasoning for that below.

Large Scale Machine Learning

Over the last few years, we have seen a lot of progress in large neural networks with billions of parameters. Large language models started this trend and showed the importance of unsupervised and semi-supervised learning. Nowadays it seems that larger the model and data, the better the results. This is great, but this puts the bottleneck on compute and memory, both technically and socially. Technically speaking, getting hundereds of computers with multiple GPUs to coordinate to train a large neural model and then run inference on it is no easy task. Socially speaking, only those with access to hundreds of computers can even attempt to train such large models. This socio-technical challenge intrigues me, and I am excited about innovating and working on distributed systems problems dedicated to the specific use-case of large scale training and inference. I am also excited by initiatives that support large scale training in a decentralized way.

ML Developer Tools and MLOps

Research in Deep Learning and Machine Learning has come a long way over the past decade. However, applying that research to real world problems is not so straightforward. This has given rise to the entire field of MLOps and a number of startups have emerged that aim to do MLOps or provide MLOps as a service. While MLOps (amalgamation of ML and DevOps) focuses more on the infrastructure for the ML lifecycle, there’s also a pressing need for better developer tools and libraries for ML. While foundational frameworks like PyTorch and Tensorflow have greatly accelerated research, better libraries are necessary to make that research more accessible. Think about the Huggingface Transformers library, a few lines of Python code and you can use the latest neural networks in NLP in a breeze. This should be the benchmark for all forms of ML research.

Distributed Systems

Distributed Systems are ubiquitous in today’s day and age. It forms the backbone of the two themes I mentioned above. Everyone building software today seems to be using some distributed system directly or indirectly (through a cloud provider). Throughout my Software Engineering experience, I’ve used distributed systems almost everyday. From using queueing systems like Kafka while working as a Backend Engineer to provisioning and managing Kubernetes clusters with auto-scaling GPU nodes for ML training, I’ve been surrounded by distributed systems (and I assume most other software engineers are too). I always wondered how they worked under the hood. During my master’s, I got the chance to dive into how they work. I started by auditing the MIT course 6.824 on Distributed Systems and doing its labs. Continuing on this interest, I’ll also be taking a Big Data Systems course in Fall and working on some excited projects. In parallel, I’ve also been reading up on parallel computing and stuff like MPI. So far, I’ve been completely occupied by Distributed Systems and I feel the excitement of a child when learning about these. I am also very excited to actually build such systems. There’s so many aspects to get right, and there’s so many challenges. But these challenges are something I want to think about all the time, and so working on and building Distributed Systems remains my primary goal. Systems specific to any one of the other themes is like the perfect match.

Platform and Infrastructure

This theme goes tightly with Distributed Systems. Having experienced building and maintaining infrastructure on AWS and GCP, I aptly remember the pain points and the things that went well. This experience will help me think from a customer’s viewpoint while building distributed systems. On the other hand, having a knowledge of Distributed Systems will help me to build reliable, observable and available infrastructure. Working on projects that allow me to build infrastructure as well as the underlying systems used will help me better understand the complex processes in play and get a solid grip of the nuances and intricacies involved in complex software systems.

Summary

I’ve highlighted the four themes which I’m most interested in. However, this does not imply that these are the only topics I’d like to work on. Any software system solving a problem has thousands of moving parts, and in order to architect and build such systems it is necessary to explore as much as possible. For instance, distributed systems involves a solid understanding of networking so it helps to have explored networking ; Scale ML requires a basic understanding of the underlying algorithms; Any system requires a frontend for users to interact with (CLI, GUI, etc). I recently saw the Tesla AI Day event, and was amazed by their AI capabilities. But I was equally amazed by the complex system that enables such innovation. My ultimate dream is to be involved in building such a system. While following that journey, the pieces will automatically fall into place.