Making Sense of Big Data

Simplify data management by unifying the file format across different kinds of machine learning workloads

Photo by Maksim Shutov on Unsplash

Machine learning is all about the data. To successfully train a sophisticated model you will need a high quality training dataset; a dataset that is sufficiently large, accurately labeled, and correctly represents the distribution of data samples in the real world. However, no less important is proper management of the data. By data management we are referring to how and where the data is stored, the ways in which it is accessed, and the transformations it undergoes during the development life-cycle. The focus of this post is on the file format used to store the training data and the implications…

Making Sense of Big Data

Reduce CPU Load on the Training Instance by Processing Data During its Retrieval

Photo by Erin Minuskin on Unsplash

Two months ago (in March of 2021) AWS announced the Amazon S3 Object Lambda feature, a new capability that enables one to process data that is being retrieved from Amazon S3 before it reaches the calling application. The announcement highlights how this feature can be used to provide different views of the data to different clients and describes its advantages over other solutions in terms of complexity and cost:

To provide different views of data to multiple applications, there are currently two options. You either create, store, and maintain additional derivative copies of the data, so that each application has…

What to look out for when scaling your training to multiple workers

Photo by Laura Ockel on Unsplash

These days data distributed training is all the rage. In data distributed training learning is performed on multiple workers in parallel. The multiple workers can reside on one or more training machines. Each worker starts off with its own identical copy of the full model and performs each training step on a different subset (local batch) of the training data. After each training step it publishes its resultant gradients and updates its own model taking into account the combined knowledge learned by all of the models. Denoting the number of workers by k and the local batch size by b

Making Sense of Big Data

Dynamically Adapt your Training Session Based on Worker System Availability

Photo by Jason Leung on Unsplash

Horovod is a popular framework for running distributed training on multiple GPU workers and across multiple hosts. Elastic Horovod is an exciting new feature of Horovod that introduces support for fault-tolerance, enabling training to continue uninterrupted, even in the face of failing or resuming hosts. In this post I will explain how Elastic Horovod can be used to reduce cost in a distributed training environment and demonstrate the configuration steps required to run it on Amazon Elastic Compute Cloud (Amazon EC2) spot instances.

The post includes four parts. We begin in part 1 by describing how to reduce training costs…

Back to Basics: Rethinking Development Best Practices in the Age of Cloud Computing

Photo by Felipe Furtado on Unsplash

My previous posts have been mostly technical, covering a range of topics on training in the cloud and advanced TensorFlow development. This post is different. You might consider it more of an opinion piece.

In the course of my career I have rarely seen tensions run higher than when discussing the topic of software development practices. Even the most introverted engineer will suddenly come alive. Peers, who otherwise work in unison, will fall into heated argument. …

Making Sense of Big Data

Maximize Training Resource Utilization, Accelerate Learning, Save Money

Can you find the bottleneck? Photo by viswanath muddada on Unsplash

In a previous post, I spoke about the importance of profiling the runtime performance of your DNN training sessions as a means to making the most of your training resources, accelerating your training, and saving money. I described a typical training pipeline, (see the diagram below), reviewed some of the potential performance bottlenecks, and surveyed some of the tools available for identifying such bottlenecks. In this post I would like to expand on one of the more common performance bottlenecks, the CPU bottleneck, and some of the ways to overcome it. …

Making Sense of Big Data

How to Increase Your Efficiency and Reduce Cost When Training in the Cloud

This blog post accompanies a talk I gave at AWS re:Invent 2020, in which I described some of the ways in which my team at Mobileye, (officially known as Mobileye, an Intel Company), uses Amazon SageMaker Debugger in its daily DNN development.

Monitoring the Learning Process

A critical part of training machine learning models, and particularly deep neural networks (DNNs), is monitoring one’s learning process. (This is sometimes called babysitting one’s learning process.) …

How to Implement a Non-trivial TensorFlow Keras Loss Function

Photo by Kristopher Roller on Unsplash

One of the main ingredients of a successful deep neural network, is the model loss function. At Mobileye, (officially known as Mobileye, an Intel Company), we spend a lot of time cultivating our loss functions, and fine-tuning them to the precise problems that we are trying to solve. While we, naturally, desire as much flexibility as possible when it comes to defining loss functions, it should come as no surprise that high level training frameworks and APIs, might impose certain restrictions. In this post, I will describe the challenge of defining a non-trivial model loss function when using the, high-level…

How to Capture and Record Arbitrary Tensors in TensorFlow 2

Photo by Markus Spiske on Unsplash

In previous posts, I have told you about how my team at Mobileye, (officially known as Mobileye, an Intel Company), has tackled some of the challenges that came up, while using TensorFlow to train deep neural networks. In particular, I have covered topics such as performance profiling, and debugging. This post addresses an additional component of training machine learning models, that of monitoring the learning process.

Monitoring the learning process is an important, and often time-consuming, part of DNN training, during which we track a variety of tensors, metrics and statistics, in order to understand how our training is progressing…

How to Debug a TensorFlow Training Program Without Losing Your Mind

Photo by David Clode on Unsplash
If debugging is the process of removing software bugs, then programming must be the process of putting them in.
Edsger Dijkstra. From https://www.azquotes.com/quote/561997

In some of my previous posts (here, here, and here), I told you a bit about how my team at Mobileye, (officially known as Mobileye, an Intel Company), uses TensorFlow, the Amazon SageMaker and Amazon s3 to train our deep neural networks on large quantities of data. In this post, I want to talk about debugging in TensorFlow.

It is well known, that program debugging is an integral part of software development, and that the time that…

Chaim Rand

I am a Machine Learning Algorithm Developer working on Autonomous Vehicle technologies at Mobileye, an Intel Company.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store