Hands-on Tutorials

How to use cloud training resources to scale up your training capacity

Whether you are an algorithm developer in a growing startup company, a data scientist in a university research lab, or a kaggle hobbyist, there may come a point in time when the training resources that you have onsite no longer meet your training demands. In this post we target development…

Tips and Tricks

Methods for Streaming Training Data from Amazon S3 to Amazon SageMaker — Part 2

Last year we published a blog post in which we surveyed different methods for streaming training data stored in Amazon S3 into an Amazon SageMaker training session. We highlighted some of the strengths and weaknesses of the different options and examined their ability to address some specific needs such as:

Thoughts and Theory

Harnessing the Power of Dedicated DNN Training Chips

One of the driving forces behind the success of deep learning over the past decade has been the immense computing power offered by Graphics Processing Units (GPUs). Although originally designed for rendering images to display devices, their highly parallel structure enabled training speed-ups of orders of magnitude. Over time GPUs…

Making Sense of Big Data

A Simple Technique that Can Save You Bucketloads of Money and How to Combine it with Mixed Precision Learning

Motivated by the desire to accelerate the speed of learning, a common practice in the world of deep learning today is to distribute training activity across multiple workers (e.g. GPUs). …

Making Sense of Big Data

Simplify data management by unifying the file format across different kinds of machine learning workloads

Machine learning is all about the data. To successfully train a sophisticated model you will need a high quality training dataset; a dataset that is sufficiently large, accurately labeled, and correctly represents the distribution of data samples in the real world. However, no less important is proper management of the…

What to look out for when scaling your training to multiple workers

These days data distributed training is all the rage. In data distributed training learning is performed on multiple workers in parallel. The multiple workers can reside on one or more training machines. Each worker starts off with its own identical copy of the full model and performs each training step…

Making Sense of Big Data

Dynamically Adapt your Training Session Based on Worker System Availability

Horovod is a popular framework for running distributed training on multiple GPU workers and across multiple hosts. Elastic Horovod is an exciting new feature of Horovod that introduces support for fault-tolerance, enabling training to continue uninterrupted, even in the face of failing or resuming hosts. …

Back to Basics: Rethinking Development Best Practices in the Age of Cloud Computing

My previous posts have been mostly technical, covering a range of topics on training in the cloud and advanced TensorFlow development. This post is different. You might consider it more of an opinion piece.

In the course of my career I have rarely seen tensions run higher than when discussing…

Making Sense of Big Data

Maximize Training Resource Utilization, Accelerate Learning, Save Money

In a previous post, I spoke about the importance of profiling the runtime performance of your DNN training sessions as a means to making the most of your training resources, accelerating your training, and saving money. I described a typical training pipeline, (see the diagram below), reviewed some of the…

Chaim Rand

I am a Machine Learning Algorithm Developer working on Autonomous Vehicle technologies at Mobileye, an Intel Company.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store