Play with your model and training hyper parameters. You might be able to use a lighter model without a significant degradation in performance, for example, decrease the network’s depth, width, number of filters, floating point accuracy… etc. This and methods #2,4 will allow you to increase your batch size and increase your inference bandwidth. Explore network architectures that are optimized for ‘lighter’ hardware such as Squeeze Net . NVIDIA offers a network inference optimizer called TensorRT that is designed exactly for your need - optimize your network for deployment. This is a more novel solution to increase inference speed but it should work - a famous paper by Geoffrey Hinton allows you to distill the knowledge in the network and compress it into a smaller network. This will somewhat degrade your performance, but the paper showed that the accuracy penalty is not very significant compared to the savings in model complexity.