Play with your model and training hyper parameters. You might be able to use a lighter model without a significant degradation in performance, for example, decrease the network’s depth, width, number of filters, floating point accuracy… etc. This and methods #2,4 will allow you to increase your batch size and increase your inference bandwidth.
Explore network architectures that are optimized for ‘lighter’ hardware such as Squeeze Net.
NVIDIA offers a network inference optimizer called TensorRT that is designed exactly for your need - optimize your network for deployment.