This paper provides a novel training data selection method to construct acoustic models for automatic speech recognition (ASR). Various training data sets have been developed for acoustic modeling. Each training set was created for a specific ASR application such that acoustic characteristics in the set, e.g. speakers, noise and recording devices, match those in the application. A mixture of such already-created training sets (an out-of-domain set) becomes a large utterance set containing various acoustic characteristics. The proposed method selects the most appropriate subset of the out-of-domain set and uses it for supervised training of an acoustic model for a new ASR application. The subset that has the most similar acoustic characteristics to the target-domain set (i.e. untranscribed utterances recorded by the target application) is selected based on the proposed joint Kullback-Leibler (KL) divergence of speech and non-speech characteristics. Furthermore, in order to select one of the many subsets in practical computation time, we also propose a selection algorithm based on submodular optimization that minimizes the joint KL divergence by greedy selection with guaranteed optimality. Experiments on real meeting utterances that use deep neural network acoustic models show that the proposed method yields better acoustic models than random or likelihood-based selection.