This paper proposes a method for the unsupervised learning of place-names from pairs of a spoken utterance and a localization result, which represents a current location of a mobile robot, without any priori linguistic knowledge other than a phoneme acoustic model. In previous work, we have proposed a lexical learning method based on statistical model selection. This method can learn the words that represent a single object, such as proper nouns, but cannot learn the words that represent classes of objects, such as general nouns. This paper describes improvements of the method for learning both a phoneme sequence of each word and a distribution of objects that the word represents.