This paper reports and discusses an implementation of a cognitively-inspired and computationally appropriate linguistic encoding of motion events in human-robot dialogue. The proposed encoding is based on a schematic system of attention in spatial language and the conceptualization of the two fundamental cognitive functions in language – i.e., the Figure and the Ground – for the bipartite and tripartite spatial scene partitioning.