Speech enhancement employing a dual-path transformer (DPT) with a dilated DenseNet-based encoder and decoder has shown state-of-the-art performance. By applying attention in both time and frequency paths, the DPT learns the long-term dependency of speech and the relationship between frequency components. However, the batch processing of the DPT, which performs attention on all past and future frames, makes it impractical for real-time applications. To satisfy the real-time requirement, we propose a streaming dual-path transformer (stDPT) with zero look-ahead structure. In the training phase, we apply masking techniques to control the context length, and in the inference phase, caching methods are utilized to preserve sequential information. Extensive experiments have been conducted to show the performance based on different context lengths, and the results verify that the proposed method outperforms the current state-of-the-art speech enhancement models based on real-time processing.