The current data scarcity problem in EEG-based emotion recognition tasks leads to difficulty in building high-precision models using existing deep learning methods. To tackle this problem, a dual encoder variational autoencoder-generative adversarial network (DEVAE-GAN) incorporating spatiotemporal features is proposed to generate high-quality artificial samples. First, EEG data for different emotions are preprocessed as differential entropy features under five frequency bands and divided into segments with a 5s time window. Secondly, each feature segment is processed in two forms: the temporal morphology data and the spatial morphology data distributed according to the electrode position. Finally, the proposed dual encoder is trained to extract information from these two features, concatenate the two pieces of information as latent variables, and feed them into the decoder to generate artificial samples. To evaluate the effectiveness, a systematic experimental study was conducted in this work on the SEED dataset. First, the original training dataset is augmented with different numbers of generated samples; then, the augmented training datasets are used to train the deep neural network to construct the sentiment model. The results show that the augmented datasets generated by the proposed method have an average accuracy of 97.21% on all subjects, which is a 5% improvement compared to the original dataset, and the similarity between the generated data and the original data distribution is proved. These results demonstrate that our proposed model can effectively learn the distribution of raw data to generate high-quality artificial samples, which can effectively train a high-precision affective model.