In this paper, we propose a multi-level speech emotion recognition (SER) system that captures both acoustic and linguistic emotional features using only speech input. In particular, our approach utilizes an acoustic feature extractor (HuBERT) to process the input waveform, capturing acoustic emotional features while simultaneously performing automatic speech recognition (ASR) to implicitly learn linguistic information. The ASR-decoded transcriptions are then fed into a linguistic feature extractor (BERT) to explicitly encode linguistic emotional features. To combine these features, we introduce a temporal gated fusion method that dynamically modulates the contribution of each modality, addressing modality incongruity issues. Furthermore, integrating multi-attribute learning for emotion-related attributes such as gender and speaking style, further enhances SER performance. To address gradient conflicts inherent in multi-attribute learning, we propose a two-stage fine-tuning framework employing adapters. Additionally, to mitigate the negative impact of ASR errors, we introduce an error correction module and a contrastive learning method to align representations learned from ground-truth text and the decoded transcriptions. Comprehensive experimental results on the IEMOCAP and MELD datasets validate that our method enhances SER performance without requiring textual input. Compared to the acoustic model baseline, our approach achieves a 10.85% improvement in unweighted accuracy on the IEMOCAP dataset.