The ’black box’ characteristics of Convolutional Neural Networks (CNNs) present significant risks to their application scenarios, such as reliability, security, and division of responsibilities. Addressing the interpretability of CNN emerges as an urgent and critical issue in the field of machine learning. Recent research on CNN interpretability has either yielded unstable or inconsistent interpretations, or produced coarse-scale interpretable heatmaps, limiting their applicability in various scenarios. In this work, we propose a novel method of CNNs interpretation by incorporating a joint evaluation of multiple feature maps and employing multi-objective optimization (JE&MOO-CAM). Firstly, a method of joint evaluation for all feature maps is proposed to preserve the complete object instances and improve the overall activation values. Secondly, an interpretation method of CNNs under the MOO framework is proposed to avoid the instability and inconsistency of interpretation. Finally, the operators of selection, crossover, and mutation, along with the method of population initialization in NSGA-II, are redesigned to properly express the characteristics of CNNs. The experimental results, including both qualitative and quantitative assessments along with a sanity check conducted on three classic CNN models—VGG16, AlexNet, and ResNet50—demonstrate the superior performance of the proposed JE&MOO-CAM model. This model not only accurately pinpoints the instances within the image requiring explanation but also preserves the integrity of these instances to the greatest extent possible. These capabilities signify that JE&MOO-CAM surpasses six other leading state-of-the-art methods across four established evaluation criteria.