(文 | 辛西 编辑 | 信息 审核 | 陈洪)近日,人工智能领域国际会议ICML 2025(Forty-Second International Conference on Machine Learning,CCF-A类, CAAI-A类)公布论文接收结果,录用了家庭教师av 陈洪教授研究组在机器学习理论领域的3篇研究工作。
第1篇论文题为“How does Labeling Error Impact Contrastive Learning? A Perspective from Data Dimensionality Reduction”,系统分析了传统数据降维对自监督对比表示学习性能的影响。近年来,对比学习在自监督表示学习领域取得出色性能。虽然许多先前的研究试图从理论层面解释对比学习成功的原因,但其往往依赖于标签一致性假设,即数据增强生成的正样本对具备语义一致。然而实际应用中由于数据增强策略的多样性和随机性,可能产生标签误差,导致该假设不成立。该文从理论角度研究了标签误差对对比学习下游分类性能的影响。具体来说,首先揭示了标签误差对下游分类风险的多种显著的负面影响。为缓解该影响,研究人员提出在原始数据上应用数据降维方法(如奇异值分解SVD)以减少假阳性样本,并通过理论和实验验证其有效性。进一步研究发现,SVD是一把双刃剑——虽能降低标签误差,但可能因削弱增强图的连通性而导致下游分类精度下降。家庭教师av 2022级博士研究生陈君为论文第一作者,陈洪教授为通讯作者,悉尼大学应益明教授等参与论文研究。
第2篇论文题为“On the Generalization Ability of Next-Token-Prediction Pretraining ”,分析了“下一个Token预测”的预训练任务对大语言模型(LLM)泛化性能的影响。近年来大语言模型在NLP任务中表现出色,尤其是在文本生成方面,这很大程度上取决于LLM在大规模无标签语料数据集上的预训练。当前主流的LLM(如GPT、LLaMA、DeepSeek等)均属于Decoder-Only模型 (DOMs),而DOMs预训练任务均是:利用此前的所有token去预测下一个token,也就是Next-Token-Prediction (NTP)。尽管LLM的NTP预训练在工程技术方面已取得诸多进展,然而关于NTP预训练的理论刻画,尤其是泛化理论分析仍然十分欠缺。该文从统计学习理论的视角对多层-多头DOMs的覆盖数进行分析,并基于Rademacher复杂度与混合过程给出了DOMs在NTP预训练任务中的超额风险上界。理论结果表明,LLM的预训练泛化能力主要取决于模型参数量、token序列数量以及token序列长度,当增大模型参数量时,应以线性的比例同步增加预训练的总token数量。家庭教师av 2023级硕士研究生李址豪为论文第一作者,陈洪教授为通讯作者,南方科技大学郑锋研究员等参与论文研究。
第3篇论文题为“Adversarial Robust Generalization of Graph Neural Networks”,探究了图神经网络对抗泛化理论。图神经网络 (GNN) 在节点分类任务中表现出色,然而在对抗攻击下依然面临严重性能退化问题。对抗训练作为一种广泛用于增强 GNN 对抗鲁棒性的工具,虽然在节点分类任务中表现出显著效果,但泛化机制的解析仍较缺乏。该文通过发展基于覆盖数的一致收敛分析技术,建立了对抗GNN的高概率泛化界,该结果适用于一系列典型GNN,并揭示了影响泛化能力的架构因素。基准数据实验分析显示与理论结果的一致性。家庭教师av 2023级硕士研究生曹畅为论文第一作者,李函副教授为通讯作者,陈洪教授等参与论文研究。
上述研究工作得到国家自然科学基金面上项目、国家自然科学基金数学天元基金等资助。
会议链接://icml.cc
【论文1英文摘要】
In recent years, contrastive learning has achieved state-of-the-art performance in the territory of self-supervised representation learning. Many previous works have attempted to provide the theoretical understanding underlying the success of contrastive learning. Almost all of them rely on a default assumption, i.e., the label consistency assumption, which may not hold in practice (the probability of failure is called labeling error) due to the strength and randomness of common augmentation strategies, such as random resized crop (RRC). This paper investigates the theoretical impact of labeling error on the downstream classification performance of contrastive learning. We first reveal several significant negative impacts of labeling error on downstream classification risk. To mitigate these impacts, data dimensionality reduction method (e.g., singular value decomposition, SVD) is applied on original data to reduce false positive samples, and establish both theoretical and empirical evaluations. Moreover, it is also found that SVD acts as a double-edged sword, which may lead to the deterioration of downstream classification accuracy due to the reduced connectivity of the augmentation graph. Based on the above observations, we give the augmentation suggestion that we should use some moderate embedding dimension (such as 512, 1024 in our experiments), data inflation, weak augmentation, and SVD to ensure large graph connectivity and small labeling error to improve model performance.
【论文2英文摘要】
Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model's generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences $N$, the maximum length of sequence $m$, and the count of parameters in the transformer model $\Theta$. Additionally, experiments on public datasets verify our theoretical findings.
【论文3英文摘要】
While Graph Neural Networks (GNNs) have shown outstanding performance in node classification tasks, they are vulnerable to adversarial attacks, which are imperceptible changes to input samples. Adversarial training, as a widely used tool to enhance the adversarial robustness of GNNs, has presented remarkable effectiveness in node classification tasks. However, the generalization properties for explaining their behaviors remain not well understood from the theoretical viewpoint. To fill this gap, we develop a high probability generalization bound of general GNNs in adversarial learning through covering number analysis. We estimate the covering number of the GNN model class based on the entire perturbed feature matrix by constructing a cover for the perturbation set. Our results are generally applicable to a series of GNNs. We demonstrate their applicability by investigating the generalization performance of several popular GNN models under adversarial attacks, which reveal the architectural factors influencing the generalization gap. Our experimental results on benchmark datasets provide evidence that supports the established theoretical findings.