Home » Tech News

AI 每日资讯 — 2026-05-12

五月 12, 2026 · 6 分钟 · Pan

AI 每日资讯 — 2026-05-12

AI 每日资讯 — 2026-05-12

🔥 HuggingFace 每日论文

1. Pixal3D: Pixel-Aligned 3D Generation from Images

Dong-Yang Li, Wang Zhao, Yuxin Chen

本文针对图像到3D生成中像素级保真度（fidelity）不足的核心瓶颈，提出Pixal3D——一种像素对齐的3D生成范式。不同于在规范空间中生成再通过注意力机制注入图像线索的传统方法，Pixal3D直接在输入视角下进行像素对齐的3

D生成，并引入像素反投影条件机制，将多尺度图像特征显式提升为3D特征体，建立无歧义的像素-3D对应关系。实验表明，Pixal3D显著提升生成 fidelity，逼近重建水平，同时支持多视角融合与高保真、物体分离的场景合成。

PDF · arXiv · 代码 · 项目 | ❤️ 12

2. Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

Junhao Shen, Teng Zhang, Xiaoyan Zhao

本文针对大语言模型智能体在强化学习中依赖外部技能解决复杂任务时面临的技能管理僵化问题，提出动态技能生命周期管理框架SLIM。SLIM将外部技能集合建模为与策略联合优化的动态变量，通过“留一技能剔除”验证量化各技能的边际外部贡献，并

引入保留、退役与扩展三类生命周期操作，实现任务与阶段自适应的技能集演化。在ALFWorld和SearchQA上的实验表明，SLIM平均超越最优基线7.1个百分点；进一步分析证实策略内化与外部技能调用可协同共存，验证了动态技能管理的必要性与有效性。

PDF · arXiv · 代码 | ❤️ 11

3. ELF: Embedded Language Flows

Keya Hu, Linlu Qiu, Yiyang Lu

本文针对扩散语言模型（DLMs）在连续空间建模能力不足的问题，提出Embedded Language Flows（ELF）——一种基于连续时间Flow Matching的嵌入空间扩散模型。ELF全程在词嵌入的连续空间中进行去噪建模

，仅在最终时刻通过共享权重网络映射至离散token，从而无缝复用图像领域成熟的扩散技术（如无分类器引导CFG）。实验表明，ELF在生成质量上显著超越现有离散与连续DLMs，且采样步数更少，验证了其作为高效连续语言生成范式的潜力。

PDF · arXiv | ❤️ 2

4. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

Chenyang Song, Weilin Zhao, Xu Han

本文针对MoE模型在端侧设备部署中面临的存储与内存访问瓶颈问题，提出DECO——一种稀疏MoE架构，在相同总参数量和训练token下实现与稠密Transformer相当的性能。DECO采用可微、灵活的ReLU基路由机制，并引入专家

级可学习缩放因子以动态平衡路由专家与共享专家的贡献；提出NormSiLU激活函数，通过归一化输入提升SiLU稳定性，增强专家激活稀疏性；并发现非门控MLP专家配合ReLU路由可简化MoE设计。实验表明：DECO仅激活20%专家即达稠密模型性能，显著优于主流MoE基线，且定制加速核在真实硬件上实现3.00倍推理加速。

PDF · arXiv | ❤️ 1

5. DataMaster: Towards Autonomous Data Engineering for Machine Learning

Yaxin Du, Xiyuan Yang, Zhifan Zhou

本文针对机器学习中数据工程高度依赖人工、缺乏系统化与自动化的问题，提出DataMaster——一种面向任务条件的自主数据工程框架。该框架通过树状搜索结构（DataTree）、共享外部数据池（Data Pool）和全局记忆机制（Gl

obal Memory），实现外部数据发现、选择组合、清洗转换等全流程自主优化，在不修改学习算法的前提下提升下游模型性能。实验表明，DataMaster在多个基准任务上显著优于人工调优与基线方法，验证了其在开放搜索空间、分支依赖优化与延迟反馈场景下的有效性与可扩展性。

PDF · arXiv | ❤️ 1

6. RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

Huashuo Lei, Wenxuan Song, Huarui Zhang

本文针对当前机器人记忆基准缺乏多模态记忆标注、任务覆盖有限及缺失真实世界评估等问题，提出RoboMemArena——一个涵盖26个长程任务的大规模机器人记忆基准，平均轨迹长度超1000步，68.9%子任务依赖记忆。其生成流程融合视

觉-语言模型（VLM）进行子任务设计与组合，并通过原子函数生成完整轨迹，提供子任务指令与原生关键帧等记忆相关标注，同时配套真实物理场景记忆任务以支持实体验证。此外，作者提出双系统视觉-语言-动作（VLA）模型PrediMem，集成VLM规划器、近期/关键帧双缓冲记忆库及预测编码头，显著提升对任务动态的敏感性。在RoboMemArena上的实验表明，PrediMem全面超越基线模型，并为复杂记忆系统的架构设计、管理机制与扩展规律提供了深入洞见。

PDF · arXiv · 代码 · 项目

7. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Simon Yu, Derek Chong, Ananjan Nandi

本文提出Shepherd——一种面向元智能体（meta-agents）的运行时基础架构，将元智能体对目标智能体的操作形式化为基于Lean验证的函数式编程模型，并以类型化事件构建类Git的执行轨迹，支持任意历史状态的分叉与重放。Sh

epherd通过轻量级进程与文件系统分叉，实现比Docker快5倍的隔离启动，并在重放中达成>95%的提示缓存复用率。在CooperBench、多基准反事实优化及TerminalBench-2等任务中，Shepherd分别将协同编程通过率提升至54.7%，反事实搜索性能提升最高11分且耗时降低58%，Tree-RL终端准确率从34.2%提升至39.4%。系统已开源。

PDF · arXiv

8. CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

Wenxuan Song, Han Zhao, Fuhao Li

本文提出CapVector方法，旨在解决预训练视觉-语言-动作（VLA）模型在标准监督微调（SFT）中性能提升有限、适配成本高的问题。该方法在参数空间中解耦辅助目标微调的双重优化目标——通用能力增强与任务特定动作分布拟合，通过在小

规模任务集上采用两种策略分别微调，提取其参数差作为可迁移的“能力向量”。该向量与预训练参数融合构建能力增强的元模型，并辅以轻量正交正则化损失，在保持SFT简洁性的同时逼近辅助微调基线性能。实验表明，CapVector具备跨模型泛化性、任务可迁移性与计算高效性。

PDF · arXiv · 代码 · 项目

🔥 arXiv 每日论文

📄 arXiv: cs.AI

1. Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

Logan Mann, Ajit Saravanan, Ishan Dave, Shikhar Shiromani, Saadullah Ismail, Yi Xia, Emily Huang

本文对视觉-语言模型（VLMs）的可靠性机制开展系统性归因研究，挑战“注意力图越锐利、模型越可信”的普遍直觉。作者构建统一的VLM可靠性探针（VRP），在LLaVA-1.5、PaliGemma和Qwen2-VL三类开源模型上联合分

析注意力结构、隐状态几何与生成动态。实验表明：（i）注意力分布几乎无法预测答案正确性（R²≈0.001），尽管其对特征提取具因果必要性；（ii）可靠性信号主要浮现于深层隐状态，单层线性探针在POPE基准上达AUROC>0.95；（iii）因果神经元消融揭示架构差异——晚融合模型依赖脆弱瓶颈，而早融合模型具备强鲁棒性。结果指出：隐状态几何、层间置信度边界及稀疏晚期电路比注意力锐度更能可靠表征VLM可信度。

2. Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction

Andrei Lazarev, Dmitrii Sedov, Alexander Galkin

本文针对科学图表数据自动提取任务中多模态大语言模型（LLMs）在非标准化图表上准确率不足的问题，系统比较了高层语义提示与底层空间引导两种策略的有效性。实验表明，语义方法（如元数据优先两阶段框架与思维链）未能带来统计显著的性能提升；

而提出的一种简单但高效的空间引导方法——在图表图像上叠加坐标网格——显著降低了提取误差：在合成数据集上，对称平均绝对百分比误差（SMAPE）从25.5%降至19.5%（p < 0.05）。结果证实，显式提供空间上下文比高层语义指导更能可靠提升当前多模态模型在该任务上的表现。

3. Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Juanxi Tian, Fengyuan Liu, Jiaming Han, Yilei Jiang, Yongliang Wu, Yesheng Liu, Haodong Li, Furong Xu, Wanhua Li

本文针对多模态生成模型与人类偏好对齐中奖励信号结构化不足的问题，提出Auto-Rubric as Reward（ARR）框架，将隐式偏好知识显式分解为任务相关的可解释评分标准（rubrics），避免传统RLHF中因标量/成对打分导

致的奖励黑客风险与评估偏差。ARR支持零样本部署与少样本微调，并进一步结合Rubric Policy Optimization（RPO）将多维标准蒸馏为鲁棒二元奖励，提升策略梯度稳定性。在文生图与图像编辑任务上，ARR-RPO显著优于现有成对奖励模型与视觉语言模型判别器，验证了结构化、可解释的显式标准是实现高效、可靠多模态对齐的关键瓶颈突破。

📄 arXiv: cs.CL

1. SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators

Yada Pruksachatkun, Elaine Wan, Lyanna Chen, Kai-Wei Chang, Chien-Sheng Wu

本文提出SalesSim，一个面向零售场景的多模态用户模拟基准框架，旨在评估多模态大语言模型（MLLMs）在多轮、多模态、工具增强的在线购物对话中模拟真实、人格驱动型顾客行为的能力。SalesSim将用户交互建模为基于现实约束的智

能体决策过程，强调背景多样性、偏好差异与关键拒绝条件。作者设计了以“决策对齐度”为核心的评测体系，涵盖人格一致性与对话质量两方面。在6个主流开源与闭源模型上的评测揭示了显著行为偏差：模型对话流畅但词汇多样性低、过度披露偏好，且易受销售话术影响而偏离预设人格，最高对齐率不足79%。为此，作者提出UserGRPO——一种多目标、多轮强化学习优化方法，在提升决策对齐度（+13.8%）的同时增强对话自然性，为构建高保真用户模拟器提供了新范式与可复现基准。

2. Sanity Checks for Long-Form Hallucination Detection

Geigh Zollicoffer, Minh Vu, Hongli Zhan, Raymond Li, Manish Bhattarai

本文针对大语言模型长文本推理中的幻觉检测问题，提出一种受控不变性方法，通过“强制替换答案”（Force）和“移除答案宣告步骤”（Remove）两类oracle测试，区分检测器是否真正依赖推理过程的结构与有效性，抑或仅利用最终答案的

表面线索。实验表明，现有方法常被答案级伪影干扰；而在控制此类干扰后，轻量级轨迹分析器TRACT——基于词汇层面的推理轨迹特征（如不确定性表达趋势、步长动态变化、跨响应词汇收敛性）——展现出强鲁棒性，在未扰动推理链上性能媲美甚至超越现有复杂模型。研究指出，当前核心挑战不在于推理轨迹中缺乏判别信号，而在于难以将其与终点提示解耦。

3. How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits

Michael Li, Nishant Subramani

本文探究语言模型“电路”（circuits）在机械可解释性中的一致性与特异性，即电路组件在任务内是否重复出现（consistency）及其是否专属于特定任务（specificity）。作者基于边归因修补（edge attribut

ion patching）方法，在六个任务、七种模型上量化电路重用率，发现任务内组件重用率高且对性能至关重要（消融导致最高约100%相对准确率下降）；但跨任务电路高度重叠，消融某一任务电路对其它任务性能损害程度相近，表明其缺乏任务特异性。尽管存在少量任务专属组件，其贡献有限。结果提示：当前以注意力头和MLP层为粒度的电路发现虽能识别因果关键组件，但其泛化性削弱了面向特定行为的精细理解与干预能力。

📄 arXiv: cs.LG

1. Reinforcement learning for inverse structural design and rapid laser cutting of kirigami prototypes

Milad Yazdani, Shahriar Shalileh, Dena Shahriari

本文提出RL-Kirigami框架，解决折纸超材料逆向结构设计中非线性形变建模、离散相容性约束与几何可行性保障等核心挑战。方法融合最优传输条件流匹配（OT-CFM）先验生成初始比例场，并通过分组相对策略优化（GRPO）强化学习对齐

不可微奖励（包括轮廓匹配、结构可行性与比例场正则性）。实验表明：单次采样即达94.2% sIoU，显著优于传统求解器且将前向仿真次数从数百次降至1次；引入GRPO后sIoU提升至94.91%，总变差（TV）由0.95降至0.81；生成布局经DXF导出与激光切割，在50 μm聚合物薄膜上实现8.0±1.0分钟/件的快速原型制造，验证了该制造感知逆向设计范式在强几何约束下的有效性。

2. Path-Based Gradient Boosting for Graph-Level Prediction

Claudio Meggio, Johan Pensar, Riccardo De Bin

本文提出PathBoost，一种面向图级分类与回归任务的路径式梯度提升方法，直接从图结构中学习判别性路径特征。相较于前期针对化学领域的专用方法，PathBoost实现了三项关键改进：（i）引入逻辑损失函数支持二分类；（ii）通过前

缀分解机制融合多维节点与边属性，拓展路径特征空间；（iii）基于类别属性多样性自动选择锚点节点，消除人工指定起始节点的需求。在多个基准数据集上的实验表明，PathBoost在半数任务中优于图神经网络与图核方法，在其余任务中性能相当，且在平均节点数较大的图上优势更显著，验证了路径驱动的 boosting 方法在可解释性与竞争力上的潜力。

3. Distributional Reinforcement Learning via the Cram'er Distance

Vanya Aziz, Ivo Nowak, E. M. T Hendrix

本文提出了一种基于Cramér距离的分布式软演员-评论家算法（C-DSAC），将分布式强化学习引入Soft Actor-Critic（SAC）框架，以建模状态-动作值的完整分布而非期望值。该方法通过最小化平方Cramér距离进行分

布拟合，并引入“置信度驱动”的Q值更新机制：当目标分布方差较大（即置信度低）时，自动实施更保守的参数更新，有效缓解值函数高估问题。在多个机器人控制基准任务上的实验表明，C-DSAC显著优于标准SAC及现有分布式方法，且在高复杂度环境中优势更为明显。本研究深化了对分布式RL收敛性与价值估计机理的理解。

📄 arXiv: cs.CV

1. VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

Rui Zhao, Haofeng Hu, Zhenhai Gao, Jiaqiao Liu, Gao Fei

本文针对端到端自动驾驶中视觉-语言-动作（VLA）模型在长尾场景下泛化能力不足的问题，提出VLADriver-RAG框架，将检索增强生成（RAG）引入VLA建模。该方法通过“视觉到场景”机制将传感器输入抽象为时空语义图以抑制视觉噪

声，并设计基于图动态时间规整（Graph-DTW）度量的“场景对齐嵌入模型”，确保检索结果在拓扑结构上的一致性而非表层视觉相似性；所获历史先验知识经查询驱动方式与VLA主干网络融合，生成解耦且精准的轨迹规划。在Bench2Drive基准上，该方法以89.12的驾驶得分刷新SOTA。

2. Benchmarking ResNet Backbones in RT-DETR: Impact of Depth and Regularization under environmental conditions

Pamela Barboza, V'ictor Castelli, Bel'en Pereira, Ricardo Grando, Bruna de Vargas, Augusto Calfani

本文针对竞争性机器人视觉感知中环境变化对实时目标检测性能的影响，系统评估了RT-DETR框架下不同深度ResNet主干网络（ResNet18/34/50/101）在光照与背景对比度变化条件下的鲁棒性。通过引入dropout正则化并

统一训练配置，研究发现环境扰动主要降低预测置信度，而分类精度普遍维持在近似1.00的高水平，推理延迟基本稳定。实验表明：在光照变化下，ResNet50实现最优权衡（精度≈1.00、置信度≈0.869、延迟≈0.058–0.059 ms）；在背景变化下，ResNet34表现更优（置信度达≈0.887）。结果证实中等深度模型在精度、鲁棒性与效率间具有最佳平衡。

3. Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

Yuriel Ryan, Hei Man Ip, Adriel Kuek, Paul Pu Liang, Roy Ka-Wei Lee

当前视觉语言模型在面对模糊或受损模态时易出现幻觉与鲁棒性下降问题。本文提出“自描述多模态交互调优”（Self-Captioning Multimodal Interaction Tuning），通过系统分析模态间冗余（共享）、独特

（独有）与协同（涌现）的信息构成，揭示冗余信息对模型可靠性提升的关键作用。针对现有指令微调数据集过度削弱冗余以强调视觉定位的问题，本文设计了一种基于自生成图像描述的训练范式，并引入“多模态交互门”（Multimodal Interaction Gate）机制，将独特交互动态转化为冗余交互。实验表明，该方法可降低视觉诱导错误38.3%，提升跨模态一致性16.8%。

🔬 OpenReview 近期论文

1. Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Ang Lv, Jin Ma, Yiyuan Ma

本文针对混合专家（MoE）模型中路由器决策与专家能力缺乏显式对齐的问题，提出专家-路由器耦合（ERC）辅助损失。该方法将各专家的路由器嵌入视为其专属“代理令牌”，通过扰动嵌入并输入对应专家获取中间激活，进而施加双重约束：确保每个专

家对其自身代理令牌的响应最强，且每个代理令牌在对应专家上的激活最高。ERC损失计算复杂度仅为 $O(n^2)$（$n$ 为专家数），不随批量大小增长。在3B至15B参数MoE大语言模型的预训练中，基于万亿级token的实验表明，ERC显著提升模型性能，并支持对专家专业化程度的量化监控与灵活调控。

AI 每日资讯 — 2026-05-12#

🔥 HuggingFace 每日论文#

1. Pixal3D: Pixel-Aligned 3D Generation from Images#

2. Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning#

3. ELF: Embedded Language Flows#

4. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices#

5. DataMaster: Towards Autonomous Data Engineering for Machine Learning#

6. RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark#

7. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace#

8. CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models#

🔥 arXiv 每日论文#

📄 arXiv: cs.AI#

1. Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits#

2. Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction#

3. Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria#

📄 arXiv: cs.CL#

1. SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators#

2. Sanity Checks for Long-Form Hallucination Detection#

3. How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits#

📄 arXiv: cs.LG#

1. Reinforcement learning for inverse structural design and rapid laser cutting of kirigami prototypes#

2. Path-Based Gradient Boosting for Graph-Level Prediction#

3. Distributional Reinforcement Learning via the Cram'er Distance#

📄 arXiv: cs.CV#

1. VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving#

2. Benchmarking ResNet Backbones in RT-DETR: Impact of Depth and Regularization under environmental conditions#

3. Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models#

🔬 OpenReview 近期论文#

1. Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss#

2. BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving#

3. When do World Models Successfully Learn Dynamical Systems?#

4. Constraint-guided Hardware-aware NAS through Gradient Modification#

5. A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond#

6. DES-LOC: Desynced Low Communication Adaptive Optimizers for Foundation Models#

7. LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer#

8. On Coreset for LASSO Regression Problem with Sensitivity Sampling#

9. MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates#

10. Understanding Cross-layer Contributions to Mixture-of-Experts Routing in LLMs#

11. On Universality of Deep Equivariant Networks#

12. Read the Room: Video Social Reasoning with Mental-Physical Causal Chains#

13. Diffusion & Adversarial Schrödinger Bridges via Iterative Proportional Markovian Fitting#

14. InfoBridge: Mutual Information estimation via Bridge Matching#

15. Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game#

📝 AI 官方博客#

1. The new AI-powered Google Finance is expanding to Europe.#

2. See what happens when creative legends use AI to make ads for small businesses.#

3. 5 gardening tips you can try right in Search#

4. Early Indicators of Reward Hacking via Reasoning Interpolation#

5. Reward Hacking Resarch Update#

6. Pretraining Data Filtering for Open-Weight AI Safety#

7. Introducing Claude Opus 4.7ProductApr 16, 2026Our latest Opus model brings stronger performance across coding, agents, vision, and multi-step tasks, with greater thoroughness and consistency on the work that matters most.#

8. ProductApr 17, 2026Introducing Claude Design by Anthropic LabsToday, we’re launching Claude Design, a new Anthropic Labs product that lets you collaborate with Claude to create polished visual work like designs, prototypes, slides, one-pagers, and more.#

9. AnnouncementsApr 7, 2026Project GlasswingA new initiative that brings together Amazon Web Services, Anthropic, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks in an effort to secure the world’s most critical software.#

📬 TLDR AI 精选#

1. Start free. $100 in credits →#

2. Interaction Models: A Scalable Approach to Human-AI Collaboration#

3. Elon Musk Announces xAI Will Become SpaceXAI Division#

4. Google’s Gemini Omni video model surfaces ahead of I/O debut#

5. The Inference Shift#

6. Foundation Model Scaling#

7. TLDR is hiring a Senior Software Engineer, Applied AI ($250k-$350k, Fully Remote)#

8. Learn more#

9. Trajectory Models for Few-Step Diffusion#

10. Agentic Test-Time Scaling (GitHub Repo)#

📰 TechCrunch AI 新闻#

1. Dessn raises $6M for its production focused design tool#

2. AI voice startup Vapi hits $500M valuation after winning Amazon Ring over 40 rivals#

3. Thinking Machines wants to build an AI that actually listens while it talks#

4. Riding an AI rally, Robinhood preps second retail venture IPO#

5. GM just laid off hundreds of IT workers to hire those with stronger AI skills#

AI 每日资讯 — 2026-05-12

🔥 HuggingFace 每日论文

1. Pixal3D: Pixel-Aligned 3D Generation from Images

2. Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

3. ELF: Embedded Language Flows

4. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

5. DataMaster: Towards Autonomous Data Engineering for Machine Learning

6. RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

7. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

8. CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

🔥 arXiv 每日论文

📄 arXiv: cs.AI

1. Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

2. Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction

3. Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

📄 arXiv: cs.CL

1. SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators

2. Sanity Checks for Long-Form Hallucination Detection

3. How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits

📄 arXiv: cs.LG

1. Reinforcement learning for inverse structural design and rapid laser cutting of kirigami prototypes

2. Path-Based Gradient Boosting for Graph-Level Prediction

3. Distributional Reinforcement Learning via the Cram'er Distance

📄 arXiv: cs.CV

1. VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

2. Benchmarking ResNet Backbones in RT-DETR: Impact of Depth and Regularization under environmental conditions

3. Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

🔬 OpenReview 近期论文

1. Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

2. BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving

3. When do World Models Successfully Learn Dynamical Systems?

4. Constraint-guided Hardware-aware NAS through Gradient Modification

5. A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond

6. DES-LOC: Desynced Low Communication Adaptive Optimizers for Foundation Models

7. LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer

8. On Coreset for LASSO Regression Problem with Sensitivity Sampling

9. MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates

10. Understanding Cross-layer Contributions to Mixture-of-Experts Routing in LLMs

11. On Universality of Deep Equivariant Networks

12. Read the Room: Video Social Reasoning with Mental-Physical Causal Chains

13. Diffusion & Adversarial Schrödinger Bridges via Iterative Proportional Markovian Fitting

14. InfoBridge: Mutual Information estimation via Bridge Matching

15. Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game

📝 AI 官方博客

1. The new AI-powered Google Finance is expanding to Europe.

2. See what happens when creative legends use AI to make ads for small businesses.

3. 5 gardening tips you can try right in Search

4. Early Indicators of Reward Hacking via Reasoning Interpolation

5. Reward Hacking Resarch Update

6. Pretraining Data Filtering for Open-Weight AI Safety

7. Introducing Claude Opus 4.7ProductApr 16, 2026Our latest Opus model brings stronger performance across coding, agents, vision, and multi-step tasks, with greater thoroughness and consistency on the work that matters most.

8. ProductApr 17, 2026Introducing Claude Design by Anthropic LabsToday, we’re launching Claude Design, a new Anthropic Labs product that lets you collaborate with Claude to create polished visual work like designs, prototypes, slides, one-pagers, and more.

9. AnnouncementsApr 7, 2026Project GlasswingA new initiative that brings together Amazon Web Services, Anthropic, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks in an effort to secure the world’s most critical software.

📬 TLDR AI 精选

1. Start free. $100 in credits →

2. Interaction Models: A Scalable Approach to Human-AI Collaboration

3. Elon Musk Announces xAI Will Become SpaceXAI Division

4. Google’s Gemini Omni video model surfaces ahead of I/O debut

5. The Inference Shift

6. Foundation Model Scaling

7. TLDR is hiring a Senior Software Engineer, Applied AI ($250k-$350k, Fully Remote)

8. Learn more

9. Trajectory Models for Few-Step Diffusion

10. Agentic Test-Time Scaling (GitHub Repo)

📰 TechCrunch AI 新闻

1. Dessn raises $6M for its production focused design tool

2. AI voice startup Vapi hits $500M valuation after winning Amazon Ring over 40 rivals

3. Thinking Machines wants to build an AI that actually listens while it talks

4. Riding an AI rally, Robinhood preps second retail venture IPO

5. GM just laid off hundreds of IT workers to hire those with stronger AI skills