大规模高通量计算系统的可靠性设计研究年度报告

2016-05-14 08:52李晓维鄢贵海韩银和
科技创新导报 2016年9期
关键词:故障检测深度学习

李晓维 鄢贵海 韩银和

摘 要:高通量计算系统由海量的计算节点、存储节点通过网络互连而成。由于规模巨大,系统的可靠性成为一个非常严重的问题,部件失效已经成为一种常态,系统设计必须考虑容错的问题。我们需要建立新的高通量计算系统的可靠性保障框架,来适应高通量计算中不同层次的可靠性需求,研究从芯片级到系统级跨层次的可靠计算技术。围绕该目标,该研究从高通量处理芯片的故障检测和容错设计方法,高通量计算系统的失效检测和恢复方法和从芯片级到系统级的故障自预测、自检测、自定位、自隔离和自愈合(5S)支撑环境3方面展开研究。截至2013年各项工作按照任务书原定计划正在稳步推进,部分工作取得阶段性成果。在(1)针对NBTI老化故障的在线预测技术;(2)深度学习等系统故障预测技术;(3)寄存器故障诊断;(4)片上网络通信隔离技术等技术点上取得了突破,共发表录用了IEEE Transactions论文6篇,其他期刊论文1篇。从研究点覆盖来看,部署到研究点已经全部覆盖了任务书规定的所有研究计划,并对某些研究点进行了细化。

关键词:可靠性设计 故障检测 深度学习 在线预测 通信隔离

Abstract:High-throughput computing system incorporates massive computing nodes, storage nodes and their associate inner interconnection network. It is very common that components of such system will encounter malfunction due to its large scale, which makes reliability an imperative issue that needs to be considered seriously. In other words, computing system design must take fault tolerance into account. We intend to build unprecedented reliability framework specially for high-throughput computing system, in order to accommodate the desirable reliability demands of various layers in high-throughput computingdesign the corresponding reliable computing techniques across chip level and system level. To achieve this objective, this study commences the relevant research in three consecutive aspects: (1)fault detection/tolerance approaches in high-through computing, (2)malfunction detection/recovery methods in high-throughput computing system, (3)self-prediction, self-detection, self-isolation and self-healing across chip level and system level (5S supportive environments). Up to the year 2013, various work has been carried on in align with task specification steadily, and parts of the work have reached preset milestones. We have made breakthrough in some researches, such as (1) NBTI aging prediction, (2) fault prediction based on deep learning,(3)register fault diagnosis, and (4) on-chip communication isolation techniques, along with abundant high-rank research publications. In terms of research comprehensiveness, the deployment has covered all research plans defined in the proposal, and some research techniques are further refined as well.

Key Words:Reliability design;Fault detection;Deep learning;Online prediction;Communication isolation

阅读全文链接(需实名注册):http://www.nstrs.cn/xiangxiBG.aspx?id=50730&flag=1

猜你喜欢
故障检测深度学习
有体验的学习才是有意义的学习
MOOC与翻转课堂融合的深度学习场域建构
大数据技术在反恐怖主义中的应用展望
优化网络设备维护提高数据通信传输质量