高级搜索

气象大数据算力资源管理系统的设计与实现

Design and implementation of the meteorological big data computing resource management system based on container platform

  • 摘要: 随着气象大数据云平台“天擎”正式业务化运行,提升大数据算法的运行效率和加强算力资源的优化配置已成为亟需解决的问题。从系统的建设需求、总体架构以及实现方式等问题出发,探讨了气象大数据算力资源优化管理系统的设计思路和若干关键技术,得出以下结论:系统基于B/S架构,以“天擎”为数据中台,以容器作为算力资源供给和优化管理对象,包括加工容器调度和容器平台管理两个子系统;加工容器调度子系统包括账户创建、资源授权、镜像制作、算法注册、算法加载、任务定义、任务启动、任务日志收集等功能,用户可以在调度平台上灵活地实现任务的触发;任务被触发后,加工容器调度子系统在容器平台中创建容器、执行任务并实时监控任务执行期间容器的使用情况;容器平台管理子系统基于Kubernetes的容器编排引擎,将算法配置的资源与可分配资源进行对比,完成配置文件拼装、资源部署脚本下发、节点预选优选,最终获取算法的运行结果;综合运用多节点均衡调度、算法资源精细化匹配、容器运行资源隔离、镜像存储故障恢复、容器和算法故障监控等关键技术,有效提升了容器算力资源调度能力、可靠性和资源利用率。

     

    Abstract: With the formal business operation of Hubei meteorological big data cloud platform "Tianqing",it has became an urgent problem to improve the operation efficiency of big data algorithm processing and strengthen the allocation of computing resources. This paper discusses the design idea and some key technologies of the meteorological big data computing resource optimization management system from the aspects of the construction requirements,overall architecture and implementation mode of the system,and draws the following conclusions: Based on the B/S architecture,the system takes "Tianqing" as the data center and the container as the object of computing resource supply and optimization management,including two subsystems: processing container scheduling and container platform management; The processing container scheduling subsystem includes account creation,resource authorization,image making,algorithm registration,algorithm loading,task definition,task startup,task log collection and other functions,users can flexibly trigger tasks on the scheduling platform; After the task is triggered,the processing container scheduling subsystem creates a container in the container platform,executes the task,and monitors the use of the container during the task execution in real time; The container platform management subsystem is based on the container orchestration engine of Kubernetes,Compare the resources configured by the algorithm with the allocable resources on the schedulable nodes,completes the assembly of configuration files,the issuance of resource deployment scripts,the pre selection and optimization of nodes,and finally obtains the operation results of the algorithm; Key technologies such as multi node balanced scheduling,algorithm Resource fine matching,container operation resource isolation,mirror storage fault recovery,container and algorithm fault monitoring are comprehensively used to effectively improve the scheduling ability,reliability and resource utilization of container computing resources.

     

/

返回文章
返回