针对预测性维护解决方案的 Azure AI 指南Azure AI guide for predictive maintenance solutions


预测性维护 (PdM) 是一种流行的预测分析应用程序,可帮助多个行业中的企业实现较高的资产利用率和运营成本节省。Predictive maintenance (PdM) is a popular application of predictive analytics that can help businesses in several industries achieve high asset utilization and savings in operational costs. 本指南提供业务和分析准则与最佳做法,介绍如何使用 Microsoft Azure AI 平台技术成功开发和部署 PdM 解决方案。This guide brings together the business and analytical guidelines and best practices to successfully develop and deploy PdM solutions using the Microsoft Azure AI platform technology.

针对初学者,本指南介绍了行业特定的业务方案,以及使这些方案适合 PdM 的过程。For starters, this guide introduces industry-specific business scenarios and the process of qualifying these scenarios for PdM. 此外,还提供了数据要求,以及生成 PdM 解决方案的建模技术。The data requirements and modeling techniques to build PdM solutions are also provided. 本指南的主要内容涉及到数据科学过程 - 包括数据准备、特征工程、模型创建和模型操作化的步骤。The main content of the guide is on the data science process - including the steps of data preparation, feature engineering, model creation, and model operationalization. 为了补充这些关键概念,本指南列出一组解决方案模板来帮助加快 PdM 应用程序的开发。To complement these key concepts, this guide lists a set of solution templates to help accelerate PdM application development. 本指南还提供了有用培训资源的链接,让 实践者了解数据科学幕后的 AI 技术。The guide also points to useful training resources for the practitioner to learn more about the AI behind the data science.

数据科学指南概述与目标受众Data Science guide overview and target audience

本指南的前半部分描述典型的业务问题、实施 PdM 解决这些问题的好处,并列出一些常见的用例。The first half of this guide describes typical business problems, the benefits of implementing PdM to address these problems, and lists some common use cases. 业务决策者 (BDM) 将受益于此内容。Business decision makers (BDMs) will benefit from this content. 后半部分介绍 PdM 幕后的数据科学,并提供使用本指南所述原理生成的 PdM 解决方案列表。The second half explains the data science behind PdM, and provides a list of PdM solutions built using the principles outlined in this guide. 此外,还提供了学习路径和培训材料的链接。It also provides learning paths and pointers to training material. 技术决策者 (TDM) 将受益于此内容。Technical decision makers (TDMs) will find this content useful.

从以下内容开始...Start with ... 如果你是…If you are ...
预测性维护的业务案例Business case for predictive maintenance 业务决策者 (BDM),正在寻求减少停机时间和运营成本,并提高设备利用率a business decision maker (BDM) looking to reduce downtime and operational costs, and improve utilization of equipment
预防性维护的数据科学Data Science for predictive maintenance 技术决策者 (TDM),正在评估 PdM 技术,以了解预防性维护的独特数据处理和 AI 要求a technical decision maker (TDM) evaluating PdM technologies to understand the unique data processing and AI requirements for predictive maintenance
预测性维护的解决方案模板Solution templates for predictive maintenance 软件架构师或 AI 开发人员,正在寻求快速建立演示或概念证明a software architect or AI Developer looking to quickly stand up a demo or a proof-of-concept
预测性维护的培训资源Training resources for predictive maintenance 上述任何或所有身份,想要了解数据科学、工具和技术背后的基本概念。any or all of the above, and want to learn the foundational concepts behind the data science, tools, and techniques.

必备知识Prerequisite knowledge

BDM 内容并不要求读者事先拥有数据科学方面的知识。The BDM content does not expect the reader to have any prior data science knowledge. 若要学习 TDM 内容,基本了解统计和数据科学会有帮助。For the TDM content, basic knowledge of statistics and data science is helpful. 建议了解 Azure 数据和 AI 服务、Python、R、XML 和 JSON 方面的知识。Knowledge of Azure Data and AI services, Python, R, XML, and JSON is recommended. AI 技术在 Python 和 R 包中实现。AI techniques are implemented in Python and R packages. 解决方案模板是使用 Azure 服务、开发工具和 SDK 实现的。Solution templates are implemented using Azure services, development tools, and SDKs.

预测性维护的业务案例Business case for predictive maintenance

企业需要运行关键的设备来保持高峰效率和利用率,以实现其投资回报。Businesses require critical equipment to be running at peak efficiency and utilization to realize their return on capital investments. 这些资产既包括价值数百万美元的飞机引擎、涡轮机、电梯或工业冷却塔,也包括复印机、咖啡机或饮水机等日常消费设备。These assets could range from aircraft engines, turbines, elevators, or industrial chillers - that cost millions - down to everyday appliances like photocopiers, coffee machines, or water coolers.

  • 默认情况下,大多数企业都依赖于纠正性维护,即,在部件发生故障时将其更换。By default, most businesses rely on corrective maintenance, where parts are replaced as and when they fail. 纠正性维护可确保充分使用部件(因此不会浪费组件的生命周期),代价是会导致停机、人工费用和计划外的维护要求(加班或转换到不方便的位置)。Corrective maintenance ensures parts are used completely (therefore not wasting component life), but costs the business in downtime, labor, and unscheduled maintenance requirements (off hours, or inconvenient locations).
  • 在下一个层面,企业可实行预防性维护,即,确定确定某个部件的有效生存期,并在故障之前对其进行维护或更换。At the next level, businesses practice preventive maintenance, where they determine the useful lifespan for a part, and maintain or replace it before a failure. 预防性维护可避免计划外和灾难性的故障。Preventive maintenance avoids unscheduled and catastrophic failures. 但仍然存在由于计划内停机、组件在其使用生存期内利用率不足以及人工而造成的较高成本。But the high costs of scheduled downtime, under-utilization of the component during its useful lifetime, and labor still remain.
  • 预测性维护的目标是通过实现组件的适时更换,来优化纠正性维护与预防性维护之间的平衡。 The goal of predictive maintenance is to optimize the balance between corrective and preventative maintenance, by enabling just in time replacement of components. 这种方法只会更换即将发生故障的组件。This approach only replaces those components when they are close to a failure. 通过扩长组件的生存期(与预防性维护相比)并减少计划外维护和人工费用(与纠正性维护相比),企业可以获得成本节省和竞争优势。By extending component lifespans (compared to preventive maintenance) and reducing unscheduled maintenance and labor costs (over corrective maintenance), businesses can gain cost savings and competitive advantages.

PdM 中的业务问题Business problems in PdM

意外的故障以及对复杂系统中问题的根本原因洞察能力不足,导致企业面临较高的运营风险。Businesses face high operational risk due to unexpected failures and have limited insight into the root cause of problems in complex systems. 部分关键业务问题包括:Some of the key business questions are:

  • 检测设备或系统中性能或功能的异常。Detect anomalies in equipment or system performance or functionality.
  • 预测资产在不久的将来是否会发生故障。Predict whether an asset may fail in the near future.
  • 估算资产的剩余使用寿命。Estimate the remaining useful life of an asset.
  • 识别资产故障的主要原因。Identify the main causes of failure of an asset.
  • 识别何时需要对资产执行何种维护操作。Identify what maintenance actions need to be done, by when, on an asset.

从 PdM 角度讲,典型的目标陈述为:Typical goal statements from PdM are:

  • 降低任务关键型设备的运营风险。Reduce operational risk of mission critical equipment.
  • 通过在发生故障之前预测故障,来提高资产的回报率。Increase rate of return on assets by predicting failures before they occur.
  • 通过启用适时维护操作来控制维护成本。Control cost of maintenance by enabling just-in-time maintenance operations.
  • 降低客户流失率,提高品牌形象,减少销量损失。Lower customer attrition, improve brand image, and lost sales.
  • 通过预测再订购点降低库存水平,从而减少库存成本Lower inventory costs by reducing inventory levels by predicting the reorder point.
  • 发现与各种维护问题相关的模式。Discover patterns connected to various maintenance problems.
  • 提供 KPI(关键绩效指标),例如资产状态的运行状况评分。Provide KPIs (key performance indicators) such as health scores for asset conditions.
  • 估算资产的剩余使用寿命。Estimate remaining lifespan of assets.
  • 及时推荐维护活动。Recommend timely maintenance activities.
  • 通过估算更换部件的订购日期,实现适时库存。Enable just in time inventory by estimating order dates for replacement of parts.

以下受众可以使用这些目标陈述作为起点:These goal statements are the starting points for:

  • 分析并解决具体预测问题的数据科学家。data scientists to analyze and solve specific predictive problems.
  • 整合端到端解决方案的云架构师和开发人员。cloud architects and developers to put together an end to end solution.

符合预防性维护条件的问题Qualifying problems for predictive maintenance

要强调的一个重点是,PdM 无法有效解决所有用例或业务问题。It is important to emphasize that not all use cases or business problems can be effectively solved by PdM. 在问题选择期间,需要考虑三个重要的限定条件:There are three important qualifying criteria that need to be considered during problem selection:

  • 该问题在性质上必须是预测性的;也就是说,应该有预测的目标或结果。The problem has to be predictive in nature; that is, there should be a target or an outcome to predict. 该问题还应该有明确的操作路径,防止检测到故障时发生此类故障。The problem should also have a clear path of action to prevent failures when they are detected.
  • 该问题应具有设备的操作历史记录,其中包含利弊结果。The problem should have a record of the operational history of the equipment that contains both good and bad outcomes. 这些记录中还应该包含一套用于缓解不利结果的措施。The set of actions taken to mitigate bad outcomes should also be available as part of these records. 错误报告、性能下降维护日志、修复和更换日志也是重要。Error reports, maintenance logs of performance degradation, repair, and replace logs are also important. 此外,为改善问题而采取的修复措施以及更换记录也很有用。In addition, repairs undertaken to improve them, and replacement records are also useful.
  • 历史记录应在相关的数据中反映,这些数据具有足够高的质量,可支持用例。 The recorded history should be reflected in relevant data that is of sufficient enough quality to support the use case. 有关数据相关性和充分性的详细信息,请参阅预防性维护的数据要求For more information about data relevance and sufficiency, see Data requirements for predictive maintenance.
  • 最后,企业应该拥有能够明确理解问题的领域专家。Finally, the business should have domain experts who have a clear understanding of the problem. 他们熟悉内部流程和实践,可帮助分析师理解和解释数据。They should be aware of the internal processes and practices to be able to help the analyst understand and interpret the data. 他们还应该能够对现有业务流程进行必要的更改,以根据需要帮助收集正确的数据来解决问题。They should also be able to make the necessary changes to existing business processes to help collect the right data for the problems, if needed.

PdM 示例用例Sample PdM use cases

本部分重点介绍 PdM 在航天、公用事业和运输业等多种行业中的一系列用例。This section focuses on a collection of PdM use cases from several industries such as Aerospace, Utilities, and Transportation. 每个部分以业务问题开始,再讨论 PdM 的优势、围绕业务问题的相关数据,最后讨论 PdM 解决方案的优势。Each section starts with a business problem, and discusses the benefits of PdM, the relevant data surrounding the business problem, and finally the benefits of a PdM solution.

业务问题Business Problem PdM 的优势Benefits from PdM
机械问题导致航班延误和取消。Flight delay and cancellations due to mechanical problems. 无法及时修复的故障可能导致航班取消,并中断日程安排和运营。Failures that cannot be repaired in time may cause flights to be canceled, and disrupt scheduling and operations. PdM 解决方案可以预测机械故障导致航班延误或取消的概率。PdM solutions can predict the probability of an aircraft being delayed or canceled due to mechanical failures.
飞机引擎部件故障:在航空业中,飞机引擎部件更换属于最常见的维护任务。Aircraft engine parts failure: Aircraft engine part replacements are among the most common maintenance tasks within the airline industry. 维护解决方案要求仔细管理组件库存可用性、交付和计划。Maintenance solutions require careful management of component stock availability, delivery, and planning 能够收集组件可靠性的智能可以大大降低投资成本。Being able to gather intelligence on component reliability leads to substantial reduction on investment costs.
ATM 故障是银行业中的一个常见问题。ATM failure is a common problem within the banking industry. 此处的问题是报告提款机发生卡纸或部件故障导致 ATM 提现交易中断的概率。The problem here is to report the probability that an ATM cash withdrawal transaction gets interrupted due to a paper jam or part failure in the cash dispenser. 根据交易故障的预测,可以提前维修 ATM,以防止发生故障。Based on predictions of transaction failures, ATMs can be serviced proactively to prevent failures from occurring. 所需的替代方案是防止交易中途机器发生故障,并根据预测设定机器程序以拒绝服务。Rather than allow the machine to fail midway through a transaction, the desirable alternative is to program the machine to deny service based on the prediction.
风力涡轮机故障:风力涡轮机是提倡环保的国家/地区的主要能源设备,其投资较高。Wind turbine failures: Wind turbines are the main energy source in environmentally responsible countries/regions, and involve high capital costs. 风力涡轮机的关键部件是发电电动机,它的故障会使涡轮机失效。A key component in wind turbines is the generator motor, whose failure renders the turbine ineffective. 并且维修费用高昂。It is also highly expensive to fix. MTTF(平均故障时间)等预测 KPI 可帮助能源公司防止涡轮机故障,并确保尽量缩减停机时间。Predicting KPIs such as MTTF (mean time to failure) can help the energy companies prevent turbine failures, and ensure minimal downtime. 故障概率会告知技术人员监控可能即将发生故障的涡轮机,并根据时间安排好维护方案。Failure probabilities will inform technicians to monitor turbines that are likely to fail soon, and schedule time-based maintenance regimes. 使用预测模型可以洞察导致故障的不同因素,帮助技术人员更好地了解问题的根本原因。Predictive models provide insights into different factors that contribute to the failure, which helps technicians better understand the root causes of problems.
断路器故障:家庭和企业的配电要求电线始终保持正常状态,这样才能保证能源供应。Circuit breaker failures: Distribution of electricity to homes and businesses requires power lines to be operational at all times to guarantee energy delivery. 在遇到过载或不利天气状况时,断路器有助于限制或避免电线损坏。Circuit breakers help limit or avoid damage to power lines during overloading or adverse weather conditions. 此处的业务问题是预测断路器故障。The business problem here is to predict circuit breaker failures. PdM 解决方案有助于降低修复成本,并延长设备(如断路器)的工作寿命。PdM solutions help reduce repair costs and increase the lifespan of equipment such as circuit breakers. 它们通过减少意外的故障和服务中断,来帮助提高电网的质量。They help improve the quality of the power network by reducing unexpected failures and service interruptions.
运输和物流Transportation and logistics
电梯门故障:大型电梯公司为全球数百万部功能电梯提供完整堆栈服务。Elevator door failures: Large elevator companies provide a full stack service for millions of functional elevators around the world. 电梯安全性、可靠性和运行时间是客户的主要考虑因素。Elevator safety, reliability, and uptime are the main concerns for their customers. 这些公司通过传感器跟踪这些属性和其他各种属性,以帮助客户采取纠正和预防性维护。These companies track these and various other attributes via sensors, to help them with corrective and preventive maintenance. 在电梯中,最主要的客户问题是电梯门故障。In an elevator, the most prominent customer problem is malfunctioning elevator doors. 此用例中的业务问题是提供知识库预测应用程序,用于预测门故障的潜在原因。The business problem in this case is to provide a knowledge base predictive application that predicts the potential causes of door failures. 电梯是可能会达到 20-30 年的资本投资。Elevators are capital investments for potentially a 20-30 year lifespan. 因此,每笔潜在销售都可能遇到激烈的竞争;客户对服务和支持的预期也很高。So each potential sale can be highly competitive; hence expectations for service and support are high. 在产品和服务方面,预测性维护可为这些公司提供优势,让他们打败竞争对手。Predictive maintenance can provide these companies with an advantage over their competitors in their product and service offerings.
车轮故障:车轮故障占到所有火车脱轨原因的一半,给全球铁路行业造成了数十亿美元的代价。Wheel failures: Wheel failures account for half of all train derailments and cost billions to the global rail industry. 车轮故障还会导致铁路退化,有时甚至导致铁路提前断裂。Wheel failures also cause rails to deteriorate, sometimes causing the rail to break prematurely. 而铁路断裂又会导致灾难性事件,例如脱轨。Rail breaks lead to catastrophic events such as derailments. 为了避免这种情况,铁路公司会监控车轮的性能,并以预防性的方式更换有问题的车轮。To avoid such instances, railways monitor the performance of wheels and replace them in a preventive manner. 此处的业务问题是预测车轮故障。The business problem here is the prediction of wheel failures. 对车轮进行预测性维护有助于适时更换车轮Predictive maintenance of wheels will help with just-in-time replacement of wheels
地铁车门故障:在地铁运营中,延误的主要原因之一是列车发生车门故障。Subway train door failures: A major reason for delays in subway operations is door failures of train cars. 此处的业务问题是预测列车车门故障。The business problem here is to predict train door failures. 提前察觉车门故障,或者距离故障的天数,有助于企业优化列车车门检修计划。Early awareness of a door failure, or the number of days until a door failure, will help the business optimize train door servicing schedules.

下一部分详细介绍如何实现前面所述的 PdM 优势。The next section gets into the details of how to realize the PdM benefits discussed above.

预防性维护的数据科学Data Science for predictive maintenance

本部分提供 PdM 的数据科学原理和实践的一般准则,This section provides general guidelines of data science principles and practice for PdM. 旨在帮助 TDM、解决方案架构师或开发人员了解生成适用于 PdM 的端到端 AI 应用程序的先决条件和过程。It is intended to help a TDM, solution architect, or a developer understand the prerequisites and process for building end-to-end AI applications for PdM. 可以结合预防性维护的解决方案模板中所列的演示和概念证明模板阅读本部分。You can read this section along with a review of the demos and proof-of-concept templates listed in Solution Templates for predictive maintenance. 然后,可以运用这些原理和最佳做法,在 Azure 中实现 PdM 解决方案。You can then use these principles and best practices to implement your PdM solution in Azure.


本指南并不旨在向读者讲解数据科学。This guide is NOT intended to teach the reader Data Science. 预防性维护的培训资源部分中提供了多个有用的资源,欢迎进一步阅读。Several helpful sources are provided for further reading in the section for training resources for predictive maintenance. 本指南中所列的解决方案模板演示用于解决具体 PdM 问题的其中一些 AI 技术。The solution templates listed in the guide demonstrate some of these AI techniques for specific PdM problems.

预测性维护的数据要求Data requirements for predictive maintenance

能否从学习中获得成功取决于:(a) 讲述内容的质量,(b) 学习者的能力。The success of any learning depends on (a) the quality of what is being taught, and (b) the ability of the learner. 预测模型能够从历史数据中学习模式,并根据这些观察到的模式,结合特定的概率预测将来的结果。Predictive models learn patterns from historical data, and predict future outcomes with certain probability based on these observed patterns. 模型的预测准确性取决于训练和测试数据的相关性、充分性和质量。A model's predictive accuracy depends on the relevancy, sufficiency, and quality of the training and test data. 使用此模型“评分”的新数据应具有与训练/测试数据相同的特征和架构。The new data that is 'scored' using this model should have the same features and schema as the training/test data. 新数据的特征(类型、密度、分布等)应与训练和测试数据集相匹配。The feature characteristics (type, density, distribution, and so on) of new data should match that of the training and test data sets. 本部分重点描述此类数据要求。The focus of this section is on such data requirements.

相关的数据Relevant data

首先,数据必须与问题相关。First, the data has to be relevant to the problem. 以前面所述的“车轮故障”用例为例 - 训练数据应包含与车轮的运行相关的特征。Consider the wheel failure use case discussed above - the training data should contain features related to the wheel operations. 如果问题是预测牵引系统的故障,则训练数据必须包含牵引系统的所有不同组件。If the problem was to predict the failure of the traction system, the training data has to encompass all the different components for the traction system. 第一种情况针对特定的组件,而第二种情况针对更大子系统的故障。The first case targets a specific component whereas the second case targets the failure of a larger subsystem. 一般的建议是围绕特定的组件而不是更大子系统设计预测系统,因为后者的数据更分散。The general recommendation is to design prediction systems about specific components rather than larger subsystems, since the latter will have more dispersed data. 领域专家(参阅符合预防性维护条件的问题)应帮助选择用于分析的最相关数据子集。The domain expert (see Qualifying problems for predictive maintenance) should help in selecting the most relevant subsets of data for the analysis. 预防性维护的数据准备中更详细介绍了相关数据源。The relevant data sources are discussed in greater detail in Data preparation for predictive maintenance.

充足的数据Sufficient data

在故障历史记录数据方面,我们经常会提出两个问题:(1)“训练一个模型需要多少个故障事件?”Two questions are commonly asked with regard to failure history data: (1) "How many failure events are required to train a model?" (2)“多少条记录被视为‘足够’?”这些问题没有绝对的答案,而只能凭经验法则来解答。(2) "How many records is considered as "enough"?" There are no definitive answers, but only rules of thumb. 对于问题 (1),故障事件越多,则模型质量就越好。For (1), more the number of failure events, better the model. 对于问题 (2),故障事件的确切数目取决于所要解决的问题的数据和上下文。For (2), and the exact number of failure events depends on the data and the context of the problem being solved. 但另一方面,如果某台机器过于频繁地发生故障,在这种情况下企业会将其更换,这样就会减少故障实例。But on the flip side, if a machine fails too often then the business will replace it, which will reduce failure instances. 同样,领域专家的指导非常重要。Here again, the guidance from the domain expert is important. 但是,可以通过某些方法来处理罕见事件的问题。However, there are methods to cope with the issue of rare events. 处理不平衡的数据部分对此做了介绍。They are discussed in the section Handling imbalanced data.

数据质量Quality data

数据的质量至关重要 - 与目标变量值相结合时,每个预测器属性值必须准确。The quality of the data is critical - each predictor attribute value must be accurate in conjunction with the value of the target variable. 在统计和数据管理学科中,数据质量已有全面的研究,本指南不会予以阐述。Data quality is a well-studied area in statistics and data management, and hence out of scope for this guide.


可通过多种资源和企业产品来交付质量数据。There are several resources and enterprise products to deliver quality data. 下面提供了参考示例:A sample of references is provided below:

预测性维护的数据准备Data preparation for predictive maintenance

数据源Data sources

预防性维护的相关数据源包括但不限于:The relevant data sources for predictive maintenance include, but are not limited to:

  • 故障历史记录Failure history
  • 维护/修复历史记录Maintenance/repair history
  • 机器运行状态Machine operating conditions
  • 设备元数据Equipment metadata

故障历史记录Failure history

故障事件在 PdM 应用程序中很少出现。Failure events are rare in PdM applications. 但是,在生成预测模型时,算法需要学习组件的正常操作模式及其故障模式。However, when building prediction models, the algorithm needs to learn about a component's normal operational pattern, as well as its failure patterns. 因此,训练数据应包含这两个类别中的足够示例。So the training data should contain sufficient number of examples from both categories. 维护记录和部件更换历史记录是用于查找故障事件的不错来源。Maintenance records and parts replacement history are good sources to find failure events. 凭借一定的领域知识,还可以将训练数据中的异常定义为故障。With the help of some domain knowledge, anomalies in the training data can also be defined as failures.

维护/修复历史记录Maintenance/repair history

资产的维护历史记录包含有关已更换的组件、已执行的维修活动等的详细信息。这些事件会记录降级模式。Maintenance history of an asset contains details about components replaced, repair activities performed etc. These events record degradation patterns. 训练数据中缺少此重要信息可能导致误导性的模型结果。Absence of this crucial information in the training data can lead to misleading model results. 也可以在维护历史记录中找到特殊错误代码或部件订购日期形式的故障历史记录。Failure history can also be found within maintenance history as special error codes, or order dates for parts. 领域专家应该调查并提供影响故障模式的其他数据源。Additional data sources that influence failure patterns should be investigated and provided by domain experts.

机器运行状态Machine operating conditions

运行中设备的基于传感器的(或其他)流数据是重要的数据源。Sensor based (or other) streaming data of the equipment in operation is an important data source. PdM 中的一项重要假设是,机器的运行状况在日常运行过程中会不断降级。A key assumption in PdM is that a machine's health status degrades over time during its routine operation. 我们预期数据包含随时间改变的特征,这些特征捕获这种老化模式,以及导致降级的任何异常。The data is expected to contain time-varying features that capture this aging pattern, and any anomalies that leads to degradation. 算法需要根据数据的时态特征来学习不同时间的故障和非故障模式。The temporal aspect of the data is required for the algorithm to learn the failure and non-failure patterns over time. 根据这些数据点,算法会通过学习来预测机器在发生故障之前,还能继续工作多少个时间单位。Based on these data points, the algorithm learns to predict how many more units of time a machine can continue to work before it fails.

静态特征数据Static feature data

静态特征是有关设备的元数据。Static features are metadata about the equipment. 例如,设备制造商、型号、制造日期、服务开始日期、系统位置和其他技术规格。Examples are the equipment make, model, manufactured date, start date of service, location of the system, and other technical specifications.

下面以表格形式列出了 PdM 示例用例的相关数据示例:Examples of relevant data for the sample PdM use cases are tabulated below:

用例Use Case 相关数据的示例Examples of relevant data
航班延误和取消Flight delay and cancellations 航段和页面日志形式的航班路线信息。Flight route information in the form of flight legs and page logs. 航段数据包括路线详细信息,例如出发/抵达日期、时间、机场、中途停留地点等。页面日志数据包括地面维护人员所记录的一系列错误和维护代码。Flight leg data includes routing details such as departure/arrival date, time, airport, layovers etc. Page log includes a series of error and maintenance codes recorded by the ground maintenance personnel.
飞机引擎部件故障Aircraft engine parts failure 从飞机传感器收集的数据,提供各部件状况的信息。Data collected from sensors in the aircraft that provide information on the condition of the various parts. 维护记录还有助于识别何时发生了组件故障,以及何时更换了这些组件。Maintenance records help identify when component failures occurred and when they were replaced.
ATM 故障ATM Failure 有关每笔交易(现金存取)和现金发放的传感器读数。Sensor readings for each transaction (depositing cash/check) and dispensing of cash. 纸币、纸币厚度、纸币送出距离、凭单属性等的差距度量值的信息。维护记录:提供错误代码、维修信息,上次在提款机中放钱时间。Information on gap measurement between notes, note thickness, note arrival distance, check attributes etc. Maintenance records that provide error codes, repair information, last time the cash dispenser was refilled.
风力涡轮机故障Wind turbine failure 传感器会监控温度、风向、发电功率、发电机转速等涡轮机状况。从不同区域的风力发电厂中的多个风力涡轮机收集数据。Sensors monitor turbine conditions such as temperature, wind direction, power generated, generator speed etc. Data is gathered from multiple wind turbines from wind farms located in various regions. 通常,每个涡轮机有多个传感器读数,它们会按固定的时间间隔中继度量值。Typically, each turbine will have multiple sensor readings relaying measurements at a fixed time interval.
断路器故障Circuit breaker failures 维护日志:包括纠正、预防和系统性措施。Maintenance logs that include corrective, preventive, and systematic actions. 操作数据:包括发送到断路器的自动和手动命令,例如打开和关闭操作。Operational data that includes automatic and manual commands sent to circuit breakers such as for open and close actions. 制造日期、位置、型号等设备元数据。电压级别、地理位置、环境条件等断路器规格。Device metadata such as date of manufacture, location, model, etc. Circuit breaker specifications such as voltage levels, geolocation, ambient conditions.
电梯门故障Elevator door failures 电梯类型、制造日期、维护频率、建筑物类型等电梯元数据。Elevator metadata such as type of elevator, manufactured date, maintenance frequency, building type, and so on. 门周期数、门平均关闭时间等操作信息。Operational information such as number of door cycles, average door close time. 包含原因的故障历史记录。Failure history with causes.
车轮故障Wheel failures 度量车轮加速度、刹车距离、传动距离、速度等的传感器数据。有关车轮的静态信息,例如制造商、制造日期。Sensor data that measures wheel acceleration, braking instances, driving distance, velocity etc. Static information on wheels like manufacturer, manufactured date. 从跟踪订购日期和数量的部件订购数据库推理的故障数据。Failure data inferred from part order database that track order dates and quantities.
地铁车门故障Subway train door failures 车门打开和关闭时间,以及列车车门当前状况等其他操作数据。Door opening and closing times, other operational data such as current condition of train doors. 静态数据包括资产标识符、时间和状况值列。Static data would include asset identifier, time, and condition value columns.

数据类型Data types

提供上述数据源后,在 PdM 域中观察到的两个主要数据类型为:Given the above data sources, the two main data types observed in PdM domain are:

  • 时态数据:操作遥测数据、机器状况、工单类型,以及优先级代码(包括记录时的时间戳)。Temporal data: Operational telemetry, machine conditions, work order types, priority codes that will have timestamps at the time of recording. 故障、维护/维修和使用情况历史记录也包含与每个事件关联的时间戳。Failure, maintenance/repair, and usage history will also have timestamps associated with each event.
  • 静态数据:机器特征和操作员特征通常是静态的,因为它们描述机器的技术规格或操作员属性。Static data: Machine features and operator features in general are static since they describe the technical specifications of machines or operator attributes. 如果这些特征会随时时间的变化而变化,则应包含关联的时间戳。If these features could change over time, they should also have timestamps associated with them.

应该根据所用的算法,将预测器和目标变量预处理/转换成数字、分类和其他数据类型Predictor and target variables should be preprocessed/transformed into numerical, categorical, and other data types depending on the algorithm being used.

数据预处理Data preprocessing

特征工程的先决条件之一是准备各种流中的数据,以编写可从中轻松生成特征的架构。As a prerequisite to feature engineering, prepare the data from various streams to compose a schema from which it is easy to build features. 首先将数据可视化为记录表。Visualize the data first as a table of records. 表中的每一行表示训练实例,列表示预测器特征(也称为独立属性或变量)。Each row in the table represents a training instance, and the columns represent predictor features (also called independent attributes or variables). 对数据进行组织,使最后一列成为目标(依赖的变量)。Organize the data such that the last column(s) is the target (dependent variable). 对于每个训练实例,分配一个标签作为此列的值。For each training instance, assign a label as the value of this column.

对于时态数据,将传感器数据的持续时间分割成时间单位。For temporal data, divide the duration of sensor data into time units. 每条记录应属于某个资产的时间单位,并且应提供不同的信息。Each record should belong to a time unit for an asset, and should offer distinct information. 时间单位根据业务需求以秒数、分钟数、小时数、天数、月数等的乘数进行定义。Time units are defined based on business needs in multiples of seconds, minutes, hours, days, months, and so on. 时间单位不一定要与数据收集频率相同。The time unit does not have to be the same as the frequency of data collection. 如果频率较高,数据可能不会显示不同单位之间的重要差异。If the frequency is high, the data may not show any significant difference from one unit to the other. 例如,假设每隔 10 秒收集环境温度。For example, assume that ambient temperature was collected every 10 seconds. 对训练数据使用这种间隔只会增大示例数目,而不会提供更多的信息。Using that same interval for training data only inflates the number of examples without providing any additional information. 对于这种情况,更好的策略是根据业务理由,对超过 10 分钟或一小时的数据求平均值。For this case, a better strategy would be to use average the data over 10 minutes, or an hour based on the business justification.

对于静态数据:For static data,

  • 维护记录:原始维护数据包含资产标识符和时间戳,以及在给定时间点执行的维护活动的信息。Maintenance records: Raw maintenance data has an asset identifier and timestamp with information on maintenance activities that have been performed at a given point in time. 将维护活动转换为分类列,其中每个类别描述符唯一映射到特定的维护操作。Transform maintenance activities into categorical columns, where each category descriptor uniquely maps to a specific maintenance action. 维护记录的架构包括资产标识符、时间和维护操作。The schema for maintenance records would include asset identifier, time, and maintenance action.

  • 故障记录:可将故障或故障原因记录为特定业务条件定义的特定错误代码或故障事件。Failure records: Failures or failure reasons can be recorded as specific error codes or failure events defined by specific business conditions. 如果设备具有多个错误代码,领域专家应该帮助识别与目标变量相关的代码。In cases where the equipment has multiple error codes, the domain expert should help identify the ones that are pertinent to the target variable. 使用剩余的错误代码或条件来构造与这些故障相关的预测器特征。Use the remaining error codes or conditions to construct predictor features that correlate with these failures. 故障记录的架构包括资产标识符、时间、故障或故障原因(如果可用)。The schema for failure records would include asset identifier, time, failure, or failure reason - if available.

  • 机器和操作员元数据:将机器和操作员数据合并到一个架构,以将资产与其操作员及其相关的属性相关联。Machine and operator metadata: Merge the machine and operator data into one schema to associate an asset with its operator, along with their respective attributes. 机器状况的架构包括资产标识符、资产特征、操作员标识符和操作员特征。The schema for machine conditions would include asset identifier, asset features, operator identifier, and operator features.

其他数据预处理步骤包括处理缺失值和属性值的规范化。 Other data preprocessing steps include handling missing values and normalization of attribute values. 本指南不会对此进行详细讨论 - 请参阅下一部分的有用参考资源。A detailed discussion is beyond the scope of this guide - see the next section for some useful references.

完成上面所述的数据源预处理后,在进行特征工程之前所做的最后一项转换是,根据资产标识符和时间戳联接上述表。With the above preprocessed data sources in place, the final transformation before feature engineering is to join the above tables based on the asset identifier and timestamp. 机器处于正常运行状态后,生成的表将在故障列中包含 null 值。The resulting table would have null values for the failure column when machine is in normal operation. 可以根据正常操作的指示符推算这些 null 值。These null values can be imputed by an indicator for normal operation. 使用此故障列创建预测模型的标签。Use this failure column to create labels for the predictive model. 有关详细信息,请参阅预防性维护的建模技术部分。For more information, see the section on modeling techniques for predictive maintenance.

特性工程Feature engineering

特征工程是为数据建模之前的第一个步骤。Feature engineering is the first step prior to modeling the data. 此处介绍了该步骤在数据科学过程中的作用。Its role in the data science process is described here. 特征是模型的预测属性 - 例如温度、压力、震动,等等。A feature is a predictive attribute for the model - such as temperature, pressure, vibration, and so on. 对于 PdM 而言,特征工程涉及根据在相当长的时间内收集的历史数据抽象出机器的运行状况。For PdM, feature engineering involves abstracting a machine's health over historical data collected over a sizable duration. 在这种意义上,它不同于其对等术语,例如远程监视、异常检测和故障检测。In that sense, it is different from its peers such as remote monitoring, anomaly detection, and failure detection.

时间窗口Time windows

远程监视需要报告截止相应时间点发生的事件。Remote monitoring entails reporting the events that happen as of points in time. 异常检测模型评估(评分)传入的数据流,以标记截止相应时间点发生的异常。Anomaly detection models evaluate (score) incoming streams of data to flag anomalies as of points in time. 故障检测将时间点发生的故障分类为特定的类型。Failure detection classifies failures to be of specific types as they occur points in time. 相比之下,PdM 涉及到根据特征预测将来时间段的故障。这些特征表示历史时间段机器的行为。 In contrast, PdM involves predicting failures over a future time period, based on features that represent machine behavior over historical time period. 对于 PdM 而言,来自各个时间点的特征数据过于杂乱,而没有预测性。For PdM, feature data from individual points of time are too noisy to be predictive. 因此,需要通过聚合不同时间窗口的数据点,将每个特征的数据平滑化。So the data for each feature needs to be smoothened by aggregating data points over time windows.

滞后特性Lag features

业务要求定义模型预测将来的远近程度。The business requirements define how far the model has to predict into the future. 而这个持续时间又有助于定义“模型需要回溯多久以前的数据”来做出这些预测。In turn, this duration helps define 'how far back the model has to look' to make these predictions. 这段“回溯”期称为“延隔时间”,在此延隔时间段设计的特征称为“延隔特征”。 This 'looking back' period is called the lag, and features engineered over this lag period are called lag features. 本部分介绍可从包含时间戳的数据源构造的延隔特征,以及如何从静态数据源创建特征。This section discusses lag features that can be constructed from data sources with timestamps, and feature creation from static data sources. 延隔特征在性质上通常是数字型的。Lag features are typically numerical in nature.


窗口大小通过试验来确定,应该在领域专家的帮助下最终确定。The window size is determined via experimentation, and should be finalized with the help of a domain expert. 选择和定义延隔特征、其聚合与窗口类型时,应遵循相同的注意事项。The same caveat holds for the selection and definition of lag features, their aggregations, and the type of windows.

滚动聚合Rolling aggregates

对于每条资产记录,选择大小为“W”的滚动窗口作为时间单位数来计算聚合。For each record of an asset, a rolling window of size "W" is chosen as the number of units of time to compute the aggregates. 然后,使用该记录日期之前的 W 时段来计算延隔特征。Lag features are then computed using the W periods before the date of that record. 在图 1 中,蓝线显示在每个时间单位针对资产记录的传感器值。In Figure 1, the blue lines show sensor values recorded for an asset for each unit of time. 它们表示在窗口大小 W = 3 时,特征值的滚动平均值。They denote a rolling average of feature values over a window of size W=3. 滚动平均值是根据 t1(橙色)到 t2(绿色)范围内包含时间戳的所有记录计算得出的。The rolling average is computed over all records with timestamps in the range t1 (in orange) to t2 (in green). W 值通常以分钟或小时为单位,具体取决于数据的性质。The value for W is typically in minutes or hours depending on the nature of the data. 但对于某些问题,选择较大的 W(假设 12 个月)能够提供记录时间之前某个资产的整个历史记录。But for certain problems, picking a large W (say 12 months) can provide the whole history of an asset until the time of the record.

图 1.

图 1.Figure 1. 滚动聚合功能Rolling aggregate features

基于时间窗口的滚动聚合的示例包括计数、平均、CUMESUM(累计和)度量、最小/最大值。Examples of rolling aggregates over a time window are count, average, CUMESUM (cumulative sum) measures, min/max values. 此外,经常会使用方差、标准偏差和超过 N 标准偏差的离群值计数。In addition, variance, standard deviation, and count of outliers beyond N standard deviations are often used. 下面列出了可能适用于本指南所述用例的聚合示例。Examples of aggregates that may be applied for the use cases in this guide are listed below.

  • 航班延误:过去一天/一周的错误代码计数。Flight delay: count of error codes over the last day/week.
  • 飞机引擎部件故障:滚动平均、标准偏差,以及过去一天/一周的总和,等等。应该在业务领域专家的配合下确定此指标。Aircraft engine part failure: rolling means, standard deviation, and sum over the past day, week etc. This metric should be determined along with the business domain expert.
  • ATM 故障:滚动平均、中间值、范围、标准偏差、超过三个标准偏差的离群值计数、上限和下限 CUMESUM。ATM failures: rolling means, median, range, standard deviations, count of outliers beyond three standard deviations, upper and lower CUMESUM.
  • 地铁车门故障:过去一天、一周、两周的事件计数等。Subway train door failures: Count of events over past day, week, two weeks etc.
  • 断路器故障:过去一周、一年、三年的故障计数等。Circuit breaker failures: Failure counts over past week, year, three years etc.

PdM 中的另一个有用技术是使用检测数据异常的算法来捕获趋势变化、峰值和水平变化。Another useful technique in PdM is to capture trend changes, spikes, and level changes using algorithms that detect anomalies in data.

翻转聚合Tumbling aggregates

对于每条带有标签的资产记录,定义大小为 W-k 的窗口,其中,k 是大小为 W 的窗口数。然后,根据记录时间戳之前时段的 k 翻转窗口 W-k, W-(k-1), …, W-2, W-1 创建聚合。For each labeled record of an asset, a window of size W-k is defined, where k is the number of windows of size W. Aggregates are then created over k tumbling windows W-k, W-(k-1), …, W-2, W-1 for the periods before a record's timestamp. k 可以是较小数字(以捕获短期效应),也可以是较大数字(以捕获长期降级模式)。k can be a small number to capture short-term effects, or a large number to capture long-term degradation patterns. (参阅图 2)。(see Figure 2).

图 2.

图 2.Figure 2. 翻转聚合特性Tumbling aggregate features

例如,可以使用 W=1 和 k=3 创建风力涡轮机用例的延隔特征。For example, lag features for the wind turbines use case may be created with W=1 and k=3. 这些特征使用顶部和底部离群值表示过去三个月每个月的延隔时间。They imply the lag for each of the past three months using top and bottom outliers.

静态特性Static features

这些设备的技术规范(例如制造日期、型号、位置)就是静态特征的示例。Technical specifications of the equipment such as date of manufacture, model number, location, are some examples of static features. 可将这些特征视为建模的分类变量。They are treated as categorical variables for modeling. 对于断路器用例,静态特征的部分示例包括电压、电流、功率容量、变压器类型和电源。Some examples for the circuit breaker use case are voltage, current, power capacity, transformer type, and power source. 对于车轮故障用例,轮胎类型(合金还是钢)就是静态特征的示例。For wheel failures, the type of tire wheels (alloy vs steel) is an example.

完成上面所述的数据准备工作后,接下来应该可以根据下面所述对数据进行组织。The data preparation efforts discussed so far should lead to the data being organized as shown below. 训练、测试和验证数据应具有此逻辑架构(此示例中的时间单位为“天”)。Training, test, and validation data should have this logical schema (this example shows time in units of days).

资产 IDAsset ID 时间Time <Feature Columns> LabelLabel
A123A123 第 1 天Day 1 上获取。. 上获取。. 上获取。. 上获取。.
A123A123 第 2 天Day 2 上获取。. 上获取。. 上获取。. 上获取。.
...... ...... 上获取。. 上获取。. 上获取。. 上获取。.
B234B234 第 1 天Day 1 上获取。. 上获取。. 上获取。. 上获取。.
B234B234 第 2 天Day 2 上获取。. 上获取。. 上获取。. 上获取。.
...... ...... 上获取。. 上获取。. 上获取。. 上获取。.

特征工程的最后一个步骤是将目标变量加上 标签The last step in feature engineering is the labeling of the target variable. 此过程依赖于建模技术。This process is dependent on the modeling technique. 而建模技术又依赖于业务问题和可用数据的性质。In turn, the modeling technique depends on the business problem and nature of the available data. 下一部分将介绍标签。Labeling is discussed in the next section.


若要获得成功的 PdM 解决方案,数据准备和特征工程与建模技术同等重要。Data preparation and feature engineering are as important as modeling techniques to arrive at successful PdM solutions. 领域专家和实践者应投入大量的时间来获得适当的特征和模型数据。The domain expert and the practitioner should invest significant time in arriving at the right features and data for the model. 下面列出了许多书籍中有关特征工程的简短示例:A small sample from many books on feature engineering are listed below:

  • 1999 年 Pyle, D. 发表的“Data Preparation for Data Mining (The Morgan Kaufmann Series in Data Management Systems)”(数据挖掘的数据准备(数据管理系统中的 Morgan Kaufmann 系列))Pyle, D. Data Preparation for Data Mining (The Morgan Kaufmann Series in Data Management Systems), 1999
  • 2018 年 Zheng, A. 和 Casari, A. 在 O'Reilly 上发表的“Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists”(机器学习的特征工程:面向数据科学家的原理和技术)。Zheng, A., Casari, A. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists, O'Reilly, 2018.
  • 2018 年 Dong, G. 和 Liu, H.(编辑)在 CRC Press 上发表的“Feature Engineering for Machine Learning and Data Analytics (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series)”(机器学习和数据分析的特征工程(Chapman 和 Hall/CRC 数据挖掘与知识探索系列))Dong, G. Liu, H. (Editors), Feature Engineering for Machine Learning and Data Analytics (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series), CRC Press, 2018.

预防性维护的建模技术Modeling techniques for predictive maintenance

本部分介绍 PdM 问题的主要建模技术及其具体的标签构造方法。This section discusses the main modeling techniques for PdM problems, along with their specific label construction methods. 请注意,可跨不同的行业使用一种建模技术。Notice that a single modeling technique can be used across different industries. 建模技术与数据科学问题而不是手头数据的上下文相搭配。The modeling technique is paired to the data science problem, rather than the context of the data at hand.


应在咨询领域专家后确定如何The choice of labels for the failure cases and the labeling strategy
为故障案例和标签策略选择标签。should be determined in consultation with the domain expert.

二元分类Binary classification

二元分类用于 预测设备部件在未来时间段(称为“未来边际时段 X”)内发生故障的概率。在咨询领域专家的情况下,根据业务问题和手头数据确定 X。Binary classification is used to predict the probability that a piece of equipment fails within a future time period - called the future horizon period X. X is determined by the business problem and the data at hand, in consultation with the domain expert. 示例如下:Examples are:

  • 更换组件、部署维护资源、执行维护以避免在该时段内发生问题的最小提前期。minimum lead time required to replace components, deploy maintenance resources, perform maintenance to avoid a problem that is likely to occur in that period.
  • 问题发生之前可能发生的事件的最小计数。minimum count of events that can happen before a problem occurs.

在这种技术中,将识别两种类型的训练示例。In this technique, two types of training examples are identified. 一个指示故障的正示例,其标签为 1。A positive example, which indicates a failure, with label = 1. 一个指示正常操作的负示例,其标签为 0。A negative example, which indicates normal operations, with label = 0. 目标变量(因而也包括标签值)可分类。The target variable, and hence the label values, are categorical. 模型应识别在将来 X 个时间单位内可能会发生故障或正常工作的每个新示例。The model should identify each new example as likely to fail or work normally over the next X time units.

二元分类的标签构造Label construction for binary classification

此处的问题是:“资产在将来的 X 个时间单位内发生故障的概率是多少?”The question here is: "What is the probability that the asset will fail in the next X units of time?" 若要回答此问题,请将资产故障前面的 X 条记录标记为“即将发生故障”(标签 = 1),并将其他记录全部标记为“正常”(标签 = 0)。To answer this question, label X records prior to the failure of an asset as "about to fail" (label = 1), and label all other records as being "normal" (label =0). (参阅图 3)。(see Figure 3).

图 3.

图 3.Figure 3. 为二进制分类标记Labeling for binary classification

下面列出了某些用例的标签策略示例。Examples of labeling strategy for some of the use cases are listed below.

  • 航班延误:可将 X 选择为 1 天,以预测未来 24 小时的延误。Flight delays: X may be chosen as one day, to predict delays in the next 24 hours. 然后,将故障发生前 24 小时内的所有航班标记为 1。Then all flights that are within 24 hours before failures are labeled as 1.
  • ATM 提款机故障:目标可以是确定下一小时交易发生故障的概率。ATM cash dispense failures: A goal may be to determine failure probability of a transaction in the next one hour. 在这种情况下,会将过去发生故障的一小时内发生的所有交易标记为 1。In that case, all transactions that happened within the past hour of the failure are labeled as 1. 若要预测吐出后面 N 张纸币后发生故障的概率,会将发生故障后吐出 N 张纸币时间段内吐出的所有纸币标记为 1。To predict failure probability over the next N currency notes dispensed, all notes dispensed within the last N notes of a failure are labeled as 1.
  • 断路器故障:目标可以是预测下一次断路器命令故障。Circuit breaker failures: The goal may be to predict the next circuit breaker command failure. 在这种情况下,请将 X 选择为将来的某个命令。In that case, X is chosen to be one future command.
  • 列车车门故障:可将 X 选择为两天。Train door failures: X may be chosen as two days.
  • 风力涡轮机故障:可将 X 选择为两个月。Wind turbine failures: X may be chosen as two months.

预测维护的回归Regression for predictive maintenance

回归模型用于计算资产的剩余使用寿命 (RUL)。Regression models are used to compute the remaining useful life (RUL) of an asset. RUL 定义为在下一次发生故障之前,资产保持正常运行的时间长短。RUL is defined as the amount of time that an asset is operational before the next failure occurs. 每个训练示例是属于资产时间单位 nY 的记录,其中 n 是乘数。Each training example is a record that belongs to a time unit nY for an asset, where n is the multiple. 模型应将每个新示例的 RUL 计算为连续数字。The model should calculate the RUL of each new example as a continuous number. 此数字表示故障之前剩余的时间段。This number denotes the period of time remaining before the failure.

回归的标签构造Label construction for regression

此处的问题是:“设备的剩余使用寿命 (RUL) 有多久?”The question here is: "What is the remaining useful life (RUL) of the equipment?" 对于故障之前的每条记录,将标签计算为下一次故障之前剩余的时间单位数。For each record prior to the failure, calculate the label to be the number of units of time remaining before the next failure. 在此方法中,标签是连续变量。In this method, labels are continuous variables. (参阅图 4)(See Figure 4)

图 4。

图 4。Figure 4. 回归标记Labeling for regression

对于回归,可以参照故障点完成标签操作。For regression, labeling is done with reference to a failure point. 如果不知道资产在故障之前已保留之久,则无法计算标签值。Its calculation is not possible without knowing how long the asset has survived before a failure. 因此,与二元分类相比,在数据中没有任何故障的资产不可用于建模。So in contrast to binary classification, assets without any failures in the data cannot be used for modeling. 最好是通过另一种称作生存分析的统计技术来解决此问题。This issue is best addressed by another statistical technique called Survival Analysis. 但是,对涉及到随时变化且间隔频繁的数据的 PdM 用例运用这种技术可能存在一定的难度。But potential complications may arise when applying this technique to PdM use cases that involve time-varying data with frequent intervals. 有关生存分析的详细信息,请参阅此单页指南For more information on Survival Analysis, see this one-pager.

预测维护的多类分类Multi-class classification for predictive maintenance

可以 PdM 解决方案中针对两种场景使用多类分类技术:Multi-class classification techniques can be used in PdM solutions for two scenarios:

  • 预测两个将来的结果:第一个结果是资产的故障时间范围。Predict two future outcomes: The first outcome is a range of time to failure for an asset. 资产已分配到多个可能的时间段中的一个。The asset is assigned to one of multiple possible periods of time. 第二个结果是多个根本原因之一导致将来某个时间段发生故障的可能性。The second outcome is the likelihood of failure in a future period due to one of the multiple root causes. 维护人员可以使用这种预测来监视症状和安排维护计划。This prediction enables the maintenance crew to watch for symptoms and plan maintenance schedules.
  • 预测给定故障的最可能根本原因。Predict the most likely root cause of a given failure. 此结果会建议一组用于解决故障的适当维护措施。This outcome recommends the right set of maintenance actions to fix a failure. 根本原因的排名列表和建议的修复措施可帮助技术人员在故障后排定修复措施的优先级。A ranked list of root causes and recommended repairs can help technicians prioritize their repair actions after a failure.

多类分类的标签构造Label construction for multi-class classification

此处的问题是:“资产在将来的 nZ(其中的 n 是时段数目)个时间单位内发生故障的概率是多少?”The question here is: "What is the probability that an asset will fail in the next nZ units of time where n is the number of periods?" 若要回答此问题,请使用时间桶 (3Z, 2Z, Z) 标记资产故障之前的 nZ 条记录。To answer this question, label nZ records prior to the failure of an asset using buckets of time (3Z, 2Z, Z). 将其他所有记录标记为“正常”(标签 = 0)。Label all other records as "normal" (label = 0). 在此方法中,目标变量保存分类值。In this method, the target variable holds categorical values. (参阅图 5)。(See Figure 5).

图 5。

图 5。Figure 5. 用于故障时间预测的多类分类的标签Labeling for multi-class classification for failure time prediction

此处的问题是:“根本原因/问题 P i 导致资产在将来的 X 个时间单位内发生故障的概率是多少?”The question here is: "What is the probability that the asset will fail in the next X units of time due to root cause/problem Pi?" 其中的 i 是可能根本原因的数目。where i is the number of possible root causes. 若要回答此问题,请将资产故障前面的 X 条记录标记为“根本原因 P i 即将导致故障”(标签 = P i )。To answer this question, label X records prior to the failure of an asset as "about to fail due to root cause P i" (label = Pi). 将其他所有记录标记为“正常”(标签 = 0)。Label all other records as being "normal" (label = 0). 在此方法中,标签也可分类(参阅图 6)。In this method also, labels are categorical (See Figure 6).

图 6。

图 6。Figure 6. 用于根本原因预测的多类分类的标签Labeling for multi-class classification for root cause prediction

模型根据每个 P i 分配故障概率以及不发生故障的概率。The model assigns a failure probability due to each Pi as well as the probability of no failure. 这些概率可以按度量值排序,以允许预测最可能在未来发生的问题。These probabilities can be ordered by magnitude to allow prediction of the problems that are most likely to occur in the future.

此处的问题是:“发生故障后你们建议采取哪些维护措施?”The question here is: "What maintenance actions do you recommend after a failure?" 若要回答此问题,标签不需要拾取未来边际,因为模型不会预测将来的故障。To answer this question, labeling does not require a future horizon to be picked, because the model is not predicting failure in the future. 它只是在已发生故障之后,预测最有可能的根本原因。It is just predicting the most likely root cause once the failure has already happened.

预测性维护的训练、验证和测试方法Training, validation, and testing methods for predictive maintenance

Team Data Science Process 涵盖整个模型训练-测试-验证周期。The Team Data Science Process provides a full coverage of the model train-test-validate cycle. 本部分介绍 PdM 的独特方面。This section discusses aspects unique to PdM.

交叉验证Cross validation

交叉验证的目标是定义一个数据集用于在训练阶段“测试”模型。The goal of cross validation is to define a data set to "test" the model in the training phase. 此数据集称为验证集。This data set is called the validation set. 此技术有助于限制“过度拟合”等问题,并提供有关模型如何通用化为独立数据集的见解。This technique helps limit problems like overfitting and gives an insight on how the model will generalize to an independent data set. 该数据集是可能来自真实问题的未知数据集。That is, an unknown data set, which could be from a real problem. PdM 的训练和测试例程需要考虑到随时可变的因素,以基于不可见的未来数据更好地进行通用化。The training and testing routine for PdM needs to take into account the time varying aspects to better generalize on unseen future data.

许多机器学习算法依赖于可以显著改变模型性能的多个超参数。Many machine learning algorithms depend on a number of hyperparameters that can change the model performance significantly. 在训练模型时,这些超参数的最佳值不是自动计算的,The optimal values of these hyperparameters are not computed automatically when training the model. 应由数据科学家指定。They should be specified by the data scientist. 有几种方法可以找到超参数的正确值。There are several ways of finding good values of hyperparameters.

最常见的是“k - 折叠交叉验证”,它将示例随机拆分为 k 个折叠。The most common one is k-fold cross-validation that splits the examples randomly into k folds. 对于每组超参数值,运行学习算法 k 次。For each set of hyperparameters values, run the learning algorithm k times. 在每次迭代时,使用当前折叠中的示例作为验证集,使用剩余的示例作为训练集。At each iteration, use the examples in the current fold as a validation set, and the rest of the examples as a training set. 基于训练示例训练算法,基于验证示例计算性能指标。Train the algorithm over training examples and compute the performance metrics over validation examples. 在此循环结束时,计算 k 个性能指标的平均值。At the end of this loop, compute the average of k performance metrics. 对于每个组超参数值,选择具有最佳平均性能的值。For each set of hyperparameter values, choose the ones that have the best average performance. 选择超参数的任务在性质上通常是试验性的。The task of choosing hyperparameters is often experimental in nature.

在 PdM 问题中,数据被记录为来自若干数据源的事件时序。In PdM problems, data is recorded as a time series of events that come from several data sources. 可根据标签时间将这些记录排序。These records may be ordered according to the time of labeling. 因此,如果数据集已随机拆分成训练和验证集,一些训练示例在时间上可能比另一些验证示例更晚。 Hence, if the dataset is split randomly into training and validation set, some of the training examples may be later in time than some of validation examples. 将会根据训练模型之前到达的某些数据评估超参数值的未来性能。Future performance of hyperparameter values will be estimated based on some data that arrived before model was trained. 这些估计可能过于乐观,特别是当时序不固定并且会随时间变化时。These estimations might be overly optimistic, especially if the time-series is not stationary and evolves over time. 因此,选择的超参数值可能欠佳。As a result, the chosen hyperparameter values might be suboptimal.

建议的方法是以时间相关的方式将示例拆分为训练和验证集,使得在时间上所有验证示例比所有训练示例都晚。The recommended way is to split the examples into training and validation set in a time-dependent manner, where all validation examples are later in time than all training examples. 对于每组超参数值,基于训练数据集训练算法。For each set of hyperparameter values, train the algorithm over the training data set. 基于相同的验证集度量模型的性能。Measure the model's performance over the same validation set. 选择显示最佳性能的超参数值。Choose hyperparameter values that show the best performance. 根据训练/验证拆分选择的超参数值使得未来的模型性能超过根据交叉验证选择的值。Hyperparameter values chosen by train/validation split result in better future model performance than with the values chosen randomly by cross-validation.

可以使用最佳的超参数值,通过基于整个训练数据训练学习算法,来生成最终的模型。The final model can be generated by training a learning algorithm over entire training data using the best hyperparameter values.

测试模型性能Testing for model performance

生成模型后,需要基于新数据对该模型的未来性能进行评估。Once a model is built, an estimate of its future performance on new data is required. 良好的评估结果是通过验证集计算的超参数值的性能指标,或通过交叉验证计算的平均性能指标。A good estimate is the performance metric of hyperparameter values computed over the validation set, or an average performance metric computed from cross-validation. 这些评估结果通常过于乐观。These estimations are often overly optimistic. 企业通常会制定附加的准则来规定如何测试模型。The business might often have some additional guidelines on how they would like to test the model.

PdM 的建议方法是以时间相关的方式将示例拆分为训练、验证和测试数据集。The recommended way for PdM is to split the examples into training, validation, and test data sets in a time-dependent manner. 所有测试示例在时间上应该比所有训练和验证示例要晚。All test examples should be later in time than all the training and validation examples. 拆分后,根据前面所述生成模型并度量其性能。After the split, generate the model and measure its performance as described earlier.

如果时序固定且易于预测,则随时和时间相关的方法会生成类似的未来性能评估结果。When time-series are stationary and easy to predict, both random and time-dependent approaches generate similar estimations of future performance. 但是,如果时序不固定和/或难以预测,则时间相关的方法生成的未来性能评估结果会更真实。But when time-series are non-stationary, and/or hard to predict, the time-dependent approach will generate more realistic estimates of future performance.

依赖于时间的拆分Time-dependent split

本部分介绍实现时间相关拆分的最佳做法。This section describes best practices to implement time-dependent split. 下面介绍如何在训练集与测试集之间执行时间相关的双向拆分。A time-dependent two-way split between training and test sets is described below.

假设各种传感器发送了带有时间戳的事件流(例如测量值)。Assume a stream of timestamped events such as measurements from various sensors. 定义特定时间范围内包含多个事件的训练和测试示例的特征与标签。Define features and labels of training and test examples over time frames that contain multiple events. 例如,对于二元分类,请基于过去的事件创建特征,并基于“X”个时间单位内的未来事件创建标签(请参阅有关特征工程和建模技术的部分)。For example, for binary classification, create features based on past events, and create labels based on future events within "X" units of time in the future (see the sections on feature engineering and modeling techniques). 因此,示例的标签时间范围比其特征的时间范围要晚。Thus, the labeling time frame of an example comes later than the time frame of its features.

对于时间相关的拆分,请选择训练截止时间 Tc,到该时间点时,将使用通过截至 Tc 的历史数据进行优化的超参数来训练模型。For time-dependent split, pick a training cutoff time Tc at which to train a model, with hyperparameters tuned using historical data up to Tc. 为了防止超过 Tc 的未来标签泄漏到训练数据,请选择最新的时间将训练示例标记为 Tc 之前的 X 个单位。To prevent leakage of future labels that are beyond Tc into the training data, choose the latest time to label training examples to be X units before Tc. 在图 7 所示的示例中,每个方块表示数据集中的一条记录,该数据集中的特征和标签已按前文所述进行计算。In the example shown in Figure 7, each square represents a record in the data set where features and labels are computed as described above. 图中显示,当 X = 2 且 W = 3 时应进入训练和测试集的记录:The figure shows the records that should go into training and testing sets for X=2 and W=3:

图 7。

图 7。Figure 7. 适用于二进制分类的依赖于时间的拆分Time-dependent split for binary classification

绿色方块表示属于时间单位的可用于训练的记录。The green squares represent records belonging to the time units that can be used for training. 在考虑到过去三个特征生成时段,以及 Tc 之前两个未来标签时段的情况下,生成每个训练示例。Each training example is generated by considering the past three periods for feature generation, and two future periods for labeling before Tc. 如果两个未来时段的任何部分超过 Tc,则从训练数据集中排除该示例,因为不会假设可见性超过 TcWhen any part of the two future periods is beyond Tc, exclude that example from the training data set because no visibility is assumed beyond Tc.

由于上述约束,黑色方块表示不应在训练数据集中使用的最终标记数据集的记录。The black squares represent the records of the final labeled data set that should not be used in the training data set, given the above constraint. 这些记录也不会在测试数据中使用,因为它们超过了 TcThese records will also not be used in testing data, since they are before Tc. 此外,其标签时间范围部分依赖于训练时间范围,这是不理想的。In addition, their labeling time frames partially depend on the training time frame, which is not ideal. 训练和测试数据应具有不同的标签时间范围,以防标签信息泄漏。Training and test data should have separate labeling time frames to prevent label information leakage.

到目前为止所述的方法允许时间戳接近 Tc 的训练和测试示例之间存在重叠。The technique discussed so far allows for overlap between training and testing examples that have timestamps near Tc. 实现更大程度的隔离的解决方法是从测试集中排除处于 W 个 Tc 时间单位内的示例。A solution to achieve greater separation is to exclude examples that are within W time units of Tc from the test set. 但这种激进的拆分依赖于足够高的数据可用性。But such an aggressive split depends on ample data availability.

用于预测 RUL 的回归模型更严重地受到泄漏问题的影响。Regression models used for predicting RUL are more severely affected by the leakage problem. 使用随机拆分方法会导致极端的过度拟合。Using the random split method leads to extreme over-fitting. 对于回归问题,拆分应该使得在 Tc 之前发生故障的资产中的记录进入训练集。For regression problems, the split should be such that the records belonging to assets with failures before Tc go into the training set. 截止时间之后发生故障的资产的记录进入测试集。Records of assets that have failures after the cutoff go into the test set.

拆分训练和测试数据的另一种最佳做法是按资产 ID 使用拆分。Another best practice for splitting data for training and testing is to use a split by asset ID. 拆分应该避免训练集中使用的任何资产用于测试模型性能。The split should be such that none of the assets used in the training set are used in testing the model performance. 如果使用此方法,模型更有可能会使用新资产提供更真实的结果。Using this approach, a model has a better chance of providing more realistic results with new assets.

处理不平衡的数据Handling imbalanced data

在分类问题中,如果一个类拥有的示例比其他类多,则数据集被认为是不平衡的。In classification problems, if there are more examples of one class than of the others, the data set is said to be imbalanced. 理想情况下,最好是在训练数据中为每个类提供足够的代表项目,以区分不同的类。Ideally, enough representatives of each class in the training data are preferred to enable differentiation between different classes. 如果一个类小于 10% 的数据,该数据被视为不平衡。If one class is less than 10% of the data, the data is deemed to be imbalanced. 代表数目欠足的类称为少数类。The underrepresented class is called a minority class.

许多 PdM 问题会遇到这种不平衡的数据集,其中一个类的代表项目数相比其他一个或多个类要少得多。Many PdM problems face such imbalanced datasets, where one class is severely underrepresented compared to the other class, or classes. 在某些情况下,少数类可能只构成总数据点的 0.001%。In some situations, the minority class may constitute only 0.001% of the total data points. 并非只有 PdM 存在类不平衡的问题。Class imbalance is not unique to PdM. 其他极少发生故障和异常的领域也会遇到类似问题,例如欺诈检测和网络入侵。Other domains where failures and anomalies are rare occurrences face a similar problem, for examples, fraud detection and network intrusion. 这些故障构成了少数类示例。These failures make up the minority class examples.

如果数据中存在类不平衡的情况,大多数标准学习算法的性能受到牵连,因为它们旨在最小化总错误率。With class imbalance in data, performance of most standard learning algorithms is compromised, since they aim to minimize the overall error rate. 对于负示例数占 99%、正示例数占 1% 的数据集,模型可以通过将所有实例标记为负来展示 99% 的准确性。For a data set with 99% negative and 1% positive examples, a model can be shown to have 99% accuracy by labeling all instances as negative. 但是,模型会不当地将正示例分类;因此,即使其准确性较高,算法也不一定有用。But the model will mis-classify all positive examples; so even if its accuracy is high, the algorithm is not a useful one. 因此,诸如错误率的总体准确性这种传统评估指标在不平衡的学习情况下是不足的。Consequently, conventional evaluation metrics such as overall accuracy on error rate are insufficient for imbalanced learning. 面对不平衡的数据集时,将使用其他指标进行模型评估:When faced with imbalanced datasets, other metrics are used for model evaluation:

  • PrecisionPrecision
  • 召回率Recall
  • F1 评分F1 scores
  • 成本调整的 ROC(接收方操作特征)Cost adjusted ROC (receiver operating characteristics)

有关这些指标的详细信息,请参阅模型评估For more information about these metrics, see model evaluation.

然而,有一些方法可以帮助补救类不平衡问题。However, there are some methods that help remedy class imbalance problem. 两个主要的方法是采样技术和成本敏感学习。 The two major ones are sampling techniques and cost sensitive learning.

采样方法Sampling methods

不平衡的学习涉及到使用采样方法将训练数据集修改为平衡的数据集。Imbalanced learning involves the use of sampling methods to modify the training data set to a balanced data set. 采样方法不会应用到测试集。Sampling methods are not to be applied to the test set. 尽管有多种采样技术,但最直接的技术是随机过采样和欠采样。 Although there are several sampling techniques, most straight forward ones are random oversampling and under sampling.

随机过采样涉及到从少数类中选择随机样本,复制这些样本并将它们添加到训练数据集中。Random oversampling involves selecting a random sample from minority class, replicating these examples, and adding them to training data set. 因此,少数类中的示例数会增加,最终平衡不同类的示例数。Consequently, the number of examples in minority class is increased, and eventually balance the number of examples of different classes. 过采样的一个弊端是,某些示例的多个实例可能会使分类器变得过于具体,导致过度拟合。A drawback of oversampling is that multiple instances of certain examples can cause the classifier to become too specific, leading to over-fitting. 模型可以展示较高的训练准确性,但处理不可见的测试数据时性能可能欠佳。The model may show high training accuracy, but its performance on unseen test data may be suboptimal.

相反,随机欠采样从多数类中选择随机样本,并从训练数据集中删除这些示例。Conversely, random under sampling is selecting a random sample from a majority class and removing those examples from training data set. 然而,从多数类中删除样本可能导致分类器错过与多数类相关的重要概念。However, removing examples from majority class may cause the classifier to miss important concepts pertaining to the majority class. 另一种可行的方法是混合采样,即对少数类进行过采样,同时对多数类进行欠采样。Hybrid sampling where minority class is over-sampled and majority class is under-sampled at the same time is another viable approach.

有许多复杂的采样技术。There are many sophisticated sampling techniques. 所选的技术取决于数据属性,以及数据科学家的迭代试验结果。The technique chosen depends on the data properties and results of iterative experiments by the data scientist.

成本敏感学习Cost sensitive learning

在 PdM 中,构成少数类的故障比普通示例更值得关注。In PdM, failures that constitute the minority class are of more interest than normal examples. 因此,重点主要在于算法处理故障的性能。So the focus is mainly on the algorithm's performance on failures. 错误地将正类预测为负类可能比相反的做法开销更高。Incorrectly predicting a positive class as a negative class can cost more than vice-versa. 这种情况通常称为不均等损失或非对称开销,它是错误地将元素分类为不同的类造成的。This situation is commonly referred as unequal loss or asymmetric cost of mis-classifying elements to different classes. 理想的分类器应该针对少数类提供较高的预测准确性,且不会影响多数类的准确性。The ideal classifier should deliver high prediction accuracy over the minority class, without compromising on the accuracy for the majority class.

可通过多种方法实现这种平衡。There are multiple ways to achieve this balance. 为了缓解不均等损失问题,可将较高的开销分配给少数类的错误分类,并尝试将整体开销降至最低。To mitigate the problem of unequal loss, assign a high cost to mis-classification of the minority class, and try to minimize the overall cost. SVM(支持向量机)等算法原生就能适应此方法,它允许在训练期间指定正负示例的开销。Algorithms like SVMs (Support Vector Machines) adopt this method inherently, by allowing cost of positive and negative examples to be specified during training. 同样,提升决策树等提升方法通常对不平衡的数据展示较好的性能。Similarly, boosting methods such as boosted decision trees usually show good performance with imbalanced data.

模型评估Model evaluation

在业务误报开销较高的 PdM 场景中,错误分类是一个严重的问题。Mis-classification is a significant problem for PdM scenarios where the cost of false alarms to the business is high. 例如,根据错误的引擎故障预测使飞机着陆可能会中断日程安排和旅行计划。For instance, a decision to ground an aircraft based on an incorrect prediction of engine failure can disrupt schedules and travel plans. 在装配生产线上关闭某台机器可能会导致收入损失。Taking a machine offline from an assembly line can lead to loss of revenue. 因此,根据新的测试数据使用适当的性能指标对模型进行评估非常关键。So model evaluation with the right performance metrics against new test data is critical.

下面介绍了用于评估 PdM 模型的典型性能指标:Typical performance metrics used to evaluate PdM models are discussed below:

  • 准确性是用于描述分类器性能的最常用指标。Accuracy is the most popular metric used for describing a classifier's performance. 但是,准确性对数据的分布非常敏感,并且对于使用不平衡数据集的方案而言是一种低效的度量方法。But accuracy is sensitive to data distributions, and is an ineffective measure for scenarios with imbalanced data sets. 需改用其他指标。Other metrics are used instead. 可以使用混淆矩阵等工具来计算和推理模型准确性。Tools like confusion matrix are used to compute and reason about accuracy of the model.
  • PdM 模型的精度与误报率相关。Precision of PdM models relate to the rate of false alarms. 模型的精度较低通常意味着误报率较高。Lower precision of the model generally corresponds to a higher rate of false alarms.
  • 召回率表示模型在测试集中正确识别的故障数。Recall rate denotes how many of the failures in the test set were correctly identified by the model. 较高的召回率意味着模型成功捕获了真正的故障。Higher recall rates mean the model is successful in identifying the true failures.
  • F1 评分是精度和召回率的调和平均值,其值范围为 0(最差)到 1(最佳)。F1 score is the harmonic average of precision and recall, with its value ranging between 0 (worst) to 1 (best).

对于二元分类:For binary classification,

  • 接收方操作曲线 (ROC) 也是一个常用的指标。Receiver operating curves (ROC) is also a popular metric. 在 ROC 曲线中,模型性能根据 ROC 上的固定操作点来解释。In ROC curves, model performance is interpreted based on one fixed operating point on the ROC.
  • 但对于 PdM 问题, decile 表提升图 更具信息性。But for PdM problems, decile tables and lift charts are more informative. 它们只注重正类(故障),提供的算法性能图比 ROC 曲线更复杂。They focus only on the positive class (failures), and provide a more complex picture of the algorithm performance than ROC curves.
    • 十分位表是使用文本示例根据故障概率的降序创建的。Decile tables are created using test examples in a descending order of failure probabilities. 然后,将排序的示例分组成十分位(具有最高概率的样本的 10%、20%、30%,依此类推)。The ordered samples are then grouped into deciles (10% of the samples with highest probability, then 20%, 30%, and so on). 每个十分位的比率(真实正比率)/(随机基线)可帮助估计每个十分位的算法性能。The ratio (true positive rate)/(random baseline) for each decile helps estimate the algorithm performance at each decile. 随机基线采用值 0.1、0.2,依此类推。The random baseline takes on values 0.1, 0.2, and so on.
    • 提升图绘制十分位的真实正比率,而不是所有十分位的随机真实正比率。Lift charts plot the decile true positive rate versus random true positive rate for all deciles. 最前面的十分位是结果的重点,因为它们展示了最大增益。The first deciles are usually the focus of results, since they show the largest gains. 用于 PdM 时,也可以将最前面的十分位视为“有风险”的代表。First deciles can also be seen as representative for "at risk", when used for PdM.

预测性维护的模型操作化Model operationalization for predictive maintenance

只有在使训练的模型可操作后,才能体现数据科学实践的优势。The benefit the data science exercise is realized only when the trained model is made operational. 也就是说,必须将模型部署到业务系统,才能基于前所未见的新数据做出预测。That is, the model must be deployed into the business systems to make predictions based on new, previously unseen, data. 新数据必须在两个方面完全符合经过训练的模型的模型签名:The new data must exactly conform to the model signature of the trained model in two ways:

  • 所有特征必须在新数据的每个逻辑实例(例如表中的某行)中存在。all the features must be present in every logical instance (say a row in a table) of the new data.
  • 必须像训练数据一样预处理新数据和设计每个特征。the new data must be pre-processed, and each of the features engineered, in exactly the same way as the training data.

学术与和行业文献中从多个角度阐述了上述过程。The above process is stated in many ways in academic and industry literature. 但是,以下所有陈述的意思相同:But all the following statements mean the same thing:

  • 使用模型为新数据评分Score new data using the model
  • 将模型应用到新数据Apply the model to new data
  • 使模型可操作Operationalize the model
  • 部署模型Deploy the model
  • 针对新数据运行模型Run the model against new data

如前所述,PdM 的模型操作化不同于其对等模块。As stated earlier, model operationalization for PdM is different from its peers. 涉及到异常检测和故障检测的方案通常实施在线评分(也称为实时评分)。 Scenarios involving anomaly detection and failure detection typically implement online scoring (also called real time scoring). 此处,模型将对每条传入的记录评分,并返回预测结果。Here, the model scores each incoming record, and returns a prediction. 对于异常检测,预测指示发生了异常(示例:单类 SVM)。For anomaly detection, the prediction is an indication that an anomaly occurred (Example: One-class SVM). 对于故障检测,预测会指示故障的类型或类。For failure detection, it would be the type or class of failure.

相比之下,PdM 涉及到批量评分。In contrast, PdM involves batch scoring. 为了符合模型签名,必须像训练数据一样设计新数据中的特征。To conform to the model signature, the features in the new data must be engineered in the same manner as the training data. 对于新数据经常采用的大型数据集,特征将会基于不同的时间窗口聚合,并分批进行评分。For the large datasets that is typical for new data, features are aggregated over time windows and scored in batch. 批量评分通常在 SparkAzure Batch 等分布式系统中进行。Batch scoring is typically done in distributed systems like Spark or Azure Batch. 可用采用两种替代方案 - 但两者都欠佳:There are a couple of alternatives - both suboptimal:

  • 流数据引擎支持基于内存中窗口的聚合。Streaming data engines support aggregation over windows in memory. 因此,对于它们是否支持在线评分有所争议。So it could be argued that they support online scoring. 但是,这些系统适用于较窄时间窗口中的密集数据,或较宽窗口中的稀疏元素。But these systems are suitable for dense data in narrow windows of time, or sparse elements over wider windows. 如 PdM 方案中所示,对于较宽时间窗口中的密集数据,它们可能无法正常缩放。They may not scale well for the dense data over wider time windows, as seen in PdM scenarios.
  • 如果批量评分不可用,解决方法是调整在线评分,以便每次以较小的批次处理新数据。If batch scoring is not available, the solution is to adapt online scoring to handle new data in small batches at a time.

预测性维护的解决方案模板Solution templates for predictive maintenance

本指南的最后一部分提供可在 Azure 中实现的 PdM 解决方案模板、教程和试验的列表。The final section of this guide provides a list of PdM solution templates, tutorials, and experiments implemented in Azure. 在某些情况下,只需片刻时间即可将这些 PdM 应用程序部署到 Azure 订阅。These PdM applications can be deployed into an Azure subscription within minutes in some cases. 可将它们用作概念证明演示、用于试验替代方案的沙盒,或者用于实际生产实施项目的加速器。They can be used as proof-of-concept demos, sandboxes to experiment with alternatives, or accelerators for actual production implementations. 这些模板在 Azure AI 库Azure GitHub 中提供。These templates are located in the Azure AI Gallery or Azure GitHub. 这些不同的示例会逐渐部署到此解决方案模板。These different samples will be rolled into this solution template over time.

# 标题Title 说明Description
22 Azure 预测性维护解决方案模板Azure Predictive Maintenance Solution Template 开放源代码解决方案模板,用于演示 Azure 机器学习建模和完整的 Azure 基础结构,该结构可支持 IoT 远程监视环境中的预测性维护方案。An open-source solution template that demonstrates Azure ML modeling and a complete Azure infrastructure capable of supporting Predictive Maintenance scenarios in the context of IoT remote monitoring.
33 预测性维护的深度学习Deep Learning for Predictive Maintenance 包含一个演示解决方案的 Azure Notebook。该解决方案使用 LSTM (长短期记忆)网络(某类递归神经网络)进行预测性维护。请参阅有关此示例的博客文章Azure Notebook with a demo solution of using LSTM (Long Short-Term Memory) networks (a class of Recurrent Neural Networks) for Predictive Maintenance, with a blog post on this sample.
44 面向航天工业的 Azure 预测性维护Azure Predictive Maintenance for Aerospace 基于 Azure ML v1.0 的首批 PdM 解决方案模板之一,适用于飞机维护。One of the first PdM solution templates based on Azure ML v1.0 for aircraft maintenance. 本指南源于此项目。This guide originated from this project.
55 Azure AI Toolkit for IoT EdgeAzure AI Toolkit for IoT Edge IoT Edge 中使用 TensorFlow 的 AI;该工具包在与 Azure IoT Edge 兼容的 Docker 容器中打包了深度学习模型,并以 REST API 的形式公开这些模型。AI in the IoT Edge using TensorFlow; toolkit packages deep learning models in Azure IoT Edge-compatible Docker containers and expose those models as REST APIs.
66 Azure IoT 预测性维护Azure IoT Predictive Maintenance Azure IoT 套件 PCS - 预配置解决方案。Azure IoT Suite PCS - Preconfigured Solution. 包含 IoT 套件的飞机维护 PdM 模板。Aircraft maintenance PdM template with IoT Suite. 与同一个项目相关的另一个文档演练Another document and walkthrough related to the same project.
77 使用 SQL R Services 的预测性维护模板Predictive Maintenance template using SQL Server R Services 基于 R Services 演示剩余使用寿命的场景。Demo of remaining useful life scenario based on R services.
88 预测性维护建模指南Predictive Maintenance Modeling Guide 使用 R 结合试验数据集以及 AzureML v1.0 中的 Azure 笔记本和试验设计的飞机维护数据集特征Aircraft maintenance dataset feature engineered using R with experiments and datasets and Azure notebook and experiments in AzureML v1.0

预测性维护的培训资源Training resources for predictive maintenance

除了有关一般 AI 概念和实践的内容和培训之外,Microsoft Azure 还为 PdM 技术背后的基础概念提供了学习路径。Microsoft Azure offers learning paths for the foundational concepts behind PdM techniques, besides content and training on general AI concepts and practice.

培训资源Training resource 可用性Availability
使用树和随机林学习 PdM 的路径Learning Path for PdM using Trees and Random Forest 公共Public
使用深度学习学习 PdM 的路径Learning Path for PdM using Deep Learning 公共Public
Azure 上的 AI 开发人员AI Developer on Azure 公共Public
Microsoft AI 学校Microsoft AI School 公共Public
GitHub 中的 Azure AI 学习资源Azure AI Learning from GitHub 公共Public
LinkedIn LearningLinkedIn Learning 公共Public
Microsoft AI YouTube 网络研讨会Microsoft AI YouTube Webinars 公共Public
Microsoft AI 展示Microsoft AI Show 公共Public
LearnAI@MS 合作伙伴Partners
Microsoft 合作伙伴网络Microsoft Partner Network 合作伙伴Partners

此外,Stanford 和 MIT 等学术机构以及其他培训公司也在线提供了有关 AI 的免费 MOOCS(大型开放式在线课程)。In addition, free MOOCS (massive open online courses) on AI are offered online by academic institutions like Stanford and MIT, and other educational companies.