博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
机器学习 模型性能评估_如何策略性地实现您的机器学习模型的性能目标
阅读量:2524 次
发布时间:2019-05-11

本文共 15161 字,大约阅读时间需要 50 分钟。

机器学习 模型性能评估

by Nezar Assawiel

由Nezar Assawiel

如何策略性地实现您的机器学习模型的性能目标 (How to strategically accomplish your machine learning model’s performance goals)

介绍 (Introduction)

Machine learning (ML) development is an iterative process. You have an idea to solve the problem at hand, you build the idea and examine the results. You get another idea to improve the results and so on until you reach the performance goal that deems your model ready for deployment into production — ready to use by end users.

机器学习(ML)开发是一个迭代过程。 您有解决当前问题的想法,可以构想并检查结果。 您会想到另一种改善结果的想法,以此类推,直到达到性能目标,该目标认为您的模型已准备好部署到生产环境中—可供最终用户使用。

However, there are often many ideas and possibilities you can try to improve the performance and get closer to your goal. For example, you can collect more data, train longer, or try bigger or smaller networks.

但是,通常有很多想法和可能性可以尝试改善性能并更接近目标。 例如,您可以收集更多数据,训练更长的时间,或者尝试更大或更小的网络。

Going in the wrong direction during this experimentation process can be costly, especially for large projects. No one wants to spend two months collecting more data to discover later on that the performance gain was negligible and it was not worth a day let alone two months!

在这个实验过程中,朝错误的方向前进可能会非常昂贵,尤其是对于大型项目而言。 没有人愿意花两个月的时间来收集更多数据,以至于以后发现性能提升可忽略不计,而且一天也不值得花两个月来!

Strategically setting and working towards the performance goal of your ML model is vital in speeding up the experimentation process and achieving that goal. In this post, I present some tips that will hopefully help you in this regard.

从战略上设定并努力实现ML模型的性能目标,对于加快实验过程并实现该目标至关重要。 在这篇文章中,我提出了一些技巧,希望可以在这方面为您提供帮助。

预期的先验知识 (Expected prior knowledge)

This post assumes knowing, at least, the basics of building a ML model. This discussion is not meant to illustrate what a ML model is or how to build one. Rather, the content is on how to strategically improve a ML model during the development process. Specifically, the following concepts and terminologies should be familiar to you:

这篇文章假定至少了解建立ML模型的基础知识。 该讨论并不旨在说明什么是ML模型或如何构建ML模型。 相反,内容涉及在开发过程中如何从战略上改进ML模型。 具体来说,您应该熟悉以下概念和术语:

  • train, dev (development), and test sets: The dev set is also called the validation or the hold-on set. is a great short introduction to the topic.

    训练,开发(开发)和测试集:开发集也称为验证集或保留集。 是对该主题的简短介绍。

  • evaluation (or performance) metrics: Which are the measures used to indicate how “good” a ML model is in doing its job. covering some of the basic metrics used in ML.

    评估(或性能)指标:用于指示ML模型执行其工作的“良好”程度的度量。 介绍ML中使用的一些基本指标的文章。

  • bias (underfitting) and variance (overfitting) errors: is great explanation of these errors in a simple way.

    偏差(拟合不足)和方差(拟合过度)错误: 是对这些错误的简单说明。

正交化的重要性 (The importance of orthogonalization)

As you know, the sequential steps in developing a ML model are as follows:

如您所知,开发ML模型的顺序步骤如下:

  1. fit the training set well. For example try a bigger network, try another cost-function optimization method, try training longer.

    非常适合训练集。 例如,尝试使用更大的网络,尝试另一种成本函数优化方法,尝试进行更长的培训。
  2. fit the dev set well. For example try regularization, try collecting more training data.

    适合开发人员。 例如,尝试正则化,尝试收集更多训练数据。
  3. fit the test set well. For example try a bigger dev set.

    非常适合测试仪。 例如,尝试更大的开发集。
  4. perform well in production. If not, the dev set needs to change or the cost function of the model.

    在生产中表现良好。 如果不是,则开发集需要更改或模型的成本函数。

During this process, ideally you would like the modifications you try — “model controls” — to be independent. Why?

在此过程中,理想情况下,您希望尝试的修改(“模型控件”)是独立的。 为什么?

Consider, for example, the steering wheel of a car. When you steer the wheel left, the car moves left. When you steer it right, the car moves right. What if steering the wheel left moved the car left and increased the speed of the car? It would become much more difficult to control the car, right? Why? Because steering the wheel left is not an independent control of the car any more. It is coupled with another control, the speed control. It is always easier when the controls are independent.

考虑例如汽车的方向盘。 向左转向时,汽车向左移动。 当您正确转向时,汽车将向右移动。 如果向左操纵方向盘使汽车向左移动提高了汽车速度,该怎么办? 控制汽车会变得更加困难,对吧? 为什么? 因为向左转向不再是汽车的独立控制 。 它与另一个控制装置,即速度控制装置结合在一起 当控件独立时,总是更容易。

In ML development, early stopping, for example, is a form of regularization used to improve the performance on the dev set by training only on a part of the training set. So, early stopping is a control that is not independent from another control, namely how long you train.

在ML开发中, 及早停止 例如, 是一种正规化形式,用于通过仅对一部分训练集进行训练来提高开发集的性能。 因此,提前停止是一个另一个控件无关的控件,即您训练的时间。

For faster iterative development process, you would like the independence of your controls, that is, orthogonalization. In other words, consider avoiding dependent controls like early stopping as much as you can for faster development process.

为了加快迭代开发过程,您需要控件的独立性,即正交化 。 换句话说,请考虑避免诸如 提前停止 尽可能多地加快开发过程。

策略 (Strategies)

With the previous introduction in mind, here are some tips to set and strategically improve your model’s performance:

考虑到先前的介绍,以下是一些技巧,这些技巧可用于设置和策略上改善模型的性能:

a)将多个评估指标合并为一个 (a) Combine multiple evaluation metrics into one)

It is likely you have several evaluation metrics to evaluate the performance of your ML model. For example, you might have recall and precision to evaluate a classifier. Recall and precision are competing metrics — typically when one increases, the other decreases. So, how do you choose the best classifier from Table 1 below, for example?

您可能有多个评估指标来评估ML模型的性能。 例如,您可能具有回忆和精确度来评估分类器。 召回率和精度是相互竞争的指标-通常当一个指标增加时,另一个指标下降。 因此,例如,如何从下面的表1中选择最佳分类器?

It is a good idea in this case to combine precision and recall into one metric. The F1 score [F1 score= (2 *precision*recall)/(precision + recall)] will do, as you might have realized already. As such, Classifier A from Table 1 will have the best F1 score.

在这种情况下,将精度和召回率合并为一个度量标准是一个好主意。 正如您可能已经意识到的那样,F1分数[ F1 score= (2 *precision*recall)/(precision + recall) 。 因此,表1中的分类器A的F1得分最高。

Obviously, this process is problem-specific. Your application might require maximizing precision. In this case, Classifier C in Table 1 will be your best choice.

显然,此过程是针对特定问题的。 您的应用程序可能需要最大的精度。 在这种情况下,表1中的分类器C将是您的最佳选择。

Optimizing and satisficing

优化和满足

You might want to follow the optimizing and satisficing approach. Meaning, you are optimizing one metric as long as the other metric(s) meet a certain minimum threshold.

您可能需要遵循优化和令人满意的方法。 意思是,您正在优化一个度量标准,只要其他度量标准达到某个最小阈值即可。

Assume the classifiers from Table 1, and accuracy and run-time as two metrics as shown in Table 2 below. You may be mainly concerned about optimizing one metric — accuracy — as long as the other metrics — run-time — meet a certain threshold. In this example, the run-time threshold is 50 ms or less. So, you are looking for the classifier with the highest accuracy as long as the run-time is 50 ms or less. Thus, from Table 2, you would choose Classifier B.

假定表1中的分类器以及准确性和运行时间是两个指标,如下表2所示。 您可能主要担心优化一个指标(准确性),只要其他指标(运行时)满足一定的阈值。 在此示例中,运行时阈值为50毫秒或更短。 因此, 只要运行时间为50 ms或更短,就在寻找精度最高的分类器。 因此,从表2中,您将选择分类器B。

However, if your aim is to maximize the performance across all metrics, you would need to combine them.

但是,如果您的目标是在所有指标上最大化性能,则需要将它们结合起来。

It is not always easy to combine all metrics into one. There may be many of them. The relationship between them is not clear. In such cases, you would need to be creative and careful in combining them! The time you invest in coming up with an all-in-one performance metric is worth it. It won’t only speed up the development process, but will also produce a well-performing model in production.

将所有指标合为一体并不总是那么容易。 可能有很多。 它们之间的关系尚不清楚。 在这种情况下,您需要发挥创造力谨慎组合! 您花费大量时间来制定一个多方面的性能指标是值得的。 它不仅将加快开发过程,而且还将在生产中产生性能良好的模型。

b)正确设置训练,开发和测试集 (b) Set the train, dev, and test sets correctly)

Choose the right size for the train/dev /test split

选择适合训练/开发/测试拆分的大小

You have likely seen the 60%, 20%, 20% split for the train, dev, test data sets, respectively. This works well for small data sets — say 10k data points or fewer. However, when working with large data sets, especially with deep learning, a 98%, 1%, 1% split or similar might be more appropriate. If you have 2 million data points in your dataset, a 1% split is 20K data points which are enough for the dev and test sets.

您可能已经看到火车,开发和测试数据集分别有60%,20%,20%的分割。 这适用于小型数据集(例如1万个数据点或更少)。 但是,在处理大型数据集(尤其是深度学习)时,更适合使用98%,1%,1%的拆分率或类似的比例。 如果您的数据集中有200万个数据点,则1%的拆分为20K数据点,这对于开发和测试集就足够了。

In general, you want your dev set to be large enough to capture the changes you make to your model during your experimentation process. You want your test set to be large enough to give you high confidence in the performance of your model.

通常,您希望您的开发集足够大,以捕获在实验过程中对模型所做的更改。 您希望测试集足够大,以使您对模型的性能充满信心。

Make sure the dev and test sets come from the same distribution

确保开发和测试集来自同一发行版

While this may seem trivial, I have seen experienced developers forget this important point. Say you have experimented and iteratively improved a model that predicts default on car loans based on zip code. Don’t expect your model to work correctly on a test set from zip code areas with low average-income if the dev set comes from zip code areas with high average-income, for example. These are two different distributions!

尽管这看似微不足道,但我看到有经验的开发人员忘记了这一重要点。 假设您已经试验并迭代改进了一个模型,该模型可以根据邮政编码预测汽车贷款的违约情况。 例如,如果开发集来自平均收入较高的邮政编码区域,则不要指望模型可以在来自平均收入较低的邮政编码区域的测试集上正常工作。 这是两个不同的发行版!

Make sure the dev and test sets reflect the data your model will encounter in production

确保开发和测试集能够反映您的模型将在生产中遇到的数据

For example, if you are doing face-recognition, the resolution of your dev/test images should reflect the resolution of the images in production. While this might be a trivial example, you should examine all aspects of the data on which your application would work in production in comparison to your train/dev/test data. If your model is doing well on your metric and dev/test sets but not so well in production, you have the wrong metric and/or the wrong dev/test sets!

例如,如果您要进行人脸识别,则开发/测试图像的分辨率应反映生产中图像的分辨率。 尽管这可能是一个琐碎的示例,但您应该与培训/开发/测试数据相比,检查应用程序将在生产中使用的数据的所有方面。 如果您的模型在度量标准和开发/测试集上表现良好,但在生产上却表现不佳,那么您的度量标准和/或开发/测试集就不正确!

c)首先确定并解决“正确的”错误 (c) Identify and tackle the “correct” error first)

Bayes’ error and human error

贝叶斯错误和人为错误

Bayes’ error is the theoretical lowest error that exists in a model, in other words irreducible error. Consider, for example, a dog classifier that predicts if the image at hand is of a dog or another animal. There might be some images that are so blurry that are impossible to classify by humans or even the most sophisticated system that has ever been invented. This will be Bayes’ error.

贝叶斯误差是模型中存在的理论上的最低误差,即不可约误差。 例如,考虑一个狗分类器,该分类器预测手头的图像是狗还是另一种动物。 可能有些图像太模糊了,无法被人类甚至是迄今为止发明的最复杂的系统分类。 这将是贝叶斯的错误。

Human error, by definition, is larger — worse — than Bayes’ error. However, human error is usually very close to Bayes’ error since we are really good at recognizing patterns. Thus, human error is usually used as a proxy for Bayes error.

根据定义,人为错误比贝叶斯错误更大,甚至更糟。 但是,人为错误通常与贝叶斯错误非常接近,因为我们非常擅长识别模式。 因此,人为错误通常被用作贝叶斯错误的代理。

Improving below human-level performance

改善低于人类水平的表现

When the performance of your ML model is below human-level performance, you can improve the performance by:

当ML模型的性能低于人员水平的性能时,可以通过以下方法提高性能:

  1. getting more labeled data by humans

    由人类获取更多带标签的数据
  2. analyzing the errors and incorporating the insights into the system. Why does the ML model get this and that wrong while humans get them right?

    分析错误并将见解纳入系统。 为什么当人类正确时,机器学习模型会错误地解释这一点?
  3. improving the model itself. Look at whether it underfits — high bias error — or overfits — high variance error — and change the model accordingly.

    改进模型本身。 查看它是欠拟合(高偏差误差)还是过拟合(高方差误差),并相应地更改模型。

Once human-level performance is surpassed, improving the performance becomes a much slower and more difficult process as you would expect.

一旦超过了人的绩效,提高绩效就变得比您期望的要慢得多,也更加困难。

So, how do you define human error exactly? This is what is discussed next!

那么,如何准确定义人为错误? 这是接下来要讨论的!

Defining human error and identifying the error to tackle first

定义人为错误并确定要首先解决的错误

Consider the dog classification problem from before — recognizing if an image is of a dog or not. After doing some research, you might find the human error as follows:

从前考虑一下狗的分类问题-识别图像是否是狗。 经过研究,您可能会发现人为错误,如下所示:

  • average person: 2% error

    一般人:2%错误
  • average zoologist: 1% error

    动物学家平均水平:1%的误差
  • expert zoologist: 0.6% error

    动物学专家:0.6%错误
  • a team of expert zoologists: 0.4% error

    一组专业的动物学家:0.4%的误差

Now, consider the four cases in Table 3 below.

现在,考虑下面表3中的四种情况。

In Case A, your priority should be the underfitting problem — high bias — as indicated by the errors in red, since the bias error (5% -2% =3%) is larger than the variance error (6% -5% =1%). For the human error, the largest among the human errors that are smaller than the training error is used first. Thus, your human error reference in this case is that of the average person — 2% — since it is the largest error among the human errors that are smaller than the training error (all of them in this case).

在案例A中,您的优先事项应该是欠拟合问题-高偏差-如红色误差所示,因为偏差误差(5%-2%= 3%)大于方差误差(6%-5%= 1%)。 对于人为错误, 首先使用小于训练错误最大人为错误。 因此,在这种情况下,您的人为错误参考值是普通人的参考值-2%-因为它是人为错误中最大的错误,而人为错误小于训练错误(在此情况下均为所有人)。

In Case B, you might need to improve the variance error first, 9%-5%=4%, since it is larger than the bias error of 5%-2%=3%.

在情况B中,您可能需要首先改善方差误差,即9%-5%= 4%,因为它大于5%-2%= 3%的偏差。

In Case C, you surpassed the human performance of the average person and you are tied to the performance of the average zoologist. So, your new human error should be that of the expert zoologist — 0.6% — or even the team of expert zoologists — 0.4%. The variance error in this case is 0.2% while the bias error is between 0.4% and 0.6%. So, you should work on this error first — need to better fit the training data.

在案例C中,您超越了普通人的表现,并且与普通动物学家的表现紧密相关。 因此,您的新的人为错误应该是专家动物学家的错误-0.6%-甚至是专家动物学家团队的错误-0.4%。 在这种情况下,方差误差为0.2%,而偏差误差为0.4%至0.6%。 因此,您应该首先解决此错误-需要更好地适合训练数据。

Surpassing human performance

超越人类的表现

In Case D, you see that the training error is 0.2% while the best human error is 0.4%. Does this mean your model surpassed human performance or the model overfits by 0.2%?! You see, it is not clear whether to focus on the bias error or the variance error. Also, if, indeed, your model surpassed human performance and you are still looking to improve the model, it becomes unclear which strategy to follow from a human intuition perspective.

在案例D中,您看到训练误差为0.2%,而最佳人为误差为0.4%。 这是否意味着您的模型超过了人类的表现或模型超出了0.2%?! 您会看到,不清楚是偏向误差还是方差误差。 同样,如果确实您的模型超越了人类的性能,并且您仍在寻求改进模型,那么从人类直觉的角度来看,应该采取哪种策略也就不清楚。

There are many ML models nowadays that surpass human performance, such as product recommendation and online-ad targeting systems. These “models that surpass human performance” tend to be non-natural perception systems, that is, not computer vision, speech recognition or natural language processing systems. The reason for this is that we humans are really good at natural perception tasks.

如今,有许多ML模型已经超越了人类的性能,例如产品推荐和在线广告定位系统。 这些“超越人类性能的模型”往往是非自然的感知系统,即不是计算机视觉,语音识别或自然语言处理系统。 原因是我们人类真的很擅长自然感知任务。

With big data and deep learning, however, there are natural perception systems that surpass human performance and they are getting better and better. But these are much harder problems than non-natural perception problems.

但是,有了大数据和深度学习,自然感知系统已经超越了人类的表现,并且越来越好。 但是,这些问题比非自然感知问题要难得多。

Originally published aton March 24, 2018. Edited: Oct 4, 2018

最初于年3月24日发布在 。编辑:2018年10月4日

翻译自:

机器学习 模型性能评估

转载地址:http://qbkzd.baihongyu.com/

你可能感兴趣的文章
session cookie
查看>>
POJ 1222 EXTENDED LIGHTS OUT(翻转+二维开关问题)
查看>>
【BZOJ-4059】Non-boring sequences 线段树 + 扫描线 (正解暴力)
查看>>
几种简单的负载均衡算法及其Java代码实现
查看>>
TMS3705A PCF7991AT 线路图
查看>>
白盒测试实践(小组作业)day4
查看>>
为什么学sail.js
查看>>
pythen创建cocos2dx项目
查看>>
js调用.net后台事件,和后台调用前台等方法总结
查看>>
Vert.x 之 HelloWorld
查看>>
太阳能路灯项目背景知识
查看>>
Objec类和final关键字的用法
查看>>
打开matlab遗传算法工具箱的方法
查看>>
Ajax制作智能提示搜索
查看>>
打赏页面
查看>>
JAVA之线程同步的三种方法
查看>>
OOP之属性继承和方法继承
查看>>
PostgreSQL调用函数
查看>>
ASP.NET MVC+EF框架+EasyUI实现权限管理(附源码)
查看>>
sitecore系统教程之体验编辑器中创建一个项目
查看>>