如何使⽤亚马逊天⽓预报做出更好的预测
From resource planning and inventory control to financial management and budgeting, forecasting is ud widely across different industries. Business invest heavily in hope of ascertaining future trends. The need for accurate and simple forecasting tools and techniques is apparent in modern industries.
从资源计划和库存控制到财务管理和预算,预测在不同⾏业中得到⼴泛使⽤。 企业⼤量投资以期确定未来趋势。 在现代⼯业中,对准确,简单的预测⼯具和技术的需求显⽽易见。
介绍 (Introduction)
What is a time ries? A ries of numbers reprenting some quantity that changes with time. For example — The revenue from the sale of Ginger.
什么是时间序列? 代表⼀些随时间变化的数量的⼀系列数字。 例如-出售姜的收⼊。
What is meant by time ries forecasting? The science of making predictions bad on past and prent time ries data. For example: What will be the demand for the month of November (i.e. on 2019–11–30)?
时间序列预测是什么意思? 基于过去和现在的时间序列数据进⾏预测的科学。 例如:11⽉(即2019–11–30)的需求量是多少?Amazon Forecast
亚马逊预报
A fully managed rvice that us modern machine learning methodologies to deliver accurate time-ries forecasts. This means that there are no rvers to provision, and no machine learning models to build, train, or deploy. One need not know machine learning at all as everything happens under the hood.
⼀项完全托管的服务,使⽤现代机器学习⽅法来提供准确的时间序列预测。 这意味着⽆需配置服务器,也⽆需构建,训练或部署机器学习模型。 ⼈们根本不需要了解机器学习,因为⼀切都在幕后发⽣。
Fun fact — the e-commerce giant us the same technology for its forecasting requirements!
有趣的事实-电⼦商务巨头使⽤相同的技术来满⾜其预测要求!
Have you ever ud Amazon Forecast to meet your business needs? Have you ever wanted to improve forecasting accuracy? This blog covers veral strategies that you can u to obtain better forecasts. Here is the outline:
您是否曾经使⽤Amazon Forecast来满⾜您的业务需求? 您是否曾经想提⾼预测准确性? 该博客涵盖了⼏种策略,您可以使⽤它们来获得更好的预测。 这是⼤纲:
Prerequisites — Basics of Amazon Forecast and what it does to improvi.
先决条件 — Amazon Forecast的基础知识以及即兴发挥的作⽤。
Data Wrangling — Applying transformations, normalization, and other operations to time ries data.
数据整理 —将转换,归⼀化和其他操作应⽤于时间序列数据。
Using Amazon Forecast — Overview of the procedure.
使⽤Amazon Forecast-过程概述。
Obtaining and Postprocessing Forecasts — Using Python to obtain forecasts and tips for postprocessing.
获取和后期处理预测 -使⽤Python获取后处理的预测和技巧。
Evaluating Forecasts — Using suitable metrics for validation.
评估预测 -使⽤适当的指标进⾏验证。
Limitations of Amazon Forecast
亚马逊预报的局限性
Conclusion
结论
先决条件 (Prerequisites)
数据集组-⽬标,元数据和相关时间序列数据集 (Datat group — Target, Metadata, and Related time ries datats)
Previously we saw what a time ries looks like. Imagine that you run a provision store and you want to forecast the sales of all the categories of products you ll. Amazon Forecast allows you to feed all the time ries at once. See the below table for reference.
以前,我们看到了时间序列的样⼦。 想象⼀下,您经营⼀家⾷品商店,并且想要预测所销售产品的所有类别的销售额。 Amazon Forecast 允许您⼀次输⼊所有这些时间序列。 请参阅下表以供参考。
Target datat
⽬标数据集
Note that you can include up to 10 categorical features besides tho specifying name/id, time, and values. This datat is called the “target datat”. Moreover, you can include two more datats — “Metadata datat” and “Related time ries datat”. More details can be found . Together the three datats make a “Datat group”.
请注意,除了那些指定名称/ ID,时间和值的功能之外,您最多还可以包含10个类别功能。 该数据集称为“⽬标数据集”。 此外,您可以包括另外两个数据集-“元数据数据集”和“相关时间序列数据集”。 可以在找到更多详细信息。 这三个数据集共同构成⼀个“数据集组”。
Metadata and Related time ries datats respectively
元数据和相关时间序列数据集
演算法 (Algorithms)
Amazon Forecast can run 6 different algorithms to make forecasts. Not all of them u all the data provided. However, all of them necessarily need three features specifying name/id, time, and value. The algorithms are:
Amazon Forecast可以运⾏6种不同的算法进⾏预测。 并⾮所有⼈都使⽤提供的所有数据。 但是,它们全部都需要三个功能来指定名称/标识,时间和值。 这些算法是:
1. DeepAR+
DeepAR +
2. CNN-QR (Convolutional Neural Network — Quantile Regression)
CNN-QR(卷积神经⽹络-分位数回归)
3. Prophet
预⾔家
4. ARIMA (Auto-Regressive Integrated Moving Average)
ARIMA(⾃回归综合移动平均线)
5. NPTS (Non-Parametric Time Series)
NPTS(⾮参数时间序列)
6. ETS (Exponential Smoothing)
ETS(指数平滑)
We can either run all of them using the AutoML feature or specify one algorithm at a time manually.
我们既可以使⽤AutoML功能运⾏它们,也可以⼿动⼀次指定⼀种算法。
域 (Domain)
A datat domain defines a forecasting u ca. Amazon Forecast supports the following datat domains:
数据集域定义了⼀个预测⽤例。 Amazon Forecast⽀持以下数据集域:
· RETAIL — eg. Retail demand
·零售-例如。 零售需求
· INVENTORY_PLANNING — eg. Supply chain and inventory planning
·INVENTORY_PLANNING-例如 供应链和库存计划
· EC2 CAPACITY — eg. Amazon Elastic Compute Cloud (Amazon EC2) capacity
·EC2容量-例如。 Amazon Elastic Compute Cloud(Amazon EC2)容量
· WORK_FORCE — eg. Workforce/attrition
·WORK_FORCE-例如。 劳动⼒/损耗
· WEB_TRAFFIC — eg. Web traffic
·WEB_TRAFFIC-例如。 ⽹络流量
· METRICS — eg. Revenue and cash flow
·METRICS-例如。 收⼊和现⾦流量
·
CUSTOM — All other
·⾃定义-所有其他
The names of the necessary features in the datat change according to the domain chon. For example, for forecasting demand, we may lect RETAIL and the feature names should be — item_id, timestamp, and target_value respectively. More details can be found .
数据集中必需要素的名称根据所选域⽽变化。 例如,为了预测需求,我们可以选择RETAIL,要素名称应分别为— item_id,timestamp和target_value。 可以在找到更多详细信息。
数据整理 (Data Wrangling)
Amazon Forecast was developed to deliver accurate forecasts. The u of additional datats, AutoML, and appropriate DOMAIN also helps. However, there are many techniques to further improve forecasting accuracy. Fortunately, most of the are available with scikit-learn or scipy libraries.
Amazon Forecast旨在提供准确的预测。 使⽤其他数据集,AutoML和适当的DOMAIN也有帮助。 但是,有许多技术可以进⼀步提⾼预测准确性。 幸运的是,其中⼤多数都可以通过scikit-learn或scipy库
获得。
The datat ud for demonstration is available on Kaggle as — . The datat comes from a retail grocery store and is regarding the purchas made by customers over a 3 year period (2016–19). You will find features such as UNITS, PRICESELL, NAME, and PAYMENT for each customer along with the time of transaction. Here we are concerned with forecasting a feature that was engineered by multiplying “UNITS” and “PRICESELL”. We intend to forecast the values for categories under “NAME” (eg. Ginger).
⽤于演⽰的数据集可在Kaggle上获得,例如- 年的 。 该数据集来⾃⼀家零售杂货店,涉及三年(2016-19年)内客户的购买情况。 您会为每个客户找到诸如UNITS,PRICESELL,NAME和PAYMENT之类的功能以及交易时间。 在这⾥,我们关⼼的是预测通过乘以“
UNITS”和“ PRICESELL”⽽设计的功能。 我们打算针对“名称”下的类别(例如Ginger)预测这些值。
1.转变 (1. Transformations)
The are mathematical functions applied to time ries data in order to make patterns apparent for an algorithm to learn.
这些是应⽤于时间序列数据的数学函数,以使模式对于算法学习变得明显。
Box-Cox Transform: This technique attempts to impart data a normal distribution and stabilize the variance. The
numbers however must be positive. The optimal parameter for stabilizing variance and minimizing skewness is
estimated through the Maximum Likelihood Method. Often, a small number such as 0.00001 is added before finding the natural logarithm. This is done to ensure numerical stability.
Box-Cox变换:此技术试图赋予数据正态分布并稳定变化。 但是,数字必须为正。 通过最⼤似然法估计⽤于稳定⽅差和最⼩化偏斜的最佳参数。 在查找⾃然对数之前,通常会添加⼀个⼩数,例如0.00001。 这样做是为了确保数值稳定性。
Box-Cox and Yeo-Johnson Transforms respectively
Box-Cox和Yeo-Johnson变换
Yeo-Johnson Transform: Similar to the Box-Cox transform but it allows negative numbers.
Yeo-Johnson变换:与Box-Cox变换类似,但它允许使⽤负数。
Quantile Transform: Generally ud to impart a uniform or normal distribution to data by matching quantiles of data with the target distribution.
分位数变换:通常⽤于通过使数据分位数与⽬标分布相匹配来赋予数据均匀或正态分布。
Data before and after Quantile Transformation
分位数转换前后的数据
2.缩放,标准化和标准化 (2. Scaling, Normalization, and Standardization)
Scaling refers to increasing or decreasing the relative size of time ries values. For example, scaling down by 100 means dividing by 100. This is uful especially when numbers are large but have small variance.
定标是指增加或减⼩时间序列值的相对⼤⼩。 例如,缩⼩100表⽰除以100。这在数字较⼤但⽅差较⼩的情况下尤其有⽤。
Normalization techniques ensure that the data falls in a certain range. They help reduce numerical complexity and result in shorter algorithm run times. Min-Max scaling is most popular. It ensures that numbers fall in [0, 1]. For certain applications demanding positive values, it may be uful to substitute 0 with an extremely small positive number like
0.00001.
规范化技术可确保数据落⼊⼀定范围内。 它们有助于降低数值复杂度并缩短算法运⾏时间。 最⼩-最⼤缩放是最受欢迎的。 它确保数字落在[0,1]中。 对于某些需要正值的应⽤,⽤极⼩的正数(例如0.00001)替换0可能会很有⽤。
Min-Max Normalization and Standardization respectively
最⼩-最⼤归⼀化和标准化
The well-known technique of z-score standardization ensures that the data has a mean = 0 and variance = 1. This is achieved by subtracting the obrvations by their mean value and dividing by their standard deviation. It is uful when assuming fixed statistics for any process. It also helps reduce numerical complexity and results in shorter algorithm run times.
z分数标准化的著名技术可确保数据的平均值= 0,⽅差=1。这是通过将观测值减去其平均值并除以其标准偏差来实现的。 在为任何过