Skywalking存储数据定时清理任务失效原因查找与解决⽅案
_2021-04-26
Skywalking 存储数据定时清理任务失效现象
现⽹环境deploy Skywalking 后台oap rver,采⽤elasticarch 6为存储,现象为长达两个星期的数据
始终保存在ES中,并没有按照预想的根据配置⽂件中默认的recordDataTTL和metricsDataTTL设置的3天
和7天的有效期来进⾏清理。
--recordDataTTL The lifecycle of record data. Record data includes traces, top
n sampled records, and logs. Unit is day. Minimal value is 2.
SW_CORE_RECORD_DATA_TTL3
--metricsDataTTL The lifecycle of metrics data, including the metadata. Unit is day.
Recommend metricsDataTTL >= recordDataTTL. Minimal value is 2.
SW_CORE_METRICS_DATA_TTL7
查看配置⽂件中有关清理的配置,默认应该是启动数据清理DataKeeperExecutor, 并且是每5分钟运⾏⼀次。
--enableDataKeeperExecutor Controller of TTL scheduler.
Once disabled, TTL wouldn’t
work.
人口分界线SW_CORE_ENABLE_DATA_KEEPER_EXECUTOR true
--dataKeeperExecutePeriod The execution period of TTL
scheduler, unit is minute.
Execution doesn’t mean deleting
data. The storage provider could
什么是bgm
override this, such as
ElasticSearch storage.
SW_CORE_DATA_KEEPER_EXECUTE_PERIOD5
查看skywalking oap rver pod的⽇志,kubectl logs -f 跟踪查看,发现每五分钟,有这样⼀条记录
2021-04-26 06:05:51,082 - org.apache.skywalking.l.DataTTLKeeperTimer -325937 [pool-10-thread-1] INFO [] - The lected first getAddress is 100.67.187.229_11800. Skip.
这⾥可以看到DataTTLKeeperTimer 这个类,应该就是定时做数据清理的,转向Skywalking 8.4.0 源码分析原因。
Skywalking 8.4.0 数据清理流程源码
查看Skywalking 8.4.0源码 DataTTLKeeperTimer中,查找到log信息对应的delete⽅法,⽅法comment指出,数据清理DataTTLKeeperTimer在每个OAP node中都会运⾏,但只有当前OAP node是list中的第⼀个是才会真正执⾏清理,否则直接跳过。
<span >/**
* DataTTLKeeperTimer starts in every OAP node, but the deletion only work when it is as the first node in the OAP
* node list from {@link ClusterNodesQuery}.
*/
private void delete() {
List<RemoteInstance> remoteInstances = clusterNodesQuery.queryRemoteNodes();
if (CollectionUtils.isNotEmpty(remoteInstances) && !(0).getAddress().isSelf()) {
表达近义词log.info("The lected first getAddress is {}. Skip.", (0).toString());
return;
}
log.info("Beginning to remove expired metrics from the storage.");
好听的王者荣耀名字女IModelManager modelGetter = moduleManager.find(CoreModule.NAME).provider().getService(IModelManager.class);
List<Model> models = modelGetter.allModels();
models.forEach(this::execute);
}</span>
对应我们测试环境的情况,测试环境下只有⼀个oap node, 但是依然在log中显⽰skip信息,根据判断,只可能为
<(0).getAddress().isSelf()部分为fal导致,查看这个isSelf()⽅法在哪⾥赋值,查找到是在接⼝ClusterNodesQuery的实现类KubernetesCoordinator中的queryRemoteNodes()⽅法中初始化remoteInstances list时⼀并赋值,通过对⽐Metadata().getUid()和KubernetesCoordinator类创建时的uid对⽐来判断。
<span >public List<RemoteInstance> queryRemoteNodes() {
try {
initHealthChecker();
List<V1Pod> pods = NamespacedPodListInformer.INFORMER.listPods().orElGet(this::lfPod);
if (log.isDebugEnabled()) {
List<String> uidList = pods
.stream()
.map(item -> Metadata().getUid())
.List());
log.debug("[kubernetes cluster pods uid list]:{}", String());
}
if (port == -1) {
port = manager.find(CoreModule.NAME).provider().getService(ConfigService.class).getGRPCPort();
}
List<RemoteInstance> remoteInstances =
pods.stream()
.filter(pod -> StringUtil.Status().getPodIP()))
.map(pod -> new RemoteInstance(
new Status().getPodIP(), port, Metadata().getUid().equals(uid))))
.List());
healthChecker.health();
return remoteInstances;
} catch (Throwable e) {
healthChecker.unHealth(e);
throw new Message());
}
}</span>
这⾥为了更好判断Metadata().getUid()和KubernetesCoordinator类创建时的uid 为什么不相同,修改了源代码查看cluster pods uid list和
KubernetesCoordinator的uid,
<span > @Override
public List<RemoteInstance> queryRemoteNodes() {
try {
initHealthChecker();
李砚祖
List<V1Pod> pods = NamespacedPodListInformer.INFORMER.listPods().orElGet(this::lfPod);
// if (log.isDebugEnabled()) {
List<String> uidList = pods
.stream()
.map(item -> Metadata().getUid())
.List());
log.info("[kubernetes cluster pods uid list]:{}", String());
log.info("[KubernetesCoordinator uid: ]:{}", uid);
// }
if (port == -1) {
port = manager.find(CoreModule.NAME).provider().getService(ConfigService.class).getGRPCPort();
}
List<RemoteInstance> remoteInstances =
pods.stream()
.filter(pod -> StringUtil.Status().getPodIP()))
.map(pod -> new RemoteInstance(
new Status().getPodIP(), port, Metadata().getUid().equals(uid))))
.List());
healthChecker.health();
return remoteInstances;
} catch (Throwable e) {
healthChecker.unHealth(e);
throw new Message());
}
}</span>
修改源码后,通过源码中的makefile⼿动打包⽣成⾃定义的skywalking oap docker镜像。
打包⾃定义Skywalking OAP 镜像
使⽤项⽬中⾃带的makefile⽂件运⾏make docker即可打包,
安装后,运⾏make docker SKIP_TEST=true, 跳过测试打包镜像。
在make命令执⾏完成后,通过docker images查看⽣成的镜像
<span ><span >$ docker</span> images
REPOSITORY TAG IMAGE ID CREATED SIZE
文艺壁纸>伴君幽独weiwei11/oap latest 607ca94fd742 <span >11</span> minutes ago 537MB
apache/skywalking-ui <span >8</span>.4.0 5f4d7292cd19 <span >2</span> months ago 4 apache/skywalking-oap-rv
er <span >8</span>.4.0-es6 35183ada1fbf <span >2</span> months ago elasticarch <span >6</span>.5.1 32f93c89076d <span >2</span> years ago 773MB
将⾃定义打包的镜像tag后推送到hub上,再修改skyoap-test deployment的yaml配置⽂件,使⽤⾃定义打包的镜像,在pod的log中发现
这样的信息:
<span >2021-04-25 09:29:54,270 - org.apache.skywalking.oap.rver.cluster.plugin.kubernetes.KubernetesCoordinator -16079 [poo
2021-04-25 09:29:54,271 - org.apache.skywalking.oap.rver.cluster.plugin.kubernetes.KubernetesCoordinator -16080 [pool-3-thread-1] INFO [] - [KubernetesC
可以看到 KubernetesCoordinator创建时uid为null, 导致即使在单节点oap node时,依然isSelf ()为fal使得数据清理机制失效。
解决⽅案
查找Skywalking的application.yaml中有关Kubernetes部分的配置,发现uidEnvName配置
kubernetes:
namespace: ${SW_CLUSTER_K8S_NAMESPACE:default}
labelSelector: ${SW_CLUSTER_K8S_LABEL:app=collector,relea=skywalking}
uidEnvName: ${SW_CLUSTER_K8S_UID:SKYWALKING_COLLECTOR_UID}
⽽KubernetesCoordinator中的uid在初始化时的代码中正是通过这个uidEnvName来获取的,
<span >public KubernetesCoordinator(final ModuleDefineHolder manager,
final ClusterModuleKubernetesConfig config) {
this.uid = new UidEnvName()).get();
this.manager = manager;
}</span>简单灯谜
因此在⽹络上查找有关SW_CLUSTER_K8S_UID和SKYWALKING_COLLECTOR_UID的内容,查找到如下⽅式可以向sky oap rver中注⼊pod的metadata.uid
<span > - name: SKYWALKING_COLLECTOR_UID
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.uid</span>
在sky oap rver deployment的yaml中增加配置后,数据清理机制正常运⾏。