Skywalking存储数据定时清理任务失效原因查找与解决方案_2021-04-26

更新时间:2023-06-25 18:33:50 阅读: 评论:0

Skywalking存储数据定时清理任务失效原因查找与解决⽅案
_2021-04-26
Skywalking 存储数据定时清理任务失效现象
现⽹环境deploy Skywalking 后台oap rver,采⽤elasticarch 6为存储,现象为长达两个星期的数据
始终保存在ES中,并没有按照预想的根据配置⽂件中默认的recordDataTTL和metricsDataTTL设置的3天
和7天的有效期来进⾏清理。
--recordDataTTL The lifecycle of record data. Record data includes traces, top
n sampled records, and logs. Unit is day. Minimal value is 2.
SW_CORE_RECORD_DATA_TTL3
--metricsDataTTL The lifecycle of metrics data, including the metadata. Unit is day.
Recommend metricsDataTTL >= recordDataTTL. Minimal value is 2.
SW_CORE_METRICS_DATA_TTL7
查看配置⽂件中有关清理的配置,默认应该是启动数据清理DataKeeperExecutor, 并且是每5分钟运⾏⼀次。
--enableDataKeeperExecutor Controller of TTL scheduler.
Once disabled, TTL wouldn’t
work.
人口分界线SW_CORE_ENABLE_DATA_KEEPER_EXECUTOR true
--dataKeeperExecutePeriod The execution period of TTL
scheduler, unit is minute.
Execution doesn’t mean deleting
data. The storage provider could
什么是bgm
override this, such as
ElasticSearch storage.
SW_CORE_DATA_KEEPER_EXECUTE_PERIOD5
查看skywalking oap rver pod的⽇志,kubectl logs -f 跟踪查看,发现每五分钟,有这样⼀条记录
2021-04-26 06:05:51,082 - org.apache.skywalking.l.DataTTLKeeperTimer -325937 [pool-10-thread-1] INFO [] - The lected first getAddress is 100.67.187.229_11800. Skip.
这⾥可以看到DataTTLKeeperTimer 这个类,应该就是定时做数据清理的,转向Skywalking 8.4.0 源码分析原因。
Skywalking 8.4.0 数据清理流程源码
查看Skywalking 8.4.0源码 DataTTLKeeperTimer中,查找到log信息对应的delete⽅法,⽅法comment指出,数据清理DataTTLKeeperTimer在每个OAP node中都会运⾏,但只有当前OAP node是list中的第⼀个是才会真正执⾏清理,否则直接跳过。
<span >/**
* DataTTLKeeperTimer starts in every OAP node, but the deletion only work when it is as the first node in the OAP
* node list from {@link ClusterNodesQuery}.
*/
private void delete() {
List<RemoteInstance> remoteInstances = clusterNodesQuery.queryRemoteNodes();
if (CollectionUtils.isNotEmpty(remoteInstances) && !(0).getAddress().isSelf()) {
表达近义词log.info("The lected first getAddress is {}. Skip.", (0).toString());
return;
}
log.info("Beginning to remove expired metrics from the storage.");
好听的王者荣耀名字女IModelManager modelGetter = moduleManager.find(CoreModule.NAME).provider().getService(IModelManager.class);
List<Model> models = modelGetter.allModels();
models.forEach(this::execute);
}</span>
对应我们测试环境的情况,测试环境下只有⼀个oap node, 但是依然在log中显⽰skip信息,根据判断,只可能为
<(0).getAddress().isSelf()部分为fal导致,查看这个isSelf()⽅法在哪⾥赋值,查找到是在接⼝ClusterNodesQuery的实现类KubernetesCoordinator中的queryRemoteNodes()⽅法中初始化remoteInstances list时⼀并赋值,通过对⽐Metadata().getUid()和KubernetesCoordinator类创建时的uid对⽐来判断。
<span >public List<RemoteInstance> queryRemoteNodes() {
try {
initHealthChecker();
List<V1Pod> pods = NamespacedPodListInformer.INFORMER.listPods().orElGet(this::lfPod);
if (log.isDebugEnabled()) {
List<String> uidList = pods
.stream()
.map(item -> Metadata().getUid())
.List());
log.debug("[kubernetes cluster pods uid list]:{}", String());
}
if (port == -1) {
port = manager.find(CoreModule.NAME).provider().getService(ConfigService.class).getGRPCPort();
}
List<RemoteInstance> remoteInstances =
pods.stream()
.filter(pod -> StringUtil.Status().getPodIP()))
.map(pod -> new RemoteInstance(
new Status().getPodIP(), port, Metadata().getUid().equals(uid))))
.List());
healthChecker.health();
return remoteInstances;
} catch (Throwable e) {
healthChecker.unHealth(e);
throw new Message());
}
}</span>
这⾥为了更好判断Metadata().getUid()和KubernetesCoordinator类创建时的uid 为什么不相同,修改了源代码查看cluster pods uid list和
KubernetesCoordinator的uid,
<span >    @Override
public List<RemoteInstance> queryRemoteNodes() {
try {
initHealthChecker();
李砚祖
List<V1Pod> pods = NamespacedPodListInformer.INFORMER.listPods().orElGet(this::lfPod);
//            if (log.isDebugEnabled()) {
List<String> uidList = pods
.stream()
.map(item -> Metadata().getUid())
.List());
log.info("[kubernetes cluster pods uid list]:{}", String());
log.info("[KubernetesCoordinator uid: ]:{}", uid);
//            }
if (port == -1) {
port = manager.find(CoreModule.NAME).provider().getService(ConfigService.class).getGRPCPort();
}
List<RemoteInstance> remoteInstances =
pods.stream()
.filter(pod -> StringUtil.Status().getPodIP()))
.map(pod -> new RemoteInstance(
new Status().getPodIP(), port, Metadata().getUid().equals(uid))))
.List());
healthChecker.health();
return remoteInstances;
} catch (Throwable e) {
healthChecker.unHealth(e);
throw new Message());
}
}</span>
修改源码后,通过源码中的makefile⼿动打包⽣成⾃定义的skywalking oap docker镜像。
打包⾃定义Skywalking OAP 镜像
使⽤项⽬中⾃带的makefile⽂件运⾏make docker即可打包,
安装后,运⾏make docker SKIP_TEST=true, 跳过测试打包镜像。
在make命令执⾏完成后,通过docker images查看⽣成的镜像
<span ><span >$ docker</span> images
REPOSITORY                                        TAG        IMAGE ID      CREATED          SIZE
文艺壁纸>伴君幽独weiwei11/oap                                      latest      607ca94fd742  <span >11</span> minutes ago  537MB
apache/skywalking-ui                              <span >8</span>.4.0      5f4d7292cd19  <span >2</span> months ago    4 apache/skywalking-oap-rv
er                      <span >8</span>.4.0-es6  35183ada1fbf  <span >2</span> months ago elasticarch                                      <span >6</span>.5.1      32f93c89076d  <span >2</span> years ago      773MB
将⾃定义打包的镜像tag后推送到hub上,再修改skyoap-test deployment的yaml配置⽂件,使⽤⾃定义打包的镜像,在pod的log中发现
这样的信息:
<span >2021-04-25 09:29:54,270 - org.apache.skywalking.oap.rver.cluster.plugin.kubernetes.KubernetesCoordinator -16079 [poo
2021-04-25 09:29:54,271 - org.apache.skywalking.oap.rver.cluster.plugin.kubernetes.KubernetesCoordinator -16080 [pool-3-thread-1] INFO [] - [KubernetesC
可以看到 KubernetesCoordinator创建时uid为null, 导致即使在单节点oap node时,依然isSelf ()为fal使得数据清理机制失效。
解决⽅案
查找Skywalking的application.yaml中有关Kubernetes部分的配置,发现uidEnvName配置
kubernetes:
namespace: ${SW_CLUSTER_K8S_NAMESPACE:default}
labelSelector: ${SW_CLUSTER_K8S_LABEL:app=collector,relea=skywalking}
uidEnvName: ${SW_CLUSTER_K8S_UID:SKYWALKING_COLLECTOR_UID}
⽽KubernetesCoordinator中的uid在初始化时的代码中正是通过这个uidEnvName来获取的,
<span >public KubernetesCoordinator(final ModuleDefineHolder manager,
final ClusterModuleKubernetesConfig config) {
this.uid = new UidEnvName()).get();
this.manager = manager;
}</span>简单灯谜
因此在⽹络上查找有关SW_CLUSTER_K8S_UID和SKYWALKING_COLLECTOR_UID的内容,查找到如下⽅式可以向sky oap rver中注⼊pod的metadata.uid
<span >    - name: SKYWALKING_COLLECTOR_UID
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.uid</span>
在sky oap rver deployment的yaml中增加配置后,数据清理机制正常运⾏。

本文发布于:2023-06-25 18:33:50,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/82/1038158.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:数据   配置   查找
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图