Logstash:使⽤Logstash导⼊CSV⽂件⽰例
中欧工商管理学院在今天的⽂章中,我将展⽰如何使⽤ file input 结合 multiline 来展⽰如何导⼊⼀个 CSV ⽂件。针对 multiline,我在之前的⽂章 “” 有
讲到过。另外我也有两篇关于使⽤ Logstash 导⼊ CSV 的例⼦
针对 CSV 的导⼊,我们也可以使⽤ Filebeat 来解析 CSV ⽂件。如果你有兴趣的话,请参考:
准备数据
在今天的练习中,我们有如下的测试数据:
multiline.csv
INV-12402400071,05/31/2018,2595,Hy-Vee Wine and Spirits / Denison,"1620 4th Ave, South",Denison,51442,"1620 4th Ave, South Denison 51442(42.01 S29195400002,11/21/2015,2205,Ding's Honk And Holler,900 E WASHINGTON,CLARINDA,51632,"900 E WASHINGTON
CLARINDA 51632
(40.739238, -95.02756)",73,Page,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,12,325.68,9.00,2.38
S29198800001,11/20/2015,2191,Keokuk Spirits,1013 MAIN,KEOKUK,52632,"1013 MAIN
KEOKUK 52632
(40.39978, -91.387531)",56,Lee,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,6,162.84,4.50,1.19
S29198800001,11/20/2015,2191,Keokuk Spirits,1013 MAIN,KEOKUK,52632,"1013 MAIN钢铁侠3影评
KEOKUK 52632
(40.39978, -91.387531)",56,Lee,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,6,162.84,4.50,1.19
INV-12402400071
S29195400002
S29198800001
其中 S29195400002 及 S29198800001 连个⽂档的内容跨三⾏。和第⼀个⽂档显然是不同的。那么我们该如何处理这种情况呢?⾸先,我们看到⽂档都是以 INV- 以及 S 开头的⾏。⼀般来说 Logstash 的架构图如下:
⾸先它含有⼀个 Input, 然后经过0个或多个 filter 的处理,最终输出到 Output。
针对我们的情况,我们可以使⽤如下的架构来对它进⾏处理:
我们可以使⽤ file input 配合 multiline,然后把数据传⼊到 csv, mutate, 及 Grok 这样的过滤器来进⾏处理。愚人节的英文
⾸先,我们创建⼀个叫做 f ⽂件
f
input {
# Read the csv file. also u the multiline codec, everything that does not start with S or INV- is part of the prior line due to address having line breaks file {
start_position => "beginning"
path => "/Urs/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"建造师成绩查询
化妆培训化妆negate => "true"
disconnectwhat => "previous"
}ballet
}
}
output {
stdout {
codec => rubydebug
位育中学}
}
在上⾯,我们使⽤ file 把指定位置的 multilne.csv 读⼊进来。我们使⽤了如下的 codec:
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
南安普顿negate => "true"
what => "previous"
}
它⾸先匹配以 S 或 INV- 为开头的⾏,紧接着 S 或 INV- 后⾯接0-9之中的两个数字。negate 为 true 表⽰没有匹配的⾏需要添加到previous (前⾯)已经匹配的⾏⾥从⽽组成⼀个⽂档。如果你对这个还不是很理解的话,请参阅之前在 “” 中的描述。
我们使⽤ Logstash 运⾏上⾯的配置⽂件:
sudo ./bin/logstash -f f
那么输出的结果为:
我们看到⽂档虽然⼀个⽂档被分为三⾏,但是它们还是被正确地识别为⼀个⽂档。在⽂档中,我们看见有 \n 字符出现。在接下来的处理中,我们需要把这个字符去掉。
我们接下来使⽤ csv 过滤器来进⾏处理:
f
input {
# Read the csv file. also u the multiline codec, everything that does not start with S or INV- is part of the prior line due to address having line breaks
file {
start_position => "beginning"
path => "/Urs/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
ambitious}
}
}
filter {
# Par the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","Categ convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "Count
yNumber" => "integer" "VendorNumber" => "integer" "Pack" =
remove_field => ["message"]
}
}
output {
stdout {
codec => rubydebug
}
}
在上⾯,我们把 CSV ⽂档中的项进⾏解析,并形成各个字段。同时我们也使⽤ convert 把字段⾥的数值字段转换为数值类型以便于分析。
删除 message 字段。
重新运⾏ Logstash, 并查看结果:
在上⾯,我们看到 Country 以及 City,它们都是⼤写字母,我们想把它们转换为⼩写字母。同时在 StoreLocation 中,我们发现有 \n 字符。我们在 filter 部分添加 mutate 来对它们进⾏处理:
f