在Java中使⽤tabula提取PDF中的表格数据问题:如何将pdf⽂件中指定的表格数据提取出来?
尝试过的⼯具包有:pdfbox、tabula。最终选⽤tabula
两种⼯具的⽐较
pdfbox
其中,pdfbox能将pdf中的内容直接提取成String,代码⽚段:
public static void readPdf(String path) {
try {
PDDocument document = PDDocument.load(new File(path));
PDFTextStripper textStripper = new PDFTextStripper();
textStripper.tSortByPosition(true);
String text = Text(document);
怎么查话费
System.out.println(text);
document.clo();
} catch (IOException e) {
e.printStackTrace();
}
}
但是如果遇到类似以下表格数据时,会有格式损失。⽆论中间有⼏个空的单元格,最终只会转为1个制表位字符(/t)。
input1.pdf
possible什么意思>我在想你英文
转换为String后是这样的:
pdfbox优点:⽅便快捷,使⽤简单,maven添加依赖后,使⽤Text()即可提取⽂本。
pdfbox缺点:提取带有连续的空单元格的表格数据时,有格式丢失。
tabula
重点介绍tabula,虽然底层也是⽤pdfbox实现的,但是经过封装后的tabula更适合提取复杂格式的表格。心率低
同样的pdf表格,转换为csv后,是这样的:
output1.csv
可以说是完美还原了。
继续尝试转换其他格式的表格。
input2.pdf
output2.csv
input3.pdf
output3.csv
测试结果:input1、input2基本可以还原,input3有部分差异,但通过BufferedReader读出来的值和pdf基本⼀致。
tabula的使⽤
行政工作总结
1. 获取
1.1 获取源码
从下载tabula-java-master.zip,使⽤Eclip将tabula打成jar包,然后将jar引⽤到⾃⼰的⼯程中。也可以直接下载tabula-1.0.2-jar-with-dependencies.jar到本地。
1.2 获取Windows客户端⼯具
从hnology下载tabula-win-1.2.0.zip到本地,解压后运⾏即可使⽤。
2. 使⽤
2.1 解读README.md
## Usage Examples
`tabula-java` provides a command line application:
$ java -jar target/tabula-1.0.2-jar-with-dependencies.jar --help
usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f
<FORMAT>] [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r]
[-s <PASSWORD>] [-t] [-u] [-v]
Tabula helps you extract tables from PDFs
-
a,--area <AREA> Portion of the page to analyze. Accepts top,
left,bottom,right.
Example: --area 269.875,12.75,790.5,561.
If all values are between 0-100 (inclusive)
and preceded by '%', input will be taken as
% of actual height or width of the page.
Example: --area %0,0,100,50.
To specify multiple areas, -a option should
开家长会be repeated. Default is entire page
-b,--batch <DIRECTORY> Convert all .pdfs in the provided directory.
-c,--columns <COLUMNS> X coordinates of column boundaries. Example
-
-columns 10.1,20.2,30.3
-d,--debug Print detected table areas instead of
processing.
-f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV
-g,--guess Guess the portion of the page to analyze per
page.
-h,--help Print this help text.
-i,--silent Suppress all stderr output.
-l,--lattice Force PDF to be extracted using lattice-mode
extraction (if there are ruling lines
parating each cell, as in a PDF of an Excel
spreadsheet)
-n,--no-spreadsheet [Deprecated in favor of -t/--stream] Force PDF
今年情人节not to be extracted using spreadsheet-style
extraction (if there are no ruling lines
parating each cell)
-o,--outfile <OUTFILE> Write output to <file> instead of STDOUT.
Default: -
-p,--pages <PAGES> Comma parated list of ranges, or all.
Examples: --pages 1-3,5-7, --pages 3 or
--pages all. Default is --pages 1
-r,--spreadsheet [Deprecated in favor of -l/--lattice] Force
PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines
parating each cell, as in a PDF of an Excel
spreadsheet)
-s,--password <PASSWORD> Password to decrypt document. Default is empty
-t,--stream Force PDF to be extracted using stream-mode
extraction (if there are no ruling lines
parating each cell)
-u,--u-line-returns U embedded line returns in cells. (Only in
spreadsheet mode.)
-v,--version Print version and exit.
其中⼀些附加参数可视情况选⽤。
-a:表⽰指定某个矩形区域,程序只会对此区域进⾏解析,类似pdfbox的PDFTextStripperByArea.addRegion()。-a后跟4个值,以逗号分隔。分别表⽰:
区域上边界到页⾯上边界的距离(或百分⽐)
区域左边界到页⾯左边界的距离(或百分⽐)
区域下边界到页⾯上边界的距离(或百分⽐)
区域右边界到页⾯左边界的距离(或百分⽐)简单英语句子
以%开头时表⽰百分⽐,⽐如-a %10,0,90,100。
-o:表⽰将结果输出到⽂件,后⾯跟⽂件路径
-p:表⽰提取指定页,后⾯跟数字,如果不指定则默认为1
-t:表⽰按流的⽅式提取,遇到合并单元格时使⽤
2.2 命令⾏运⾏
使⽤cmd命令⾏⼯具直接运⾏jar包
java -jar tabula-1.0.2.jar E:\tmp\input\input1.pdf -o E:\tmp\output\output1.csv
2.3 程序内调⽤
String cmd = "java -jar tabula-1.0.2.jar E:\tmp\input\input1.pdf -o E:\tmp\output\output1.csv"; Runtime().exec();