首页 > 美文阅读

在Java中使用tabula提取PDF中的表格数据

更新时间:2023-07-20 20:13:33 阅读：评论：0

在Java中使⽤tabula提取PDF中的表格数据问题：如何将pdf⽂件中指定的表格数据提取出来？

尝试过的⼯具包有：pdfbox、tabula。最终选⽤tabula

两种⼯具的⽐较

pdfbox

其中，pdfbox能将pdf中的内容直接提取成String，代码⽚段：

public static void readPdf(String path) {

try {

PDDocument document = PDDocument.load(new File(path));

PDFTextStripper textStripper = new PDFTextStripper();

textStripper.tSortByPosition(true);

String text = Text(document);

怎么查话费

System.out.println(text);

document.clo();

} catch (IOException e) {

e.printStackTrace();

}

但是如果遇到类似以下表格数据时，会有格式损失。⽆论中间有⼏个空的单元格，最终只会转为1个制表位字符（/t）。

　input1.pdf

possible什么意思>我在想你英文

转换为String后是这样的：

pdfbox优点：⽅便快捷，使⽤简单，maven添加依赖后，使⽤Text()即可提取⽂本。

pdfbox缺点：提取带有连续的空单元格的表格数据时，有格式丢失。

tabula

重点介绍tabula，虽然底层也是⽤pdfbox实现的，但是经过封装后的tabula更适合提取复杂格式的表格。心率低

同样的pdf表格，转换为csv后，是这样的：

output1.csv

可以说是完美还原了。

继续尝试转换其他格式的表格。

input2.pdf

　output2.csv

input3.pdf

output3.csv

测试结果：input1、input2基本可以还原，input3有部分差异，但通过BufferedReader读出来的值和pdf基本⼀致。

tabula的使⽤

行政工作总结

1. 获取

1.1 获取源码

从下载tabula-java-master.zip，使⽤Eclip将tabula打成jar包，然后将jar引⽤到⾃⼰的⼯程中。也可以直接下载tabula-1.0.2-jar-with-dependencies.jar到本地。

1.2 获取Windows客户端⼯具

从hnology下载tabula-win-1.2.0.zip到本地，解压后运⾏即可使⽤。

2. 使⽤

2.1 解读README.md

## Usage Examples

`tabula-java` provides a command line application:

$ java -jar target/tabula-1.0.2-jar-with-dependencies.jar --help

usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f

<FORMAT>] [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r]

[-s <PASSWORD>] [-t] [-u] [-v]

Tabula helps you extract tables from PDFs

a,--area <AREA> Portion of the page to analyze. Accepts top,

left,bottom,right.

Example: --area 269.875,12.75,790.5,561.

If all values are between 0-100 (inclusive)

and preceded by '%', input will be taken as

% of actual height or width of the page.

Example: --area %0,0,100,50.

To specify multiple areas, -a option should

开家长会be repeated. Default is entire page

-b,--batch <DIRECTORY> Convert all .pdfs in the provided directory.

-c,--columns <COLUMNS> X coordinates of column boundaries. Example

-columns 10.1,20.2,30.3

-d,--debug Print detected table areas instead of

processing.

-f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV

-g,--guess Guess the portion of the page to analyze per

page.

-h,--help Print this help text.

-i,--silent Suppress all stderr output.

-l,--lattice Force PDF to be extracted using lattice-mode

extraction (if there are ruling lines

parating each cell, as in a PDF of an Excel

spreadsheet)

-n,--no-spreadsheet [Deprecated in favor of -t/--stream] Force PDF

今年情人节not to be extracted using spreadsheet-style

extraction (if there are no ruling lines

parating each cell)

-o,--outfile <OUTFILE> Write output to <file> instead of STDOUT.

Default: -

-p,--pages <PAGES> Comma parated list of ranges, or all.

Examples: --pages 1-3,5-7, --pages 3 or

--pages all. Default is --pages 1

-r,--spreadsheet [Deprecated in favor of -l/--lattice] Force

PDF to be extracted using spreadsheet-style

extraction (if there are ruling lines

parating each cell, as in a PDF of an Excel

spreadsheet)

-s,--password <PASSWORD> Password to decrypt document. Default is empty

-t,--stream Force PDF to be extracted using stream-mode

extraction (if there are no ruling lines

parating each cell)

-u,--u-line-returns U embedded line returns in cells. (Only in

spreadsheet mode.)

-v,--version Print version and exit.

其中⼀些附加参数可视情况选⽤。

-a：表⽰指定某个矩形区域，程序只会对此区域进⾏解析，类似pdfbox的PDFTextStripperByArea.addRegion()。-a后跟4个值，以逗号分隔。分别表⽰：

区域上边界到页⾯上边界的距离（或百分⽐）

区域左边界到页⾯左边界的距离（或百分⽐）

区域下边界到页⾯上边界的距离（或百分⽐）

区域右边界到页⾯左边界的距离（或百分⽐）简单英语句子

以%开头时表⽰百分⽐，⽐如-a %10,0,90,100。

-o：表⽰将结果输出到⽂件，后⾯跟⽂件路径

-p：表⽰提取指定页，后⾯跟数字，如果不指定则默认为1

-t：表⽰按流的⽅式提取，遇到合并单元格时使⽤

2.2 命令⾏运⾏

使⽤cmd命令⾏⼯具直接运⾏jar包

java -jar tabula-1.0.2.jar E:\tmp\input\input1.pdf -o E:\tmp\output\output1.csv

2.3 程序内调⽤

String cmd = "java -jar tabula-1.0.2.jar E:\tmp\input\input1.pdf -o E:\tmp\output\output1.csv"; Runtime().exec();

本文发布于:2023-07-20 20:13:33，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1107561.html

上一篇：TSPA Cost Model

下一篇：Access中用 VBA实现导入导出功能(使用实例技巧)

标签：提取表格格式数据指定区域边界

留言与评论（共有 0 条评论）