使⽤JavaStAX解析超⼤xml(超过60g)⽂件,并将其存⼊数
据库(MySQL)
遇到的问题
本⼈需要解析Stack Overflow的(xml格式)将数据其存⼊数据库,其中关于Stack Overflow帖⼦(Posts)的xml⽂件超过了60G。
那么如何解析那么⼤的xml⽂件呢(Stack Overflow上有解决⽅案-)?
解决⽅案
或许你已经想到了分块读取,然后解析。那么如何分块解析呢?Java中处理xml⽂件有两种处理⽅案(DOM和Event Driven)。DOM需要
将所有⽂件读取,在内存中构建DOM树,显然这种⽅案不⾏;只能选择基于事件的⽅式,本⼈选择了StAX,或许⽤SAX的⼈⽐较多,那么
SAX和StAX的区别是什么呢?,这是说明: 。
你需要了解的是:什么是事件驱动。(懂得这个,也就懂得了StAX为什么是分块处理了)
代码实现
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="573" ViewCount="37080" Body="" OwnerUrId="8" LastE <row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="256" ViewCount="16306" Body="" OwnerUrId="9" Las <row Id="7" PostTypeId="2" ParentId="4" CreationDate="2008-07-31T22:17:57.883" Score="401" Body="" OwnerUrId="9" LastEditorUrId="4020527" LastEd <row Id="9" PostTypeId="1" AcceptedAnswerId="1404" CreationDate="2008-07-31T23:40:59.743" Score="1743" ViewCount="480476" Body="" OwnerUrId="1 <row Id="11" PostTypeId="1" AcceptedAnswerId="1248" CreationDate="2008-07-31T23:55:37.967" Score="1348" ViewCount="136033" Body="" OwnerUrId=" <row Id="12" PostT
ypeId="2" ParentId="11" CreationDate="2008-07-31T23:56:41.303" Score="320" Body="" OwnerUrId="1" LastEditorUrId="1271898" Las <row Id="13" PostTypeId="1" CreationDate="2008-08-01T00:42:38.903" Score="539" ViewCount="157009" Body="" OwnerUrId="9" LastEditorUrId="532136 </posts>
数据库设计
由于帖⼦(Post)可以分为两种问题和回答(通过属性PostTypeId的值确定),因为数据量很⼤(4200万⾏左右),所以我设计了两张
表。
-- PostTypeId = 1
-- Id CreationDate Score ViewCount OwnerUrId Tags AnswerCount FavoriteCount CREATE TABLE `questions` (
密不可分的英文`Id` int,
`CreationDate` datetime,
`Score` int,
`ViewCount` int,
`OwnerUrId` int,
`Tags` varchar(250),
`AnswerCount` int,
`FavoriteCount` int,
PRIMARY KEY (`Id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
-- PostTypeId = 2
--Id ParentId CreationDate Score CommentCount
CREATE TABLE `answers` (
`Id` int,
`ParentId` int,白萝卜汤做法
`CreationDate` datetime,
`Score` int,
`OwnerUrId` int,
`CommentCount` int,
PRIMARY KEY (`Id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Java代码
sql.jdbc.PreparedStatement;
l.XMLConstants;
l.stream.XMLInputFactory;
l.stream.XMLStreamException;
l.stream.XMLStreamReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.util.Date;
/
**点子公司
* Desc: Par & Import(to mysql db) Stack Overflow dump xml file
* Created by Myth on 10/12/2018
*/
public class XmlProcessor {
private Connection connection = null;
/**
* Get Db Connection
* @throws ClassNotFoundException
* @throws SQLException
*/
public void openMysqlConnection() throws ClassNotFoundException, SQLException { String driver = "sql.jdbc.Driver";
String url = "jdbc:mysql://localhost:3306/stackoverflow";
String urname = "root";
String password = "123456";
Connection connection = null;
Class.forName(driver);
综的组词connection = Connection(url, urname, password);
}
public void cloConnection() throws SQLException {
}
/**
*
* @param filePath
* @param 每 commitCount ⾏提交⼀次
* @throws SQLException
* @throws FileNotFoundException
* @throws XMLStreamException
*/
public void parPosts(String filePath, int commitCount) throws SQLException, FileNotFoundException, XMLStreamException { // 计时器 starts
Long begin = new Date().getTime();
// 组合sql语句
String prefixQuestions = "INSERT INTO questions VALUES ";
String prefixAnswers = "INSERT INTO answers VALUES ";
StringBuffer suffixQuestions = new StringBuffer();
StringBuffer suffixAnswers = new StringBuffer();
// 设置事务为⾮⾃动提交
// PreparedStatement 执⾏ sql语句
PreparedStatement pst = (PreparedStatement) tion.prepareStatement("");
// 解析xml获得数据
XMLInputFactory inputFactory = wInstance();
inputFactory.tProperty("/xml/jaxp/properties/getEntityCountInfo", "yes");
// 设置entity size , 否则会报 JAXP00010004 错误
爱国福图片inputFactory.tProperty("/xml/jaxp/properties/totalEntitySizeLimit", Integer.MAX_VALUE);
File file = new File(filePath);
InputStream isS= new FileInputStream(file);
XMLStreamReader streamReader = ateXMLStreamReader(isS);
int countRow = 0;
// Q: Id CreationDate Score ViewCount OwnerUrId Tags AnswerCount FavoriteCount
// A: Id ParentId CreationDate Score CommentCount
String id, creationDate, score, viewCount, ownerUrId, tags, answerCount, favoriteCount, parentId, commentCount;
String postTypeId;
String sqlQuestions = null, sqlAnswers = null;
岑参拼音// 存储数据
while(streamReader.hasNext()) {
<();
EventType() == XMLStreamReader.START_ELEMENT){
if (LocalName().equals("row")) {
postTypeId = AttributeValue(null,"PostTypeId");
id = AttributeValue(null,"Id");
creationDate = AttributeValue(null,"CreationDate");
score = AttributeValue(null,"Score");
viewCount = AttributeValue(null,"ViewCount");
ownerUrId = AttributeValue(null,"OwnerUrId");
tags = AttributeValue(null,"Tags");
answerCount = AttributeValue(null,"AnswerCount");
qos限速
favoriteCount = AttributeValue(null,"FavoriteCount");
parentId = AttributeValue(null,"ParentId");
commentCount = AttributeValue(null,"CommentCount");
// 1 Question, 2 Answer
if ("1".equals(postTypeId)) {
suffixQuestions.append("(" + id + "," + "\"" + creationDate + "\"" + "," +
score + "," + viewCount + "," + ownerUrId + "," +
"\"" + tags + "\"" + "," + answerCount + "," + favoriteCount + "),");
} el {
suffixAnswers.append("(" + id + "," + parentId + "," + "\"" + creationDate + "\"" + "," +
score + "," + ownerUrId + "," + commentCount + "),");
}
countRow += 1; // 记录⾏数
if (countRow % commitCount == 0) {
// System.out.print("Count: " + String(count));养鱼塘
// 构建完整sql
sqlQuestions = prefixQuestions + suffixQuestions.substring(0, suffixQuestions.length() - 1); sqlAnswers = prefixAnswers + suffixAnswers.substring(0, suffixAnswers.length() - 1);
// 添加执⾏sql
pst.addBatch(sqlQuestions);
pst.addBatch(sqlAnswers);
// 执⾏操作
// 提交事务
// 清空上⼀次添加的数据
suffixQuestions = new StringBuffer();
suffixAnswers = new StringBuffer();
System.out.println("Committed: " + countRow + " √");
}
}
}
}
if (suffixQuestions.length() != 0) {
sqlQuestions = prefixQuestions + suffixQuestions.substring(0, suffixQuestions.length() - 1);
pst.addBatch(sqlQuestions);
}
if (suffixAnswers.length() != 0) {
sqlAnswers = prefixAnswers + suffixAnswers.substring(0, suffixAnswers.length() - 1);
/
/ System.out.println(suffixAnswers.substring(0, suffixAnswers.length() - 1));
pst.addBatch(sqlAnswers);
}
System.out.println("Committed All: " + countRow + " √");
pst.clo();
// 耗时
Long end = new Date().getTime();
System.out.println("Cast : " + (end - begin) / 1000 + " s");
}
}
⼤约需要10多分钟就可以将全部数据(4200多万⾏)导⼊。
总结
Bug
Exception in thread "main" l.stream.XMLStreamException: ParError at [row,col]:[1077158,4084]
Message: JAXP00010004: The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit t by "FEATURE_SECURE_PROCESSING".
原因:java xml解析器⾃带limit(),⽤来控制读⼊的⼤⼩、控制内存..... 因为要处理的xml⽂件过⼤,超过了默认的limit,所以在java代码71⾏,设置了limit的⼤⼩,但是即使设置最⼤,也勉强能够处理4000多万条数据,如果超过这个数量,就不能通过这个⽅法了,可以将超⼤的xml切分成⼏个xml⽂件,然后按照上述处理。
其他问题
使⽤batch 插⼊,可以提⾼效率
使⽤ XMLStreamReader ⽽不⽤ XMLEventReader(区别见官⽅⽂档、教程)提⾼效率