DBLP XML Requests
货代销售
Appendix to the paper“DBLP—Some Lessons Learned”(June17,2009)
Michael Ley
Universit¨at T rier,Informatik
D–54286T rier
Germany
ley@uni-trier.de
ABSTRACT
If your software needs only in a few facts from DBLP,down-loading the lfile may be a too costly burden. The web pages are intended for humans,wrappers are al-ways expod to the risk of formatting changes.In this ap-pendix we describe a very basic API for DBLP.Our example application for the API is a simple crawler whichfinds the shortest path between two DBLP authors in the coauthor graph.In addition the appendix lists code to map person names to DBLP URLs.
DBLP Records
If you know the key of a DBLP record,you may retrieve the record from the URL
dblp.uni-trier.de/rec/l
请造句
<dblp>
<article key="journals/acta/BayerM72"
mdate="2003-11-25">
<author>Rudolf Bayer</author>
<author>Edward M.McCreight</author>
<title>Organization and Maintenance
of Large Ordered Indices</title>
紫外线消毒
...
</article>
</dblp>
Obviously this looks like a version of the lfile which has been shrunken to just one record.But there is an important difference:The header does not refer to an DTD and we do not u symbolic entities.All non-ASCII characters are encoded by numeric entities,in the header the encoding intentionally has not been specified.This pure-ASCII XML document without symbolic entities may be pard very fast by a lightweight parr.The same encoding Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the copyright notice and the title of the publication and its date appear,and notice is given that copying is by permission of Michael Ley.To copy otherwi,or to republish,to post on rvers or to redistribute to lists, requires a fee and/or special permission from the author.
DBLP,University of Trier
Copyright2009Michael Ley.is ud by the other rvices described in the quel of this ction.
Person Search
On dblp.uni-trier.de/DBLP provides a primi-tive form to arch for persons inside the bibliography.If you type in Schek,you will get an answer on page
dblp.uni-trier.de/arch/author?author=Schek This page is an HTML document.Modify the URL by in-rting an‘x’after the question mark:
.../arch/author?xauthor=Schek
and you will get an XML version of the answer page:
<?xml version="1.0"?>
<authors>
<author urlpt="s/Schek:Hans=J=ouml=rg">
Hans-Jörg Schek</author>
<author urlpt="s/Schekelmann:Andr=eacute=">
AndréSchekelmann</author>
</authors>
In the DBLP person arch a query is interpreted as a t of prefixes of name parts.If you enter a few words,you get the names which include the words as prefixes of some name parts.The query string and the names stored in DBLP are broken in parts.The delimiters of this tokenizing are spaces and punctuation marks.The punctuation marks are not relevant for the matching.Ar-b-c.produces the same result as Ar b c.The matching is not ca-nsitive.The order of the query words does not he queries Petra M A and M Petra A are equivalent.For words which end with a$-sign,only exact matches of this word are shown. Try the queries xi li,xi$li xi li$,and xi$li$(xi$li and xi$li are equivalent).
As long as you restrict your query to ASCII(<128)the arch engine matches‘diacritic innsitive’,i.e.the query moller matches moller,m¨o ller,møller,m´o ller,etc.As soon as your query contains any diacritic mark,the matching becomes exact for diacritics.Now Ren´e matches Ren´e but not Rene or Ren`e.
The preferred encoding to transmit the query is UTF-8. As soon as the query contains a byte quen
ce which is ille-gal in UTF-8,the incoming byte quence is interpreted as Latin-1.Additionally,the arch engine understands sym-bolic entities listed in dblp.dtd.The author arch accepts queries using the GET or the POST method.The answer
length has a hard limit:If the query produces more than 1000hits,the result is truncated.
In the XML version the answer is enclod in an authors element.For each hit,it contains an author element with the matching person name.For convenience we have added the attribute urlpt to the author elements.This attribute contains the esntial part of the name-to-URL mapping,by this for many applications it is no longer necessary to im-plement the make file name function described in the ap-pendix.To get the URL of a person page,a DBLPbURL, indices/a-tree/,the urlpt value,and the.html suffix have to be concatenated.
世界上最长寿的人1065岁If this simple arch engine for persons is not sufficient for your purpos,you should implement your own one.The textfile
dblp.uni-trier.de/db/indices/AUTHORS
lists all DBLP person names except homonyms(e next subction).The encoding is pure ASCII,Latin-1characters are reprented by symbolic entities,each line contains one name.
Publications of a Person
Thefirst part of a person page lists all known publications this person is involved in.This information is available in XML,too.Requests to URLs of the form
dblp.uni-trier.de/rec/pers/urlpt/xk result in XML answers like(for urlpt=Wong:Curtis):
<?xml version="1.0"?>
<dblpperson name="Curtis Wong">
<dblpkey>journals/sigmod/Wong08</dblpkey>
<dblpkey>conf/chi/HuynhDBW05</dblpkey>
...
</dblpperson>
In the enclosing dblpperson element the attribute name spec-ifies the person’s name.The keys of the bibliographic records are contained in dblpkey elements.All records this person is involved in either a
s author or in the role of an editor are enumerated.If a“person record”has been created for this person,it’s key is quoted ahead the regular bibliographic records.The entry is marked by an attribute:
<dblpkey type="person record">homepages/... Additional information is provided if homonyms are known: <?xml version="1.0"?>
<dblpperson name="Michael Meier">
<homonym>m/Meier:Michael</homonym>
<homonym>m/Meier0004:Michael</homonym>
<homonym>m/Meier0003:Michael</homonym>
<dblpkey type="person record">
homepages/m/MichaelMeier2</dblpkey>
<dblpkey>conf/edbt/LaunMS08</dblpkey>
<dblpkey>journals/corr/abs-0812-3788</dblpkey> </dblpperson>
Our example was requested from
.../rec/pers/m/Meier0002:Michael/xk
.../arch/author?xauthor=Michael+Meier
DBLP returns only
<?xml version="1.0"?>
<authors>
<author urlpt="m/Meier:Michael">
Michael Meier</author>
</authors>
but the homonym elements shown above make it easy to deal with ambiguous person names.
Coauthors
If you load all bibliographic records of a person,it is a trivial step to get the t of coauthors for this person.But loading all bibliographic records may be expensive.The URLs dblp.uni-trier.de/rec/pers/urlpt/xc
give direct access to the coauthor lists.For example
.../rec/pers/h/Halevy:Alon Y=/xc
returns the XML document
<?xml version="1.0"?>
<coauthors author="Alon Y.Halevy">
<author urlpt="a/Abiteboul:Serge"
count="3">Serge Abiteboul</author>
...
<author urlpt="z/Zhang:Yang"
count="2">Yang Zhang</author>
</coauthors>
The count attribute specifies the number of shared publica-tions of the two persons.
In a future version we plan to extend the XMI API for DBLP.Access to individual BHTfiles is not possible yet, but it may be beneficial for applications which are interested in the tables of contents pages.Indices for DOIs,ISBN and otherfields may be uful.
Shortest Path Algorithm
The java program1documented in this ction demonstrates the u of one of the DBLP XML rvices described above. It computes the shortest path between two DBLP authors in the coauthor graph.The software works like a little web crawler,it loads the information incrementally from the DBLP rver.
The program is structured in two class:The shortest path algorithm is a variant of the classic breasth-first algo-rithm,it is shown infigure1.The class Person shown in figure2hides all DBLP specific implementation details from the algorithms itlf.
The interaction beween both class is very simple:If you have a Person object,you may ask it for its neighbors by applying the method getCoauthors,which returns an array of persons.You can attach a label to each person.A label is an integer.tLabel creates a label,hasLabel tests 1An expanded version of this software is available from dblp.uni-trier.de/xml/docu/
import java.util.Collection;
import java.util.HashSet;
import java.util.Iterator;
public class CoauthorPath{
private Person path[];
public CoauthorPath(Person p1,Person p2)
{shortestPath(p1,p2);}
public Person[]getPath(){return path;}
private void tracing(int position){
Person pNow,pNext;
int direction,i,label;
label=path[position].getLabel();
direction=Integer.signum(label);
label-=direction;
while(label!=0){
pNow=path[position];
Person ca[]=Coauthors();
for(i=0;i<ca.length;i++){
pNext=ca[i];
if(!pNext.hasLabel())
continue;
Label()==label){
position-=direction;
label-=direction;
path[position]=pNext;
break;
}}}}
private void shortestPath(Person p1,Person p2){ Collection<Person>h,
now1=new HashSet<Person>(),
now2=new HashSet<Person>(),
next=new HashSet<Person>();
int direction,label,n;
if(p1==null||p2==null)
return;
if(p1==p2){
p1.tLabel(1);
path=new Person[1];
path[0]=p1;
return;
}
p1.tLabel(1);now1.add(p1);
p2.tLabel(-1);now2.add(p2);
while(true){
if(now1.isEmpty()||now2.isEmpty())
return;
if(now2.size()<now1.size()){
h=now1;now1=now2;now2=h;
}
Iterator<Person>nowI=now1.iterator();
while(nowI.hasNext()){
Person ();
Label();
direction=Integer.signum(label);
Person neighbours[]=Coauthors();
int i;
for(i=0;i<neighbours.length;i++){
Person px=neighbours[i];
if(px.hasLabel()){
if(Integer.Label())
==direction)
continue;
if(direction<0){
Person ph;
ph=px;px=pnow;pnow=ph;
}
//pnow has a positive label,
//px a negative
Label()-px.getLabel();
path=new Person[n];
Label()-1]=pnow;
path[Label()]=px;
Label()-1);
tracing(Label());
return;
}
px.tLabel(label+direction);
next.add(px);
}
}
now1.clear();h=now1;now1=next;next=h;
}
}
}
Figure1:A shortest path algorithm
<;
长治学院是几本public class Person{
private static Map<String,Person>personMap=
new HashMap<String,Person>();
private String name;
private String urlpt;
狮子女和处女男
private Person(String name,String urlpt){
this.name=name;
this.urlpt=urlpt;
personMap.put(name,this);
coauthorsLoaded=fal;
labelvalid=fal;
}
static public Person create(String name,String urlpt){ Person p;
p=archPerson(name);
if(p==null)
p=new Person(name,urlpt);
return p;
}
static public Person archPerson(String name){
(name);
}
private boolean coauthorsLoaded;
private Person coauthors[];
static private SAXParr coauthorParr;
static private CAConfigHandler coauthorHandler;
static private List<Person>plist
=new ArrayList<Person>();
static private class CAConfigHandler
extends DefaultHandler{
private String Value,urlpt;
private boolean insideAuthor;
public void startElement(String namespaceURI,
String localName,String rawName,
Attributes atts)throws SAXException{
if(insideAuthor=rawName.equals("author")){
Value="";
Value("urlpt");
}}
public void endElement(String namespaceURI,
String localName,String rawName)
throws SAXException{
if(rawName.equals("author")&&
Value.length()>0){
plist.add(create(Value,urlpt));
}}
public void characters(char[]ch,int start,int length)
throws SAXException{
if(insideAuthor)
Value+=new String(ch,start,length);
}
public void warning(SAXParException e)
throws SAXException{...}
public void error(SAXParException e)
throws SAXException{...}
public void fatalError(SAXParException e)
throws SAXException{...}
}
static{
播音777try{coauthorParr=SAXParrFactory.
newInstance().newSAXParr();
coauthorHandler=new CAConfigHandler();
"xml/sax/features/validation",
fal);
}catch(ParrConfigurationException e){...
}catch(SAXException e){...}
}
private void loadCoauthors(){
if(coauthorsLoaded)
return;
plist.clear();
try{URL u=new URL(
"dblp.uni-trier.de/rec/pers/"
+urlpt+"/xc");
coauthorParr.par(u.openStream(),
coauthorHandler);
}catch(IOException e){...
}catch(SAXException e){...}
coauthors=new Person[plist.size()];
Array(coauthors);
coauthorsLoaded=true;
}
public Person[]getCoauthors(){
if(!coauthorsLoaded){loadCoauthors();}
return coauthors;
}
private int label;
private boolean labelvalid;
public int getLabel(){return(!labelvalid)?0:label;} public void retLabel(){labelvalid=fal;}
public boolean hasLabel(){return labelvalid;}
public void tLabel(int label){
this.label=label;
labelvalid=true;
}
static public void retAllLabels(){
Iterator<Person>i=personMap.values().iterator();
while(i.hasNext()){
Person ();
p.labelvalid=fal;
p.label=0;
}}
public String toString(){return name;}
}
邓明甲Figure2:The Class“Person”
if a label exists,and getLabel reads a label of a person. The class method retAllLabels deletes all labels from persons.It is obvious that you may generalize this to a simple interface to access any undirected graph(without weights)from your shortest path algorithm.
To run the algorithm,you have to create two persons and to construct a CoauthorPath object:
Person p1,p2;
ate("Jim Gray","g/Gray:Jim");
ate(...);
CoauthorPath cp=new CoauthorPath(p1,p2); Person path[]=cp.getPath();
After this you may print out the path.
Next we look at the class Person more cloly.In the code snipplet above it is noticeable that we u the class method create to produce new Person objects.The class does not have a public constructor becau it caches all objects in a class level Map.The create methodfirst tests if there is already an object for the specified name in personMap.A new object is only created,if this test fails.
After creation a Person object only contains the person’s name and the urlpt.The list of coauthors is only loaded on demand2.The booleanfield coauthorsLoaded contains the required state information.If getCoauthors is called for thefirst time,loadCoauthors is activated to fetch the information from DBLP.
We u a SAX parr to read the XMLfile.Becau the creation of a SAX parr object is an expensive task, the parr is reud.It is created in the static initializa-tion block just above loadCoauthors at class load time. Additionally the coauthorHandlerfield is initialid.In loadCoauthors an URL object is constructed to provide an InputStream for the parr.During the parsing process the SAX parr calls the methods startElement,endElement, and characters of the local CAConfigHandler.We are only interested in author elements.The boolean insideAuthor is t true as soon as we recognize an opening author tag. characters collects the element contents in Value.In the method endElement a Person object is created.The co-authors are temporarily stored in plist.This List is con-verted into thefinal coauthors array after parsing has been completed.
To make the shortest path algorithm practicable over a slow internet connection,we have to minimize the num-ber co-author lists to be loaded.Two known optimiza-tions of the breadth-first algorithm proved to be esntial for arches inside the huge connected component of the DBLP coauthor graph:(1)The arch has to start from both persons,and(2)the algorithm should prefer the side with the lower number of persons to be visited next.The method CoauthorPath.shortestPath is a straightforward implementation of the ideas.
The algorithm labels persons p1with1and p2with−1. The direct neighbors of p1are t2,the direct nei
ghbors of p2are t−2etc.The variables now1and now2contain the ts of persons who form the outer waves of the labeling process.The main loopflips the side where to advance next depending on the size of the ts.During an expan-sion step the still unvisited persons are collected in the t next.
2In the online version of the class the same mechanism is implemented for the list of publications of a person.
If now1or now2become empty,the persons are not con-nected.If the two waves hit each other,the method tracing is called to collect pathes from the meeting point to the starting points.
makefile name
This primitive C program shows the function which calcu-lates the URL of a DBLP author page from a name.For a production version,you should add some tests to prevent buffer overflow attacks.
#include<stdlib.h>
typedef char line[10000];
line name array,file name array;
char map char(char x){
if(isalnum(x))return x;
return(x==’’)?’’:’=’;
}
int is hidden suffix(char*x){
if(x==NULL)return0;
if(!isdigit(x[0])||!isdigit(x[1])||
!isdigit(x[2])||!isdigit(x[3]))return0;
return x[4]==’\0’;
}
char*makefile name(char*name){
int i=0;
char c,*lname,*fname,*help;
strcpy(name array,name);
lname=strrchr(name array,’’);
if(lname){
fname=name array;
*lname++=’\0’;
if(strcmp(lname,"Jr.")==0||
strcmp(lname,"II")==0||
strcmp(lname,"III")==0||
strcmp(lname,"IV")==0||
is hidden suffix(lname)){
help=strrchr(fname,’’);
if(help){
--lname;*lname=’’;
*help=’\0’;
lname=help+1;
}
}
}el{
fname=strrchr(name array,’\0’);
lname=name array;
}
if(lname)
while(c=*lname++)
file name array[i++]=map char(c);
file name array[i++]=’:’;
if(fname)
while(c=*fname++)
file name array[i++]=map char(c);
file name array[i]=’\0’;
return file name array;
}
void print keys url(char*name){