SDET A Semantic Data Enrichment Tool Application to Geographical Databas

更新时间:2023-05-25 22:16:16 阅读: 评论:0

SDET: A Semantic Data Enrichment Tool
Application to Geographical Databas
王二小的英雄事迹Khaoula Mahmoudi and Sami Faïz
dumber[S1]
1 SITIS 2006, Tunisia
{khaoula.mahmoudi, sami.faiz}@
Abstract. The success of information systems relies on their capabilities to
provide urs with the relevant information they need, when they need it.
Meeting the requirements will result in improving respon times and
decision-making process. Geographic information system (GIS) is one
instance of such systems where data are usually relative to a specific
application. To extend this datat, we propo an approach to enrich the
geographical databa (GDB). This enrichment is carried out by adding
knowledge extracted from web documents to the descriptive data of the GDB.nforce
The knowledge extraction process is performed by generating summaries from
a corpus of on-line documents. This summarization is done in a distributed
fashion by using a t of cooperating agents.
Keywords: Geographic Information System, Geographic Databa, Multi-
Document Summarization, Multi-Agents Systems, TextTiling, Rhetorical
Structure Theory.
1  Introduction
Geographic Information System (GIS) is a vital tool to query, analyze and map data needed to help p
eople make better decisions in a host of areas, such as resource management, environmental monitoring, urban planning and emergency respon [1], [2], [3]. The data ud by a GIS can be related to information on the street network, land u, cadaster and others. The geographic entities handled by a GIS have two features: alphanumeric (for instance, a noun of a country) and spatial (example: shape and position of the country) attributes. Although the u of this myriad of information, the data to be included in a GIS depends on the availability and the scale of application.experience可数吗
pompeiiDue to this restriction, we need to provide complementary data sources. For instance, a manager who is planning to build a supermarket, needs a panorama of information concerning social and economic information about the neighborhood of the designed site, public transportation and so on. Such information is usually not gathered in the same GIS. Besides, Jack Dangermond (the President of a private GIS software company) argued that "The application of GIS is limited only by the imagination of tho who u it.". Hence, it is of primary interest to provide other data sources to make the systems rich sources of information.
In this context, we propo to enhance the decision-making process by enriching the descriptive component of a GDB. This enrichment will broaden the amounts of information that can be stored and made available to GIS urs and by the way increa productivity, efficiencies, and lead to bette
r decision-making.
口才练习In fact, a review of the literature shows that databa enrichment is necessary to equip the initial data stored in the GDB with additional information ud for a variety of purpos. This enrichment has mainly focud on the spatial aspect of the GDB. One u of this data enrichment is its application within the overall generalization process. In this context, the newly acquired information was ud to provide geometrical and procedural knowledge to guide the choice of the generalization solution [4], [5], [6].  Having the same target of adding auxiliary information to the already stored data, but unlike the existing stream of works, we put the emphasis on the descriptive data.
To enrich data stored in the GDB, we extract knowledge from an on-line corpus of textual documents [8], [9], [10].
This is accomplished by using the Text Mining technique [11], [12], more precily Multi-Document Summarization technique [13], [14], [15]. To obtain the complementary data in a reasonable time, we propo a distributed multi-document summarization approach. In conformity with the multi-agents paradigm [18], [19], we u a t of cooperating agents, namely: an Interface agent, Geographic agents and Task agents. The agents collaborate to lead jointly the system to an optimal solution: an optimal summary.
The approach we propo is modular. It consists of three stages: gment and theme identification, delegation and text filtering.
In this paper, we report mainly the implementation of our approach prented with extensive details in [8], [9], [10] and the integration of the resulting tool to an existing open GIS.
The remainder of this paper is organized as follows: In the cond ction, we provide an overview of our data enrichment process. In ction 3, we describe our tool: We detail the major functionalities of the tool as well as their integration to an open GIS.
2  Our Data Enrichment Process
One of the characteristics of the GIS applications is their heavy requirements in data availability. In order to be successful, a GIS must integrate a emingly endless amount of spatially related information into a single, manageable system. This remains an issue.
In this context and in an attempt to provide timely, accurate and easily accesd information, we propo to enrich the geographic databa (GDB) embedded into GIS. Indeed, more timely and reliable information result in better decisions. Our intention is to support decision makers while makin
g decisions that rely on the information given by a GIS, by providing other complementary data sources.
To enrich data stored in the GDB, we extract knowledge from on-line corpus of textual documents. This is accomplished by using multi-document summarization [16], [17]. In fact, while the information overload problem grows, the ur needs
automated tools to retrieve the gist of information from the increasing amount of information. This is due to the difficulty to fully read all retrieved documents when we deal with a large sized corpus of documents.
To obtain the complementary data in a reasonable time, we propo a distributed multi-document summarization approach. This distribution is justified by the fact that the summary generation among a t of documents can be en as a naturally distributed problem. Hence, each document is considered as an agent, working towards a conci reprentation of its own document which will be included as part of the summary of the whole corpus.
The society of agents, shelters three class of agents: the Interface agent, the Geographic agents and the Task agents. Independently of its type, each agent enjoys a simple structure: the acquaintan
ces (the agents it knows), a local memory gathering its knowledge, a mailbox for storing the received messages that it will process later on. The interaction among agents is performed by means of message passing. It deals with a peer-to-peer communication.
Our approach [8], [9], [10] to the data enrichment is a three-stage process consisting of: gment and theme identification, delegation and text filtering.
At the first stage, the documents of the corpus are procesd in such a way that the portions of the texts that convey the same idea are depicted. The portions known as the gments are marked with their themes. The latter are the most frequent concepts. Thereafter, to each theme tackled throughout the corpus, we affect a delegate agent responsible of filtering the relative gments. This delegation is fulfilled by minimizing the work overload and the communication overhead.
The ultimate stage of the overall knowledge extraction process is the text filtering. This filtering is achieved by performing a rhetorical structure analysis of the texts. It consists of eliminating the portions of texts that are not esntial for the comprehension of the main idea delivered by the text at hand. The retained text portions are included to the final summary.
To accomplish the different steps, an enrichment system was t up. In the subquent ctions,
we detail this system.
3  The SDET Tool
To perform the data enrichment process outlined above, we have developed the SDET tool. SDET stands for Semantic Data Enrichment Tool.
In what follows, we first describe the main functionalities of our SDET tool. Second, we report the results of our implementation.
3.1  The Main Functionalities of SDET
The SDET tool was developed to enrich the descriptive data stored in the GDB. This tool consists of a t of functionalities. It deals with: first, the process launching, cond, the identification of the gments and their relative themes and third, the extraction of the gist of the text.
Process Launching. The overall data enrichment process is triggered whenever the GDB data in respon to the GIS ur query is not satisfactory enough (lack of data, ). In such ca, the system creates an Interface agent. The latter, receives as input a corpus of documents relative to the geographic entity (or entities) at hand.  This on-line corpus is a result of an
information retrieval (IR). The retrieved web documents are to be distributed by the Interface agent among the Task agents for a content processing. The agents are created by the Interface agent.
We notice that whenever the ur is dealing with more than one geographic entity, the Interface agent creates a t of Geographic agents each of which is responsible of one geographic entity. Hence, each Geographic agent governs a sub-t of Task agents responsible of processing the relative corpus.
泰国宣布紧急状态
In fact, the GIS ur can also profit from the results of a previous information retrieval ssion that are already stored on the disk.
Segment and Theme Identification. At this stage, the web documents distributed among the Task agents are procesd. The processing target is to delimit the portions of the texts that are thematically homogenous and to annotate them with the most significant concept which we call its theme.
To start this stage, each Task agent first cleans the web document in order to keep only the meaningful text. This is achieved by parsing the document according to an Html-parr [22]. This parsing allows to discard the formatting tags and other auxiliary information. Furthermore, a stop-list
containing the common words (prepositions, articles…) is given as input to filter the pard texts and to keep only the words having a mantic content.
Then, the resulting texts are stemmed to reduce words to a stem which is thought to be identical for all the words linguistically and often conceptually related. For example, the words learner, learning, and learned would all map to the common root learn. Stemming was performed using a modified version of the Porter stemmer [23]. After the pre-processing steps, the texts are gmented by adopting the TextTiling algorithm [24]. Hence, the text is en as portions (gments) that are topically homogenous. For the gments, the Task agents look for annotating them with the most salient concepts.
In fact, to achieve this purpo, each pre-procesd document is pasd to a tagging program [25]. This, aims to assigning a part-of-speech like noun, verb, pronoun, preposition, adverb, adjective or other lexical class marker to each word in a ntence. From all the part-of-speech tags, we keep only tho identified as nouns which are considered the more meaningful.
The nouns are ud as input to arch for the major concept of each gment. The concepts are depicted using the WorldNet [26] and more precily a java-implemented version of WorldNet: JWorldNet [27]. The output is then a t of gments annotated with their most significant concepts.
The resulting gments are gathered by the Interface agent (in ca of one geographic entity) or the concerned Geographic agent (in ca of a t of geographic entities) according to the similarity of their concepts. Therefore, for each concept, a generated document is created. The latter, is a t of gments dealing with the same concept.
夏季旅游胜地Afterwards, the Interface agent (or the concerned Geographic agent) affects to each concept, a Task agent as a delegate responsible of filtering the relative generated document by maintaining only the gist of the text.
Extraction of the Gist of the Text. At this stage of the enrichment process, the generated documents are under the jurisdiction of the Task agents known as delegates. The filtering is carried out by building the rhetorical structure trees [28], [29], [30] for the gments lected by the GIS ur. The lected gments are split into lexical units (ntences or claus). By using a databa storing the cue phras (markers like for example, but…), the delegate agent builds an XML file reflecting the rhetorical structure of the text. This XML file is ud to facilitate and guide the building of the rhetorical tree. Hence, for each detected cue phra having some entries in the databa, the equivalent leaves and the internal nodes are built. Whenever, no such cues are depicted, the tree building is performed by computing the similarity between the concerned units according to a given f
ormula [9], [10]. According to the similarity value, we decide which unit to include in the summary.
省会英语By sweeping the resulting tree in a top-bottom fashion, we depict the units that are most important for the comprehension of the texts. For instance, if the detected cue phra is “for example”, the unit to which the cue belongs can be discarded becau it is just an exemplification supporting a previous unit.
soledadThe units derived from each generated document form a summary of a given theme. Hence, each delegate generates a t of partial summaries to be included as a part of the final summary of the whole corpus of documents.
In the following ction, we detail the implementation of our enrichment process. We detail the main functionalities and we show their integration within an open GIS.
3.2  Implementation
The above SDET functionalities have been implemented and integrated into an open GIS. The implementation was performed using the Java programming language. Indeed, to comply with our distributed architecture, we have ud Java which supports the multithreaded programming. A multithreaded program contains two or more parts that can run concurrently.
Concerning the GIS platform, and from a comparative study of the existing open GIS, we have opted for the u of the open JUMP (Java Unified Mapping Platform) [7], [31]. This choice relies on the fact that open JUMP GIS provides a highly extensible framework for the development and execution of custom spatial data processing applications. It includes many functions common to other popular GIS products for the analysis and manipulation of geospatial data.
In what follows, we report some screen captures showing the major capabilities of our system.
The enrichment process is launched once a GIS ur who is looking for information about the geographic entities is not satisfied. As shown in Fig. 1, the ur has firstly to choo the geographic entity (or entities). In our ca, we deal with a Shape File
format relative to the geographic entities: The world map. The figure shows the results of querying the GDB about the entity Tunisia.
Fig. 1. The GDB enquiring relative to the geographic entity: country
Here the ur is unsatisfied, he resorts to enrich the results by adding complementary information. The Fig. 2, shows the launching of the enrichment process. The ur launches an IR ssion carried out by using the Google SOAP Search API [20], [21]. This arch can be generic or depending on some topics specified by a t of keys.  In fact, the ur can u a corpus resulting from an anterior IR ssion.
The retrieved corpus is to be procesd by our society of agents.
Fig. 2. Launching of the data enrichment process
Becau the architecture of our system is modular, we have reported the results of
each step to maintain a certain interactivity with the GIS ur.
Enrichment button

本文发布于:2023-05-25 22:16:16,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/90/122603.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:练习   泰国   口才   宣布
相关文章
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图