首页 > 美文阅读

Graph-bad Data Mining on Social Networks

更新时间:2023-05-25 13:27:11 阅读：评论：0

Maitrayee Mukherjee Department of Computer Science and Engineering University of Texas at Arlington

Box 19015, Arlington, TX 76019

mukherje@c.uta.edu

Lawrence B. Holder Department of Computer Science and Engineering University of Texas at Arlington

Box 19015, Arlington, TX 76019

holder@c.uta.edu

ABSTRACT

In this rearch, we compare and contrast the salient features of illicit group information with legitimate group data. We describe how the graph-bad knowledge discovery system, SUBDUE, when run in unsupervid discovery mode, finds structural patterns embedded within social network data. We also i

llustrate how SUBDUE, in supervid mode, learns distinguishing patterns between legitimate and covert groups, bad only on the communication activities of the group members.

Keywords

Graphs, Social Networks

1.INTRODUCTION

Data Mining has emerged as a novel field of rearch and has valuable applications in the real world. In the early days, data was predominantly stored in the relational format, where each row reprented an instance of a class or maybe one transaction. However, the advent of data stored in HTML/XML texts, ordered/unordered trees, symbolic quences etc. provided impetus to the study of data mining on mi-structured data [11]. Rearch on data mining and machine learning of symbolic quences [10] and ordered tree structures [20] also gained momentum. Multi-relational data mining who main scope is to find patterns in expressive logical and relational languages from complex, multi-relational and structured data has also picked up greatly [12].

The need for mining structured data was apparent to the rearch community and one such approac

h focud on the topological view of data structures. Since the graph has a generic topological structure and is one of the most thoroughly rearched data structures in Computer Science and Discrete Mathematics, state- of-the-art techniques in graph-bad data mining (GDM) have had profound influence. GDM has tremendous utility becau graph-structured data occur widely in practical fields like biology, chemistry, material science and communication networking [16]. Identifying illicit groups has become an important challenge for homeland curity. Relationships (e.g., communications) among members of legitimate and illicit groups are another domain that can be easily reprented as a graph. We accept the overhead of converting existing datats to graphs and then do graph-bad data mining on them to discover interesting and novel concepts. The world is drowning in data and curity analysts face the stiff challenge of sifting through a plethora of data and zeroing in on the few suspicious bits of information. Hence, effective techniques of social network analysis (SNA) are critical to alleviate the problem of information overload. This is the main motivation of this paper. The communication patterns that we are looking for in this rearch are not big graphs, involving lots of actors, but small ones, that discriminate well between threat and non-threat groups and give us a lead on which networks to put under scrutiny. The kinds of signature patterns do not require knowledge of the entire terrorist network; hence the patterns that we found (described later) may be tested on data with varied levels of “obrvability”. In the abnce of real da

ta, we have trained on well-rearched, simulated data, with appropriate levels of obrvability, corruption and noi.

We focus on applying GDM to SNA, where participating actors are reprented as vertices and communication links between actors as edges (e Figure 8). We brief the reader on social networks, SNA, graph-bad reprentational techniques and how social networks of legitimate groups differ from tho of illicit ones. We review the theoretical bas of GDM: a paradigm, of which the graph-bad knowledge discovery system, SUBDUE [2, 3], is an implementation. We discuss two of the key data mining techniques implemented in SUBDUE: unsupervid pattern discovery and supervid concept learning. We illustrate, with examples and experiments, how we identify patterns within social network data reprented as labeled graphs with the help of SUBDUE. We bring forth the concepts that SUBDUE learns to discriminate legitimate groups from illicit ones, bad only on the communication patterns of the group members. Finally, we conclude and discuss avenues for future rearch.

2.SOCIAL NETWORKS

Permission to make digital or hard copies of all or part of this work for personal or classroom u is g

ranted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwi, or republish, to post on rvers or to redistribute to lists, requires prior specific permission and/or a fee.

KDD’04, August 22–25, 2004, Seattle, WA, USA.

Social Network data focus on the reprentation of actors and the relationships (or ties) between them. This kind of data, instead of revolving around a single actor, highlights the dynamics between groups of actors. This data is a shade different from conventional data, in databas and spreadsheets, which focus primarily on individual actors and their attributes [14, 17].

2.2Social Network Analysis (SNA)

SNA is the mapping and measuring of relationships and flows between people, groups, organizations or other information/knowledge processing entities. SNA us attributed relational graphs (ARGs) where vertices reprent people, organizations or any other object, edges reprent relationships between objects and attributes store the details of each vertex and edge [14, 17].

2.3Reprenting Social Networks

Formal methods are necessary to reprent social networks becau of their ability to depict huge datats succinctly and systematically. Graphs and matrices are the most popular techniques becau they lay down their own protocols which help

us depict with ea and lead us to discern patterns in our data which would not have been possible to describe with words. They also allow the dull and repetitive workload to be shared by computers.

Both graphs and matrices have their own pros and cons. For this rearch, we have chon graphs to reprent our social networks

for their ea of visualization and also becau a graph can be easily reprented as an adjacency matrix if required. Sociologists borrowed relevant graphical concepts from mathematicians and coined the graphics as “sociograms” [5].

Sociograms may be classified bad on the different levels of measurements of the ties [5], as described below:-

A binary sociogram is one where the only thing that matters is whether a tie is prent. An arrow repr

ents the existence of a tie whereas its abnce signifies no relationship whatsoever.

A signed sociogram us signs on its ties; a plus (+) on the arrow indicates a positive choice whereas a minus (-) indicates its opposite.

A valued sociogram takes this concept a bit further by allowing

us to put a measure of the strength of the relationship on the arrow.

Sociograms may again be classified bad on the kinds of ties between actors [5], as described below:-

A simplex sociogram is one which depicts a single kind of relation between actors.

In a multiplex sociogram, there are multiple kinds of ties between actors.

In our rearch, we have opted for binary, simplex, “labeled” sociograms to depict communication patterns between actors. Figure 8 is an apt example.

3.CONTRASTING THREAT GROUP

DATA WITH LEGITIMATE NETWORK DATA

Covert networks remain mingled with socially-oriented networks (like families, organizations etc.) in the real world. The buzz word

for covert networks is “crecy” and hence to discover such networks (technically, to discern distinctive patterns in the activities and communications of such illegitimate groups) can be very tricky and often misleading due to unavailability of authentic data or in some cas availability of “doctored” data. This issue has especially blown up in the recent past and after the September 11, 2001 tragedy, it has been in the limelight so much so that it is worthwhile to take a clo look at the distinguishing properties of such networks.

(1) Social network graphs depicting covert threat groups may be incomplete due to missing actors (vertices) and links (edges) that investigators may fail to uncover [8].

(2) The difficulty in deciding who to include and who not to include is not a problem in legitimate networks like families and organizations. But in terrorist networks this may be a problem due to the crecy maintained by the entire network of people. This is referred to as the “Fuzzy Boundary” problem [8].

(3) In legitimate networks, actors who are highly central are typically the most important ones. On the contrary, peripheral players (or “boundary spanners” as they are typically called) may be huge resources to a terrorist group although they receive very low network centrality scores. This is becau they are well-positioned to be innovators, since they have access to ideas and information flowing in other clusters. Similarly, in an organization, the peripheral employees are in a position to combine different ideas and knowledge into new products and rvices. They may be contractors or vendors who have their own network outside of the company, making them very important resources for fresh information not available inside the company [5, 8].

(4) Terrorist networks are not static as new members are added to fortify the group and members are removed / killed / captured / compromid which shrinks the group size [8]. Most organizational networks are also dynamic but this is hardly the ca with families.

(5) Covert networks trade efficiency for crecy. A strategy for keeping cell members distant from each other, and from other cells, minimizes damage to the network if a cell member is captured or otherwi compromid [8]. Hence the shortest path in the social network graph is not usually the path taken for communication.

红豆汤怎么做

(6) Relationships between members belonging to a terrorist network and tho not belonging to terrorist networks are rare and infrequent. Terrorists ldom make friends outside the trusted circle becau eliminating boundary-spanning ties reduces the visibility into the network, and chances of leaks out of the network [8].

In the transcript (U.S. Department of Defen, 2001) Osama bin Laden’s comment [8] about the September 11, 2001 attack is:

大猪图片"Tho who were trained to fly didn't know the others. One group of people did not know the other group."

Strong ties between previous contacts, which were formed years ago in school, college dormitories and training camps, keep the members linked. Yet, unlike normal social networks, the strong ties remain mostly dormant and therefore hidden to outsiders. In a covert network, becau of their low frequency of activation, strong ties may appear to be weak ties. The less active the network, the more difficult it is to discover [8]. Legitimate networks, on the other hand, have a lot of incoming and outgoing ties as one might expect.

(7) Despite the need for crecy, covert networks have goals to accomplish. Cell members must opti

mize between stealth and the need for inten task-bad communication. It is during the periods of heightened activity, that the networks may shorten and become susceptible to discovery [8].

(8) More often than not, only one of the terrorists (may not be the group leader always) would speak for the whole group. Choosing somebody other than the leader to be the speaker makes n just in ca the leader is expod. So we may find a multicast/broadcast from one particular person to the entire group or maybe to terrorists who spearhead other groups. The multicasts are often noticeable during terrorist activities and make the group vulnerable to discovery.

(9) The role of a “broker” [8] is a very powerful role in a social network as it ties two hitherto unconnected constituencies/groups together but of cour, it is a single-point of failure. The broker-type roles are often en in terrorist networks. Such nodes are also referred to as “cutpoints” [5].

We are aware of graph-bad technologies being ud for intelligence analysis by a rearch team at 21st Century Technologies in Austin, TX [1]. Graph-bad algorithms enable curity analysts to extract from a deluge of information a small subt that has suspicious characteristics. In this rearch, intelligence analysis relies on the fact that legitimate and illegitimate groups tend to exhibit

different SNA metric values like geodesics, redundancy, small world phenomena, betweenness centrality, cloness centrality, clustering coefficient, node degree, diameter, girth etc. The metrics, combined with statistical pattern classification, arm the analyst with a tool for automatically pinpointing threatening group activities within a huge body of evidence.

Our approach is different in that we believe legitimate groups have different communication styles from illegitimate groups. Hence our rearch is bad solely on the communication evidence of group members of social networks, without considering the network metric values and statistical patterns. We train on well-investigated, simulated intelligence analysis data to find discriminating patterns between threat and non-threat groups.糯米团子的家常做法

4.GRAPH-BASED DATA MINING (GDM) 4.1 Theoretical Bas of GDM

We briefly review the five theoretical bas of GDM [16].

(1) Sub-graph Categories: Sub-graphs are categorized into various class (namely general sub-graphs, induced sub-graphs, connected sub-graphs, ordered trees, unordered trees and paths) and the approaches of graph-bad data mining strongly depend on the targeted class.

(2) Sub-graph Isomorphism: Sub-graph isomorphism is the mathematical basis of substructure matching and/or counting. (3) Graph Invariants: Graph invariants are the quantities (like the number of vertices, the degree of each vertex and the number of cyclic loops) to characterize the topological structure of a graph and they help to efficiently reduce the arch space of the targeted graph structures. If two graphs are topologically identical, i.e., isomorphic, they also have identical graph invariants, though the rever property does not hold. (4) Mining Measures: The are various measures, similar to tho in conventional data mining, to mine substructures in the graph, who lection depends on the objective and the constraints of the mining approach. Some popular mining measures are support, information entropy, information gain, gini-index and minimum description length (MDL).

(5) Solution Methods: Approximately five types of arch methods are ud to solve the sub-graph isomorphism problem amidst a number of graphs. They are categorized into (1) heuristic arch methods and (2) complete arch methods, in terms of the completeness of the arch. They are also classified under (1) direct and (2) indirect matching methods, in terms of the sub-graph isomorphism matching problem. The five types of arch methods are: (1) conventional greedy arch [3, 19], (2) inductive logic programming [13], (3) inductive databa [6], (4) complete level-wi arch and (5) support vector machine (SVM) [15]. 4.2Why we cho SUBDUE

Several approaches to GDM exist bad on the task of identifying frequently occurring sub-graphs in graph transactions, i.e., tho sub-graphs meeting a minimum level of support.

The FSG system (part of the PAFI system, University of Minnesota) [9] finds all frequent sub-graphs in large graph databas. FSG starts by finding all frequent single and double edge sub-graphs. Then, in each iteration, it generates candidate sub-graphs by expanding the sub-graphs found in the previous iteration by one edge. In each iteration the algorithm checks how many times the candidate sub-graph occurs within an entire graph. The candidates, who frequency is below a ur-defined level, are pruned. The algorithm returns all sub-graphs occurring more frequently than the given level.

gSpan (University of Illinois at Urbana-Champaign) [18] combines depth-first arch and lexicographic ordering to find frequent sub-graphs. Their algorithm starts from all frequent one-edge graphs. The labels on the edges together with labels on incident vertices define a code for every such graph. Expansion of the one-edge graphs maps them to longer codes. Since every graph can map to many codes, the codes in the tree structure are not unique. If there are two codes in the code tree that map to the same graph and one is smaller than the other, the branch with the smaller code is pruned during the depth-first arch traversal of the code tree. Only the minimum code uniqu

描写植物的小练笔ely defines the graph. Code ordering and pruning reduces the cost of matching frequent sub-graphs in gSpan.

The Apriori-bad Graph Mining (AGM) system [7] arches the space of frequent sub-graphs in a bottom-up fashion, beginning with a single vertex, and then continually expanding by a single vertex and one or more edges. AGM also employs a canonical coding of graphs in order to support fast sub-graph matching. AGM returns association rules satisfying ur-specified levels of support and confidence.

Few graph-bad relational learning (GBRL) approaches have been developed to date. Two specific tools, SUBDUE and GBI [19], take a greedy approach to finding sub-graphs maximizing an information theoretic measure. SUBDUE arches the space of sub-graphs by extending candidate sub-graphs by one edge. Each candidate is evaluated using a minimum description length metric (discusd later), which measures how well the sub-graph

compress the input graph if each instance of the sub-graph were replaced by a single vertex. GBI continually compress the input graph by identifying frequent triples of vertices, some of which may reprent previously-compresd portions of the input graph. Candidate triples are evaluated using a measure similar to information gain.

To summarize, we cho SUBDUE mainly due to two reasons. Firstly, we e that all the three approaches just mentioned: FSG, gSpan and AGM, find frequent substructures in a graph data t. SUBDUE, on the other hand, finds sub-graphs that compress the input data t well but which may not be frequent. We believe sub-graphs that compress well may reveal novel concepts in a databa, whereas frequent sub-graphs may not always be interesting.

Secondly, all the three systems are bad on graph transactions. SUBDUE does support graph transactions but on top of that, can also mine one large graph for interesting patterns.

4.3Brief Introduction to SUBDUE

(ailab.uta.edu/subdue)

The Knowledge Discovery System SUBDUE [2] has been a pioneering work in the field of greedy arch-bad graph mining (Point 5, Solution Methods in Section 4.1). Structural data, reprented as a labeled graph, rves as the input to SUBDUE, which writes out substructures, again as labeled graphs. A substructure is a connected sub-graph within the input graph (Point 1, Sub-graph Categories in Section 4.1). The found sub-graph can be considered a concept. An instance of a substructure in an input graph is a t of vertices and edges from the input graph, which match, grap

h-theoretically, to the graphical reprentation of the substructure. This algorithm is bad on a computationally-constrained beam arch.

For each unique vertex label, a substructure is defined as a vertex with that label and who instances are all the vertices in the input graph with that label. A substructure is extended in all possible ways by a single edge and a vertex, or by only a single edge if both vertices are already in the sub-graph. At each expansion, candidate substructures are rated according to one of the three evaluation techniques, given below. The substructures are kept on a queue and are ordered bad on their values, defined below. The arch terminates upon reaching a ur-specified limit on the number of substructures extended, or upon exhaustion of the arch space. SUBDUE has been applied successfully to databas in domains like Image analysis, CAD circuit analysis, Chine character databas, program source code, chemical reaction chains etc.

The value of a substructure S, in an input graph G, denoted by value(S, G), may be calculated according to any one of the following three evaluation techniques:-

(1) Minimum Description Length (MDL)

This is the default evaluation method ud by SUBDUE (e Point 4, Mining Measures in Section 4.

1). The MDL principle states that the best theory to describe a datat is the one that minimizes the description length of the entire t of data. SUBDUE has implemented this principle in the context of compressing the input graph with the discovered substructure. Once the arch terminates and SUBDUE returns the list of best substructures found, the graph can be compresd using the best substructure. The compression procedure replaces all instances of the substructure in the input graph by single vertices, which reprent the substructure definition. Incoming and outgoing edges to and from the replaced instances will point to, or originate in the new vertex that reprents the instance. The SUBDUE algorithm may be invoked again (termed iteration) on this compresd graph.

Here, value(S, G) = DL(G) / (DL(S) + DL(G|S)), where DL(G) is the number of bits (description length) required to encode the input graph G, DL(S) is the number of bits required to encode the discovered substructure S and DL(G|S) is the number of bits required to encode the input graph G after it has been compresd using substructure S.

In supervid concept learning (e Section 4.5), value(S, Gp, Gn) = [DL(Gp) + DL(Gn)] / [DL(S) + DL(Gp|S) + DL(Gn) - DL(Gn|S)], where Gp and Gn are the positive and negative graphs respectively.

(2) Size

The size measure is faster to compute than the MDL measure but is less consistent.

Here, value(S, G) = size(G) / (size(S) + size(G|S)), where size(G) = (#vertices(G) + #edge(G)), and (G|S) is G compresd with S.

In supervid concept learning (e Section 4.5), value(S, Gp, Gn) Figure 1. An organization chart showing the company hierarchy (produced using AT&T Graphviz).

= [size(Gp) + size(Gn)] / [size(S) + size(Gp|S) + size(Gn) - size(Gn|S)].

(3) Set Cover

The value of a substructure S is computed as the number of positive examples containing S plus the number of negative examples not containing S, this quantity divided by the total number of examples.

If this evaluation method is chon, then the compression done at the end of each iteration in the MDL approach is replaced by just removing all positive examples containing S. The SUBDUE algorithm may be invoked again (i.e., iterated) on this reduced graph.

In our experiments, we have mostly relied on the MDL and the t cover approaches. The size measure has not been ud in our rearch. Among the key features implemented in SUBDUE, we shall be applying the unsupervid pattern discovery and supervid concept learning techniques to this rearch.

4.4Unsupervid Pattern Discovery

Knowledge is discovered in structural data by identifying common substructures (concepts reprented as graphs) within the data (e Section 4.3) [2].

Figure 1 shows the hierarchy of an organization, which after being converted to a labeled graph, is fe

d to SUBDUE. Figure 2 shows the best substructure reported by SUBDUE in unsupervid discovery mode using the MDL evaluation technique discusd in Section 4.3. This substructure has been reported to have three instances in the input graph, which implies that, every manager, in the organization shown in Figure 1, supervis at least a developer and a cretary.

In the input graph in Figure 1, we e that a manager in this company may also supervi a designer or a tester but since it is not a pattern reported by SUBDUE, we suspect that some developers in this company are made to multiplex as designers or testers according to managerial wishes.

4.5Supervid Concept Learning

SUBDUE has been extended to incorporate supervid graph-bad concept learning [2, 4], which focus on the two-class scenario. The inclusion of a negative graph enables SUBDUE to learn by example rather than by obrvation. Substructures that occur often in the positive graph, but not often in the negative graph, are more likely to reprent the target concept.

The negative information may come in two forms. First, the data may be in the form of small graphs, or graph transactions, each labeled either positive or negative. Second, data may be compod of two large graphs: one positive and one negative.

自言自语近义词

The first scenario is clost to the standard supervid learning problem in that we have a t of clearly defined examples. Figure 3 shows a t of positive examples, denoted by G+ and Figure 4 shows a t of negative examples, denoted by G-.

One approach to supervid learning, namely the t-covering approach, is to find a sub-graph that appears often in the positive graphs, but not in the negative graphs. This amounts to replacing the information-theoretic measure with simply an error-bad measure. This approach will lead the arch towards a small sub-graph that discriminates well. However, such a sub-graph does not necessarily compress well, nor reprent a characteristic description of the target concept. We can bias the arch towards a more characteristic description by using the information-theoretic measure to look for a sub-graph that compress the positive examples, but not the negative examples. This is the MDL approach and will lead the arch towards a larger sub-graph that characterizes the positive examples, but not the

negative examples.

Figure 3. Set of positive examples, denoted by G+, for

supervid learning

Figure 2. Best substructure reported by

SUBDUE in Unsupervid Discovery Mode.

开业日子Figure 4. Set of negative examples, denoted by G-, for

supervid learning

In our example, the best substructure reported by SUBDUE using both the approaches is incidentally the same and is shown in Figure 5. This “Manager supervis Developer” pattern is prent in all the positive examples in Figure 3 but not once in the t of negative examples in Figur

e 4. This sub-graph discriminates best as well as compress the best.

Finally, this process can be iterated in a t-covering approach to learn a disjunctive hypothesis. If using the error measure, then any positive example containing the learned sub-graph would be removed from subquent iterations. If using the information-theoretic measure, then instances of the learned sub-graph in both the positive and negative examples (even multiple instances per example) are compresd to a single vertex. Plea e [4] for more information on graph-bad supervid learning.

5.LEARNING STRUCTURAL PATTERNS EMBEDDED IN SOCIAL NETWORKS

5.1The concept of a clique

Structural analysts are very interested in substructures, like dyads, triads and ego-centered groups,

which may be prent in the network. The solidarity and connections of large social structures can be built up out of such small, tight components in a bottom-up approach.

A clique is a maximal complete sub-graph expanded to include as many actors as possible, in which every actor has a direct tie with each and every other member [5].

5.2Relaxations of the clique definition

N-clique – Since the strict definition of a clique may be too strong to capture the meaning of the concept, it may be relaxed a bit to include an actor as a member of a clique if he/she is connected to every other member of a group at a distance greater than one. Usually the path distance of two is ud. This corresponds to the “Friend of a friend (FoaF)” concept. This approach to defining substructures is called N-clique, where N stands for the length of the path allowed to make a connection to all other members. The problem with the N-clique approach is that it tends to find long and stringy groupings rather than the tight and discrete ones of the maximal approach. Sometimes, N-cliques can be found with a property that is undesirable for many purpos - the members of the N-clique get connected by actors who are not, themlves, members of the clique [5].

寓言故事的成语有哪些N-clan – An N-clan is a restricted form of N-clique which forces all ties between members of an N-cli

que to occur by way of the other members of the N-clique. This sometimes has an important bearing on sociological data [5].

K-plexes – The strong assumptions of a “maximal complete sub-graph” may also be relaxed by allowing actors to be members of cliques even if they have ties to all but K other members [5].

K-cores – A K-core is a maximal group of actors, all of whom are connected to K other members of the group. To be included in a K-plex, an actor must be tied to all but K other actors in the group. The K-core approach is more relaxed, allowing actors to join the group if they are connected to K members, regardless of how many other members they may not be connected to.

5.3Unsupervid Discovery of K-plexes

We shall be concentrating on K-plexes for our experiments with the belief that if SUBDUE can successfully find K-plex structures, it can be ud to mine for other such salient patterns embedded in social network data.

We introduce a couple of different class of K-plex graphs:-

1) KPlex_Inexact - A class of K-plex graphs where each instance of the K-plex does not have the sa

me structural pattern embedded.

Figure 5. Best substructure learned by SUBDUE in

Supervid Concept Learning mode 2

盘古开天地教学反思

Figure 6. A KPlex_Inexact communication graph showing two

instances of a 2-plex (Instance 1 on top, Instance 2 in bottom)

(Vertex Label: “Actor” and Edge Label: “Communication”)

本文发布于:2023-05-25 13:27:11，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/771527.html

上一篇：最新假期医院社会实践报告(6篇)

下一篇：最新数学老师心得体会300字数学老师心得体会简短(6篇)

标签：糯米练笔日子教学家常植物

留言与评论（共有 0 条评论）