Graph analysis of animals' pedigrees

Titel der Arbeit: Graphische Analyse von Zuchttierpedigrees Vorliegender Beitrag analysiert Möglichkeiten der graphischen Analyse und die Darstellung von Zuchttieroe Ve^and^cLf t l t 6 ^ T T A b s ' TM S t z w e r k e n kann wesentlich zum Verstfndn." b stehend er Verwandtschaftsbeziehungen betragen. Dargestellt werden verschiedene Methoden zur Zerlegung umfangrei mö lichTn '" l-T?T' diC 6 i n e b6SSere Und , e i c h t e r e A n a l v s e z B von Pferdfgenealogen e S £ ? • , '!;ü ,ySe V ° n P e d i g r e e s k a n n d i e Ü b e r s i c h t v o n Tierfamilien und die Schätzung von Inzuchtkoefrizienten von Zuchttieren unterstützen. Das Erkennen von Familien oder Zuchttieren X r o t r e m genetischen Gewich, wird erleichtert. Es wird gezeigt, dass die graphische Darstellung von Pedigr«Lfzwerke^ owie b e e s n , h e n d C e Ä e Ä • " " S Ä übersichtliche Abbildungen von V e L n d t s c h E S S sowie bestellende Wechselbeziehungen und Teilstrukturen zu erkennen.


Introduction
The use of Visual images is common in branches of science.Visualization provides powerful tools for the investigation of various relational structures, such networks e z of Computers, transportation, the Internet, communication, intra/inter organizational networks Applications arise in economics (project management, work-flow diagrams), Computer science (flow graphs of programs, data base modelling, algorithm animatton), social science (social networks), natural science (large molecules and tlow-diagrams in Chemistry, visualization of excavations in Archaeology) and ^^"T networks < links amon 8 pervers, usage of Internet, phone calls) (SNYDER and KICK, 1979;FREEMAN, 1988;BATAGELJ and MRVAR 2000) Two distinct forms of display can be used to construet images of networks, one based on points and lines (graphs) and the other on matrices.In a matrix display form, the rows and columns both represent points and numbers while Symbols in the cells show the connections linking those points.Large networks (hundreds or thousands of points), based on matrix representation, cannot be treated efficiently by Standard analysis tools.Application of Standard network analysis is therefore limited to networks of moderate size (tens or hundreds of points).The overwhelming majority of network images have involved the use of graphs.A graph is defined as a set of points and a set of lines that connect points, written mathematically as G=(V, E) or G(V, E).Aecording to the branch of science, there are many synonyms for the term "point": node, vertex, actor and junetion and for term "line": edge, link, branch, tie or are (HARARY, 1969;FOULDS, 1992).In a genealogy or pedigree graph, the points represent persons or animals and the lines represent relationships among them.Graph representations of pedigrees are directed graphs (also known as digraphs) because lines have directions.Animals' pedigrees are usually large networks, consisting of thousands of points and lines.Standard visualization approaches usually do not give satisfactory results in the case of large graphs.In the present paper, graph analysis has been implemented in an attempt to discern the internal structure(s) of a pedigree.Different methods of network decomposition are recruited to demonstrate how graph-theoretic analysis of pedigrees can contribute to: i) partitioning the relational structure of a pedigree, ii) estimate shortest kinship paths among animals, iii) determinate all predecessors and successors of selected animals and iv) to estimate inbreeding coefficient of selected individuals.

Graph theory
Graph theory has a wide repertoire of terms.A good reference on the subjeet offers the book of HARARY (1969).Only a brief description of the terms used in the present study will follow here.A sub-graph ofa graph G is a subset of its points together with all lines connecting members ofthe subset.Apath is an altemating sequence of points and lines beginning at a point and ending at a point, and which does not visit any point more than once.A cycle is just a path except that it Starts and ends at the same point.A bicomponent of a graph G is a sub-graph of G with three or more points in which any pair of points is connected by two independent paths, hence, by a set of edges that form a cycle.The length ofa path (or cycle) is defined as the number of lines in it.The shortest path between two points is called a geodesic.The graph theoretic distance or geodesic distance between two points is defined as the length of the shortest path between them.The diameter of a connected graph is the largest geodesic distance.A graph is connected if there exists a path (of any length) from every point to every other.A connected component is a maximal sub-graph in which all points are reachable from every other.Maximal means that it is the largest possible sub-graph.No point could be added in the sub-graph without violating its property.For directed graphs, like pedigrees, there are strong and weak components.A strong component is a maximal sub-graph in which there is a path from every point to every point following all the lines in the direction they are pointing.A weak sub-graph is a maximal sub-graph, which would be connected if the direction of the lines were ienored The distance ofa connected graph is the maximum distance between ^1 s^onts The tZ^J graph M IS , the " Umber of actua ^ -ing lines as a pro P orZof &e theorettcally possible hnes.In a directed graph, the density (d) is given by the formula where k and n are the number of lines and points in the graph, sZÄnhf T Cti f i S ^ (rCCUrSiVe) factorizati «n «f » »arge graph into several ™W T P °f the a PP roaches to s«PPort abstraction is find Clusters ie subsets of points, extract and show points that belong to the same ClusterThrufk pointe tn Clusters and show relations among Clusters.Aclique is a subset of he tänh r r b rs ^rcl osely and i nt ense, y t i ed t o ~ -SÄE Ä Network analysis Decomposition ^TJ naJ ^g,°a!^ netw ?rk anal y sis is to discern fundamental structure(s) of networks »na way that: i) allows the knowing of its structure and ii) faeilitates the understand^ of network phenomena.The most used tool is called blockmodelUngZ^S ÄÄSET?t 8 " 0 ^ l ° We,1 -s P ecifi ^ -iteria.IlocJodel m seeks to Cluster together units having substantially similar patterns of relationshins StSuS °f t netW0 ^-A b,OCkm0dd C °nsistS of structurL obt^X identifymg all units from the same Cluster ofthe clustering C. The graph Version ofthe the°1r tl; t ' S &r ; duced ^h' Bloekmodeiiing is an empirical pmcedl baS ed on the tdea that units ,n a network ean be grouped aecording to the extent to which is equivalent aecordmg to some meaningful defmition of equivalence.Two defin non of equtvalence were extensively treated in the last three decades: struetura Z regulär equtvalence.In structural equivalence (LORRAIN and WHITC 1971) the netork n'dfTf "" "T"*?^^ *** ™ C0 « d to the res of h network ,n iden tea ways.In regulär equivalence, two units are regularly equivalent if they are equ.valentlyconnected to equivalent others, i.e. they have the sametvne öf r n BO R rTT^A IL H E ?;^9 78) -FUrthCr W °rk dealt With ^omZZ equivlnce (BORGATTI and EVERETT, 1992).Points are automorphieally equivalent if the Sistl; PCrmUt 1, iS SUCh 3 W3y th3t eXchangin ^ the ^ Points has no e f t on noint fT TTÜ P u mtS ln thC 8raph " Reduction is ^ kursive deletion of all Points of the network that have only 0 or 1 neighbor points.

Connections
In a pedigree network, individuals may have many or few genetic ties.Furthermore hey may be 'sources' of genetic ties, sinks (reeeive ties but don't send) «boTS sum of connections from point v to others is called the out-degree of the point The sum of connections to point v from other points is called the LdegreeofZ"point Antmals with unusually high out-degree are influential or centraf animas in the SJSllTh t degre Z °fCe »» alü y-^ number and kind of ie t a individuals have are keys to determining their embeddednsess in the pedigree ueZZV™ 1 r^8 Ca " reVeal the way the individuals ™ embedded IS a f ee H T , he T COmmon a PP r oaches here has been to iook at dyads, i.e. sets of two points and triads, i.e. sets of three points (WASSERMAN and FAUST, 1988).

Index of connectedness (P) n _ k+m-n ~k+n-2M
where: n -number of vertices m -number of lines k -number of weakly connected components M -number of maximal vertices (vertices having output degree 0, M^l) with 0<P_1.A graph having only one Vertex has P=0.The highest connectedness (P=l) are for graphs representing matings between full-sibs.Data Pedigree data were available on Holsteiner stallion Libero through Data Horse Ltd.Data tracing back to 6 generations of the animal.They were retrieved via Internet through the web site ofthe above Company (http://www.horses.nl/klaus2/6generat_e.htm).The total number of animals in the pedigree was 122 (including 121 ancestors and the animal Libero).The total number of sires and dams was 58 and 63, respectively.There were 61, 30, 16, 8, 4, 2 and 1 animals in generations 1-6, respectively.Two animals, (Loretto and Fanal) were used in two subsequent generations 1 and 2, respectively.

Pedigree analysis and display
Network analysis of the pedigree was performed by the Computer program Pajek (BATAGELJ and MRVAR, 2000).Blockmodelling was aecording to algorithms described by DOREIAN et al. (1994).In most large networks the number of lines n is the same order as the number of the vertices.Such networks are considered sparse networks.The efficiency of an algorithm is accounted by its time T(n) and Space S(n) complexity where T(n) and S(n) are estimates ofthe time and memory space needed to run it on instances of size n, respectively.Given the capabilities of nowadays Computers, Space complexity for storing sparse networks is not crucial anymore.Having much faster Computers, however, does not help a lot in the case of high order time complexities.Most of the algorithms implemented in Pajek have subquadratic time complexities: O(n), O(nlogn), 0(nVn), or are restricted to small sets of selected vertices.Graph representation ofthe whole pedigree as well as of its Clusters was also performed by the Pajek program.Several Standard algorithms for automatic graph drawing were used like spring embedders based on minimization ofthe total energy of the system KAMADA and KAWAI (1989) and drawing in layers.The inbreeding coefficient of animal X was calculated by applying the classical formula k / , \n,+n,+1 where: CA= a common ancestor ofthe sire and dam of X, k = the number of common ancestors in X's pedigree, n, = the number of generations separating the common ancestor from the sire of X, n 2 = the number of generations separating the common ancestor from the dam of X, F CA = the inbreeding coefficient ofthe common ancestor.

Results and Discussion
ThZtl diSP K yS *?The diameter of graph E is 6 estimated on the following path that connects the Most of the triadic connections of graph E are of type 1 (Table 2) which denotes a loose graph.This result is in agreement with the density value estimated above.There are 14722, 6 and 63 relations of type 2, 3 and 4, respectively.Triads of type 3 denote a common parent, whose offspring are half-sibs.Triads of type 4 denote matings between two animals, sire and dam, resulting in full-sibs.Those triads can be considered representing families.Furthermore, there are 130 triadic relations that include members of three generations, e.g.grandparent sire, parent sire, sire.There exist two bi-components in graph E (Fig. 3).One of them is large (Cl), shown right of the graph-root and is containing 20 animals (animal Libero with code number 122 is including here) (figures before animal-names represent input coding numbers of animals).Cl: {(51)Loretto, (26)Lorbeer, (74)Oina, (98)Fangball, (89)Oresta, (106)Lowevenjager, (115)Loni, (119)Gelonika, (107)Emmia, (90)Fant I, (58)FanaI, (lll)Schneenelke, (120)Landgraf I, (117)Warthburg, (99)Blümchen, (86)Lohengrin, (105)Valet, (114)Komet, (121)Oktave, (122)Libero}.The other bi-component, C2, is smaller and is containing 8 animals: C2: {(l)Phalaris, (62)Fairway, (92)Blue Peter, (108)Sailing Light, (116)Ladykiller, (109)Lone Beech, (94)Loaningdale, (66)Colorado}.The two bi-components were detected as p-cliques after blockmodelling ofthe original pedigree E. The respective cliques are displayed in sub-graph El (Fig. 4).In this reduced graph, the number of vertices and the number of arcs are 28 and 32, respectively, resulting in a density of 4.23% (Table 1).The index of connectedness of this sub-graph is P=0.18519.Aecording to defmition, animals in a clique are more closely and intensely related to one another than they are to other members of the pedigree.Indeed, the sum of the number of all triadic genetic relations in graph El, expressed as a ratio to all number of triads, has increased up to 31.62% (Table 2).Centers detected in sub-graph El with respective degrees of centrality (c) are shown in Table 3. Animals with c higher than 1 are: Schneenelke (c=2.0),Landgraf I (c=2.5),Fangball (c=3.0),Loni (c=3.0),Ladykiller (c=3.0),Gelonika (c=3.0) and Loretto (c=4.5).Loretto has the highest degree of centrality; he has out-and in-degrees of 3 and 1, respectively.It is the most connected animal in sub-graph El implying a central point.The left part ofthe sub-graph El is a cycle starting from animal Ladykiller and ending to the same animal after visiting the animals: Sailing Light, Blue Peter, Fairway, Phalaris, Colorado, Loaningdale and Lone Beech.Phalaris is the common ancestor (generation 1), followed by Colorado and Fairway (generation 2), Blue Peter and Loaningsdale (generation 3), Sailing Light and Lone Beach (generation 4) and Ladykiller (generation 5).Finally, Ladykiller is the dam of Landgraf I, the sire of Libero.All animals in clique 1 are therefore direct or indirect descendants of animal Phalaris.The inbreeding coefficient (F x ) of animal Ladykiller can be easily obtained by applying formula 1 resulting in F x =0.0078125.The second clique is more complex and it consisted of 20 animals.This clique is shown in sub-graph E2 (Figure 5).Since animal Libero is in this partition, a further analysis is needed.The number of vertices and arcs in sub-graph E2 is 20 and 23, respectively.The density of this sub-graph is thus d=0.0605 (6.05%) (Table 1).The index of connectedness of this sub-graph is P=0.21053.It is thus a dense graph.The lines without arcs from Loretto to Gelonika, Loni, Fangball and Schneelke represent indirect connections between Loretto with other animals (note, for instance, the path: Loretto, Lorbeer, Oina, Fangball).In this paper, a new perspective to analyze animals' pedigrees, based on graph theory is presented.Visual representation of pedigrees and the associated graph analysis has proved to be successful in detecting the internal structure of pedigrees of moderate size.Visualization of pedigrees has provided a useful tool for detecting animals' families as well as isolating single animals with major gene contributions to the

Phalaris
Fig. 2: Animals on all paths from animal Phalaris to animal LiberoTable2Type and number of triadic relations in the three graphs 8raph E °f thC P edi § ree of animal Libero in genealogy lavers ^Sd^tTf.o°ft and 'r ^ gfaph W3S 122 and 126 jespeZely T£

Table 3
Centers and degrees of centrality (c) in sub-graph El SU H' 6ra f ^ ^l^' re P resentation of the cli <^ containing animal Libero (lines without arcs represent indirect connections between animals)