Proteinstructuralannotationandclassificationisanimportantproblemin bioinformatics. We reportonthedevelopmentofanefficientsubgraphmining techniqueanditsapplicationtofindingcharacteristicsubstructuralpatternswithin protein structural families. In our method, proteinstructuresarerepresentedbygraphs wherethenodesareresiduesandtheedgesconnectresiduesfoundwithincertain distance from each other.Application of subgraph mining to proteins ischallengingfor anumberreasons:(1)proteingraphsarelargeandcomplex,(2)currentprotein databases are large and continue togrowrapidly,and(3)onlyasmallfractionofthe frequent subgraphs among the huge pool of allpossiblesubgraphscouldbesignificant in the context of protein classification. To address these challenges,wehavedevelopedaninformation theoreticmodel calledcoherentsubgraph mining.From information theory,theentropyofarandom variable X measures the information content carriedbyXandthe MutualInformation (MI) between two random variables X and Y measures thecorrelationbetweenXand Y.WedefineasubgraphXascoherentifitisstronglycorrelatedwithevery sufficiently largesub-subgraphYembeddedin it.Basedon the MImetric,wehave designed a search scheme that only reports coherent subgraphs. To determine the significance of coherent proteinsubgraphs,wehaveconducted anexperimentalstudyinwhichallcoherentsubgraphswereidentifiedinseveral proteinstructuralfamiliesannotatedin theSCOPdatabase(Murzinetal,1995).The Support Vector Machine algorithm was used to classify proteins fromdifferentfamilies underthebinaryclassificationscheme.Wefind that thisapproachidentifiesspatial motifs unique to individual SCOP families and affordsexcellentdiscriminationbetweenfamilies.
【 预 览 】
附件列表
Files
Size
Format
View
Accurate Classific Ation Of ProteinStructural Families Using Coherent Subgraph Analysis