Background Measuring the structural diversity of compound databases is relevant in

Background Measuring the structural diversity of compound databases is relevant in drug finding and many other areas of chemistry. Diversity Plot (CDP) is definitely proposed as a novel method to represent in low sizes the diversity of chemical libraries considering simultaneously multiple molecular representations. We illustrate the application of CDPs to classify eight compound data units and two subsets with different sizes and compositions using molecular scaffolds structural TG-101348 fingerprints and physicochemical properties. Conclusions CDPs are general data mining tools that represent in two-dimensions the global diversity of compound data units using multiple metrics. These plots can be constructed using solitary or combined actions of diversity. An online version of the CDPs is definitely freely available at: https://consensusdiversityplots-difacquim-unam.shinyapps.io/RscriptsCDPlots/. Graphical Abstract Consensus Diversity Plot is definitely a novel data mining tool that represents in two-dimensions the global diversity of compound data units using multiple metrics. TG-101348 Electronic supplementary material The online version of this article (doi:10.1186/s13321-016-0176-9) contains supplementary material which is available to authorized users. is used extensively to describe the core structure of a molecule. Different approaches to obtain the scaffold of a molecule inside a consistent manner have been examined elsewhere [23 24 With this work the scaffolds were derived with the strategy explained by Johnson and Xu. The definition of scaffold used in this study is definitely illustrated in Additional file 1: Number S1. With this study both acyclic and cyclic systems (hereafter referred to as chemotypes) were considered. However to further characterize the behavior of the data sets TG-101348 containing more acyclic systems GRAS and the Carcinogenic the diversity of these data units was also assessed removing acyclic systems. These subsets hereafter referred to as and number of the most populated scaffolds [30]. The SE of a populace of P compounds distributed in systems is usually defined as: is the estimated probability of the occurrence of a specific chemotype in a populace of compounds made up of a total of acyclic and cyclic systems and is the quantity of molecules containing a particular chemotype. The value of SE ranges from 0 when all the compounds have the same chemotype (i.e. minimum diversity) to log2n when all the compounds are evenly distributed among the acyclic and/or cyclic systems (i.e. maximum diversity). To normalize SE to the different the scaled Shannon entropy (SSE) is usually defined as: (e.g. 5 were considered. In a previous work a limited and arbitrary number of most populated cyclic systems was explored [11]. Structural fingerprints For all those pairs of compounds the pairwise structural diversity was assessed with Molecular ACCess System (MACCS) keys (166-bits) [12] and Extended Connectivity Fingerprints (ECFP_4) [13] using the Tanimoto similarity coefficient [31]. The fingerprints were calculated with ENSA MayaChem Tools (http://www.mayachemtools.org/) and R Studio scripts [32]. MACCS keys/Tanimoto is usually a broadly used method to assess the diversity of compound data units. However CDPs can be generated using any other fingerprint representation or combination of them. Also similarity coefficients other than Tanimoto [33] can be used. Physicochemical properties Six properties of pharmaceutical relevance were calculated with MOE: hydrogen bond donors (HBD) hydrogen bond acceptors (HBA) the octanol and/or water partition coefficient (logP) molecular excess weight (MW) topological polar surface area (TPSA) and quantity of rotatable bonds (RTB). In MOE the six properties have the following notation: a_don a_acc SlogP TG-101348 Excess weight TPSA and b_rotN respectively. These molecular descriptors have been used to measure the diversity of compound databases [34-36]. The distance (or and be the N-dimensional vector of physicochemical properties for molecule in dataset be the N-dimensional vector of physicochemical properties for molecule in dataset = 6.) Let the number of molecules in data units and be and respectively. Then the between data units and is known as the most populated scaffolds. Taking together all the SSE values at different quantity of scaffolds (SSE5-SSE70) it can be concluded that the scaffold diversity of the eight TG-101348 data units as captured by SSE.