Optibrium Community

Downloads

Shen HIA Models

May 12, 2013

Shen HIA Models

May 12, 2013

Human Intestinal Absorption (HIA) is a measure of the fraction of orally dosed compound that enters the bloodstream in the hepatic portal vein. Good HIA is a necessary, but not sufficient, requirement for good oral bioavailability. This is because oral bioavailability is a measure of the percentage of dose reaching the systemic circulation, which includes the effect of other processes, most significantly first pass metabolism by the liver.

Shen et al. [J. Chem. Inf. Model. 2010,50( 6) pp. 1034-1] published a paper describing the generation and validation of QSAR models of HIA and blood-brain-barrier penetration (BBB). The data sets with which these models were built and validated were provided in the supplementary information to this paper and the HIA data have been used to build the models described herein. Models of the BBB data from [1] are described in another article.

Full details of the data, methods, results and use of these models are provided below.

Data

Shen et al. [1] used a data set of 578 compounds classified as high or low based on a threshold of 30%. This was divided into a training set of 480 compounds (407 high and 73 low) and a test set of 98 compounds (93 high and 5 low). These data sets are included in the supporting information, as described below.

However, we noted that the test set contains only 5 poorly absorbed compounds from a total of 98 and this bias limits the statistical significance of the independent test of the model. Furthermore, we identified a number of compounds for which the structures in the supporting information of [1] were incorrect. Therefore, following correction of the structures, the full data set of 578 compounds, was split into independent training, validation and test sets in the proportions 70:15:15, using the clustering method of the Auto-Modeller and the default cluster size of 0.7 (Tanimoto similarity). The resulting data sets are summarised in Table 1 and are included in the supporting information, as described below.

Table 1 Overview of data set split generated using the Auto-Modeller.

Data set

Number High

Number Low

Training

355

50

Validation

70

16

Test

75

12

Methods

The Auto-Modeller was applied the to the original data sets provided by Shen et al. to allow direct comparisons with the models generated in this paper. The Auto-Modeller was subsequently applied to the revised data set split, as described above.

In both cases, the default descriptors and parameters for descriptor selection were used and models were generated using the decision tree (DT) and random forest (RF) methods.

Details of the parameters and descriptors used are provided in the supporting information, as described below.

Results

Table 2 shows comparison of the random forests (RF) model generated by the Auto-Modeller with the best model of Shen et al. and a model of the same data set reported in a previous work by Hou et al. [2].

Table 2 Results of RF model generated with the Auto-Modeller using data set split in [1] and compared with the bet models published in [1] and [2]. TP is number of true positives, TN is number of true negatives, FP is number of false positives and FN is number of false negatives. k is the kappa statistic as described in Section 6.8.7 of the StarDrop Reference Guide.

Model

Training

Test

 

TP

TN

FP

FN

k

TP

TN

FP

FN

k

Shen et al.

405

68

5

2

0.94

92

5

0

1

0.90

Hou et al.

398

69

4

9

0.90

91

5

0

2

0.82

Auto-Modeller (RF)

407

73

0

0

1

92

4

1

1

0.79

As noted above, the independent test set contains only 5 poorly absorbed compounds from a total of 98 and this bias limits the statistical significance of the independent test of the model; the difference in the performance between these models on the test set is accounted for by only a single misspredicted compound.

The best performing models generated from the revised data set split, generated by the Auto-Modeller,  are summarised in Table 3. The best performing model on the validation set was a DT (DTModel15); however the RF model also performed well on the validation set and may be preferable due to the improved robustness of DT models over individual decision trees.

Table 3 Results of RF model generated with the Auto-Modeller the revised data set split. TP is number of true positives, TN is number of true negatives, FP is number of false positives and FN is number of false negatives. k is the kappa statistic as described in Section 6.8.7 of the StarDrop Reference Guide.

Model

Training

Validation

External Test

 

TP

TN

FP

FN

k

TP

TN

FP

FN

k

TP

TN

FP

FN

k

DTModel15

355

45

5

0

0.94

68

13

3

2

0.80

72

9

3

3

0.71

RF

355

50

0

0

1.00

67

13

3

3

0.77

74

8

4

1

0.73

 

 

Using the HIA Models

The models can be downloaded for use within StarDrop from the following links:

HIA_Shen_training.aim : The RF model generated with the data set split in Shen et al. [1]

HIA_Shen_full_set_DT15.aim : The DT model generated using the revised split of the full set published in Shen et al. [1]

HIA_Shen_full_set_RF.aim : The RF model generated using the revised split of the full set published in Shen et al. [1]

To use these within StarDrop, download and save these files in a convenient place. Load them into StarDrop using the  button on the Models tab. Alternatively, the directory in which the model files have been saved can be added to the paths from which models are automatically loaded when StarDrop starts by selecting the File->Preference menu option and adding the directory under Models in the File Locations tab.

References

[1] Shen et al. J. Chem. Inf. Model. 2010,50( 6) pp. 1034-1041

[2] Hou et al. J. Chem. Inf. Model. 2007, 47(1), pp. 208–218

Supporting Information

The data sets and detailed outputs from the modelling process may be downloaded in a .zip archive.