Human Intestinal Absorption (HIA) is a measure of the fraction of orally dosed compound that enters the bloodstream in the hepatic portal vein. Good HIA is a necessary, but not sufficient, requirement for good oral bioavailability. This is because oral bioavailability is a measure of the percentage of dose reaching the systemic circulation, which includes the effect of other processes, most significantly first pass metabolism by the liver.

Shen et al. [J. Chem. Inf. Model. 2010,50( 6) pp. 1034-1] published a paper describing the generation and validation of QSAR models of HIA and blood-brain-barrier penetration (BBB). The data sets with which these models were built and validated were provided in the supplementary information to this paper and the HIA data have been used to build the models described herein. Models of the BBB data from 1 are described in another article.

Full details of the data, methods, results and use of these models are provided below.

Data

Shen et al. 1 used a data set of 578 compounds classified as high or low based on a threshold of 30%. This was divided into a training set of 480 compounds (407 high and 73 low) and a test set of 98 compounds (93 high and 5 low). These data sets are included in the supporting information, as described below.

However, we noted that the test set contains only 5 poorly absorbed compounds from a total of 98 and this bias limits the statistical significance of the independent test of the model. Furthermore, we identified a number of compounds for which the structures in the supporting information of 1 were incorrect. Therefore, following correction of the structures, the full data set of 578 compounds, was split into independent training, validation and test sets in the proportions 70:15:15, using the clustering method of the Auto-Modeller and the default cluster size of 0.7 (Tanimoto similarity). The resulting data sets are summarised in Table 1 and are included in the supporting information, as described below.

Data setNumber HighNumber Low
Training35550
Validation7016
Test7512
Table 1 Overview of data set split generated using the Auto-Modeller.

Methods

The Auto-Modeller was applied the to the original data sets provided by Shen et al. to allow direct comparisons with the models generated in this paper. The Auto-Modeller was subsequently applied to the revised data set split, as described above.

In both cases, the default descriptors and parameters for descriptor selection were used and models were generated using the decision tree (DT) and random forest (RF) methods.

Details of the parameters and descriptors used are provided in the supporting information, as described below.

Results

Table 2 Results of RF model generated with the Auto-Modeller using data set split in 1 and compared with the bet models published in 1 and 2. TP is number of true positives, TN is number of true negatives, FP is number of false positives and FN is number of false negatives. k is the kappa statistic as described in Section 6.8.7 of the StarDrop Reference Guide.

ModelTrainingTest
TPTNFPFNkTPTNFPFNk
Shen et al.40568520.94925010.90
Hou et al.39869490.90915020.82
Auto-Modeller (RF)40773001924110.79
Table 2 shows comparison of the random forests (RF) model generated by the Auto-Modeller with the best model of Shen et al. and a model of the same data set reported in a previous work by Hou et al. 2.

As noted above, the independent test set contains only 5 poorly absorbed compounds from a total of 98 and this bias limits the statistical significance of the independent test of the model; the difference in the performance between these models on the test set is accounted for by only a single misspredicted compound.

The best performing models generated from the revised data set split, generated by the Auto-Modeller,  are summarised in Table 3. The best performing model on the validation set was a DT (DTModel15); however the RF model also performed well on the validation set and may be preferable due to the improved robustness of DT models over individual decision trees.

ModelTrainingValidationExternal test
TPTNFPFNkTPTNFPFNkTPTNFPFNk
DTModel1535545500.946813320.80729330.71
RF35550001.006713330.77748410.73
Table 3 Results of RF model generated with the Auto-Modeller the revised data set split. TP is number of true positives, TN is number of true negatives, FP is number of false positives and FN is number of false negatives. k is the kappa statistic as described in Section 6.8.7 of the StarDrop Reference Guide.

Installing and using the models

HIA Shen training model

Download HIA Shen training model

The RF model generated with the data set split in Shen et al. 1

HIA Shen full set DT15 model

Download HIA Shen full set DT15 model

The DT model generated using the revised split of the full set published in Shen et al. 1

HIA Shen full set RF model

HIA Shen full set RF model

The RF model generated using the revised split of the full set published in Shen et al. 1

Supporting information

The data sets and detailed outputs from the modelling process may be downloaded.

Download data set and outputs

Installation files

HIA Shen training model

Download HIA Shen training model

The RF model generated with the data set split in Shen et al. 1

HIA Shen full set DT15 model

Download HIA Shen full set DT15 model

The DT model generated using the revised split of the full set published in Shen et al. 1

HIA Shen full set RF model

HIA Shen full set RF model

The RF model generated using the revised split of the full set published in Shen et al. 1

Supporting information

The data sets and detailed outputs from the modelling process may be downloaded.

Download data set and outputs

How to use the script

You must first create a free Mcule account on mcule.com, and request a token from support@mcule.com. Once generated, your API token will be available on mcule.com/accounts/api-access/.

Before your first search, you will be prompted to enter the API token. All of the following scripts are available under the Custom Scripts->Mcule menu.

With Query from selection and Draw molecule you can run a search for the queries that you have selected, or for a drawn structure. You can specify the search type (exact, similarity, substructure), the collection (full, in_stock), set the similarity threshold, and adjust the maximum number of hits returned for each query. Optionally, you can also fetch pricing information for the hits.

The Price details for Mcule ID list script allows you to fetch pricing information for list of Mcule IDs pasted into a text box.

With the Request quote for Mcule ID column you can generate a URL to paste into your browser to start a quote generation. You can update your API token using the Update API token script.

References

1 Shen et al. J. Chem. Inf. Model. 2010,50( 6) pp. 1034-1041

2 Hou et al. J. Chem. Inf. Model. 2007, 47(1), pp. 208–218