Human Intestinal Absorption (HIA) is a measure of the fraction of orally dosed compound that enters the bloodstream in the hepatic portal vein. Good HIA is a necessary, but not sufficient, requirement for good oral bioavailability. This is because oral bioavailability is a measure of the percentage of dose reaching the systemic circulation, which includes the effect of other processes, most significantly first pass metabolism by the liver.
Shen et al. [J. Chem. Inf. Model. 2010,50( 6) pp. 1034-1] published a paper describing the generation and validation of QSAR models of HIA and blood-brain-barrier penetration (BBB). The data sets with which these models were built and validated were provided in the supplementary information to this paper and the HIA data have been used to build the models described herein. Models of the BBB data from 1 are described in another article.
Full details of the data, methods, results and use of these models are provided below.
Data
Shen et al. 1 used a data set of 578 compounds classified as high or low based on a threshold of 30%. This was divided into a training set of 480 compounds (407 high and 73 low) and a test set of 98 compounds (93 high and 5 low). These data sets are included in the supporting information, as described below.
However, we noted that the test set contains only 5 poorly absorbed compounds from a total of 98 and this bias limits the statistical significance of the independent test of the model. Furthermore, we identified a number of compounds for which the structures in the supporting information of 1 were incorrect. Therefore, following correction of the structures, the full data set of 578 compounds, was split into independent training, validation and test sets in the proportions 70:15:15, using the clustering method of the Auto-Modeller and the default cluster size of 0.7 (Tanimoto similarity). The resulting data sets are summarised in Table 1 and are included in the supporting information, as described below.
Data set | Number High | Number Low |
---|---|---|
Training | 355 | 50 |
Validation | 70 | 16 |
Test | 75 | 12 |
Methods
The Auto-Modeller was applied the to the original data sets provided by Shen et al. to allow direct comparisons with the models generated in this paper. The Auto-Modeller was subsequently applied to the revised data set split, as described above.
In both cases, the default descriptors and parameters for descriptor selection were used and models were generated using the decision tree (DT) and random forest (RF) methods.
Details of the parameters and descriptors used are provided in the supporting information, as described below.
Results
Table 2 Results of RF model generated with the Auto-Modeller using data set split in 1 and compared with the bet models published in 1 and 2. TP is number of true positives, TN is number of true negatives, FP is number of false positives and FN is number of false negatives. k is the kappa statistic as described in Section 6.8.7 of the StarDrop Reference Guide.
Model | Training | Test | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
TP | TN | FP | FN | k | TP | TN | FP | FN | k | |
Shen et al. | 405 | 68 | 5 | 2 | 0.94 | 92 | 5 | 0 | 1 | 0.90 |
Hou et al. | 398 | 69 | 4 | 9 | 0.90 | 91 | 5 | 0 | 2 | 0.82 |
Auto-Modeller (RF) | 407 | 73 | 0 | 0 | 1 | 92 | 4 | 1 | 1 | 0.79 |
As noted above, the independent test set contains only 5 poorly absorbed compounds from a total of 98 and this bias limits the statistical significance of the independent test of the model; the difference in the performance between these models on the test set is accounted for by only a single misspredicted compound.
The best performing models generated from the revised data set split, generated by the Auto-Modeller, are summarised in Table 3. The best performing model on the validation set was a DT (DTModel15); however the RF model also performed well on the validation set and may be preferable due to the improved robustness of DT models over individual decision trees.
Model | Training | Validation | External test | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TP | TN | FP | FN | k | TP | TN | FP | FN | k | TP | TN | FP | FN | k | |
DTModel15 | 355 | 45 | 5 | 0 | 0.94 | 68 | 13 | 3 | 2 | 0.80 | 72 | 9 | 3 | 3 | 0.71 |
RF | 355 | 50 | 0 | 0 | 1.00 | 67 | 13 | 3 | 3 | 0.77 | 74 | 8 | 4 | 1 | 0.73 |
Installing and using the models
HIA Shen training model
Download HIA Shen training model
The RF model generated with the data set split in Shen et al. 1
HIA Shen full set DT15 model
Download HIA Shen full set DT15 model
The DT model generated using the revised split of the full set published in Shen et al. 1
HIA Shen full set RF model
The RF model generated using the revised split of the full set published in Shen et al. 1
Supporting information
The data sets and detailed outputs from the modelling process may be downloaded.
You must first create a free Mcule account on mcule.com, and request a token from support@mcule.com. Once generated, your API token will be available on mcule.com/accounts/api-access/.
Before your first search, you will be prompted to enter the API token. All of the following scripts are available under the Custom Scripts->Mcule menu.
With Query from selection and Draw molecule you can run a search for the queries that you have selected, or for a drawn structure. You can specify the search type (exact, similarity, substructure), the collection (full, in_stock), set the similarity threshold, and adjust the maximum number of hits returned for each query. Optionally, you can also fetch pricing information for the hits.
The Price details for Mcule ID list script allows you to fetch pricing information for list of Mcule IDs pasted into a text box.
With the Request quote for Mcule ID column you can generate a URL to paste into your browser to start a quote generation. You can update your API token using the Update API token script.
Installation files
HIA Shen training model
Download HIA Shen training model
The RF model generated with the data set split in Shen et al. 1
HIA Shen full set DT15 model
Download HIA Shen full set DT15 model
The DT model generated using the revised split of the full set published in Shen et al. 1
HIA Shen full set RF model
The RF model generated using the revised split of the full set published in Shen et al. 1
Supporting information
The data sets and detailed outputs from the modelling process may be downloaded.
How to use the script
You must first create a free Mcule account on mcule.com, and request a token from support@mcule.com. Once generated, your API token will be available on mcule.com/accounts/api-access/.
Before your first search, you will be prompted to enter the API token. All of the following scripts are available under the Custom Scripts->Mcule menu.
With Query from selection and Draw molecule you can run a search for the queries that you have selected, or for a drawn structure. You can specify the search type (exact, similarity, substructure), the collection (full, in_stock), set the similarity threshold, and adjust the maximum number of hits returned for each query. Optionally, you can also fetch pricing information for the hits.
The Price details for Mcule ID list script allows you to fetch pricing information for list of Mcule IDs pasted into a text box.
With the Request quote for Mcule ID column you can generate a URL to paste into your browser to start a quote generation. You can update your API token using the Update API token script.
References
1 Shen et al. J. Chem. Inf. Model. 2010,50( 6) pp. 1034-1041
2 Hou et al. J. Chem. Inf. Model. 2007, 47(1), pp. 208–218