If you are an aspiring data scientist or a veteran data scientist, this article is for you! In this article, we will be building a simple regression model in Python. To spice things up a bit, we will not be using the widely popular and ubiquitous Boston Housing dataset but instead, we will be using a simple Bioinformatics dataset. Particularly, we will be using the Delaney Solubility dataset that represents an important physicochemical property in computational drug discovery.
如果您是有抱负的数据科学家或经验丰富的数据科学家,那么本文适合您! 在本文中,我们将在Python中构建一个简单的回归模型。 为了使事情更加有趣,我们将不使用广泛流行且无处不在的Boston Housing数据集,而是将使用简单的Bioinformatics数据集。 特别是,我们将使用代表计算药物发现中重要物理化学性质的Delaney溶解度数据集。
The aspiring data scientist will find the step-by-step tutorial particularly accessible while the veteran data scientist may want to find a new challenging dataset for which to try out their state-of-the-art machine learning algorithm or workflow.
有抱负的数据科学家会发现分步教程特别易于访问,而经验丰富的数据科学家可能希望找到一个新的具有挑战性的数据集,以尝试其最新的机器学习算法或工作流程。
1.我们今天要建设什么? (1. What we are Building Today?)
A regression model! And we are going to use Python to do that. While we’re at it, we are going to use a bioinformatics dataset (technically, it’s cheminformatics dataset) for the model building.
回归模型! 我们将使用Python来做到这一点。 在此过程中,我们将使用生物信息学数据集(从技术上讲,它是化学信息学数据集)进行模型构建。
Particularly, we are going to predict the LogS value which is the aqueous solubility of small molecules. The aqueous solubility value is a relative measure of the ability of a molecule to be soluble in water. It is an important physicochemical property of effective drugs.
特别是,我们将预测LogS值,该值是小分子的水溶性。 水溶性值是分子溶于水的能力的相对量度。 它是有效药物的重要理化性质。
What better way to get acquainted with the concept of what we are building today than a cartoon illustration!
有比卡通插图更好的方法来熟悉我们今天正在构建的概念!
2.德莱尼溶解度数据集 (2. Delaney Solubility Dataset)
2.1。 数据理解 (2.1. Data Understanding)
As the name implies, the Delaney solubility dataset is comprised of the aqueous solubility values along with their corresponding chemical structure for a set of 1,144 molecules. For those, outside the field of biology there are some terms that we will spend some time on clarifying.
顾名思义, Delaney溶解度数据集由水溶性溶解度值以及一组1,144个分子的相应化学结构组成。 对于那些在生物学领域之外的人,我们将花费一些时间来澄清它们。
Molecules or sometimes referred to as small molecules or compounds are chemical entities that are made up of atoms. Let’s use some analogy here and let’s think of atoms as being equivalent to Lego blocks where 1 atom being 1 Lego block. When we use several Lego blocks to build something whether it be a house, a car or some abstract entity; such constructed entities are comparable to molecules. Thus, we can refer to the specific arrangement and connectivity of atoms to form a molecule as the chemical structure.
分子或有时称为小分子或化合物的分子是由原子组成的化学实体。 让我们在这里使用一些类比,让我们认为原子等同于乐高积木,其中1个原子等于1个乐高积木。 当我们使用几个乐高积木来建造东西时,无论是房屋,汽车还是抽象物体。 这样构造的实体可与分子相比。 因此,我们可以将形成分子的原子的特定排列和连通性称为化学结构 。
So how does each of the entities that you are building differ? Well, they differ by the spatial connectivity of the blocks (i.e. how the individual blocks are connected). In chemical terms, each molecules differ by their chemical structures. Thus, if you alter the connectivity of the blocks, consequently you would have effectively altered the entity that you are building. For molecules, if atom types (e.g. carbon, oxygen, nitrogen, sulfur, phosphorus, fluorine, chlorine, etc.) or groups of atoms (e.g. hydroxy, methoxy, carboxy, ether, etc.) are altered then the molecules would also be altered consequently becoming a new chemical entity (i.e. that is a new molecule is produced).
那么,您要构建的每个实体有何不同? 好吧,它们的区别在于块的空间连通性(即各个块的连接方式)。 用化学术语来说,每个分子的化学结构都不同。 因此,如果您更改块的连接性,则将有效地更改您正在构建的实体。 对于分子,如果原子类型(例如碳,氧,氮,硫,磷,氟,氯等)或原子团(例如羟基,甲氧基,羧基,醚等)发生改变,则分子也将被改变改变从而成为新的化学实体(即产生了新的分子)。
To become an effective drug, molecules will need to be uptake and distributed in the human body and such property is directly governed by the aqueous solubility. Solubility is an important property that researchers take into consideration in the design and development of therapeutic drugs. Thus, a potent drug that is unable to reach the desired destination target owing to its poor solubility would be a poor drug candidate.
为了成为有效的药物,分子将需要被吸收并分布在人体中,并且这种性质直接受水溶性的支配 。 溶解度是研究人员在设计和开发治疗药物时要考虑的重要属性。 因此,由于溶解度差而无法达到所需目标靶点的有效药物将是较差的药物候选物。
2.2。 检索数据集 (2.2. Retrieving the Dataset)
The aqueous solubility dataset as performed by Delaney in the research paper entitled ESOL: Estimating Aqueous Solubility Directly from Molecular Structure is available as a Supplementary file. For your convenience, we have also downloaded the entire Delaney solubility dataset and made it available on the Data Professor GitHub.
Delaney在题为



