Machine learning has several applications in diverse fields, ranging from healthcare to natural language processing. Dr. Ragothanam Yennamalli, a computational biologist and Kolabtree freelancer, examines how machine learning and AI are being applied in biology and genomics.
Machine Learning and Artificial Intelligence — these technologies have stormed the world and have changed the way we work and live. Advances in these areas have led to many either praising it or decrying it. However, for a computational person like me, they are not new words. AI and ML, as they’re popularly called, have several applications and benefits across a wide range of industries. Most notably, they are revolutionizing the way biological research is performed, leading to new innovations across healthcare and biotechnology.
What is machine learning?
Machine learning and statistics are closely knit. The reason is that the methods used in most machine learning approaches have origins from statistics such as regression analysis. While there are many applications for machine learning methods, their applications to biological data since the last 30 years or so have been in gene prediction, functional annotation, systems biology, microarray data analysis, pathway analysis, etc.
Patterns is what a machine tries to identify in a given data, using which it tries to identify a similar pattern in another set of data. The processes of machine learning are quite similar to predictive modelling and data mining. They search data to identify patterns and alter the action of program, accordingly.
We are aware about machine learning and AI through online shopping tools, since some recommendations are suggested related to our purchase. This happens because the recommendation engines work on machine learning. Machine learning also has other applications such as spam filtering, security threat detection, fraud detection, and personalizing news feeds.
Machine learning is majorly categorized into three types: supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning: Supervised machine learning algorithms require external assistance. The external assistance is usually through a human expert who provides curated input for the desired output to predict accuracy in algorithm training. The expert or data scientist determines the features or patterns that the model would use. Once the training is completed, then it can be applied to test another data for the prediction and classification. It is supervised because the algorithm learns from the training data set akin to a teacher supervising the learning process of a student.
Further, supervised learning is divided into two categories, classification and regression. In classification, the output variable is categorized into classes such as ‘red’ or ‘green’ or ‘disease’ or ‘non-disease’. In regression, the output variable is a real value such as ‘dollars’ or ‘weight’.
So, in supervised classifiers a training set is provided to train the machine and it is evaluated with a test set. Most important in these classifiers is how one goes about building a training set. In most cases, having a high quality training set makes or breaks the machine learning. One should also consider the negative data that is provided as part of the training set. Sometimes, it becomes difficult to identify a good negative data set.
For example, if I would want to develop/train a machine to predict if two proteins interact (Protein-Protein interactions or PPI) or not; I would require a positive set of protein sequences/structures that have been proven to interact physically (such as X-ray crystallography, NMR data) and I would require a negative set of protein sequences/structures that are known to work without interacting with. a partner. In this case, the negative set is relatively large in comparison to the positive set, since the data of known PPI is significantly less as compared to the proteome of an organism. Thus, critically analyzed data is needed and this takes time.
Unsupervised learning: In unsupervised learning algorithms no external assistance is required. The computer program automatically searches the feature or pattern form the data and groups them into clusters. When we introduce new data for the prediction, then it uses previously learned features to classify the data. This method is very useful in the era of big data because it requires huge amount of training data. It is called unsupervised learning because there is no teacher or supervision involved.
The unsupervised learning is further classified in three classes such as clustering, hierarchical clustering, and Gaussian mixture model. In clustering method, one finds out the relation among similar kind of data and group into clusters. In hierarchical clustering, the data is grouped on the basis of small clusters by some similarity measurement. Then, based on some similar parameter sub-clusters are grouped again. In the Gaussian mixture model, each mixture component presents a unique cluster.
Reinforcement learning: In reinforcement learning the decision is made on the basis of taken action that that give more positive outcome. The learner has no knowledge which action to take, it can decide by performing actions and seeing results. So, this learning is depend upon the trial and error .
The most promising implementation of machine learning and artificial intelligence is in personalized medicine and in precision medicine. In recent years, many startups have focused on this and have developed pipelines. It is worth waiting to see if these translate into commodities that benefit the common man in the long run.
Applications of Machine Learning in Biology
Identifying gene coding regions
In the area of genomics, next-generation sequencing has rapidly advanced the field by sequencing a genome in a short time. Thus, an active area machine learning is applied to identifying gene coding regions in a genome. Such gene prediction tools that involve machine learning would be more sensitive than typical homolog based sequence searches.
In proteomics, we touched upon PPI earlier. But, the use of machine learning in structure prediction has pushed the accuracy from 70% to more than 80%. The use of machine learning in text-mining is quite promising with using training sets to identify new or novel drug targets from multiple journal articles and searching secondary databases.
Deep learning is a more recent subfield of machine learning that is the extension of neural network. In deep learning “deep” refers to the number of layers through which data is transformed. So, deep learning is similar to neural network with multi-layers. These multi-layers nodes try to mimic how the human brain thinks to solve the problems. Neural networks are already used by machine learning. Neural network-based machine learning algorithms needs refined or significant data from raw data sets to perform analysis. But increasing data of genome sequencing made it difficult to process meaningful information and then perform the analysis. Multi layers in neural network filter the information and communicate to each layer and permit to refine the output.
Deep learning algorithms extract features from large data sets like a group of images or genomes and develop a model on the basis of extracted features. Once the model is developed, then algorithms can use the developed model to perform analysis of other data set. Today, scientists use deep learning algorithms to perform classification of cellular images, genome analysis, drug discovery and also find out how image data and genome data are link with electronic medical records. Now day’s deep learning is an active field in computational biology. Deep learning applied on high-throughput biological data that help to make better understating about high-dimension data set. In computational biology, deep learning is used in regulatory genomics for the identification of regulatory variants, effect of mutation using DNA sequence, analyzing whole cells, population of cells and tissues .
AI in healthcare
Machine learning and AI are being used extensively by hospitals and health service providers to improve patient satisfaction, deliver personalized treatments, make accurate predictions and enhance the quality of life. It is also being used to make clinical trials more efficient and help speed up the process of drug discovery and delivery.
To quote the work by Google employing AI in healthcare data [17, 18]
Doctors are already inundated with alerts and demands on their attention — could models help physicians with tedious, administrative tasks so they can better focus on the patient in front of them or ones that need extra attention? Can we help patients get high-quality care no matter where they seek it?
And from the patient’s point of view
When will I be able to go home? Will I get better? Will I have to come back to the hospital?
Machine Learning Tools used in Biology
Cell Profiler: Few years ago, software for biological image analysis only measured single parameter from group of images. As, in 2005, a computational biologist, Anne Carpenter from MIT and Harvard released a software called CellProfiler for the measurement of quantitatively individual features like fluorescent cell number in microscopy field. But, currently CellProfiler can produce thousands of features by implementing deep learning techniques.
DeepVariant: Application of deep learning is extensively used in tools for mining genome data. Verily life science and Google developed a tool based on deep learning called DeepVariant that predicts a common type of genetic variation more accurately in comparison to conventional tools.
Atomwise: Another field is drug discovery in which deep learning contributing significantly. A San Francisco based biotech company called Atomwise has developed a algorithm that help to convert molecules into 3D pixels. This representation helps to account the 3D structure of proteins and small molecules with atomic precision. Then by using these features algorithm can predict small molecules that possibly interact with given protein .
Different types of deep learning methods exist such as deep neural network (DNN), recurrent neural network (RNN), convolution neural network (CNN), deep autoencoder (DA), deep Boltzman machine (DBM), deep belief network (DBN) and deep residual network (DRN) etc. In the field of biology some methods like, DNN, RNN, CNN, DA and DBM are most commonly used methods . Translation of biological data to perform validation of biomarkers that reveal disease state is a key task in biomedicine. DNN plays significant role in the identification of potential biomarkers from genome and proteome data. Deep learning also play important role in drug discovery .
CNN has been used recently developed computational tool DeepCpG to predict DNA methylation states in single cells. In the DNA methylation, methyl groups associated with DNA molecule and alter the functions of DNA molecule with causing any changes in sequence. DeepCpG also used for the prediction of known motifs that are responsible for methylation variability. DeepCpG predicted more accurate result in comparison to other methods when evaluation using five different types of methylation data. DNA methylation is a most widely studied epigenetic marker .
TensorFlow is a deep learning framework developed by Google researchers. TensorFlow is a recently developed software that accelerates DNN design and training. It is implemented in several improvements like graphical visualization and time complication. Main improvement of TensorFlow is that, it available with supporting tools called TensorBoard used for visualization of model training progress. It can provide visualization of a complex model .
In conclusion, AI and machine learning are changing the way biologists carry out research, interpret it, and apply it to solve problems. As science grows increasingly interdisciplinary it is only inevitable that biology will continue to borrow from machine learning, or better still, machine learning will lead the way.
Need to hire a machine learning consultant for a project? Consult from freelance experts on Kolabtree. It’s free to post your project and get quotes!
Acknowledgement: The author would like to thank Mr. Arvind Yadav for assisting in this blogpost.
References and Further Reading:
- Raina, C. K. (2016). A review on machine learning techniques. International Journal on Recent and Innovation Trends in Computing and Communication, 4(3), 395-399.
- Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255-260.
- Praveena, M., & Jaiganesh, V. (2017). A literature review on supervised machine learning algorithms and boosting process. International Journal of Computer Applications, 169(8), 32-35.
- Forsberg, F., & Alvarez Gonzalez, P. (2018). Unsupervised Machine Learning: An Investigation of Clustering Algorithms on a Small Dataset.
- Gosavi, A. (2009). Reinforcement learning: A tutorial survey and recent advances. INFORMS Journal on Computing, 21(2), 178-192.
- Angermueller, C., Pärnamaa, T., Parts, L., & Stegle, O. (2016). Deep learning for computational biology. Molecular systems biology, 12(7), 878.
- Webb, S. (2018). Deep learning for biology. Nature. 2018 554(7693):555-557.
- Mahmud, M., Kaiser, M. S., Hussain, A., & Vassanelli, S. (2018). Applications of deep learning and reinforcement learning to biological data. IEEE transactions on neural networks and learning systems, 29(6), 2063-2079.
- Mamoshina, P., Vieira, A., Putin, E., & Zhavoronkov, A. (2016). Applications of deep learning in biomedicine. Molecular pharmaceutics, 13(5), 1445-1454.
- Angermueller, C., Lee, H. J., Reik, W., & Stegle, O. (2017). DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome biology, 18(1), 67.
- Rampasek, L., & Goldenberg, A. (2016). Tensorflow: Biology’s gateway to deep learning?. Cell systems, 2(1), 12-14.
- Rajkomar et al., (2018) “Scalable and accurate deep learning with electronic health records“, npj Digital Medicine, 1(1)