Demystifying Big Data: Skytree Brings Machine Learning to the Masses

by Greg Emmerich, UW Madison M.S. Biotechnology Program. Advanced Biotechnology: Global Perspectives. Thesis Paper. April 16th, 2013.


The Digital Revolution has created a knowledge-based society reliant upon a high-tech global economy.  The pace of innovation has been exponential, leaving some to wonder what possibilities the future may hold.

Big Data is the term given for collections of data sets that are too large and complex for traditional hands-on data management and processing.  The term comes from the realm of information technology, but across an increasing number of fields, scientists are encountering situations that fit the category of Big Data.  Astronomy, genetics, and proteomics are a few of the fields beginning to feel the pressure for managing their data effectively.

There are numerous technical challenges going into setting up a system to process Big Data in reasonable amounts of time.  Machine learning algorithms present great potential in their ability to tease out hidden relationships among data sets and make predictions, but these analyses require distributed computing clusters capable of communicating intermediate results between tasks.

There are numerous commercial offerings for processing Big Data using machine learning, but Skytree rises above the pack with its customized, state-of-the-art machine learning algorithms and new data representations.

Numerous social, legal, and ethical concerns should be taken into account as machine learning is applied towards more areas.  The use of advanced analytics will soon become the norm in biotechnology, allowing small laboratories and pharmaceutical companies alike the ability to harness the power of machine learning.

Table of Contents

Introduction and Market Analysis
1.1         Statement of the technology and its significance
1.2         Commercial products / technologies / applications
1.3         Brief competitive analysis
1.3.1     Intellectual property summary
1.3.2     Blocking and competing patents
1.3.3     Options / strategy
1.4         Working hypothesis for research
Technology Overview
2.1         Science of data structure systems and machine learning algorithms
2.1.1     Computer Science Background
2.1.2     Machine Learning Background
2.1.3     Machine Learning Discussion
2.2         Practical applications to Biotechnology
Critical Analysis & Recommendations
3.1         Findings and their Implications
3.2         Recommended applications
3.3         Implementation strategies
3.3.1     Challenges and considerations
3.3.2     Global and international considerations
Methods – Resources used for Research
Summary and Conclusions

1      Introduction and Market Analysis

Robots will take over the world. Dystopian alternate realities aside, the conclusion seems inevitable, and the field of biotechnology will serve as the “App Store” for human biology. Would you like your genome analyzed? Your symptoms diagnosed? Your antibody drug designed? There’s an algorithm for that.  The shocker? That future may not be too far away.

Looking at the course of human history, it can be observed that there has been one major factor driving us towards this future: communication. Technological advancement relies upon the communication of ideas and knowledge.  Machine learning is the next great leap forward for biotech, allowing scientists to gain insights about disease never before thought possible.  The pace of communication advancements has been exponential, and parallels the rate of cosmic evolution:

Considering the Big Bang as January 1st and the present as midnight on December 31st, the Sun didn’t form until September 1st. Multicellular life began on Earth December 5th, and primates evolved by December 30th. Cave paintings date back some two minutes ago, the alphabet originated 11 seconds ago, and Christopher Columbus sailed to North America one second ago. The telephone? A short 0.3 seconds ago. The World Wide Web? About the length of a single beat of a hummingbird’s wings, or 0.06 seconds.

The telescopic history of the universe is brilliantly illustrated by Carl Sagan’s cosmic calendar.  Accumulated knowledge sows the seeds for a future harvest of even greater knowledge.  Collectively, ideas building upon ideas generates greater technological capacity.  Advancements have lowered the barriers of time and geography for communication, giving humans an unprecedented level of connection to each other and access to the vast wealth of information in the world.

The crescendo of these knowledge feedback loops is predicted as a technological singularity where the emergence of an artificial superintelligence supersedes that of human intelligence. Plenty of futuristic fantasies from decades past like flying cars and underwater cities have not come true, so attributing increased computing power alone to a hypothetical superintelligence may be an over-simplification of the problem. As we have seen with the internet, the sheer power of total access has given way to specific applications that create more significant value for the individual. The emphasis here is not on the information itself, but on the access and interpretation of information. Real progress is being made on both of these fronts in the field of machine learning.

This paper will give a background on what machine learning is all about and then provide examples to give an appreciation for the prowess of machine learning. Focus will then shift to explore the capabilities of Skytree, a new start-up company that provides Big Data management and machine learning analysis.  They will be the prime example covered throughout this paper. To give perspective on Skytree, the science of machine learning will be more thoroughly discussed along with the capabilities of competing technologies.  Further examples will then be given for how machine learning specifically relates to biotechnology. To conclude, a broader picture will be drawn about the implementation of machine learning to biotechnology, as well as the related social, legal, and ethical concerns.

1.1    Statement of the technology and its significance

Statistical analysis has a range of functions that can be applied on data in order to generate conclusions, from simple functions like standard deviation to complex ones like multivariate regression.  Machine learning is advanced analytics used to discover patterns and make predictions from complex data, applied through algorithms.  These algorithms can more efficiently process large amounts of data than humans can, although they are much more computationally intensive than simple analytics.  Advances to computing power have been doubling every 18 months, in-line with Moore’s Law.  This is good because data generation has also increased at an exponential rate: Sloan Digital Sky Survey collected more data in its first few weeks back in 2000 than had been collected in the entirety of astronomy. Over 10 years it amassed 140 terabytes of information. The next generation Large Synoptic Survey Telescope, due in 2016, will collect that amount of data every five days (Economist, 2010).

This is the era of Big Data.  Big Data is the term given for collections of data sets that are too large and complex for traditional hands-on data management and processing.  There has been much progress on this front, but the elephant in the room is how to apply the advanced analytics of machine learning to Big Data.  Skytree makes progress here through their new data structures upon which their custom, advanced algorithms can process 10-10,000x faster than competing technologies in K-means clustering and SVM (Gray, 2013).  Furthermore, Skytree’s technology is readily accessible to organizations without expertise in data science.  As more scientists in other fields start to apply machine learning, there will be great increases to the pace of innovation.  Machine learning is going to have huge, positive effects on cost reduction, timeliness, and accuracy for all different areas of the biotechnology and pharmaceutical industries.

1.2    Commercial products / technologies / applications

Machine learning algorithms have already penetrated people’s daily lives, whether or not they realize it. Perhaps most easily recognized is the online search engine, of which Google’s PageRank is the most popularized of the search algorithms.  Google has also successfully experimented with applying machine learning algorithms to a self-driving car.  Entertainment companies like Netflix, Pandora and GoodReads as well as commercial site like Amazon regularly track user activity in order to personalize recommendations for which movies to watch, new music to listen to, new books to read, and complementary products to snag.  Face recognition technology helps tag friends in photos on Facebook.  Apple’s Siri uses voice recognition to improve the user experience for their iPhones.  The Roomba uses spatial mapping to vacuum floors.  Even in games like chess and Jeopardy, humans are bested by robots.

Behind the scenes, robots—both of the physical and software varieties—have helped improve the American economy.  Companies in the US are staying more competitive against outsourcing to China and India through machine learning. Quiet Logistics uses robots to greatly improve the efficiency of storing and retrieving physical goods in warehouses. Increased automation has done the same for all sorts of companies including biotech with liquid handling and pharmacies with filling prescriptions. Rethink Robotics has unveiled Baxter, which can learn new manufacturing tasks in a manner of minutes, and work safely around humans with its spatial recognition (Radliffe and Gavrilovic, 2013).  Wall Street has taken on high-frequency algorithmic trading with companies like Renaissance Technologies.

Machine learning has helped keep people healthy and safe.  Doctors will be working alongside computers with vast databases of medical knowledge to link symptoms to disease and help guide treatment. TUG robots work at 150 hospitals and deliver meals, medication, take dirty linen to get cleaned, and avoid collisions in the hallways. Big Jim is a bomb-defusing, tear gas and rubber bullet-wielding 485-pound member of the Lane County, Oregon police department.

Still think there are sacred areas that are just too human for robots to take over?  Shimon, a Georgia Tech robot duped audiences into believing they were listening to a live musician. MindMentor was developed to help patients seeking a psychologist.  With lower barriers to engagement, nearly half of patients going through a one- to two-hour session claimed their problems were solved. The Popchilla helps children with autism develop social skills and respond to facial expressions. RUBI helped toddlers increase their word mastery by 25%. The Vangobot takes digital images and paints a replicate through influences like postimpressionism and pop art, and its works have been sold at Crate & Barrel. The cleverly named Data custom tailors its delivery of jokes based on audience feedback. China’s Dalu Robot Restaurant has replaced 80 waiters with 20 robots and 0 of the drama. (Kelly and Dutton, 2013)

Genome interpretation, biomarker discovery, data analysis and drug design are just a few of the ways machine learning algorithms will improve biotechnology. These will be further explored throughout the paper.

1.3    Brief competitive analysis

Data processing is a very large industry with historical trends of high competition. The net 2012 revenues were $83.8 billion, with a five-year 2.4% annual growth rate that is expected to continue as more companies continue to outsource their IT needs to third parties. Mergers and acquisitions are expected to continue to be frequent.  Cloud computing is the fastest growing segment of the data processing industry. Small companies are especially likely to benefit from cloud computing’s ease of use and costs that scale with usage. The major revenue comes from large clients with massive amounts of data and little internal expertise. The increasing complexity of data and the exponential costs to manage that data makes outsourcing to companies with expertise in data management especially attractive (Krabeepetcharat, 2013).

There are a significant amount of direct competitors for Skytree. A total of 12 were investigated, although there are numerous other companies offering similar features as listed here (see an extensive list of database solutions). Figure 1 summarizes these competitors.  Some of the technologies these companies use and various other open source options for machine learning like Hadoop are discussed more in section 4.1.3.

Table 1: Direct Competitors

Company Commercialization State Technology Capabilities
Aster 16+ customers; Teradata Aster Discovery Platform Visual, interactive, fast, big analytic applications with minimal time and effort. SQL MapReduce, Hadoop
Greenplum 23+ customers; Greenplum UAP Massively parallel processing, Hadoop distribution, collaboration/sharing, and MapReduce M5
Cloudera 48 customers; also open source versions Hadoop, real-time query
Netezza Acquired by IBM in 2010 for $1.7 B Data warehouse with simple optimization and minimal maintenance
BloomReach 160 customers; Relevance Engine Hadoop, Lucene, Monte Carlo simulations and large-scale image processing
Continuuity Private beta for App-Fabric Make it easier to build applications that can leverage both cloud computing and big data technologies
Odiago Private beta for Wibidata. Customers: Wikipedia, RichRelevance, FoneDoktor and Atlassian Helps websites better analyze their user data to build more-targeted features. Uses Hadoop and HBase
Platfora Fractal Cache and Adaptive Job Synthesis technology. Customer: and others Hadoop-based analytics and Fractal Cache technology with intuitive interface, data visualizations, and predictive analytics
Cloudscale Patented Cloudblocks technology MapReduce programming model focusing on real-time processing and fault tolerance
VoltDB Enterprise and Community edition VoltDB; 12+ customers newSQL distributed database system
Clustrix DBaaS; 14+ customers Scale-out SQL database that integrates with Amazon Web Services, or a public cloud database, localized data centers, or the stand-alone software.
Cloudant DBaaS; 4+ customers Erlang, Java, Scala, C; Apache CouchDB, MapReduce, and JSON store

Services offered by companies in the data processing industry are highly similar, which increases competition. As long as an algorithm or software is accomplishing a technical challenge in a novel way, it is patentable in Europe. Software patents are more lenient in the US, Japan and elsewhere, additionally covering software that solves business challenges. However, any protection gleaned from patents is quite limited since experts in this field can find numerous ways to get around particular claims to accomplish the same goals, as evidenced by the amount of alleged infringement in the “smartphone wars” between Apple, Nokia, Samsung, and several other companies. Patents are still useful for larger companies that have the resources to actively seek out and enforce their rights.  Reputation is perhaps an overlooked competitive factor in the wake of such immense legal battles, but it is important for determining success in the marketplace.

1.1.1    Intellectual property summary

Skytree currently does not have any granted patents or applications, although Chief Technology Officer Alexander Gray is one of the inventors for an ovarian cancer biomarker patent application through Georgia Tech Research Corporation.

1.1.2    Blocking and competing patents

Table 2 gives an incredibly thin slice of the immense pie that is machine learning and data management patents. Total software patents have been steadily increasing since 1986, as seen in Figure 1. Skytree would need to conduct a more precise patent search based upon their specific technology to see if any existing patents would limit their freedom to operate.

Table 2 – Sample of machine learning patent landscape

Assignee Filing Date Patent Number Patent Title Claims
IBM Oct. 1993 US 5461699 Forecasting using a neural network and a statistical forecast A statistical model learning from historical data to compute a forecast
Paul J. Werbos Jun. 1997 US 6169981 3-brain architecture for an intelligent decision and control system Computer neural network to control external devices, data visualization and data mining
IBM Sep. 1997 US 6182123 Interactive computer network and method of operation Distributed data processing
Sony Aug. 2002 US 6434540 Hardware or software architecture implementing self-biased conditioning System responds to inputs to trigger modified actions
Japan Science and Technology Corp. Mar. 2003 US 6529887 Agent learning machine Unsupervised machine learning system
Cloudscale Inc. Dec. 2008 US 8069190 System and methodology for parallel stream processing Continuous parallel processing using clustered computing
Dillenberger et al. Jun. 2009 US 8386930 Contextual data center management utilizing a virtual environment Data center management with machine learning techniques


Figure 1:
US Patents Graph
Geoff Dallimore, 2007, Wikimedia Commons

1.1.1    Options / strategy

Skytree should consider pursuing patent protection if no prior art encompasses their unique algorithms. Without knowing the details of how their technology works, it is hard to compare to the vast amount of machine learning and data management patents that have been granted. Skytree should move quickly to file for a patent, but they should thoroughly cover all aspects of their technology to protect themselves long into the future.

Seeking acquisition by a large technology corporation would seem to be a promising long-term strategy. This was the case for Netezza being acquired by IBM.  All signs point towards a continuously growing market demand, which would make Skytree’s technology an extremely valuable asset for a number of large companies. Considering the fierce competition in the industry, they will likely only be interested if they know the technology they are acquiring is protected through patents.

1.2    Working hypothesis for research

Machine learning will be essential for the biotechnology industry to adopt in order to stay competitive, and services like those Skytree offers will help accelerate the pace of innovation.

2      Technology Overview

Machine learning is an advanced field, and there are many underlying technologies at play necessary for these algorithms to function such as data structures, networking technology, distributed computing, and cloud-based databases.  The science behind these enabling technologies will be discussed, followed by examples of machine learning algorithms to illustrate just how they function.  Real-world application of machine learning algorithms requires customizing the algorithms to the particular situation at hand.  Skytree excels at this point by reducing the burden of customization on their clients.

2.1    Science of data structure systems and machine learning algorithms

2.1.1    Computer Science Background

Computer science has been essential for progress in every other modern scientific discipline. Computer science also has come to change the daily lives of billions of people and the culture of nations worldwide.  A large part of this can be attributed to recent advances to improving processing power while decreasing size and cost for computation.  To understand the significance of computational power towards machine learning, the basis of data structure will first be explained.

A data structure is simply an organization of information in a computer.  Efficient data structures are paramount for designing efficient algorithms.  This importance is amplified as datasets become larger.  Common database structures are queue, stack, linked list, tree, heap and dictionary (Mehlhorn and Sanders, 2008).  Queue is referred to as “first in first out” where new items are added to the back-end, and only the oldest (or front-end) data can be processed or deleted.  Queue serves as a kind of buffer for processing incoming data. Stack is the opposite of queue, or “last in first out.” Linked lists consist of a sequence of nodes (data entries) that are each linked to the next node.  Linked lists are quite common data structures because of their ease for inserting new data points and deleting them.  A tree is a data structure where the data items are connected to a base root item.  A tree may have intermediate data nodes (“branches”) called the parent, and endpoint data linked to the parent called the child (“leaves”).  A heap is a kind of specialized tree where every data node has a key value more extreme (i.e. either the root can be the highest or lowest key value) than its parent. A key value could be numerical or alphabetical. Heaps are useful for data sorting and selection, and graph algorithms.  A dictionary, or associative array, is a collection of (key, value) pairs, such as (lastName, Emmerich).

Interactions with databases fall into four main groups: data definition, update, retrieval and administration.  Data definition consists of creating new data structures, deleting them, or modifying them.  The data itself (within database structures) can similarly be updated by inserting new, deleting, or modifying data. Retrieval can be used for processing data elements (like with machine learning algorithms) or just simply for end-user queries. Administration is a little more complex: database managers can create new users and modify their accounts, monitor system performance, and maintain data integrity and security.  An even more complicated administrator task is concurrency control.  When a program or algorithm is processing numerous operations, they may be using the same data for those operations.  It is important that there are rules and methodologies for maintaining consistency and correctness, even though this reduces performance speed (Mehlhorn and Sanders, 2008).

When databases grow in size, it becomes increasingly important to dedicate more resources towards their operation.  Database servers are groups of computers dedicated to hosting and running databases, with accelerated capabilities for multiprocessing.  These can be run with standard operating systems or custom interfaces to control the computers’ hardware to allocate memory access (RAM), schedule and switch between tasks (CPU), and coordinate activities with other devices. Multitasking is a bit of an illusion because single CPUs can really only process one task at a time, but the scheduling of tasks through time-sharing allows multiple tasks to be completed basically at the same time. Preemptive multitasking sets limits to how much time each task can use the CPU.

Memory management is also very important for maintaining optimal performance. Virtual memory allows for operating systems to use the same memory locations for different tasks. It does this by virtualizing the physical memory to give it a virtual “address”, whereby the system can more easily adjust the allocated memory range of the task. That is, memory can be swapped between different areas. Memory that is accessed less frequently can be stored on the slower-access hard disk so that other programs needing rapid memory access have more to work with.  Memory virtualization is also very beneficial when dealing with networked computers (Mehlhorn and Sanders, 2008).

Networking allows computers to gain access to the resources of other computers (hardware processing power as well as data) as if those resources belonged locally to that computer.  Sharing memory allows for increased performance and memory utilization efficiency on all the computers.  The basis behind parallel computing is that complex tasks can be broken down into smaller pieces and then solved simultaneously (in parallel) among networked computers. In a Beowulf configuration (see Figure 2 below), the individual computer nodes are controlled by a single server (the “master”).  Significant benefits from this include increased performance, lower cost, and their simplicity in scaling computing power to meet customized needs.  Various security measures limit the access from external sources in order to protect the computational system.

Figure 2: Clustered Computers in Beowulf Configuration

Beowulf clustered computers diagram

Mukarramahmad, 2008, Wikimedia Commons

                Cloud computing works in much of the same ways as traditional computing. There is still a server running some sort of distributed computing and database management, and there is remote access through a network (i.e. the internet). It can be thought of much like accessing a regular website; a user makes a request to the server each time they visit a new web page, and the information that is relayed back is coded into programming languages like HTML, CSS and Javascript, which web browsers interpret and then access the local computer’s hardware to display an image that matches that information.  With cloud computing, the same thing happens except the information request could be more significant, like a request to update a database or to run an algorithm.  The benefit of cloud computing is that it reduces the burden of technical challenges on a company, especially smaller companies who do not have the resources to acquire the necessary infrastructure.  Also, keeping the software in a centralized location allows updates to be implemented instantaneously across all platforms, without requiring users to download program updates.  The benefits of cloud computing are only beginning to be felt as companies start to develop the capabilities and specific applications of cloud computing.  For example, many websites now offer cloud storage for individuals’ documents, pictures, music, and other files, which individuals can access from any platform with an internet connection.  Figure 3 gives a conceptual overview of cloud computing.

Figure 3: Cloud Computing Overview

Cloud computing diagram

Sam Johnston, Wikimedia Commons, 2009

2.1.2    Machine Learning Background

To carry on with the theme of similarity between technologies, machine learning algorithms are not all that different from normal database functions.  They are, however, much more computationally intensive. Often times the nature of learning algorithms is exploratory: they seek to discover patterns in a sea of data to help make future predictions. Because these algorithms are “learning”, they are essentially modifying their own functions over time. An example of this would be adding instances of what constitutes “spam” for an email filtering algorithm. Other examples will be given later on.  Machine learning algorithms first take an initial set of data as a “learning set” to generate a mathematical model, and then a “validation set” to prove it is functioning as it should. If successful, these are applied towards unknown data sets to draw out generalizations.  Their performance at these tasks should improve the more data that is passed through them.  Machine learning can fall into the labels of “supervised” or “unsupervised” with the difference being when the conclusions are known (categorizing “spam” or “not spam”) or unknown (data mining to discover consumer shopping habits).

Linear or one-dimensional mathematics represent the easiest statistical analyses to undertake, such as computing the mean, counting the number of data instances or length, and covariances. The more dimensions and dependencies between the variables, the more complex the algorithm becomes and the more time is necessary to perform them.  Advanced analytics include predictive analytics, data mining, pattern recognition, and multivariate statistics. Machine learning algorithms are powerful methods to perform advanced analytics.  Advanced analytics may be used on the small scale with a wide host of different software platforms, but their intensive computational requirements do not make them well tuned for handling large amounts of data (Exforsys, 2006).

Big Data is the term given for collections of data sets that are too large and complex for traditional hands-on data management and processing. The term Big Data comes from the realm of information technology, but across an increasing number of fields, scientists are encountering situations that fit the category of Big Data.  Astronomy, proteomics, genetics, and meteorology are a few of the fields beginning to feel the pressure for managing their data effectively.

The problems with Big Data come in three flavors of “V” – volume, variety, and velocity.  They are, respectively, the sheer amount of data to be processed, the nature of the data being unstructured or structured, and the need to process this data in reasonable amounts of time in order to be useful (Gray, 2012).  Unstructured data is especially challenging for algorithm design because there isn’t a defined place to look for known values.

Other issues with machine learning algorithms are interpretability, incorporation of domain knowledge, handing missing data, and uncertainty estimates, which are all mostly related to which particular algorithm is chosen. The tasks that machine learning can accomplish fall into seven major categories (Mehlhorn and Sanders, 2008).

1) Querying: interrogation of data is often multi-variant, meaning a search on a protein molecule library could look for protein size, heat stability, and renal clearance. An example of this would be the nearest-neighbor algorithm, where a search could look for the closest proteins with heat stability around 37° C and size around 8.8 kDa of a start point.

2) Density estimation: this is used when the input (or independent variable) is not known but is instead created from information derived from the data. This requires larger data sets to structure the model before estimates can be made. An example of density estimation is with testing blood glucose levels to determine if patients are likely to have diabetes (see figure 4).

Figure 4: Diabetes density estimation

Density estimation with diabetes

In the top graph, the red line represents the measured blood glucose level (BGL) distribution of Pima Indians with diabetes, the blue line represents measured patients without diabetes, and the black line represents the net distribution for all patients. To calculate the probability that a particular BGL indicates diabetes (db), this graph is transformed using Bayes’ rule. For each point, say at BGL of 150, the probability is estimated: (0.01*0.4) / (0.01*0.4) + (0.005*0.6) = 0.57, or written with words, (probability of 150 BGL given a person has db * % population with db) / (p 150|db * %with db) + (p 150|no db * %no db). Worth noting is that the Pima Indians have one of the highest incidence rates for diabetes among population groups. (image: Wile Heresiarch, 2004, Wikimedia Commons) 

 3) Classification: perhaps the easiest to understand, classification is readily used in many areas like qualifying an email as “spam” or “not spam” based on the number of occurrences of certain words, or in classifying blood type by expression of particular antigens. Some important terminology here is that observations are called “instances”, their variables are called “features”, and the categories that could be predicted by the algorithm are called “classes.”  This method requires a training set of data to learn what features each class has, which is known as “supervised learning.” Examples of classification algorithms are decision trees (think of “20 questions”), neural networks (like biological neurons) and k-nearest neighbor classifiers (see figure 5).

Figure 5: K-nearest neighbor example

K-nearest neighbor classification

Given data points previously classified as a blue square or red triangle, what should this new green circle be classified as? With k=3, the circle would be considered a red triangle, but if k was increased to the 5 nearest neighbors, the circle would be considered a blue square. (Antti Ajanki, 2007, Wikimedia Commons)

 4) Regression: linear regression is a straightforward comparison of how a dependent variable y is affected by (correlates to) an independent variable x, and is calculated by minimizing the sum of distances2 from the data points to the regression line (see figure 6). Another example of this is kernel regression, which finds the nonlinear relationship between two random variables.

Figure 6: Linear regression

Linear Regression

Linear regression, or finding the line of best fit, takes the square of the distance (red line) from each point to the estimated line (blue) and minimizes the sum of these distances. (modified from Schutz, 2010, Wikimedia Commons)

5) Dimension reduction: principal component analysis (PCA) is used to reveal the inner structure of data in a way that most clearly explains the variance of the data.  PCA can be more easily understood by figuring out how to take a picture of a teapot in order to get the most valuable information about it (see figure 7).

Figure 7: PCA teapot example

Teapot Dimension Reduction Teapot Dimension Reduction

The goal of PCA is to get the most informative viewpoint of the data that shows the maximum distribution of data, like this picture of a teapot. The object/data set is rotated about its center to get the largest distribution across the x-axis, and then again for the y-axis. The distributions are the eigenvalues and are measured using a covariance matrix (James X. Li, 2009, VisuMap Tech.)

6) Clustering: the most common algorithm for clustering is k-means, which breaks data sets into k groups (see figure 8).

Figure 8: K-means clustering example

k-means clustering example

To start, k random “seed” data points are generated (black circles) among a data distribution (red circles). A perpendicular line is drawn at the linear midpoint between each of these k seeds. The other data points that fall into the same area as that seed are clustered into the same group. The seeds are then moved into the centroid of each clustered data points, and the boundaries between the clusters are recalculated. (Jigsaw Academy, 2012)

7) Testing and matching: evolutionary taxonomy was one of the earliest applications using a minimum spanning tree (MST). This was used to quantify similarity between bacteria strains based on the number of features they shared. Strains that have more similar features are more closely related evolutionarily. Figure 9 shows what a MST could look like.

Figure 9: Minimum Spanning Tree

Minnimum Spanning Tree

A minimum spanning tree computes the possible sums of the edges necessary to connect each vertices and find the tree that minimally covers this (Dcoetzee, 2005, Wikimedia Commons)

2.1.3    Machine Learning Discussion

The value of the seven listed machine learning algorithms comes from the more robust analyses they can perform.  However, because most of the advanced analytics are a function of the number of data points squared (or worse, cubed), that value becomes meaningless when the algorithms are being applied to Big Data because the amount of time to perform the algorithm increases exponentially (Mehlhorn and Sanders, 2008; Gray, 2012). This means that massively parallel software running on tens to thousands of servers is necessary in order to deal with such large amounts of data.

There are several options available to handle the challenges of applying advanced analytics to Big Data.  Current platforms to perform these are through Hadoop, MATLAB, noSQL, Microsoft Excel DataScope, the programming language R, Statistical Analysis System (SAS), IBM’s Statistical Product and Service Solutions (SPSS), and those provided by the listed competitors (Rodrigues, 2012). The most advanced and developed packages are MATLAB and R (see Appendix: Table 3 for a full comparison). Current hardware solutions include data warehouses, graphical processing units, and parallel tasking.  Ultimately which programming language to use in which kind of physical set-up depends on the specific application for machine learning. There is crossover between the different approaches and there is admittedly an overwhelming amount of personal opinion as to which is the best (O’Connor et al., 2009). The problem for all of these approaches with Big Data is that they were not designed for scalability from the ground up. The transport of data from wherever it is stored to the computer cluster for processing poses technical challenges and high costs to solve (Lith and Mattsson, 2010).

Hadoop is by far the most common tool of choice when it comes to machine learning, but the model has significant flaws with performance on advanced analytics. Hadoop is an open source version of Google’s popularized MapReduce.  The basis behind this is that tasks are broken down into smaller pieces and distributed to computer clusters (the “map”), and then the results are collected and unified (the “reduce”).  Hadoop has proven to be useful and improve performance on some less complex tasks, which is not to understate the value of simple analytics. These task are deemed “embarrassingly parallel” when they are easily split into parallel tasks with minimal communication between each of those components. BLAST searches are examples of these tasks (Lin et al., 2005).  Advanced analytics can require functions like complex joins, real-time processing, or interactive analysis.  Hadoop is 10-10,000 times slower at these functions than next-generation Big Data architectures (McColl, 2010). Consider the following to illustrate: data is often asymmetrically distributed, meaning certain parts may be more closely connected than other parts. Since Hadoop is data agnostic (doesn’t care about the structure of the data), data that should be processed together will be separated onto multiple computer clusters, which creates major inefficiencies for a system that isn’t designed well to communicate partial functions to other units (Mone, 2013).

Other machine learning models suffer from similar misgivings.  MATLAB’s Statistics Toolbox can perform decision trees and k-nearest neighbor, and their Neural Network Toolbox can implement and visualize feedforward networks.  However, the program is slow when it comes to scalability (Yura, 2010; Malangi, 2010; Hungry, 2010). Distributed non-relational databases, or noSQL, has horizontal scale capabilities (can easily add more computer nodes to the system) but cannot guarantee reliability of ACID (atomicity, consistency, isolation, durability). Specifically, the consistency is of concern, and increases the complexity of development (Pritchett, 2008).  Microsoft knows their market well with a claim like “complex data analytics via familiar Excel interface.”  That familiarity may be the biggest asset for the company as scientists and others who are longtime users of Excel start discovering the more advanced capabilities of the program. Microsoft focuses on user interface and experience which is crucial when targeting non computer science-oriented markets. However, Excel will be significantly slower with large datasets and has incomplete statistics support (O’Connor et al., 2009).  The SAS and SPSS data analysis packages are more limited in the analysis they can perform (see Appendix: Table 3). The SAS language is more restrictive, and the SPSS language is very simplistic (Stanford Consulting, 2013). The programming language R is best compared to MATLAB.  While MATLAB is not open source like R, it does have an open source version called Octave which covers the majority of MATLAB’s features. R has a steep learning curve and has no graphical interface or wrapper to call different algorithms and unify their output (Bischl et al., 2011). R is very popular, like MATLAB, and both platforms have seen consistent development of advanced machine learning capabilities.

Graphic processing units (GPUs) have recently been used in place of typical CPUs for algorithm computation.  This is termed GPGPU (general purpose computing on GPU), and the open source computing language for this is called OpenCL. The advantages of this is that GPUs are inherently more parallel than CPUs since they were designed to process graphical rendering (independent vertices and fragments), which does not require communication between the computational nodes. This is another example of an “embarrassingly parallel” specialization, which is a disadvantage when considering advanced analytics and the need to communicate intermediate functions between nodes. Specially designing algorithms around GPUs is necessary to meet this constraint (Raina et al., 2009).

2.2    Practical applications to Biotechnology

Biotechnology is starting to feel the burden of massive datasets and the need of more sophisticated tools for managing and processing data.  Whole genome sequencing is perhaps in the forefront of this area. For more information about advanced analytical approaches to genome sequencing, see Emmerich, 2012. Other in vitro diagnostics are beginning to benefit from machine learning on large datasets.  Feature selection is very important to the efficiency and success of machine learning algorithms, and in the case of biotechnology, domain knowledge of genetics or relevant biomarkers for a particular disease will greatly influence algorithm design.  Protein engineering represents a significant area for improvement using machine learning.  Protein engineering is useful for drug development, improving protein function for industrial applications, and in deriving molecular tools to elucidate biological functions.  Three case studies will be discussed, including a group from UC Santa Cruz using machine learning to engineer more robust proteinase K derivatives, two companies (PreDx and Stemina Biomarker Discovery) using biomarkers to predict future onset of diabetes and pharmaceutical prenatal toxicity, and a review on machine learning’s impact on metabolomics.

The typical paradigm of blindly synthesizing a massive library of candidate drugs or proteins to then perform inexpensive, high-throughput assays upon is starting to change.  A pitfall of the traditional approach comes from weak correlation of initial assay results to in vivo protein function. More robust molecular activity assays are more expensive, but when a massively high-throughput approach is not needed, this option becomes viable. The quicker biotech and pharmaceutical companies can narrow down their library to higher probability leads, the more cost-efficient they become.  A case study for this is with proteinase K.  Proteinase K digests (breaks down) other proteins, which has found great utility in association with DNA purification by protecting DNA itself from degradation.  Jun Liao et al. were able to increase both the heat stability and functional activity of proteinase K using machine learning in combination with synthetic biology.  Using eight algorithms (ridge regression, Lasso, PLSR, SVMR, LPSVMR, LPBoostR, matching loss regression and ORMR), the researchers took an iterative approach of 1) selecting amino acid substitutions, 2) designing protein variants, 3) synthesizing respective genes, 4) expressing those proteins, 5) measuring functional activity, and 6) assessing the contributions of the amino acid substitutions to altered activity (Liao et al., 2007). This approach could prove useful for a host of other proteins and molecular targets in the drug discovery process, and as demonstrated, can create valuable insights with remarkably quick turnaround and at low cost.

There are several other current applications of machine learning to biotechnology.  Rilonacept, an interleukin 1 inhibitor, was engineered by Regeneron Pharmaceuticals using fusion proteins to treat cryopyrin-associated periodic syndrome (drug brand name Arcalyst).  Biomarkers are increasingly becoming a hot area of research, and for good reason—biomarkers can predict impending onset of disease, as is the case at the company PreDx, who are working to develop early detection of type 2 diabetes.  Another company, Stemina Biomarker Discover, is measuring toxicity of pharmaceutical compounds on fetal development. Animals have different responses to teratogens than humans do, which creates a need for improvement in this area. A study by Stemina used human embryonic stem cells (hESC) to more accurately predict toxicity, although their cardioTOX assay uses induced pluripotent stem cells (iPSC) (West et al., 2010).

Other traditional approaches to molecular biology are beginning to be upturned. In order to draw out supported inferences from experiments, life scientists have been forced to minimize variables in tests and assays.  In reality, a biological system is an intricately connected and finely tuned collection of genes, proteins, and metabolites. The flux of energy within an organism, or metabolome, is highly regulated. That is to say, nothing in biology happens in isolation, so the traditional reductionist approach has many shortcomings. Technical limitations are being greatly reduced with advances to systems biology and machine learning.  Machine learning presents numerous methods for multivariate analysis. Popular algorithms are discriminant analysis, partial least squares, artificial neural networks, evolutionary computation (genetic algorithms), and classification and regression trees.  The levels of variation to account for in metabolome studies include higher-level population selection, sample preparation, instrumentation, and algorithm design (Hollywood et al., 2006).  A related issue to advancing the field of metabolomics is standardizing and validating data to a centralized database which curates the most accurate information about biomarkers for all researchers to have access to. Advances to IT will be essential to make this a viable option in terms of efficiency of data transfer and in terms of comparing large datasets.

The practical application of machine learning to biotechnology is that personalized healthcare is increasingly becoming a reality, and the only limitation for future applications of machine learning is individual creativity.  The same methods described for improving protein function and studying the metabolome can be applied towards agriculture, for example.

3      Critical Analysis & Recommendations

3.1    Findings and their Implications

Machine learning algorithms are more efficient and cost-effective than humans at a wide variety of structured tasks.  The areas in which machine learning is proving to be useful are ever-increasing in the field of biotechnology.  Anyone wishing to implement machine learning must first carefully consider the nature of the problem and all the requirements they have for machine learning performance. This will ultimately dictate which algorithm(s) to choose, which hardware configuration to set up, and which software platform to pick to manage and implement the machine learning.  This paper presented a brief overview of those considerations.

An explosion in the number of companies offering Big Data services creates significant competition for Skytree.  Educating customers about the benefits of their platform over competitors will be necessary for success.  The most effective way to do so will be to increase the number of studies conducted using machine learning, clearly demonstrating the significant practical benefits of selecting such approaches.

The era of personalized medicine and rapid commercialization of scientific discoveries is quickly approaching. Machine learning will help reduce development costs while increasing accuracy.

3.2    Recommended applications

  1. Improved analysis of human genomes and biomarkers to predict disease onset
  2. Accelerating the speed and accuracy of drug development
  3. Improved identification genes conferring drought and insect resistance in agriculture
  4. More robust data classification and clustering

3.3    Implementation strategies

3.3.1    Challenges and considerations

Certain areas of biotechnology will be able to more quickly adopt machine learning than other areas.  Individual researchers, for example, are starting to gain increased access to advanced analytics for small scale experiments without needing to devote the considerable amount of time to becoming an expert in algorithm design.  Genome wide association studies (GWAS) are at the cusp of implementation challenges, regularly feeling the impacts of massive data sets on system efficiency and performance.  Many of these challenges were discussed in the machine learning section of the technology overview.  Companies like Skytree are democratizing Big Data, allowing individuals to gain insights from their data without having to worry about maintaining expensive infrastructure or staying up-to-date on the latest advanced machine learning algorithms.

Other challenges for implementation continue to be worked on.  The physical location of both the computer clusters upon which machine learning algorithms are performed on and the databases themselves makes a difference in terms of performance. It would be inefficient to constantly be sending massive files over remote connections.  Other challenges include handling missing or incomplete data during analysis, and standardizing algorithm design in order to reduce the level of variability between machine learning studies.

3.3.2    Global and international considerations

Geographical and social concerns

Modern computing is fundamentally networked and increasingly cloud-based, which breaks down any regional barriers to implementation of machine learning. About 34% of the world’s population use the internet, or some 2.4 billion people, meaning that any single one of them could be collecting data about any of the others (Internet World Stats, 2012). Language barriers come into play slightly, although English makes up over half the content of all websites.  Programming languages themselves do not change geographically.

The threat of job displacement is very real for Americans, and is likely to continue as developing nations become more educated and as robots start performing jobs with more specialized tasks.  This can be very disconcerting for many people, but the less obvious benefit of this is the development of new jobs unlocking possibilities current generations couldn’t even imagine (Kelly and Dutton, 2013). No one from the industrial revolution could have predicted that people would spend hours upon hours behind a glowing screen, punching buttons with their fingers at lightning speed, creating a way for people to talk to and see each other from across the other side of the globe. To say that the world we live in now is better than life was during the industrial revolution may be somewhat of a value judgment, but for meaningful measurements like life expectancy (38 to 78 years) and the poverty rate (30-50% to 14%), it is clear that technological progress has done some good things (Moore and Simon, 2000).

As individuals gain greater access to information about their health, it becomes increasingly important that they take an objective approach to understanding the science behind that information. Genetic test results are not 100% accurate, so there is a chance for false positives and negatives.  Also, the evidence linking gene to disease is not always so clear, which could lead to mistaken interpretations of test results. A measure of discretion should be taken as to what information is revealed to individuals when there is no course for treatment for an identified future disease.

Legal and regulatory concerns

The United States Congress unanimously* passed the Genetic Information Nondiscrimination Act in May of 2008 to protect the genetic information of individuals from coming to bear harm on them through discrimination in receiving health care and for employment (*at a vote of 414-1 against Ron Paul, who has a history of going against the grain). While this act is the cornerstone for genetic privacy, it is not all-encompassing.  For example, the bill does not mention online security of genetic information, which is a significant concern considering it is not uncommon to hear about individuals’ having their email accounts hacked into. Furthermore, what protection does this act yield in social contexts? High-profile celebrities and public figures would be obvious targets, and once the information is in the public domain that a person has a gene for some debilitating disease in their future, how will that slander their reputation? Discrimination is not always so clear-cut. It will be important to see how the courts interpret the components of the act when presiding over any future genetic discrimination cases so that any nuances not clarified in the original legislation may have a chance to be amended. Future regulations about personalized healthcare should foremost address the needs of the individual before coming to bear on organizations, else risk being called hypocritical.

Other important laws include HIPAA and ECPA.  The Health Insurance Portability and Accountability Act (HIPAA) provides privacy for any potentially identifying health information, and requires authorization to use that information by practitioners. The Electronic Communications Privacy Act (ECPA) protects electronic communication from unlawful interception.

Ethical concerns

Applying machine learning to biotechnology will bring an increased vigor of ethical questions along with it.  The usual suspects are protecting individual privacy of genomic data and ensuring the safety of genetically modified food.  If anything, machine learning will help to quell these concerns by providing more robust analysis of the data, thereby generating more accurate conclusions about the likelihood a gene is linked to a disease or that no measured adverse health effects were observed with GMO food.

Ethical concerns in regard to advancing robotics and technology could parallel dystopian fictional works like The Matrix, 1984, and Brave New World. Many unanswered questions exist around these concerns, such as what constitutes consciousness? Is it merely an emergent property of the physical structure and interconnections of the human brain?  If so, it would stand to reason that technological progress could start to mimic the structure of the human brain, thus creating an artificially sentient being. Should an artificial intelligence be able to improve upon its knowledge through learning algorithms such as the ones described here within, theoretically it could approach a superintelligent state that would vastly exceed human intelligence.  Thinking much further than this on the subject would only yield wild conjectures, but it doesn’t seem apparent that an artificial intelligence would become malevolent if it could not experience emotion and be dealt injustice.

The other half of the ethical concerns about advanced technology resides in political systems and human shortcomings. Data is being collected about individuals on an ever-increasing scale, from what websites they visit, to the purchases they make, to the things they talk about. Enacted US laws may provide some protection of privacy, but this is limited. Organizations like Google are using the data to improve their services and promise to “not be evil.” What would happen should they turn to “the dark side of the force?” The increased globalization of the world has decreased the likelihood that a totalitarian dictatorship who compulsively monitors and controls its citizens would be able to take over the world.  Nonetheless, privacy concerns will continue to be an issue in new and unexpected ways, highlighting the importance of flexibility in laws and regulations.

4      Methods – Resources used for Research

Research was conducted entirely online, through computer and data science literature reviews and a wide array of technology blogs and forums. This field has taken strongly to the sharing and discussion of technical information online, and it is not uncommon to read group conversations between many highly educated and experienced computer scientists who fundamentally disagree with each other on points that would seem insignificant to an outsider.  This incredibly robust culture of critical, constructive discussion has served to advance the field considerably, and it is one of inspiration for other scientific disciplines.  A more collaborative future for all of science will be shepherded in by these master architects of reality.


5      Summary and Conclusions

The success of machine learning algorithms, like all innovations, is built upon the successful communication of ideas and knowledge. Their efficiency depends on the nature of the dataset, the desired analysis to be performed, and the robustness of the computational infrastructure.  Numerous companies have offerings in the realm of Big Data and machine learning, but Skytree differentiates itself with their customized state-of-the-art machine learning algorithms. The democratization of Big Data will allow people of all backgrounds to be able to glean significant insight from complex problems, not just large corporations with extravagant budgets.

Biotechnology will be one of numerous areas that machine learning algorithms come to dramatically change the nature of what is possible. The areas of genome interpretation, disease prediction, protein engineering, drug discovery, and agricultural development will be the first to see an increase in the development of machine learning algorithms. This could advance to the point where 1) massive servers and databases store individuals’ genomic data, dietary trends, exercise logs, etc., 2) that data is accessible through the cloud, 3) applied through machine learning algorithms and web-based software, 4) all to give real-time, personalized recommendations for what items on a restaurant’s menu are best to order, 5) brought to you by a robo-waitress.  And yes, 6) you can even have a cherry on top.

The cumulative effect of all these technological and biological advancements could be described as the beginning of the robot takeover, where everyday jobs utilize robots and software in ways that are hard to imagine now. When the ethics of new technology are balanced with increased access, their value can be properly realized. What these great technological strides in communication have given us are improved ways to share the human experience, and for “the cosmos to know itself.”

6      Appendix

For brevity, Table 3 (Comparison of R, MATLAB, SAS, STATA, and SPSS) was removed, but can be found at the original website.

7      References (34)

Anglade, Tom. (2011) Understanding MapReduce – with Mike Miller. Vol. 8, Retrieved from

Bischl, Bernd et al. (2011) Machine learning in R (mlr). Retrieved from

Dillenberger, Donna N. et al. (2009) Contextual data center management utilizing a virtual environment. US Patent 8386930.

Economist, The. (2010) Data, data everywhere. Special report: managing information. Retrieved from

Emmerich, Greg. (2012) Knome: A model for personalized medicine. UW Madison MS Biotech Program. Retrieved from

Exforsys Inc. (2006) How data mining is evolving. Data mining tutorial: Ch. 1-17. Retrieved from

Gray, Alexander. (2012) How to do massive-data machine learning. Los Angeles Tech Community Talks. Video from

Gray, Alexander. (2013) Skytree Server: Machine Learning Engine. Data Sheet. Retrieved from

Hollywood, Katherine, et al. (2006) Metabolomics: Current technologies and future trends. Proteomics, 6: 4716-4726.

Hungry. (2010). What are the advantages/disadvantages between R and MATLAB with respect to machine learning? StackOverflow forum thread. Retrieved from

IBM. (1993) Forecasting using a neural network and a statistical forecast. US Patent 5461699.

IBM. (1997) Interactive computer network and method of operation. US Patent 6182123.

Internet World Statistics. (2012) World internet usage and population statistics. Miniwatts Marketing Group. Retrieved from

Japan Science and Technology Corp. (2003) Agent learning machine. US Patent 6529887.

Kelly, Kevin and Judy Dutton. (2013) Better than human: robots are coming to take our jobs. We should be happy about it. Wired Magazine, 21-01.

Krabeepetcharat, Andrew. (2013) Data processing and hosting services in the US. IBISWorld Industry Report 51821.

Liao, Jun, Manfred Warmuth et al. (2007) Engineering proteinase K using machine learning and synthetic genes. BMC Biotechnology, 7:16.

Lin, Heshan et al. (2005) Efficient data access for parallel BLAST. IEEE International Parallel and Distributed Processing Symposium (IPDPS); Denver, CO.

Lith, Adam and Jakob Mattsson. (2010) Investigating storage solutions for large data: A comparison of well performing and scalable data storage solutions for real time extraction and batch insertion of data. Chalmers University of Technology, Dept. CS & Eng. Retrieved from

Malangi. (2010) High volume SVM (machine learning) system. StackOverflow forum thread. Retrieved from

McColl, Bill. (2010) Beyond Hadoop: Next-generation Big Data architectures. Gigaom. Retrieved from

Mehlhorn, Kurt and Peter Sanders. (2008) Algorithms and Data Structures: The Basic Toolbox. Springer Publishing. Retrieved from

Mone, Gregory. (2013) Beyond Hadoop. Communications of the ACM, Vol. 56 No. 1, Pages 22-24.

Moore, Stephen and Julian L Simon. (2000) It’s getting better all the time. Cato Institute, Pages 3-9.

O’Connor, Brendan, Lukas Biewald, Pete Skomoroch et al. (2009) Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata. AI and Social Science blog thread. Retrieved from

Pritchett, Dan. (2008) BASE: An ACID Alternative. ACM Queue Vol. 6, No. 3. Retrieved from

Radliffe, Harry and Maria Gavrilovic. (2013) Are robots hurting job growth? CBS Interactive. 60 Minutes: March of the Machines.

Raina, Rajat, Anand Madhavan and Andrew Ng. (2009) Large-scale deep unsupervised learning using graphical processors.  Proc. 26th Nat. Int. Conf. on Machine Learning, Montreal, Canada.

Rodrigues, Thoran. (2012) 10 emerging technologies for Big Data: interview with Dr. Satwant Kaur. Tech Republic: Big Data Analytics. Retrieved from

Sony. (2002) Hardware or software architecture implementing self-biased conditioning. US Patent 6434540.

Stanford Consulting. (2013) Statistical Software. Retrieved from on 4/14/13

Werbos, Paul J. (1997). 3-brain architecture for an intelligent decision and control system. US Patent 6169981.

West, Paul, April Weir et al. (2010) Predicting human developmental toxicity of pharmaceuticals using human embryonic stem cells and metabolomics. Tox & Appl Pharmaco, 247: 18-27.

Yura. (2010) Which programming language has the best repository of machine learning libraries? MetaOptimize forum thread. Retrieved from

  1. Capstone research paper guidelines:

    1) Define an important problem in biotechnology
    2) Determine the most effective resources to research the problem
    3) Interpret and critically analyze results, observations & information
    4) Describe research findings clearly and concisely
    5) Discuss global / international considerations
    6) Recognize and articulate unanswered questions
    7) State obstacles and challenges facing the solution to the problem

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: