Our present decade is witness to a lot of happenings in the technology sector. With computing taking a smooth progression from PCs to Cloud and the Internet of Things taking over the world by a storm, it keeps getting better. This makes the future all the more exciting as we can easily speculate what aspects of the existing technology calls for a revamp. But the question of how to get the solution first, remains.

Let us tackle it systematically. Computing getting ‘cloud’ed will make data management the next hurdle. Data can be reorganised and structured to facilitate process automation. Combine this with IoT and the result will be a completely automated system working on it’s own intelligence. For this, the machine has to be ‘taught’ just like a human brain. There is no doubt that machine learning algorithms are absolutely necessary for the next big thing to happen.

Machine learning by itself is a very broad term. Basically, it refers to a set of computer programs that can teach itself to work even on new information. Machine learning algorithms can be classified loosely into three types. Supervised Learning algorithms are those in which a dependent variable is predicted from a given set of independent variables. The model is trained until it achieves a certain level of accuracy. Unsupervised Learning algorithms are used for clustering a given set of values into different groups. Using Reinforcement Learning algorithms, a machine is taught to take specific decisions where it trains itself in an exposed environment by trial and error method.

Common Machine Learning Algorithms Explained

Some of the important machine learning algorithms are explained below. They can be used to tackle all types of data environments.

  1. Linear Regression
    This algorithm is used to find approximate real values based on predictors or independent variables. A relationship between dependent and independent variables are established by plotting a best line given by the equation y=ax+b.
    Consider the example of an agricultural problem. The expected yield of a crop can be predicted by analyzing the temperatures of the area during the season and the water requirements for that variety of crop. Here temperature and amount of water are the independent variables and the crop yield is the dependent variable.
  2. Decision Tree
    It is a supervised learning algorithm that is used in classification problems. A very popular algorithm among data scientists for data mining, decision tree works by splitting a population in as many distinct ways as possible. Classification is done by using the most significant attributes (independent variables) at each level to form homogenous groups. An added advantage is that it work for categorical as well as continuous dependent variables.
    Given below illustrates a typical decision tree application. Here a given population is observed and classified to understand routine customer patterns better and improve the existing business model. For heterogeneous groups, techniques such as Gini impurity, information gain, variable reduction, etc. are also used.

dt3.PNG

  1. Naive Bayes
    Based on Bayes theorem, it is a classification technique that assumes the absolute independence of predictors. In other words, the algorithms works on the assumption that one feature of a product is not related to any other feature of the product. For example, a fruit which is round, yellow and is 5 cm in diameter may be considered as a lemon. A Naive Bayes classifier would consider this fruit a lemon even if there might be possible correlations between the features of the fruit.
    In many applications, maximum likelihood is used for parameter estimation in Naive Bayes models, i.e., you can work with this algorithm without accepting Bayesian methods. Also, it has been known to outperform sophisticated classification methods despite its naive design and supposedly oversimplified assumptions. Moreover, it requires only a small amount of training data to accurately identify the parameters for classification.
  1. K-Means
    It is an unsupervised learning algorithm which is used for clustering. It is a simple method which uses a certain number of clusters to classify a given data set. Data points associated with each cluster should be in such a way that: 1) points in the same cluster should be as similar as possible and 2) points in different clusters should be as dissimilar as possible. K MAP (1).png
    For example, if a hospital wants to open emergency units in a particular state, the accident prone regions should be identified and located. It should be ensured that these units are at a minimum distance from the accident zone and at a significant distance from each other. For this, clusters can be identified in such a way that their centroids define the placement of the emergency units.
  1. Boosting Algorithms
    These algorithms are employed to reduce bias, variance in supervised learning or to make a prediction which requires a high prediction power. Gradient boosting and AdaBoost are the two popular algorithms in this category. They are used to combine multiple weak or average learners to strong ones.

    ba1.jpg

    A typical example where AdaBoost is used is in face detection. The algorithm is used to define a classification boundary, here the boundary of the face. The grayscale transformation is done on the RCB image and a threshold is assumed for face boundary. Once that is done, each patch of the face can be analysed.

Nowadays, all the machine learning algorithms for a particular application are available as machine learning software suites. With better software, faster distributed systems and more data sets, a data scientist will no longer be necessary to monetize it- only someone who can code an application and call an API. Only time will tell how application developers make use of machine learning to their advantage.