What is Data?
What is machine learning?
Learning system model
Training and testing
Performance
Applications
Types of Machine learning
Learning techniques
Linear Regression – Step By Step
In simpler terms, a machine “learns” by looking for patterns among massive data loads, and when it sees one, it adjusts the program to reflect the “truth” of what it found. The more data you expose the machine to, the “smarter” it gets. And when it sees enough patterns, it begins to make predictions. Unlike humans, however, machines cannot generalize knowledge or transfer learning from one application to another
in the 20th century, computer programmers had to get their electronic charges to do things by tapping out lines of code specifying exactly what needed to be done. Machine learning shifts some of that work away from humans, forcing the computer to figure things out for itself.
.
Why do we use train and test sets?
Creating a train and test split of your dataset is one method to quickly evaluate the performance of an algorithm on your problem.
The training dataset is used to prepare a model, to train it.
We pretend the test dataset is new data where the output values are retained from the algorithm. We gather predictions from the trained model on the inputs from the test dataset and compare them to the existing output values of the test set.
Comparing the predictions and existing outputs on the test dataset allows us to compute a performance measure for the model on the test dataset. This is an estimate of the skill of the algorithm trained on the problem when making predictions on unseen data.
When we evaluate an algorithm, we are in fact evaluating all steps in the procedure, including how the training data was prepared (e.g. scaling), the choice of algorithm (e.g. kNN), and how the chosen algorithm was configured (e.g. k=3).
The performance measure calculated on the predictions is an estimate of the skill of the whole procedure.
We generalize the performance measure from:
“the skill of the procedure on the test set“
to
“the skill of the procedure on unseen data“.
This is quite a leap and requires that:
The procedure is sufficiently robust that the estimate of skill is close to what we actually expect on unseen data.
The choice of performance measure accurately captures what we are interested in measuring in predictions on unseen data.
The choice of data preparation is well understood and repeatable on new data, and reversible if predictions need to be returned to their original scale or related to the original input values.
The choice of algorithm makes sense for its intended use and operational environment (e.g. complexity or chosen programming language).
A lot rides on the estimated skill of the whole procedure on the test set.
In fact, using the train/test method of estimating the skill of the procedure on unseen data often has a high variance (unless we have a heck of a lot of data to split). This means that when it is repeated, it gives different results, often very different results.
The outcome is that we may be quite uncertain about how well the procedure actually performs on unseen data and how one procedure compares to another
Supervised learning is a learning in which we teach or train the machine using data which is well labeled that means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labeled data.
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first step is to train the machine with all different fruits one by one like this:
If shape of object is rounded and depression at top having color Red then it will be labelled as –Apple.
If shape of object is long curving cylinder having color Green-Yellow then it will be labelled as –Banana.
Now suppose after training the data, you have given a new separate fruit say Banana from basket and asked to identify it.
Supervised learning classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a category, such as “Red” or “blue” or “disease” and “no disease”.
Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.
Unsupervised learning is the training of machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Here the task of machine is to group unsorted information according to similarities, patterns and differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the machine. Therefore machine is restricted to find the hidden structure in unlabeled data by our-self.
For instance, suppose it is given an image having both dogs and cats which have not seen ever.
Thus the machine has no idea about the features of dogs and cat so we can’t categorize it in dogs and cats. But it can categorize them according to their similarities, patterns, and differences i.e., we can easily categorize the above picture into two parts. First first may contain all pics having dogs in it and second part may contain all pics having cats in it. Here you didn’t learn anything before, means no training data or examples.
Unsupervised learning classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.
The basic difference between Supervised and Unsupervised learning is that Supervised Learning datasets have an output label associated with each tuple while Unsupervised Learning datasets do not.
The most basic disadvantage of any Supervised Learning algorithm is that the dataset has to be hand-labeled either by a Machine Learning Engineer or a Data Scientist. This is a very costly process, especially when dealing with large volumes of data. The most basic disadvantage of any Unsupervised Learning is that it’s application spectrum is limited.
To counter these disadvantages, the concept of Semi-Supervised Learning was introduced. In this type of learning, the algorithm is trained upon a combination of labeled and unlabeled data. Typically, this combination will contain a very small amount of labeled data and a very large amount of unlabeled data. The basic procedure involved is that first, the programmer will cluster similar data using an unsupervised learning algorithm and then use the existing labeled data to label the rest of the unlabeled data.
one may imagine the three types of learning algorithms as Supervised learning where a student is under the supervision of a teacher at both home and school, Unsupervised learning where a student has to figure out a concept himself and Semi-Supervised learning where a teacher teaches a few concepts in class and gives questions as homework which are based on similar concepts.
Reinforcement Learning is a type of Machine Learning paradigms in which a learning algorithm is trained not on preset data but rather based on a feedback system. These algorithms are touted as the future of Machine Learning as these eliminate the cost of collecting and cleaning the data.