An Insight to Data Mining Algorithms

One of the most instructive lessons is that simple ideas often work very well, and I strongly recommend the adoption of a simplicity-first methodology when analyzing practical datasets.

There are many different kinds of simple structure that datasets can exhibit.

In one dataset, there might be a single attribute that does all the work and the others may be irrelevant or redundant.

Inferring rudimentary rules

In any event, it is always a good plan to try the simplest things first.

The idea is this:

we make rules that test a single attribute and branch accordingly.

Each branch corresponds to a different value of the attribute.

It is obvious what is the best classification to give each branch: use the class that occurs most often in the training data.

Missing values and numeric attributes

Although a very rudimentary learning method, 1R does accommodate both missing values and numeric attributes.

It deals with these in simple but effective ways.

Missing is treated as just another attribute value.

So that, for example,if the weather data had contained missing values for the outlook attribute, a rule set formed on outlook would specify four possible class values, one each for sunny, overcast, and rainy and a fourth for missing.

Statistical modeling

The 1R method uses a single attribute as the basis for its decisions and chooses the one that works best.

Another simple technique is to use all attributes and allow them to make contributions to the decision that are equally important and independent of one another.

Constructing decision trees

Decision tree algorithms are based on a divide-and-conquer approach to the classification problem.

They work from the top down, seeking at each stage an attribute to split on that best separates the classes; then recursively processing the sub-problems that result from the split.

Conclusion

This strategy generates a decision tree, which can if necessary be converted into a set of classification rules — although if it is to produce effective rules, the conversion is not trivial.

5 Main Types of Knowledge Representation in Machine Learning

There are many different ways for representing the patterns that can be discovered by machine learning, and each one dictates the kind of technique that can be used to infer that output structure from data.

Once you understand how the output is represented, you have come a long way toward understanding how it can be generated.

In this article,I talk about main types of representation: 

Decision tables

Decision Table Example

The simplest, most rudimentary way of representing the output from machine learning is to make it just the same as the input.

Decision trees

Decision Tree Example

A divide-and-conquer approach to the problem of learning from a set of independent instances, leads naturally to a style of representation called a decision tree.

Classification rules

classification rules example 

Classification rules are a popular alternative to decision trees.

The antecedent, or precondition, of a rule is a series of tests just like the tests at nodes in decision trees, and the consequent,or conclusion, gives the class or classes that apply to instances covered by that rule, or perhaps gives a probability distribution over the classes.

Association rules

Association rules are really no different from classification rules except that they can predict any attribute, not just the class, and this gives them the freedom to predict combinations of attributes too.

To reduce the number of rules that are produced, in cases where several rules are related it makes sense to present only the strongest one to the user.

For example, with the weather data,we can extract this rule:

If temperature = cool then humidity = normal

Rules with exceptions

Returning to classification rules, a natural extension is to allow them to have exceptions.

Then incremental modifications can be made to a rule set by expressing exceptions to existing rules rather than reengineering the entire set.

Instead of changing the tests in the existing rules, an expert might be consulted to explain why the new instance violates them, receiving explanations that could be used to extend the relevant rules only.

Clusters

Clusters Example

When clusters rather than a classifier is learned, the output takes the form of a diagram that shows how the instances fall into clusters.

In the simplest case this involves associating a cluster number with each instance, which might be depicted by laying the instances out in two dimensions and partitioning the space to show each cluster.

Conclusion

Knowledge representation is a key topic in classical artificial intelligence and is well represented by a comprehensive series of papers edited by Brachman and Levesque.

I mentioned the problem of dealing with conflict among different rules.

Various ways of doing this, called conflict resolution strategies, have been developed for use with rule-based programming systems. 

These are described in books on rule-based programming, such as that by Brownstown.

Data Mining : The impact of The Input

I think with any software system, understanding what the inputs and outputs are is far more important than knowing what goes on in between, and Data Mining is no exception.

I think with any software system, understanding what the inputs and outputs are is far more important than knowing what goes on in between, and Data Mining is no exception.

The input takes the form of concepts, instances, and attributes.

So,in this article I explain these terms and I talk about preparing data. 

What’s a concept ?

Photo by NeONBRAND on Unsplash

Four basically different styles of learning appear in data mining applications. 

In classification learning, the learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples.

In association learning, any association among features is sought, not just ones that predict a particular class value.

 In clustering, groups of examples that belong together are sought. In numeric prediction, the outcome to be predicted is not a discrete class but a numeric quantity. 

Regardless of the type of learning involved, we call the thing to be learned the concept and the output produced by a learning scheme the concept description.

What’s in an example?

Photo by Markus Spiske on Unsplash

The input to a machine learning scheme is a set of instances.

 These instances are the things that are to be classified, associated, or clustered. 

Although until now we have called them examples, henceforth we will use the more specific term instances to refer to the input. 

Each instance is an individual, independent example of the concept to be learned.

What’s in an attribute?

Photo by Ali Hegazy on Unsplash

Each individual, independent instance that provides the input to machine learning, is characterized by its values on a fixed, predefined set of features or attributes. 

The instances are the rows of the tables.

Preparing the input

Photo by Markus Spiske on Unsplash

Preparing input for a data mining investigation usually consumes the bulk of the effort invested in the entire data mining process. 

Although this article is not really about the problems of data preparation, I want to give you a feeling for the issues involved so that you can appreciate the complexities.

Bitter experience shows that real data is often of disappointingly low in quality, and careful checking,a process that has become known as data cleaning, pays off many times over.

Gathering the data together

Integrating data from different sources usually presents many challenges,not deep issues of principle but nasty realities of practice.

Different departments will use different styles of record keeping, different conventions, different time periods, different degrees of data aggregation, different primary keys, and will have different kinds of error.

 The data must be assembled, integrated, and cleaned up.

The idea of company wide database integration is known as data warehousing. 

Data warehouses provide a single consistent point of access to corporate or organizational data, transcending departmental divisions. 

They are the place where old data is published in a way that can be used to inform business decisions.

Missing values

Most datasets encountered in practice, contain missing values.

Missing values are frequently indicated by out of-range entries, perhaps a negative number (e.g., -1) in a numeric field that is normally only positive or a 0 in a numeric field that can never normally be 0.

For nominal attributes, missing values may be indicated by blanks or dashes.

Sometimes different kinds of missing values are distinguished (e.g., unknown vs. unrecorded vs. irrelevant values) and perhaps represented by different negative integers (-1, -2, etc.).

Inaccurate values

Data goes stale. Many items change as circumstances change.

 For example, items in mailing lists : names, addresses, telephone numbers, and so on.Change frequently.

 You need to consider whether the data you are mining is still current.

Conclusion:

Data cleaning is a time-consuming and labor-intensive procedure but one that is absolutely necessary for successful data mining. 

With a large dataset, people often give up.

How can they possibly check it all ?

 Instead, you should sample a few instances and examine them carefully. 

You’ll be surprised at what you find. Time looking at your data is always well spent.

Data Mining is an Effective Tool for Decision-Making

Problem with actual real-life datasets is that they are often proprietary.No one is going to share their customer and product choice database with you so that you can understand the details of their data mining application and how it works. Corporate data is a valuable asset, one whose value has increased enormously with the development of data mining techniques such as those will be described in this article.

So,in this article I will go through some example of data mining applications.

First Example:

It is about the weather problem: 

Weather Dataset Example

By analyzing this table,we can extract the patterns below:

If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

If humidity = normal then play = yes

If none of the above then play = yes

These rules are meant to be interpreted in order: the first one, then if it doesn’t apply the second, and so on. A set of rules that are intended to be interpreted in sequence is called a decision list. the rules correctly classify all of the examples in the table, whereas taken individually, out of context, some of the rules are incorrect.

 For example, the rule :

 if humidity = normal then play = yes 

gets one of the examples wrong (check which one).The meaning of a set of rules depends on how it is interpreted.

Not surprisingly ! The rules we have seen so far are classification rules: they predict the classification of the example in terms of whether to play or not. It is equally possible to disregard the classification and just look for any rules that strongly associate different attribute values. These are called association rules. Many association rules can be derived from the weather data in Table above. Some good ones are as follows:

If temperature = cool then humidity = normal

If humidity = normal and windy = false then play = yes

If outlook = sunny and play = no then humidity = high

If windy = false and play = no then outlook = sunny

and humidity = high.

Second Example : Irises, A classic numeric dataset

The dataset is provided by kaggle website.Thanks for checking the link below.

Iris Flower Dataset
Iris flower data set used for multi-class classification.www.kaggle.com

The iris dataset, which dates back to seminal work by the eminent statistician R.A. Fisher in the mid-1930s and is arguably the most famous dataset used in data mining, contains 50 examples each of three types of plant: Iris setosa, Iris versicolor, and Iris virginica. 

There are four attributes: sepal length, sepal width, petal length, and petal width (all measured in centimeters).

Unlike previous datasets, all attributes have values that are numeric.

The following set of rules might be learned from this dataset:

If petal length < 2.45 then Iris setosa

If sepal width < 2.10 then Iris versicolor

If sepal width < 2.45 and petal length < 4.55 then Iris versicolor

If sepal width < 2.95 and petal width < 1.35 then Iris versicolor

If petal length ≥ 2.45 and petal length < 4.45 then Iris versicolor

If sepal length ≥ 5.85 and petal length < 4.75 then Iris versicolor

If sepal width < 2.55 and petal length < 4.95 and petal width < 1.55 

then Iris versicolor

If petal length ≥ 2.45 and petal length < 4.95 and

petal width < 1.55 then Iris versicolor

If sepal length ≥ 6.55 and petal length < 5.05 then Iris versicolor

If sepal width < 2.75 and petal width < 1.65 and

sepal length < 6.05 then Iris versicolor

If sepal length ≥ 5.85 and sepal length < 5.95 and

petal length < 4.85 then Iris versicolor

If petal length ≥ 5.15 then Iris virginica

If petal width ≥ 1.85 then Iris virginica

If petal width ≥ 1.75 and sepal width < 3.05 then Iris virginica

If petal length ≥ 4.95 and petal width < 1.55 then Iris virginica

Third Example, Loan company

The illustrations that follow tend to stress the use of learning in performance situations, in which the emphasis is on ability to perform well on new examples.

When you apply for a loan,for example, you have to fill out a questionnaire that asks for relevant financial and personal information. This information is used by the loan company as the basis for its decision as to whether to lend you money. Such decisions are typically made in two stages. 

First, statistical methods are used to determine clear acceptand reject cases. The remaining borderline cases are more difficult and call for human judgment.

For example, one loan company uses a statistical decision procedure to calculate a numeric parameter based on the information supplied in the questionnaire. Applicants are accepted if this parameter exceeds a preset threshold and rejected if it falls below a second threshold. This accounts for 90% of cases, and the remaining 10% are referred to loan officers for a decision. On examining historical data on whether applicants did indeed repay their loans, however, it turned out that half of the borderline applicants who were granted loans actually defaulted. Although it would be tempting simply to deny credit to borderline customers, credit industry professionals pointed out that if only their repayment future could be reliably determined.

 It is precisely these customers whose business should be wooed.They tend to be active customers of a credit institution because their finances remain in a chronically volatile condition. A suitable compromise must be reached between the viewpoint of a company accountant, who dislikes bad debt, and that of a sales executive, who dislikes turning business away.

Introduction to Data Mining

We are overwhelmed with data. The amount of data in the world, in our lives,seems to go on and on increasing, and there’s no end in sight. Omnipresent personal computers make it too easy to save things that previously we would have trashed. Inexpensive multi-gigabyte disks make it too easy to postpone decisions about what to do with all this stuff we simply buy another disk and keep it all.

The World Wide Web overwhelms us with information. Meanwhile, every choice we make is recorded.And all these are just personal choices: they have countless counterparts in the world of commerce and industry.We would all testify to the growing gap between the generation of data and our understanding of it.

As the volume of data increases,inexorably, the proportion of it that people understand decreases, alarmingly. Lying hidden in all this data is information, potentially useful information, that is rarely made explicit or taken advantage of.

People have been seeking patterns in data since human life began. Hunters seek patterns in animal migration behavior, farmers seek patterns in crop growth, politicians seek patterns in voter opinion, and lovers seek patterns in their partners’ responses. A scientist’s job is to make sense of data,to discover the patterns that govern how the physical world works and encapsulate them in theories that can be used for predicting what will happen in new situations.The entrepreneur’s job is to identify opportunities, that is, patterns in behavior that can be turned into a profitable business, and exploit them.

Economists, statisticians, forecasters, and communication engineers have long worked with the idea that patterns in data can be sought automatically, identified, validated, and used for prediction.

As the world grows in complexity, overwhelming us with the data it generates, data mining becomes our only hope for elucidating the patterns that underlie it. Intelligently analyzed data is a valuable resource. It can lead to new insights and, in commercial settings, to competitive advantages.

Data mining is about solving problems by analyzing data already present in Databases.

A database of customer choices, along with customer profiles, holds the key to this problem. Patterns of behavior of former customers can be analyzed to identify distinguishing characteristics of those likely to switch products and those likely to remain loyal. Once such characteristics are found, they can be put to work to identify present customers who are likely to jump ship. This group can be targeted for special treatment,treatment too costly to apply to the customer base as a whole. More positively, the same techniques can be used to identify customers who might be attracted to another service the enterprise provides, one they are not presently enjoying, to target them for special offers that promote this service.

In today’s highly competitive, customer-centered, service-oriented economy, data is the raw material that fuels business growth, if only it can be mined.

How are the patterns expressed ? Useful patterns allow us to make nontrivial predictions on new data. There are two extremes for the expression of a pattern:

as a black coffer whose innards are effectively incomprehensible and as a transparent coffer whose construction reveals the structure of the pattern.

Both, we are assuming, make good predictions.The difference is whether or not the patterns that are mined are represented in terms of a structure that can be examined, reasoned about, and used to inform future decisions.

 Such patterns we call structural because they capture the decision structure in an explicit way. In other words, they help to explain something about the data.

Structural patterns

The rules do not really generalize from the data.They merely summarize it. In most learning situations, the set of examples given as input is far from complete, and part of the job is to generalize to other, new examples.

Real-life datasets invariably contain examples in which the values of some features, for some reason or other, are unknown.For example, measurements were not taken or were lost.

Machine learning

Earlier we defined data mining operationally as the process of discovering patterns, automatically or semi-automatically, in large quantities of data and the patterns must be useful. An operational definition can be formulated in the same way for learning.

Things learn when they change their behavior in a way that makes them perform better in the future.

This ties learning to performance rather than knowledge. You can test learning by observing the behavior and comparing it with past behavior. This is a much more objective kind of definition and appears to be far more satisfactory.

Data mining

Data mining is a practical topic and involves learning in a practical, not a theoretical, sense.

We are interested in techniques for finding and describing structural patterns in data as a tool for helping to explain that data and make predictions from it.

We are interested in techniques for finding and describing structural patterns in data as a tool for helping to explain that data and make predictions from it. The data will take the form of a set of examples.

Examples of customers who have switched loyalties, for instance, or situations in which certain kinds of contact lenses can be prescribed. The output takes the form of predictions about new examples.A prediction of whether a particular customer will switch or a prediction of what kind of lens will be prescribed under given circumstances.

People frequently use data mining to gain knowledge, not just predictions. Gaining knowledge from data certainly sounds like a good idea if you can do it.

As a conclusion,To know more about data mining,I have made a video defining data mining,you will get a useful information from it 🙂

Optimize,Manage,And Deploy The ML Model In An Effective Way

Reducing the training data will eventually reduce accuracy. Finding the right balance is a trade-off decision, and you can use the sensitivity analysis to help you choose the most efficient point along the curves.

Sensitivity analysis is a financial model that determines how target variables are affected based on changes in other variables known as input variables.

Optimizing model size for devices involves performing a sensitivity analysis for the critical parameter(s) of the chosen algorithm. Create models to observe their size, and then choose tangent points along the sensitivity curves for the optimum tradeoff. Machine Learning (ML) environments like Weka make it easy to experiment with parameters to optimize your models.

One of the huge advantages of Deep Learning(DL) algorithms is that generally, their size does not scale linearly with the size of the dataset, as was the case for the Random forest algorithm. DL algorithms such as CNN and RNN algorithms use hidden layers. As the dataset grows in size, the number of hidden layers does not. DL models get smarter without growing proportionally in size.

Model Version Control

Once created, you should treat your ML models as valuable assets. Although you did not write code in the creation process, you should consider them as code equivalents when managing them. This infers that ML models be placed under version control in a similar manner as your application source code.

Whether or not you store the actual model, a serialized Java object in the case of Weka’s model export, depends on if the model is reproducible deterministically. Ideally,you should be able to reproduce any of your models from the input components,including:

  • Dataset
  • Input configuration including filters or preprocessing
  • Algorithm selection
  • Algorithm parameters

For deterministic models that are reproducible, it is not necessary to store the model itself. Instead, you can just choose to store the input components. When creation times are long, such as with the KNN algorithm for large datasets, it can make sense to store the model itself, along with the input components.

The following tools are free and open source, and promise to allow you to seamlessly deploy and manage models in a scalable, reliable, and cost-optimized way:

https://dataversioncontrol.com

https://datmo.com

These tools support integration with the cloud providers such as AWS and GCP. They solve the version control problem by guaranteeing reproducibility for all of your model-based assets.

Updating Models

One of the key aspects to consider when you begin to deploy your ML app is how you are going to update the model in the future. One of the solutions is to simply load the ML model directly from the project’s asset directory when the app starts.

This is the easiest approach when starting with ML application development, but it is the least flexible when it comes time to upgrade your application-model combination in the future.

A more flexible architecture is to abstract the model from the app. This provides the opportunity to update the model in the future without the need to rebuild the application.

Conclusion:

The best practices for creating and handling prebuilt models for on-device ML applications:

  • Optimal model size depends on the input dataset size, attribute complexity, and target device hardware capabilities.
  • Prepare a model sensitivity analysis plotting model accuracy vs model size.

4 Critical factors for Machine Learning Model

In Machine Learning application development, the model is one of your key assets. You must carefully consider how to handle the model, because it can grow to be very large, and you need to start by making sure the models you create can physically reside on your target device.

In this article I am going to develop four factors,you must consider in the model integration phase.These factors are training time, test time, accuracy and size.

Model training time

Training time is important.However, when you are deploying static models within applications at the edge, the priority is low because you can always apply more resources, potentially even in the cloud, to train the model.

Model test time

If an algorithm produces a complex model requiring relatively long testing times, this could result in latency or performance issues on the device when making predictions.

Model accuracy

Model accuracy must be sufficient to produce results required by your well-defined problem.

Model size

When deploying pre-trained Machine Learning models onto devices, the size of the model must be consistent with the memory and processing resources of the target device.

To understand how the factors interrelate, you can perform a sensitivity analysis,which is a financial model that determines how target variables are affected based on changes in other variables known as input variables.

Consider the Random Forest algorithm. You know the number of iterations i, is a key variable for determining how deep or how many trees the algorithm produces. More iterations means more trees, which results in each of the following:

  • Higher degree of accuracy
  • Longer creation time
  •  Larger model size

Monetizing your application with ML

It is amazing how many apps are available on the app stores today. In fact, there are so many, it has become difficult to cut through the noise and establish a presence. A small percentage of apps on the app stores today use Machine Learning (ML), but this is changing.

Machine learning is the future of app development.You must learn to design ML performance into the app, including considerations for model size, model accuracy, and prediction latency.

These final two ML-Gates (Model Integration/Deployment) represent the “business end” of the ML development pipeline. They represent the final steps in the pipeline where you realize the benefit of all the hard work performed in the earlier phases when you were working with data,algorithms, and models. Model integration and deployment are the most visible stages,the stages that enable you to monetize your applications.

Managing Models

In ML application development, the model is one of your key assets. You must carefully consider how to handle the model, including

  • Model sizing considerations
  • Model version control
  • Updating models

Models can grow to be very large, and you need to start by making sure the models you create can physically reside on your target device.

Device Constraints

When you use ML models from the cloud providers, you simply rely on network connectivity and a cloud provider API to access models and make predictions. Storing prebuilt models on devices is a different approach, requiring you to understand the limitations of the target device.

It is common on Android devices to see applications with sizes greater than 300 MB. This does not mean you should create models with sizes to match. Huge models are difficult to manage. The primary downside of huge models is the time it takes to load them. with Android, the best approach is to load models on a background thread, and you would like the loading operation to be complete within a few seconds.

Model accuracy, model training, and model testing times varied for each of the classification algorithms discussed in this article. There is an additional factor, model size, which is equally important to consider.

Weka-Explorer, How powerful it is !

Weka (Waikato Environment for Knowledge Analysis), developed at the University of Waikato, New Zealand. It is free software licensed under the GNU General Public License.The Explorer is the main Weka interface. The figure below shows the Weka Explorer.

Weka-Explorer, How powerful it is !

Across the top of the Explorer, you will see tabs for each of the key steps you need to accomplish during the model creation phase:

Preprocess: Filter is the word used by Weka for its set of data preprocessing routines. You apply filters to your data to prepare it for classification or clustering.

Classify: The Classify tab allows you to select a classification algorithm, adjust the parameters, and train a classifier that can be used later for predictions.

Cluster: The Cluster tab allows you to select a clustering algorithm, adjust its parameters, and cluster an unlabeled dataset.

Attributes: The Attributes tab allows you to select the best attributes for prediction.

Visualize: The Visualize tab provides a visualization of the dataset. A matrix of visualizations in the form of 2D plots represents each pair of attributes.

Weka Filters 

Within Weka, you have an additional set of internal filters you can use to prepare your data for model building Weka, like all good ML environments, it contains a wealth of Java classes for data preprocessing. If you do not find the filter you need, you can modify an existing Weka filter Java code to create your own custom filter.

Weka Explorer Key Options 

Explorer is where the magic happens. You use the Explorer to classify or cluster. Note that the Classify and Cluster tabs are disabled in the Weka Explorer until you have opened a dataset using the Preprocess tab. Within the Classify and Cluster tabs at the top of the Weka Explorer are three important configuration sections you will frequently use in Weka:

  • Algorithm options
  •  Test options
  • Attribute predictor selection (label) for classification 

 There is a lot more to learn about Explorer module than what I have covered in this article. But you have already know enough to be able to analyze your data using preprocessing, classification, clustering, and association with WEKA-EXPLORER module. 

 If you plan to do any complicated data analysis, which require software flexibility, I recommend you to use WEKA’s Simple CLI interface. You have few new tools, but practice makes it perfect.

 Good luck with your data analysis 🙂