Site Loader

Exploration on Car Insurance data using supervised LearningSushmitha.

S.P11PG Scholar, Department of Computer ScienceStella Maris College,Chennai, [email protected] Renuka Devi D22Assistant Professor, Department of Computer ScienceStella Maris College,Chennai, [email protected] — Insurancesector by nature has rigorous collection of data.

Best services for writing your paper according to Trustpilot

Premium Partner
From $18.00 per page
4,8 / 5
4,80
Writers Experience
4,80
Delivery
4,90
Support
4,70
Price
Recommended Service
From $13.90 per page
4,6 / 5
4,70
Writers Experience
4,70
Delivery
4,60
Support
4,60
Price
From $20.00 per page
4,5 / 5
4,80
Writers Experience
4,50
Delivery
4,40
Support
4,10
Price
* All Partners were chosen among 50+ writing services by our Customer Satisfaction Team

Insurance business datastatistics are used to measure the menace we can archive this using big dataanalytics. Big Data Analytics is used to find out features of potential client,allowing insurance business to higher foretelling accuracy. The key objectiveof this paper is to analyze and understanding the need and purchase plan tofind who all buy the car insurance service of the campaign. So, we proposedifferent classification algorithms in R on large-scale insurance data toimprove the performance and predictive modeling. We have collected the datafrom Kaggle datasets: The Home ofData Science & Machine Learning. We use Confusion matrix, PrecisionRecall and F-Measure to estimate the performance of the algorithm.

The finalproduct shows that which algorithm outperformed than other classificationalgorithm in terms both accuracy and performance with insurance data to predictwho all buy the car insurance service.  Keywords –Supervised learning, R tool, big data, Classification algorithms, samplingtechnique.                                                                                                                                                 I.

           INTRODUCTIONA huge amount of data is generally referred as Big Data. It isenormous in size, diverse variety and has highest velocity of data arrival.This huge information is useless unless the data is examined to uncover the newcorrelations, customer experiences and other useful information that can helpthe organization to take more informed business decisions. Big data is widelyapplied in all sectors like healthcare, insurance, finance and many more.  Big data in insurancesector is one of the most promising. Traditional marketing system of insuranceis offline based sales business. They generally sell the insurance policies bycalling and visiting the customers.

This fixed marketing system also achievedgood results in past time. But currently many new private insurances companiesalso have entered into the marketplace which gives healthier competition. Onother hand, eagerness of people to pay for the insurance service is alsoincreased. Therefore, understanding the need and purchase plan of clients isextremely essential for insurance companies to raise the sales volume.

  Big data technologysupports the insurance companies’ transformations. Due to lack of principle andinnovation of traditional marketing, badly structured insurance data, unclearcustomers purchasing characteristics leads to imbalanced data, which brings thedifficulty of classification of user and insurance product recommendation.Decision making task is difficult with imbalanced datadistribution. To solve this problem, we usually use few resampling methodswhich will construct the balanced training datasets. This will improve theperformance of predictive model.  Main purpose of thispaper is to identify the potential customer with help of big data technology.

This paper does not only provide good strategy for identifying the potentialclient but also act as good reference for classification problems.We propose supervised learning algorithm call ensembledecision tree to analysis the potential customer and their majorcharacteristics.This paper is organized as follows. Section II introduces thecurrent research status of machine learning; Section III puts forward theclassification model and intelligent recommendation algorithm based on XGBoostalgorithm for insurance business data, and analyzes its efficiency; Section IV givesyou experiment and result. Section V puts forward the analysis result. SectionVI Conclusion and future work.                                                                                                                                            II.           RELATED WORKThe classification problem for US bank insurance business datahas imbalanced data distribution.

This means ratio between positive andnegative proportion are extremely unbalanced, the prediction models generateddirectly by supervised learning algorithms like SVM, Logistic Regression arebiased for large proportion. Example, the ratio between positive and negativeclasses is 100:1. Therefore, this can be seen as such model does not help inprediction. Imbalanced classdistribution will affect the performance of classification problem.

Thus, sometechniques should be applied to deal this problem. One approach to handle theproblem of unbalanced class distribution is sampling techniques 2. This willrebalance the dataset. Sampling techniques are broadly classified into twotypes. They are under sampling and over sampling. Under sampling technique isapplied to major class for reduction process (e.g.

Random Under Sampling) andover sampling is another technique applied to add missing scores to set ofsamples of minor class (e.g. Random Over Sampling).The drawback of ROS isredundancy in dataset this will again lead to classification problem that isclassifier may not recognize the minor class significantly. To overcome thisproblem, SMOTE (Synthetic Minority Over Sampling) is used.

This will createadditional sample which are close and similar to near neighbors along withsamples of the minor class to rebalance the dataset with help of K-NearestNeighbors (KNN) 2.Sampling method is divided into non-heuristic method andheuristic method. Non-heuristic will randomly remove the samples from majorityclass in order to reduce the degree of imbalance 10. Heuristic sampling isanother method which will distinguish samples based on nearest neighboralgorithm 7.Another difficulty in classification problem is data quality,which is existence of missing data. Frequent occurrence of missing data willgive biased result. Mostly, dataset attributes are dependent to each other.Thus, identifying the correlation between those attributes can be used todiscover the missing data values.

One approach to replace the missing valueswith some probable values is called imputation 6.One of the challenges in big data is data quality. We need toensure the quality of data otherwise it will mislead to wrong predictionssometimes. One significant problem of data quality is missing data.Imputation is method for handling the missing data. This willreconstruct the missing data with estimated ones. Imputation method hasadvantage of handling missing data without help of learning algorithms and thiswill also allow the researcher to select the suitable imputation method forparticular circumstance 3.

 There are many imputation methods for missing value treatment(Some widely used data imputation methods are Case substitution, Mean and Modeimputation, Predictive model). In this paper we have built the predictive modelfor missing value treatment.There are a variety of machine learning algorithms to crackboth classification and regression problems. Machine learning is practice ofdesigning the classification which has capability to repeatedly learn and performwithout being explicitly programmed. Machine learning algorithms are classifiedinto three types (Supervised learning, Unsupervised learning, Reinforcement Learning).

In this paper, we propose supervised machinelearning algorithms to built the model. Some of the supervised learningalgorithms are listed below: Regression, DecisionTree, Random Forest, KNN, Logistic Regression etc 8. Decision tree in machinelearning can be used for both classification and regression.In decision examination, a decision tree can be used to visually and unambiguouslyrepresent decision. The tree has two significantentities precisely known as decision nodes and leaves. The leaves are the verdictor the final result. And the decision nodes are wherever the data is split.

Classification tree is type of decision tree where the outcome was a variablelike ‘fit’ or ‘unfit’. Here the decision variable is Categorical.One of the best ensemble methods is random forest. It is usedfor both classification and regression 5.

Random Forest is collection of manydecision trees; every tree has its full growth. And it has advantage ofautomatic feature selection and etc 4.Gradient Boosting looks to consecutively decrease fault witheach consecutive model, until one final model is produced. The key intend of everymachine learning algorithms is to construct the strongest predictive modelwhile accounting for computational effectiveness on top. This is whereXGBoosting algorithm engages in recreation.

 XGBoost (eXtreme Gradient Boosting) is a direct application of GradientBoosting for decision trees. It gives you further regularize modelformalization to manage over-fitting, which gives improved performance 8.                                                                                                                                             III.           METHODOLOGYClassification Model:Traditional sales approach of insurance product is offline process and it asfollowing disadvantages: (1) lack of customer evaluation system, don’t know thecharacteristics influence weight of the potential customers; (2) the dataaccumulated in this way usually has serious ruinous, indirect influence theaccuracy of classification model 4.For a bunch of classification models, distribution of classand correlation features affects the forecast results.

Imbalanced dataclassification and independent attributes of insurance dataset will haveserious deviation in classification model result. We can handle this kind ofproblems with different sampling method and supervised learning algorithms.In this article, we have used over sampling approach withsupervised learning algorithms on car insurance dataset to build the bestpredictive model. Imbalanced data classification problem is resolved with oversampling method and finally we build the model with supervised learningalgorithms using training dataset. Finally, predictive model is validated withtest dataset and Performance of algorithms is evaluated using confusion matrixmethod with test dataset.

And Precision-Recall, F-measure is also otherperformance metrics calculated for accuracy of algorithms. The taxonomy ofproposed classification model is given below:Figure1.Taxonomy of proposed methodology A.

     Dataset:The key objective of this paper is to analyze andunderstanding the need and purchase plan of find who all buy the car insuranceservice of campaign. So, we propose different classification algorithms in R onlarge-scale insurance data to improve the performance and predictive modeling.We have collected the data from Kaggle datasets: The Home of Data Science &Machine Learning.

This dataset is collected from one of the bank in the US. Inaddition to common services, this bank also provides car insurance services.This bank arranges promotions like campaign every year to catch the attentionof new customers.

The bank has provided details about potential customers’data, and bank staff’ call duration time for promotion available car insurance decision.You have data regarding 4000 customers who were make contact with the lastcampaign and also the fallout of campaign that is did the customer buyinsurance product or not are known.B.     Preprocessing:Data is usually collected for unspecified applications.

Dataquality is one of the major issues that are needed to be addressed in processof big data analytics. Problems that affect the data quality are given in thefollowing: 1.Noise and outliers 2. Missing values 3. Duplicate data. Preprocessingis a method used to make data more appropriate for data analytics. DataCleaning is a process to handle the misplaced data.

We have used analyticalmodel for imputation method to envisage the misplaced values using thenon-missing data. Here, we used KNN algorithm to estimate the missing data.This will estimate missing data with help of the recent neighbor values. Data transformation is one of the methods in preprocessingto normalize data. Normalization is a process in which wemodify the complex dataset into simpler dataset. Here, we used Min-MaxNormalization to normalize the data. It will scale the data between the 0 and1.Where, x is thevector that we going to normalize.

Then min and max arethe minimum and maximum values in x given its range. Once thedataset is pre-processed. Now it is ready for data partition.  C.

     DataPartition:In this step, we will split data into separate roles of learn(train) and test rather than just working with all of the data. Training datawill contain the input data with expected output. More or less 60% of theoriginal dataset constitutes the training dataset and other 40% is consideredas testing dataset. This is the data that validate the core model and checksfor accuracy of the model. Here, we partitioned the original insurance datasetinto train and test set with probability of 60 and 40 split.

D.    SupervisedLearning Algorithms:Supervised learning is machine learning technique. This infersfunction and practice with training data without explicitly programmed.Learning is said to be supervised, when the desired outcome is already known.

After partition, next step is to build the model with trainingsample set. Here, our target variable is chosen first. We selected our targetvariable as car insurance and other attributes in dataset is taken aspredictors to develop the predictive model. Now, I desire to make a model to envisagewho all buy the car insurance service during campaign? In this problem, we needto separate out clients who buy car insurance and who were not buy theinsurance in the campaign based on extremely considerable key variables.In this paper, we are using Random Forest and Extreme gradientboosting Algorithm to envisage the model.

And evaluate which algorithm confersbetter performance. Random Forest: Before we move to random forestit is necessary to have a look on decision tree. What is Decision Tree?Decision Tree can be used for both classification andregression. Unlike linear models, tree based predictive model gives highaccuracy.

Decision tree is frequently used in classification problem. It will separateout the clients based on predictor variables and identify the variable, whichcreates the best uniform sets of clients. In this, our decision variableis categorical.Why Random Forest?Random forest is one of the frequently used predictive modeland machine learning technique. In a normal decision tree, onedecision tree is built to identify the potential client but in case of randomforest algorithm, numbers of decision trees are built during the process toidentify the potential client. A vote from each of the decision trees isconsidered in deciding the final class of an object.

Model Description:Sampling is one of the methods in preprocessing. This willselect the subset of original samples. This is mainly used in case of balancethe data classification. In our model, we have used under sampling approachesto balance the data sampling.It will condense the majority group to make their occurrence closer to the infrequentgroup.

Original insurance data is balanced with undersampling. So further we will use this sample in Random Forest Algorithmsto build the model. This randomly generates the n number of trees to build theeffective model.Extreme Gradient Boosting:Another classifier is extreme gradient boosting. The XGBoosthas an immensely high predictive model.

This algorithm works ten percent fasterthan existing algorithms. It can be used for both regression, classificationand also ranking.One of the most interesting things about the XGBoost is regularizedboosting method. This helps to lessen overfit modeling.

 Over-fitting is the occurrence in which the learning modeltightly fits the given training data so much that it would be inaccurate inpredicting the outcome of the test data. Model Description:In our model, first we used over sampling method to balancethe classification. Sampling technique can be used to get better forecastperformance in the case of imbalanced classes using R and caret package.

Oversampling will randomly duplicate samples from the class with fewinstances. Here, we used over sampling method with train set to improve theperformance of model. Now, balanced samples are collected. We will pass thesesamples to XGBoost as train set and built the model. XGBoost built the binaryclassification model with insurance data. After this, model is validated withtest set.

This produces much better prediction performance compared torandom forest algorithm.E.     ModelEvaluation:Performance analysis of classification problems includes thematrix analysis of predicted result.

In this paper, we have used followingmetrics to evaluate the performance of classification algorithms. They arePrecision-Recall, F-measure.Precision is the fraction of predicted occurrence that isrelated. It is also called positive predicted value.Recall is part of related instances that have been repossessed over the total quantityof related instance.

F1-Measure is the weighted harmonic mean (Number ofinterpretation, divided by the sum of reciprocals of the interpretation) of the precision and recall and correspondto the overall performance.  Where, TP – Truepositive ,FP – False Positive, TN-True Negative, FN-False Negative. Table 1: Confusion Matrix                                                                                                                  IV.           EXPERIMENT AND RESULT We used KNN for missing data treatment and after allpreprocessing we have built the predictive models with XGBoost and RandomForest for business case. The comparison table for two models is given below.

Table 2: Performance comparison of XGBoost and random forest. Algorithm Precision Recall F1 Accuracy Random Forest 0.81 80 0.76 0.76 XGBoost 0.

86 0.86 0.86 0.86  This above result shows that XGBoosting algorithm outperformed than random forest.                                                                                                                     V.           ANALYSIS OF THE RESULTFigure 1: Effect of Missing values before Imputation  Figure 2: Important Features that impact on target variableusing Random Forest Algorithm.

 Figure 3: AUC curve for Random Forest Algorithm Figure 4: AUC curve for ExtremeGradient Boosting Algorithm Figure 5: Important Features that impact on target variableusing Gradient Boosting Algorithm.Figure 6: Overall Performance Analysis                                                                                                                                                VI.           CONCLUSIONThis paper analyzed the imbalance distribution of insurancebusiness data, concluded the preprocessing algorithms of imbalance dataset,proposed an random forest algorithm based on R which can be used in the largescaled imbalanced classification of insurance business data, the experimentresult showed that the random forest algorithm is more suitable to identify howmany people will buy the insurance product in campaign.

Here, XGBoost algorithmout performed than other decision tree algorithm called Random Forest. Ourfuture works include combining proposed algorithm with deep learning. References: 1. E. Ramentol, Y. Caballero, R.

Bello, and F. Herrera,“SMOTE-RSB:A hybrid preprocessing approach based on oversampling andundersampling for high imbalanced data-sets using SMOTE and rough setstheory,”Knowl. Inf.

Syst., vol. 33, no. 2, pp. 245_265, 2012. 2.

Maryam Farajzadeh-Zanjani, Roozbeh Razavi-Far, MehrdadSaif,” Efficient Sampling Techniques for Ensemble Learning and DiagnosingBearing Defects under Class Imbalanced Condition”. 3. Gustavo E. A. P. A.

Batista and Maria Carolina Monard,”An Analysis of Four Missing Data Treatment Methods for Supervised Learning”. 4. Weiwei Lin, Ziming Wu, Longxin Lin, Angzhan Wen, And JinLi,” An Ensemble Random Forest Algorithm for Insurance Big Data Analysis”,2017. 5. Eesha Goel, Er. Abhilasha,” Random Forest: AReview”,2017.    6.

ConceptionOf Data Preprocessing And Partitioning Procedure For Machine Learning.Avaliable:http://www.academia.

edu/9517738/conception_of_data_preprocessing_and_partitioning_procedure_for_machine_learning_algorithm. 7.Down-Sampling Using Random Forests, Avaliable:https://www.r-bloggers.com/down-sampling-using-random-forests/ 8. Boosting in Machine Learning and the Implementation ofXGBoost Avaliable: https://towardsdatascience.

com/boosting-in-machine-learning-and-the-implementation-of-xgboost-in-python.9. Tianqi Chen and Tong He,”xgboost: eXtreme GradientBoosting”, January 4, 2017.10.  Jorma Laurikkala,”Improving Identification of Difficult Small Classes byBalancing Class Distribution”,2001. 

Post Author: admin

x

Hi!
I'm Eric!

Would you like to get a custom essay? How about receiving a customized one?

Check it out