Abstract:With big data growth in biomedical and healthcare communities, accurateanalysis of medical data benefits early disease detection, patient care andcommunity services. However, the analysis accuracy is reduced when the qualityof medical data is incomplete5. We experiment the modified prediction modelsover real-life hospital data collected from central China in 2013-2015.
Toovercome the difficulty of incomplete data, we use a latent factor model toreconstruct the missing data. We propose a new convolution neural network basedmultimodal disease risk prediction (CNN-MDRP) algorithm using structured andunstructured data from hospital. To the best of our knowledge, none of theexisting work focused on both data types in the area of medical big dataanalytics. Compared to several typical prediction algorithms, the prediction accuracyof our proposed algorithm reaches 94.
8% with a convergence speed which isfaster than that of the CNN-based unimodal disease risk prediction (CNN-UDRP)algorithm. ASIETI.INTRODUCTION Accordingto a report by McKinsey 50% of Americanshave one or more chronic diseases, and 80% of American medical care fee isspent on chronic disease treatment. With the improvement of living standards,the incidence of chronic disease is increasing. The United States has spent anaverage of 2.
7 trillion USD annually on chronic disease treatment. This amountcomprises 18% of the entire annual GDP of the United States. The healthcareproblem of chronic diseases is also very important in many other countries. Withthe growth in medical data ,collectingelectronic health records is increasingly convenient2. Besides, firstpresented a bio inspired high-performance heterogeneous vehicular telemetric paradigm,such that the collection of mobile users’ health related real-time big data canbe achieved with the deployment of advanced heterogeneous vehicular networks.
Moreover, in the first paper proposing healthcare cyber-physical system , itinnovatively brought forward the concept of prediction-based healthcareapplications, including health risk assessment. These models are valuable inclinical situations and are widely studied . However, these schemes have thefollowing characteristics and defects. The data set is typically small,forpatients and diseases with specific conditions, the characteristics areselected through experience. However, these pre-selected characteristics maybenot satisfy the changes in the disease and its influencing factors.II.METHODSIn this section, we introduce thedata imputation, CNNbased unimodal disease risk prediction (CNN-UDRP) algorithmand CNN-based unimodal disease risk prediction (CNNMDRP) algorithm.
1)Data ImputationForpatient’s examination data, there is a large number of missing data due tohuman error. Thus, we need to fill identify uncertain or incomplete medicaldata and then modify or delete them to improve the data quality. Then, we use dataintegration for data pre-processing. We can integrate the medical data toguarantee data atomicity: i.e., we integrated the height and weight to obtainbody mass index (BMI).
For data imputation, we use the latent factor model which is presented to explain the observablevariables in terms of the latent variables.Beforedata imputation, we first identify uncertain or incomplete medical data andthen modify or delete them to improve the data quality. Then, we use dataintegration for data pre-processing. We can integrate the medical data toguarantee data atomicity: i.e., we integrated the height and weight to obtainbody mass index (BMI). For data imputation, we use the latent factor model which is presented to explain the observablevariables in terms of the latent variables. Accordingly, assume that Rm_n isthe data matrix in our healthcare model.
The row designation, mrepresents the total number of the patients, and the column designation, n representseach patient’s number of feature attributes. Assuming that there are k latentfactors, the original matrix R can be approximated in such a way that it can bedone as R(m_n)_ Pm_kQT_kThus,each element value can be written as bruv = pT u qv, where pu isthe vector of the user factor, which indicates the patient’s preference tothese potential factors, and qv is the vector of the feature attributefactor3. The pu and qv values in the above formula are unknown.To solve the problem, we can transform this problem into an optimizationproblem:where ruvis real data, pu, qv are the parameters to be solved,and _i;i = 1; 2 is a regularization constant, which can prevent overfittingin the operation process.
We can solve it by the use of the stochastic gradientdescent method. From what has been discussed above, we can get the information thatCNN-UDRP only uses the text data to predict whether the patient is at high riskof cerebral infarction. As for structured and unstructured text data, we designa CNNMDRP algorithm based on CNN-UDRP34.. For full connection layer,computation methods are similar with CNNUDRP algorithm Since the variation of features number, the correspondingweight matrix and bias change to W3 new; b3new,respectively.
We also utilize soft max classifier. In the following we will introduce how totrain the CNN-MDRP algorithm, the specific training process is divided into twoparts. 1.1) Training word Embedding Wordvector training requires pure corpus which is generated from hospital datasetafter cleaning1.2) Trainingparameters of CNN-MDRPGradientmethod was used to train parameters.
After the training of parameters wereached the risk assessment about the disease of the patient.III.CONCLUSIONIn this paper, we propose a new convolutionalneural network based multimodal disease risk prediction (CNN-MDRP) algorithmusing structured and unstructured data from hospital.
To the best of ourknowledge, none of the existing work focused on both data types in the area ofmedical big data analytics. Compared to several typical predictionalgorithms,the prediction accuracy of our proposed algorithm reaches 94.8% witha convergence speed which is faster than that of the CNN-based unimodal diseaserisk prediction (CNNUDRP) algorithm.Compared to several typicalprediction algorithms, the prediction accuracy is high for CNN-MDRP.The accuracy of risk predictiondepends on the diversity feature of the hospital data,i.e., the better is the featuredescription of the disease, the higher the accuracy will be.