Contents lists available at ScienceDirect

Information Fusion

journal homepage: www.elsevier.com/locate/inffus

A survey on deep learning for big data

Qingchen Zhanga,b

, Laurence T. Yang?,a,b

, Zhikui Chenc

, Peng Lic

a School of Electronic Engineering, University of Electronic Science and Technology of China China

b Department of Computer Science, St. Francis Xavier University,Antigonish, Canada

c School of Software Technology, Dalian University of Technology, Dalian,China

ARTICLE INFO

Keywords:

Deep learning

Big data

Stacked auto-encoders

Deep belief networks

Convolutional neural networks

Recurrent neural networks

ABSTRACT

Deep learning, as one of the most currently remarkable machine learning techniques, has achieved great success

in many applications such as image analysis, speech recognition and text understanding. It uses supervised and

unsupervised strategies to learn multi-level representations and features in hierarchical architectures for the

tasks of classification and pattern recognition. Recent development in sensor networks and communication

technologies has enabled the collection of big data. Although big data provides great opportunities for a broad of

areas including e-commerce, industrial control and smart medical, it poses many challenging issues on data

mining and information processing due to its characteristics of large volume, large variety, large velocity and

large veracity. In the past few years, deep learning has played an important role in big data analytic solutions. In

this paper, we review the emerging researches of deep learning models for big data feature learning.

Furthermore, we point out the remaining challenges of big data deep learning and discuss the future topics.

1. Introduction

Recently, the cyber-physical-social systems, together with the

sensor networks and communication technologies, have made a great

progress, enabling the collection of big data 1,2. Big data can be

defined by its four characteristics, i.e., large volume, large variety, large

velocity and large veracity, which is usually called 4V’s model 3–5.

The most remarkable characteristic of big data is large-volume that

implies an explosive in the data amount. For example, Flicker generates

about 3.6 TB data and Google processes about 20,000 TB data everyday.

The National Security Agency reports that approximately 1.8 PB data is

gathered on the Internet everyday. One distinctive characteristic of big

data is large variety that indicates the different types of data formats

including text, images, videos, graphics, and so on. Most of the traditional

data is in the structured format and it is easily stored in the twodimensional

tables. However, more than 75% of big data is unstructured.

Typical unstructured data is multimedia data collected from

the Internet and mobile devices 6. Large velocity argues that big data

is generating fast and requires to be processed in real time. The realtime

analysis of big data is crucial for e-commerce to provide the online

services. Another important characteristic of big data is large veracity

that refers to the existence of a huge number of noisy objects, incomplete

objects, inaccurate objects, imprecise objects and redundant

objects 7. The size of big data is continuing to grow at an unprecedented

rate and is will reach 35 ZB by 2020. However, only

having massive data is inadequate. For most of the applications such as

industry and medical, the key is to find and extract valuable knowledge

from big data for prediction services support. Take the physical devices

that suffer mechanical malfunctions occasionally in the industrial

manufacturing for an example. If we can analyze the collected parameters

of devices effectively before the devices break down, we can

take the immediate actions to avoid the catastrophe. While big data

provides great opportunities for a broad of areas including e-commerce,

industrial control and smart medical, it poses many challenging issues

on data mining and information processing. Actually, it is difficult for

traditional methods to analyze and process big data effectively and

efficiently due to the large variety and the large veracity.

Deep learning is playing an important role in big data solutions

since it can harvest valuable knowledge from complex systems 8.

Specially, deep learning has become one of the most active research

points in the machine learning community since it was presented in

2006 9–11. Actually, deep learning can track back to the 1940s.

However, traditional training strategies for multi-layer neural networks

always result in a locally optimal solution or cannot guarantee the

convergence. Therefore, the multi-layer neural networks have not received

wide applications even though it was realized that the multilayer

neural networks could achieve the better performance for feature

and representation learning. In 2006, Hinton et al. 12 proposed a twostage

strategy, pre-training and fine-tuning, for training deep learning

effectively, causing the first back-through of deep learning. In addition,

http://dx.doi.org/10.1016/j.inffus.2017.10.006

Received 30 August 2017; Accepted 17 October 2017

? Corresponding author.

E-mail address: [email protected] (L.T. Yang).

Information Fusion 42 (2018) 146–157

Available online 11 November 2017

1566-2535/ © 2017 Elsevier B.V. All rights reserved.

T

the increase of computing power and data size also contributes to the

popularity of deep learning. As the era of big data comes, a large

number of samples can be collected to train the parameters of deep

learning models. Meanwhile, training a large-scale deep learning model

requires high-performance computing systems. Take the large-scale

deep belief network with more than 100 million free parameters and

millions of training samples developed by Raina et al. 13 for example.

With a GPU-based framework, the training time for such the model is

reduced from several weeks to about one day. Typically, deep learning

models use an unsupervised pre-training and a supervised fine-tuning

strategy to learn hierarchical features and representations of big data in

deep architectures for the tasks of classification and recognition 14.

Deep learning has achieved state-of-the-art performance in a broad of

applications such as computer vision 15,16, speech recognition

17,18 and text understanding 19,20.

In the past few years, deep learning has made a great progress in big

data feature learning 21–23. Compared to the conventional shallow

machine learning techniques such as supported vector machine and

Naive Bayes, deep learning models can take advantage of many samples

to extract the high-level features and to learn the hierarchical representations

by combining the low-level input more effectively for big

data with the characteristics of large variety and large veracity. In this

paper, we review the emerging research work on deep learning models

for big data feature learning. We first present four types of most typical

deep learning models, i.e., stacked auto-encoder, deep belief network,

convolutional neural network and recurrent neural network, which are

also the most widely used for big data feature learning, in Section 2.

Afterwards, we provide an overview on deep learning models for big

data according to the 4V’s model, including large-scale deep learning

models for huge amounts of data, multi-modal deep learning models

and deep computation model for heterogeneous data, incremental deep

learning models for real-time data and reliable deep learning models for

low-quality data. Finally, we discuss the remaining challenges of deep

learning on big data and point out the potential trends.

2. Typical deep learning models

Since deep learning was presented in Science magazine in 2006, it

has become an extremely hot research topic in the machine learning

community. Various deep learning models have been developed in the

past few years. The most typical deep learning models include stacked

auto-encoder (SAE), deep belief network (DBN), convolutional neural

network (CNN) and recurrent neural network (RNN), which are also

most widely used models. Most of other deep learning models can be

variants of these four deep architectures. In the following parts, we

review the four typical deep learning models briefly.

2.1. Stacked auto-encoder (SAE)

A stacked auto-encoder model is usually constructed by stacking

several auto-encoders that are the most typical feed-forward neural

networks 24–26. A basic auto-encoder has two stages, i.e., encoding

stage and decoding stage, as presented in Fig. 1.

In the encoder stage, the input x is transformed to the hidden layer h

via the encoding function f:

h fW x b = + ( ). (1) (1) (1)

Afterwards, the hidden representation h is reconstructed back to the

original input that is denoted by y in the decoding stage:

y gW h b = + ( ). (2) (2) (2)

Typically, the encoding function and the decoding function are nonlinear

mapping functions. Four widely used non-linear activation

functions are the Sigmoid function = + ? f ( ) 1/(1 ) x e , x the tanh function

=? + ? ? f ( ) ( )/( ) x ee ee , x xx x the softsign function

f ( ) /(1 ) xx x = + and the ReLu (Rectified Linear Units) function

f ( ) (0, ) x ma x x = . The functional graph of the four non-linear activation

functions is presented in Fig. 2.

? WbWb = { ,; ,} (1) (1) (2) (2) is the parameter set of the basic auto-encoder

and it is usually trained by minimizing the loss function J? with

regard to m training samples:

= ? ?

=

J m y x

1 ? ( ) ,

i

m

i i

1

() () 2

(3)

where x(i) denotes the ith training sample.

Obviously, the parameters of the basic auto-encoder are trained in

an unsupervised strategy. The hidden layer h is viewed as the extracted

feature or the hidden representation for the input data x. When the size

of h is smaller than that of x, the basic auto-encoder can be viewed as an

approach for data compression.

The basic auto-encoder model has some variants. For example, a

regularization named wight-decay is usually integrated into the loss

function to prevent the over-fitting:

= ?+ ? ?

= =

J m yx ?W 1 ? ( ) ,

i

m

i i

j

j

1

() () 2

1

2

( )

(4)

where ? is a hiper-parameter used to control the strength of the weightdecay.

Another representative variant is sparse auto-encoder 27,28. To

make the learned features sparse, the sparse auto-encoder adds a

sparsity constraint into the hidden units, leading to the corresponding

loss function as:

= ?+ ? ?

=

J m y x KL p p 1 ? ( ) (),

i

m

i i

j

n

j

1

() () 2

(5)

where n denotes the number of neurons in the hidden layer and the

second item denotes the KL-divergence. Specially, the KL-divergence

with regard to the jth neuron is defined as: Fig. 1. Basic auto-encoder.

Fig. 2. Functional graph of non-linear activation functions.

Q. Zhang et al. Information Fusion 42 (2018) 146–157

147

= ?

?

? ?

?

? + ? ?

?

?

?

?

?

?

? KLpp p p

p

p p

p

( ) log (1 )log 1

1 , j

j j (6)

where p denotes a predefined sparse parameter that is close to 0 and pj

denotes the average activation value of the jth neuron in the hidden

layer over all the training samples. Generally, a small p value close to 0

will result in a very sparse hidden representation learned by the autoencoder.

Several auto-encoders can be stacked to construct a deep learning

model, called stacked auto-encoder, to learn hierarchical features or

representations for the input, as presented in Figs. 3 and 4.

The stacked auto-encoder is typically trained by two stages, i.e., pretraining

and fine-tuning. As shown in Fig. 3, let X h = 0 and hi denote

the input layer and the ith hidden layer, respectively. In the pre-training

stage, each auto-encoder model is trained in a unsupervised layer-wise

manner from bottom to top. In detail, the auto-encoder takes h X 0 = as

input and takes Y0 as output to train the parameters of the first hidden

layer, and then h1 is fed as the input to train the parameters of the

second hidden layer. This operation is repeated until the parameters of

all the hidden layers are trained. After pre-training, the parameters are

set to the initial parameters of the stacked auto-encoder. Some labeled

samples are used as the supervised objects to fine-tune the parameters

from top to bottom in the fine-tuning stage, as presented in Fig. 4.

According to Hinton et al. 12, this two-stage training strategy can

avoid the local optima effectively and achieve a better convergency for

deep learning models.

2.2. Deep belief network (DBN)

The first deep learning model that is successfully trained is the deep

belief network 12,29. Different from the stacked auto-encoder, the

deep belief network is stacked by several restricted Boltzmann machines.

The restricted Boltzmann machine consists of two layers, i.e.,

visible layer v and hidden layer h, as presented in Fig. 5 30,31.

A typical restricted Boltzmann machine uses the Gibbs sampling to

train the parameters. Specially, the restricted Boltzmann machine uses

the conditional probability P(h|v) to calculate the value of each unit in

the hidden layer and then uses the conditional probability p(h|v) to

calculate the value of each unit in the visible layer. This process is

performed repeatedly until convergence.

The joint distribution of the restricted Boltzmann machine with

regard to all the units is defined as:

= ? p vh? Evh?

Z (, ; ) exp( ( , ; )) , (7)

where Z = ? ? exp( ( , ; ) ?Evh? v h is used for normalization. E denotes

the energy function with the Bernoulli distribution that is calculated

via:

= ? ?? ? ? ? ?

= = = =

E (, ; ) vh? w vh bv ah ,

i

I

j

J

ij i j

i

I

i i

j

J

j j

11 1 1 (8)

where I and J denote the number of the visible units and the hidden

units, respectively. ? Wba = { ,,} denotes the parameter set of the restricted

Boltzmann machine.

The sampling probability of each unit is calculated as follows:

= = ? ?

?

? + ?

?

? =

p( 1; ) h v ? f wv a , j

i

I

ij i j

1 (9)

= = ?

?

?

? +

?

?

? =

p( 1;) v h ? f wh b , j

j

J

ij j i

1 (10)

where f is typically a Sigmoid function.

An important variant is the restricted Boltzmann machine with

Gauss–Bernoulli distribution whose energy function is calculated as

follows 32:

= ? ? ? +

? ? ? ?

= =

= =

Evh? w vh

v b ah

(, ; )

( ) . i

I

j

J

ij i j

i

I i i j

J j j

1 1

1

2 1 1 (11)

The corresponding conditional probability of each visible unit is

calculated via:

Fig. 3. Stacked auto-encoder for pre-training.

Fig. 4. Stacked auto-encoder for fine-tuning.

Fig. 5. Restricted Boltzmann machine.

Q. Zhang et al. Information Fusion 42 (2018) 146–157

148

= = ?

?

?

? +

?

?

? =

p( 1 ; ) ,1 , v h ? N wh b j

j

J

ij j i

1 (12)

where vi denotes the real-value that satisfies the Gauss distribution with

the mean value of ? + = wh b j

J

ij j i 1 and the variance of 1. The restricted

Boltzmann machine with Gauss-Bernoulli distribution can transform

the real random variable to the binary variable.

Several restricted Boltzmann machines can be stacked into a deep

learning model, called deep belief network, as presented in Fig. 6 12.

Similar to the stacked auto-encoder, the deep belief network is also

trained by a two-stage strategy. The pre-training stage is used to train

the initial parameters in a greedy layer-wise unsupervised manner

while the fine-tuning stage uses the supervised strategy to fine-tune the

parameters with regard to the labeled samples by adding a softmax

layer on the top layer. Deep belief networks have a wide range applications

in image classification 33,34 and acoustic modeling 35 and

so on 36–38.

2.3. Convolutional neural network (CNN)

Convolutional neural network is the most widely used deep learning

model in feature learning for large-scale image classification and recognition

39–43. A convolutional neural network consists of three

layers, i.e., convolutional layer, subsampling layer (pooling layer) and

fully-connected layer, as presented in Fig. 7 44.

The convolutional layer uses the convolution operation to achieve

the weight sharing while the subsampling is used to reduce the dimension.

Take a 2-dimensional image x as example. The image is firstly

decomposed into a sequential input x xx x = … {, , , } 1 2 N . To share the

weight, the convolutional layer is defined as:

= ? ? ? ?

?

? + ?

?

yf K xb , j i ij i j

(13)

where yj denotes the jth output for the convolutional layer and Kij

denotes the convolutional kernel with the ith input map xi. ? denotes

the discrete convolution operator and bj denotes the bias. In addition, f

denotes the non-linear activation, typically a scaled hyperbolic tangent

function.

The subsampling layer aims to reduce the dimension of the feature

map. It can typically be implemented by an average pooling operation

or a max pooling operation. Afterwards, several fully-connected layers

and a softmax layer are typically put on the top layer for classification

and recognition.

The deep convolutional neural network usually includes several

convolutional layers and subsampling layers for feature learning on

large-scale images.

In recent years, convolutional neural networks have also made a

great success in language processing and speech recognition and so on

45–50.

2.4. Recurrent neural network (RNN)

The traditional deep learning models such as stacked auto-encoders,

deep belief networks and convolutional neural networks do not take the

time series into account, so they are not suitable to learn features for the

time series data. Take one natural language sentence that is a kind of

typical time series data as an example. Since each word is closely related

to other words in a sentence, the previous one or more words

should be used as inputs when using the current word to predict the

next word. Obviously, the feed-forward deep learning models cannot

work well for this task since they do not store the information of previous

inputs.

The recurrent neural network is a typical sequential learning model.

It learns features for the series data by a memory of previous inputs that

are stored in the internal state of the neural network. A directed cycle is

introduced to construct the connections between neurons, as presented

in Fig. 8.

A recurrent neural network includes input units

{ , , , , , }, x x xx 01 1 … … t t+ output units … … + {, , ,, , } y y yy 01 1 t t and hidden units

{, , ,, , } s s ss 01 1 … … t t+ . As shown in Fig. 8, at the time step t, the recurrent

neural network takes the current sample xt and the previous hidden

representation st?1 as input to obtain the current hidden representation

st:

s fx s t tt = ( , ), ?1 (14)

where f denotes the encoder function.

One widely used recurrent neural network is vanilla one, which at

the time step t is defined as the following forward pass:

s fWx Ws b = ++ ( ) ? , t sx t ss t s 1 (15)

y gW s b = + ( ), t ys t y (16)

where f and g denote the encoder and decoder, respectively, and

? W W bW b = { , ,; , } sx ss s ys y denotes the parameter set.

Therefore, the recurrent neural network captures the dependency

between the current sample xt with the previous one xt?1 by integrating

the previous hidden representation st?1 into the forward pass. From a

theoretical point of view, the recurrent neural network can capture

arbitrary-length dependencies. However, it is difficult for the recurrent

Fig. 6. Deep belief network.

Fig. 7. Convolutional neural network. Fig. 8. Recurrent neural network.

Q. Zhang et al. Information Fusion 42 (2018) 146–157

149

neural network to capture a long-term dependency because of the

gradient vanishing with the back-propagation strategy for training the

parameters. To tackle this problem, some models, such as long shortterm

memory, have been presented by preventing the gradient vanishing

or gradient exploding 51–54.

Multiple recurrent neural networks can be stacked into a deep

learning model. The recurrent neural network and its variants have

achieved super performance in many applications such as natural language

processing, speech recognition and machine translation 55–59.

3. Deep learning models for big data feature learning

Big data is typically defined by the following four characteristics:

volume, variety, velocity and veracity. In this section, we review the

deep learning models for big data feature learning from four aspects,

i.e., deep learning models for huge amounts of data, deep learning

models for heterogeneous data, deep learning models for real-time data

and deep learning models for low-quality data.

3.1. Deep learning models for huge amounts of data

First and foremost, huge amounts of data poses a big challenge on

deep learning models. A big dataset often includes a great many samples,

each with a large number of attributes. Furthermore, there are

many class types of samples in a big dataset. In order to learn features

and representations for large amounts of data, some large-scale deep

learning models have been developed. Generally, a large-scale deep

learning model involves a few hidden layers, each with a large number

of neurons, leading to millions of parameters. It is a typically difficult

task to train such large-scale deep learning models.

In recent years, many algorithmic methods have been presented to

train large-scale models, which can roughly grouped into three categories,

i.e., parallel deep learning models, GPU-based implementation,

and optimized deep learning models.

One of the most representative parallel deep learning models is

called deep stacking network which is presented by Deng et al. 60. A

deep stacking network is constituted by some modules. Fig. 9 shows a

specific example of the deep stacking network with three modules.

In a deep stacking network, each module is also a neural network

with a hidden layer and two sets of weights, as presented in Fig. 9. The

lowest module consists of three layers from bottom to up. The bottom is

a linear layer which uses the original data as input while the hidden

layer is a non-linear one with some hidden neurons. Similar to most of

deep learning models, the deep stacking network uses the Sigmoid

function to map the input to the hidden layer by a weight matrix and a

bias vector. The top is also a linear layer, constituted by C output

neurons which denote the targets of classification.

The original data is concatenated with the previous output layer(s)

and the concatenated vector is used as the input of each module above

the lowest module. For example, if each original data object is represented

by an n-dimensional vector and there are c class types, the

dimension, D, of the input vector of the ith counting from bottom to up,

is Dnc m = +× ? ( 1).

The deep stacking network is efficient for training since it can be

paralleled. Furthermore, a tensor deep stacking network was presented

to improve the training efficiency further on CPU clusters 61.

Recently, a software framework called DistBelief was developed to

train large-scale deep learning models in a large number of machines in

parallel 62–64. DistBelief is efficient to train large-scale models with

billions of free parameters and huge amounts of data by combination of

data parallelism and model parallelism. In order to achieve the model

parallelism, a large deep learning model is partitioned into some small

blocks and each block is assigned to a computer for training. Fig. 10

presents one example of DistBelief with four blocks 8.

DistBelief needs to transfer data among the computers for training

the deep learning models, which will result in a great deal of communication,

especially for the fully-connected network such as stacking

auto-encoder and deep belief network. In spite of this, DistBelief still

improves the training efficiency significantly by partitioning a large

deep model into 144 blocks, as reported in 62.

DistBelief achieves data parallelism by implementing two optimization

procedures, i.e., Downpour and Sandblaster. The former is used

for online optimization while the latter is used for batch optimization.

DistBelief has obtained a high speedup for training several largescale

deep learning models. For example, it achieved a speedup of 12 ×

than using only one machine for a convolutional neural network with

1.7 billion parameters and 16 million images on 81 machines. Besides,

it also achieved a significant improvement of training efficiency for

another deep learning architecture with 14 million images, each one

with a size of 200 × 200 pixels, on 1000 machines, each with 16 CPU

cores. Therefore, DistBelief is very suitable for big data feature learning

since it is able to scale up over many computers, which is the most

remarkable advantage of DistBelief 8.

Deep stacking network and DistBelief typically use multiple CPU

cores to improve the training efficiency for large-scale deep learning

models. Some details about the use of multiple CPU cores for scaling up

deep belief networks, such as implementing data layout and using SSE2

instructions, were discussed in 65.

More recently, some large-scale deep learning frameworks based on

graphic processors units (GPUs) have been explored. GPUs are typically

equipped by great computing power and a big memory bandwidth, so

they are suitable for parallel computing for large-scale deep learning

models. Some experiments have demonstrated a great advance of largescale

deep learning frameworks based on GPUs.

For example, Raina et al. 13 developed a deep learning framework

based on GPUs for parallel training large-scale deep belief networks and

sparse coding with more than 100 million parameters and millions of

training objects. In order to improve the efficiency for parallelizing the Fig. 9. A deep stacking network with three modules.

Fig. 10. Example of DistBelief with four blocks.

Q. Zhang et al. Information Fusion 42 (2018) 146–157

150

learning models, Raina et al. 13 used some specialized strategies in

their developed learning framework 66,67, For instance, they put the

parameters and some training objects into the global memory to reduce

the data transfer. Besides, they implemented a parallel Gibbs sampling

of hidden and visible neurons by producing two sampling matrices, i.e.,

p(h|x) and p(x|h). With this framework, a deep belief network constructed

by multiple restricted Boltzmann machines, each with 45

million free parameters and 1 million training objects, was sped up by a

factor of 70 × .

Another recent developed large-scale deep learning system is the

commodity off-the-shelf high performance computing system, which is

constituted by 16 GPU servers. Each server consists of 4 NVIDIA

GTX680 GPUs, everyone with 4GB memory. This system trains large

deep learning models by implementing CUDA kernels with effective

memory usage and efficient computation 68. For instance, Coates

et al. 68 makes full use of the matrix sparseness and local receptive

field to improve the calculation efficiency for large matrices multiplication.

Compare with DistBelief that needs 16 thousand CPU cores to

train a large deep learning model with 10 million images and billion

parameters, the commodity off-the-shelf high performance computing

system achieved almost training efficiency (e.g., about 3 days) for

training the same model on three computers.

FPGA-based approaches have also explored for large-scale deep

learning models in recent years 69,70. Chen and Lin reviews the recent

progress in large-scale deep learning frameworks in 8.

Generally, large-scale deep learning models can only be trained in

high-performance computing servers which are equipped with multiple

CPU cores or GPUs, limiting the application of such models on low-end

devices. Based on the recent researches indicating that the parameters

of large-scale deep learning models especially with fully-connected

layers are of high redundance, some methods have been presented to

improve the training efficiency by compressing the parameters significantly

without a large accuracy drop.

One typical and straightforward method for compressing deep

learning models is the use of a low-rank representation of the parameter

matrices 71–73. For example, Sainath et al. 72 applied the low-rank

factorization to a deep neural network with 5 hidden layers for speech

recognition. Specially, they decomposed the last weight matrix into two

smaller matrices to compress the parameters significantly since more

than half of the parameters are included in the final layer in the deep

learning models for speech recognition. In detail, A is denoted as the

weight matrix and A is m × n-dimensional with the rank of r. Sainath

et al. 72 decomposed A into B and C, namely ABC = × , which are of

dimension with m × r and r × n, respectively. Obviously, if

mr rn mn + < , the parameters of the deep learning model are compressed.

Furthermore, if we want to compress the parameters of the

final layer by a factor of p, the following condition must be satisfied.

<

+ r pmn

m n . (17)

The low-rank factorization of the deep learning model is able to

constrain the space of search directions, which is helpful to optimize the

objective function more efficiently. Some experiments indicated that

they could reduce the number of parameters of the deep learning

models up to 50% which leads to about 50% speedup without a large

loss of accuracy.

Chen et al. 74 employed the Hashing Trick to compress large-scale

deep learning models. In detail, they employed a hash function to

gather the network connections into several hash groups randomly and

the connections in one group share the weights. Fig. 11 shows one

example of a network with one hidden layer 74.

In the neural network in Fig. 11, the connections between the input

layer and the hidden layer are represented by V1 while the connections

between the hidden layer and the output layer are represented by V2

.

With a hash function hl

( · , · ) that projects an index (i, j) to a natural

number, the item Vij

l is assigned into an item of wl indexed by hl

(i, j):

V = w . ij

l

h ij

l

(,) l (18)

Thus, the connections can be gathered into 3 groups. Furthermore,

the connections marked by the same color share the same weights that

are denoted as w1 and w2

. Therefore, the parameters of the neural

network in Fig. 11 are compressed to 1/4, i.e., 24 parameters are denoted

by 6 parameters. Experiments demonstrated that this approach

could compress the parameters of a neural network by a factor of up to

8 × on MNIST without an accuracy drop.

More recently, tensor decomposition schemes have been used to

compress the parameters of large-scale deep learning models 75,76.

For example, Novikov et al. 75 proposed a tensorizing learning model

based on the tensor-train network. In order to use the tensor-train

network to compress the parameters, they converted the neural network

to the tensor format. Given a neural network with an N-dimensional

input ( = ? = N n k

d

k 1 ) and an M-dimensional hidden layer

(M = ?k= m d

k 1 ), they defined a bijection ?( ) ( ( ), ( ), , ( ) l ?l ?l ?l = … ) 1 2 d to

convert the input vector into the tensor format, where l N ? … {1, 2, , }

and ?k ( ) {1, 2, , } l n ? … k denote the coordinates of the input vector b and

the coordinate of the corresponding tensor B, respectively. Therefore,

the input tensor B can be obtained by B?l b ( ( )) = l. Similarly, they

convert the hidden vector into the tensor format. Furthermore, they

convert the weight matrix w ? RM × N into the tensor format W using the

bijections ?t v t v t v t ( ) ( ( ), ( ), , ( ) = … 1 2 d ) and ?( ) ( ( ), ( ), , ( ) l ?l ?l ?l = … ) 1 2 d

that projects the index (t, l) of w into the corresponding index of the

tensor format W.

Afterwards, they convert the weight tensor W into the tensor-train

format G:

= …

= ?

wtl W ? t ? l ? t ? l

G ?t ?l G ?t ?l

( . ) (( ( ), ( )), , ( ), ( ))

( ( ), ( ) ( ( ), ( ) , d d

d d d

1 1

1 1 1 (19)

where Gk(?k(t), ?k(l) denote the core matrices of the tensor-train representation

for W, with the index (?k(t), ?k(l).

Therefore, a linear projection y wx b = + in a fully-connected lay

can be transformed into the tensor-train form:

… = ? ?

…+ …

… Yi i i G ? t ? l

G ? t ? l Xj j Bi i

( , , , ) ( ( ), ( )

( ( ), ( ) ( , , ) ( , , ) . d j j

d d d d d

1 2 , , 1 1 2

1 1

1 d

(20)

This method could reduce the computational complexity of the

forward pass and improve the training efficiency in the back-propagation

procedure. Table 1 summarizes the computational complexity and

storage complexity of an M × N tensor-train layer (TT) compared with

the original fully-connected layer (FC) 75.

Besides, Lebdev et al. 76 employed the canonical polyadic decomposition

to compress the parameters of the convolutional neural

network and they achieved a significant speedup for the inference time.

Fig. 11. Example of a neural network which is compressed by the hashing trick.

Q. Zhang et al. Information Fusion 42 (2018) 146–157

151

3.2. Deep learning models for heterogeneous data

A distinct characteristic of big data is its variety, implying that big

data is collected in various formats including structured data and unstructured

data, as well as semi-structured data, from a large number of

sources. Specially, a great number of objects in big datasets are multimodel.

For example, a webpage typically contains image and text simultaneously.

Another multi-model example is a multimedia object

such as a video clip which includes still images, text and audio. Each

modality of multi-modal objects has different characteristic with each

other, leading to the complexity of heterogeneous data. Therefore,

heterogeneous data poses another challenge on deep learning models.

Some multi-model deep learning models have been proposed for

heterogeneous data representation learning. For example, Ngiam et al.

77 developed a multi-modal deep learning model for audio-video

objects feature learning. Fig. 12 shows the architecture of the multimodal

deep learning model.

Ngiam et al. 77 used the restricted Boltzmann machines to learn

features and representations for audio and video separately. The

learned features are concatenated into a vector as the joint representation

of the multi-modal object. Afterwards, the joint representation

vector is used as the input of a deep auto-encoder model

for the tasks of classification or recognition.

Srivastava and Salakhutdinov developed another multi-model deep

learning model, called bi-modal deep Boltzmann machine, for textimage

objects feature learning, as presented in Fig. 13 78.

In this model, two deep Boltzmann machines are built to learn

features for text modality and image modality, respectively. Similarly,

the learned features of the text and the image are concatenated into a

vector as the joint representation. In order to perform the classification

task, the classifier such as the supported vector machine could be

trained with the joint representation as input.

Another multi-modal deep learning model, called multi-source deep

learning model, was presented by Ouyang et al. 79 for human pose

estimation. Different from above two multi-model models, the multisource

deep learning model aims to learn non-linear representation

from different information sources, such as human body articulation

and clothing for human pose estimation. In this model, each information

source is used as input of a deep learning model with two hidden

layers for extracting features separately. The extracted features are then

fused for the joint representation.

Other representative multi-modal deep learning models include

heterogeneous deep neural networks combined with conditional

random fields for Chinese dialogue act recognition 80, multi-modal

deep neural network with sparse group lasso for heterogeneous feature

selection 81 and so on 82–84. Although they have different architectures,

their ideas are similar. Specially, multi-modal deep learning

models first learn features for single modality and then combine the

learned features as the joint representation for each multi-modal object.

Finally, the joint representation is used as input of a logical regression

layer or a deep learning model for the tasks of classification or recognition.

Multi-modal deep learning models achieved better performance

than traditional deep neural networks such as stacked auto-encoders

and deep belief networks for heterogeneous data feature learning.

However, they concatenated the learned features of each modality in a

linear way, so they are far away effective to capture the complex correlations

over different modalities for heterogeneous data. To tackle

this problem, Zhang et al. 85,86 presented a tensor deep learning

model, called deep computation model, for heterogeneous data.

Specially, they designed a tensor auto-encoder by extending the

stacked auto-encoder model to the tensor space based on the tensor

data representation. In the tensor auto-encoder model, the input layer

X, the hidden layer H, and the parameters ? W b Wb = { ,;,} (1) (1) 2 (2) are

represented by tensors. Besides, tensor distance is used to reveal the

complex features of heterogeneous data in the tensor space, which

yields a loss function with m training objects of the tensor auto-encoder

model:

?

?

= ? ?

? + ?

?

? ??

+ ? ?? ?

?

=

=

× ×

= =

=

× ×

= =

J ? h x y Gh x

y W

W

() ( ( () ) ( ()

)) ··· ( )

··· ( )

,

TAE m i

m

W b T W b

?

p

J J

i

I

i

I

pi i

q

I I

j

J

j

J

qj j

1

1

1

2 , ,

2 1

···

1 1 ··· (1) 2

1

···

1 1 ··· (2) 2

N

N

N

n

N

N

N

n

1

1

1

1

1

1

1

1 (21)

where G denotes the metric matrix of the tensor distance and the second

item is used to avoid over-fitting.

Furthermore, they built a deep computation model by stacking

multiple tensor auto-encoder models. Experiments demonstrated that

the deep computation model achieved about 2%-4% higher classification

accuracy than multi-modal deep learning models for heterogeneous

data.

3.3. Deep learning models for real-time data

High velocity is another important characteristic of big data, which

Table 1

Comparison of computational complexity and storage complexity between TT and FC

where r denotes the maximal rank of tensor-train network.

Operation Computational complexity Storage complexity

FC forward pass O(MN) O(MN)

TT forward pass O(dr2

mmax {M, N}) O(rmax {M, N})

FC backward pass O(MN) O(MN)

TT backward pass O(d2

r

4

mmax {M, N}) O(r

3

max {M, N})

Fig. 12. Architecture of the multi-modal deep learning model.

Fig. 13. Architecture of the bi-modal deep Boltzmann machine.

Q. Zhang et al. Information Fusion 42 (2018) 146–157

152

requires to analyze big data in real time. Big data is usually collected at

a high speed, posing a challenge on big data real-time processing.

Unfortunately, most of deep learning models are of high computational

complexity since they typically involves a large number of parameters

for big data feature learning, especially large-scale deep neural networks.

Therefore, it is difficult for traditional deep learning models to

learn features and representations for big data in real time.

In recent years, a lot of incremental learning methods have been

presented for high-velocity data feature learning. An kind of incremental

learning methods is online learning which only updates the

parameters when new objects are arriving with preserving the network

structure.

Fu et al. 87 presented an incremental back-propagation learning

model based on a concept of bound. A validity bound is defined as a

range of the weights that represent the knowledge learned by a neural

network. To limit the weights in the validity bound in the updating

procedure, a scaling factor s is introduced into weight modifications,

yielding a learning rule in the kth iteration:

?W k s k ?? k O k () () () () = , ji j i (22)

where ? and ?i denote the learning rate and the error gradient of the

neuron j, Oi denotes the activation level of the neuron i.

Then, s should be set according to:

= ?

?

?

? ? ?

?

?

=

?

s k Bp W t

?? k O k ( ) min 1, ( ) ? ()

() () , j i

t

k ji

j i ,

1

1

(23)

where B denotes a pre-defined bound on the weight modification for an

object p.

Fu et al. 87 applied the bounded weight adjustment to implement

an incremental learning model. Specially, weights are not modified to

prevent the over-training when the new arriving object is covered by

the knowledge of the current network. However, it is difficult to define

the bound beforehand since it is sensitive to the previous knowledge.

Besides, this method is only suitable for two-layer neural network, so it

is difficult to apply the bounded weight modification to the incremental

deep learning models. Wan and Banta 88 proposed an online learning

model based on parameters updating. Specially, an incremental autoencoder

model was presented by updating the current parameters ? to

? ? + ? to adapt the new arriving objects. ? denotes the previous

knowledge trained on the old objects while ?? denotes the parameters

increment on the new arriving objects. Given a new arriving object, to

achieve the goal of adaptation that requires the updated parameters

could learn the new objects effective, an objective function Jadaptation is

defined:

J = x x 1

2 adaptation ? ?? , T

(24)

where ? denotes a weight matrix defined as

= ?

? ? ?

?

?

?

?

?

?

? I ?( )( ) ?x?

?

T ?x?

?

(, ) (, ) 1

and ?x ?x ? ? = +? (, ?) x denotes the

post-reconstruction error.

To achieve the goal of preservation that requires the updated model

could still learn old objects, an objective function Jpreservation is defined:

J ?? preservation = . ?

1 T

2 (25)

Furthermore, to obtain a tradeoff between adaptation and preservation,

the global objective function is defined as:

Jx? ? J J (, ?) . += + adaptation preservation (26)

There, the parameters increment ?? can be obtained by minimizing

the above function with regard to the new arriving objects.

Online learning methods are efficient for big data feature learning

since they do not need to re-train the parameters on the old training

objects 89–91. This scheme is particularly suitable for big data because

the data size is too big to hold in the memory. However, this

strategy usually preforms poor for dynamic big data whose distribution

is changing drastically over time.

To tackle this problem, another kind incremental learning method

based on the structural modification has been presented. For example, a

structure-based incremental auto-encoder model was implemented by

adding one or more neurons into the hidden layer to adapt new arriving

objects 92–94. Fig. 14 shows an example of this model with adding

only one neuron.

After the network structure is modified, the parameters should be

also updated accordingly. Let ? WbWb = { ,; ,} (1) (1) (2) (2) represent the

parameters of a neural network with n-dimensional input an m-dimensional

hidden layer. Thus, the parameters have the forms:

? ?

? ?

×

×

WR bR

WR bR .

m n m

n m n

(1) (1)

(2) (2) (27)

If only one neuron is added into the hidden layer, the weight matrices

W(1) and W(2) add one row and one column, respectively. Also,

the bias vector b(1) adds one element. Therefore, the parameters forms

are updated as:

? ?+

? ?

+ ×

× +

W R b Rm

WR bR

1)

.

m n

nm n

(1) ( 1) (1) (

(2) ( 1) (2) (28)

If p new neurons are added into the hidden layer, the initial parameters

are set to:

? ?

? =

?

?

?

??

?

?

?

??

=

?

?

?

??

?

?

?

??

? = ?

?

?

?

W

W

p

b

b

p

W p

W

0

0

0

0

0 0

( ) ( )

( ) .

(1)

(1)

(1)

(1)

(2)

(2) (29)

Furthermore, the final parameters can be trained by the learning

algorithms such as the back-propagation algorithm.

3.4. Deep learning models for low-quality data

Another emerging challenge for big data feature learning arises

from its veracity. Specially, low-quality data is prevalent in big data,

implied by the fact that there are a large number of incomplete objects,

noise, inaccurate objects, imprecise objects and redundant objects in

big data. Low-quality data is resulted from many reasons. For example,

a large amount of data is collected from sensors. If some sensors are

broken, we may collect some incomplete objects. The transmission fault

of the network may also result in some noise in big data. The topics

about big data quality have been studied in literatures 95–98.

Most of deep learning models do not take low-quality data into

account. In other words, most of deep learning models are designed for

high-quality data feature learning. In the past few years, some methods

have been proposed to learn features for low-quality data. Vincent et al.

Fig. 14. Structure-based incremental learning method with adding one hidden neuron.

Q. Zhang et al. Information Fusion 42 (2018) 146–157

153

99 presented a denoising auto-encoder model which could learn

features and representations for data with noise. Specially, the denoising

auto-encoder model trains the parameters by reconstructing the

original input from the corrupted input, as shown in Fig. 15.

To reconstruct the original input, the objective function of the denoising

auto-encoder model is defined as:

J = ? E Lx g f x ( , ( ( ))) ? ,

t

qxx t ( ) ? ? ( ) ( )t

(30)

where x?denote the corrupted instances of the original input x(t) based

on the corruption process q(x x? ) ( )t and E · qxx ( ) ( )t averages over the

instances x?.

The parameters of the objective function can be trained by the

gradient descent strategy. x? can be obtained by adding isotropic

Gaussian noise or pepper noise. Furthermore, Vincent et al. 100 developed

stacked denoising auto-encoder models for feature learning on

data with noise.

Zhang et al. 101 proposed an imputation auto-encoder model to

learn features for incomplete objects, as shown in Fig. 16.

The simulated incomplete object x? is obtained by setting a part of

attributes values of the original object x to 0. The imputation autoencoder

model takes the incomplete object x? as input and output the

reconstructed object z:

z gfx = ( ( )). ? ? (31)

The parameters ? are trained by minimizing the following objective

function:

J = Lx z (, ). (32)

Furthermore, they built a deep imputation network for incomplete

objects feature learning by stacking several imputation auto-encoders,

as presented in Fig. 17.

More recently, Wang and Tao presented a non-local auto-encoder to

learn reliable features for corrupted data 102. Their work is motivated

by the neurological observation that similar input should stimulate

human brains to produce the similar response. Therefore, the

neural networks should yield similar hidden representations for the

similar input objects. In detail, suppose that h1, h2 and h3 are the

learned representations of x1, x2 and x3, respectively. If

xx xx 12 13 ?

Scholarship Essay Writing Service