Site Loader

Contents lists available at ScienceDirect
Information Fusion
journal homepage:
A survey on deep learning for big data
Qingchen Zhanga,b
, Laurence T. Yang?,a,b
, Zhikui Chenc
, Peng Lic
a School of Electronic Engineering, University of Electronic Science and Technology of China China
b Department of Computer Science, St. Francis Xavier University,Antigonish, Canada
c School of Software Technology, Dalian University of Technology, Dalian,China
Deep learning
Big data
Stacked auto-encoders
Deep belief networks
Convolutional neural networks
Recurrent neural networks
Deep learning, as one of the most currently remarkable machine learning techniques, has achieved great success
in many applications such as image analysis, speech recognition and text understanding. It uses supervised and
unsupervised strategies to learn multi-level representations and features in hierarchical architectures for the
tasks of classification and pattern recognition. Recent development in sensor networks and communication
technologies has enabled the collection of big data. Although big data provides great opportunities for a broad of
areas including e-commerce, industrial control and smart medical, it poses many challenging issues on data
mining and information processing due to its characteristics of large volume, large variety, large velocity and
large veracity. In the past few years, deep learning has played an important role in big data analytic solutions. In
this paper, we review the emerging researches of deep learning models for big data feature learning.
Furthermore, we point out the remaining challenges of big data deep learning and discuss the future topics.
1. Introduction
Recently, the cyber-physical-social systems, together with the
sensor networks and communication technologies, have made a great
progress, enabling the collection of big data 1,2. Big data can be
defined by its four characteristics, i.e., large volume, large variety, large
velocity and large veracity, which is usually called 4V’s model 3–5.
The most remarkable characteristic of big data is large-volume that
implies an explosive in the data amount. For example, Flicker generates
about 3.6 TB data and Google processes about 20,000 TB data everyday.
The National Security Agency reports that approximately 1.8 PB data is
gathered on the Internet everyday. One distinctive characteristic of big
data is large variety that indicates the different types of data formats
including text, images, videos, graphics, and so on. Most of the traditional
data is in the structured format and it is easily stored in the twodimensional
tables. However, more than 75% of big data is unstructured.
Typical unstructured data is multimedia data collected from
the Internet and mobile devices 6. Large velocity argues that big data
is generating fast and requires to be processed in real time. The realtime
analysis of big data is crucial for e-commerce to provide the online
services. Another important characteristic of big data is large veracity
that refers to the existence of a huge number of noisy objects, incomplete
objects, inaccurate objects, imprecise objects and redundant
objects 7. The size of big data is continuing to grow at an unprecedented
rate and is will reach 35 ZB by 2020. However, only
having massive data is inadequate. For most of the applications such as
industry and medical, the key is to find and extract valuable knowledge
from big data for prediction services support. Take the physical devices
that suffer mechanical malfunctions occasionally in the industrial
manufacturing for an example. If we can analyze the collected parameters
of devices effectively before the devices break down, we can
take the immediate actions to avoid the catastrophe. While big data
provides great opportunities for a broad of areas including e-commerce,
industrial control and smart medical, it poses many challenging issues
on data mining and information processing. Actually, it is difficult for
traditional methods to analyze and process big data effectively and
efficiently due to the large variety and the large veracity.
Deep learning is playing an important role in big data solutions
since it can harvest valuable knowledge from complex systems 8.
Specially, deep learning has become one of the most active research
points in the machine learning community since it was presented in
2006 9–11. Actually, deep learning can track back to the 1940s.
However, traditional training strategies for multi-layer neural networks
always result in a locally optimal solution or cannot guarantee the
convergence. Therefore, the multi-layer neural networks have not received
wide applications even though it was realized that the multilayer
neural networks could achieve the better performance for feature
and representation learning. In 2006, Hinton et al. 12 proposed a twostage
strategy, pre-training and fine-tuning, for training deep learning
effectively, causing the first back-through of deep learning. In addition,
Received 30 August 2017; Accepted 17 October 2017
? Corresponding author.
E-mail address: [email protected] (L.T. Yang).
Information Fusion 42 (2018) 146–157
Available online 11 November 2017
1566-2535/ © 2017 Elsevier B.V. All rights reserved.
the increase of computing power and data size also contributes to the
popularity of deep learning. As the era of big data comes, a large
number of samples can be collected to train the parameters of deep
learning models. Meanwhile, training a large-scale deep learning model
requires high-performance computing systems. Take the large-scale
deep belief network with more than 100 million free parameters and
millions of training samples developed by Raina et al. 13 for example.
With a GPU-based framework, the training time for such the model is
reduced from several weeks to about one day. Typically, deep learning
models use an unsupervised pre-training and a supervised fine-tuning
strategy to learn hierarchical features and representations of big data in
deep architectures for the tasks of classification and recognition 14.
Deep learning has achieved state-of-the-art performance in a broad of
applications such as computer vision 15,16, speech recognition
17,18 and text understanding 19,20.
In the past few years, deep learning has made a great progress in big
data feature learning 21–23. Compared to the conventional shallow
machine learning techniques such as supported vector machine and
Naive Bayes, deep learning models can take advantage of many samples
to extract the high-level features and to learn the hierarchical representations
by combining the low-level input more effectively for big
data with the characteristics of large variety and large veracity. In this
paper, we review the emerging research work on deep learning models
for big data feature learning. We first present four types of most typical
deep learning models, i.e., stacked auto-encoder, deep belief network,
convolutional neural network and recurrent neural network, which are
also the most widely used for big data feature learning, in Section 2.
Afterwards, we provide an overview on deep learning models for big
data according to the 4V’s model, including large-scale deep learning
models for huge amounts of data, multi-modal deep learning models
and deep computation model for heterogeneous data, incremental deep
learning models for real-time data and reliable deep learning models for
low-quality data. Finally, we discuss the remaining challenges of deep
learning on big data and point out the potential trends.
2. Typical deep learning models
Since deep learning was presented in Science magazine in 2006, it
has become an extremely hot research topic in the machine learning
community. Various deep learning models have been developed in the
past few years. The most typical deep learning models include stacked
auto-encoder (SAE), deep belief network (DBN), convolutional neural
network (CNN) and recurrent neural network (RNN), which are also
most widely used models. Most of other deep learning models can be
variants of these four deep architectures. In the following parts, we
review the four typical deep learning models briefly.
2.1. Stacked auto-encoder (SAE)
A stacked auto-encoder model is usually constructed by stacking
several auto-encoders that are the most typical feed-forward neural
networks 24–26. A basic auto-encoder has two stages, i.e., encoding
stage and decoding stage, as presented in Fig. 1.
In the encoder stage, the input x is transformed to the hidden layer h
via the encoding function f:
h fW x b = + ( ). (1) (1) (1)
Afterwards, the hidden representation h is reconstructed back to the
original input that is denoted by y in the decoding stage:
y gW h b = + ( ). (2) (2) (2)
Typically, the encoding function and the decoding function are nonlinear
mapping functions. Four widely used non-linear activation
functions are the Sigmoid function = + ? f ( ) 1/(1 ) x e , x the tanh function
=? + ? ? f ( ) ( )/( ) x ee ee , x xx x the softsign function
f ( ) /(1 ) xx x = + and the ReLu (Rectified Linear Units) function
f ( ) (0, ) x ma x x = . The functional graph of the four non-linear activation
functions is presented in Fig. 2.
? WbWb = { ,; ,} (1) (1) (2) (2) is the parameter set of the basic auto-encoder
and it is usually trained by minimizing the loss function J? with
regard to m training samples:
= ? ?
J m y x
1 ? ( ) ,
i i
() () 2
where x(i) denotes the ith training sample.
Obviously, the parameters of the basic auto-encoder are trained in
an unsupervised strategy. The hidden layer h is viewed as the extracted
feature or the hidden representation for the input data x. When the size
of h is smaller than that of x, the basic auto-encoder can be viewed as an
approach for data compression.
The basic auto-encoder model has some variants. For example, a
regularization named wight-decay is usually integrated into the loss
function to prevent the over-fitting:
= ?+ ? ?
= =
J m yx ?W 1 ? ( ) ,
i i
() () 2
( )
where ? is a hiper-parameter used to control the strength of the weightdecay.
Another representative variant is sparse auto-encoder 27,28. To
make the learned features sparse, the sparse auto-encoder adds a
sparsity constraint into the hidden units, leading to the corresponding
loss function as:
= ?+ ? ?
J m y x KL p p 1 ? ( ) (),
i i
() () 2
where n denotes the number of neurons in the hidden layer and the
second item denotes the KL-divergence. Specially, the KL-divergence
with regard to the jth neuron is defined as: Fig. 1. Basic auto-encoder.
Fig. 2. Functional graph of non-linear activation functions.
Q. Zhang et al. Information Fusion 42 (2018) 146–157
= ?
? ?
? + ? ?
? KLpp p p
p p
( ) log (1 )log 1
1 , j
j j (6)
where p denotes a predefined sparse parameter that is close to 0 and pj
denotes the average activation value of the jth neuron in the hidden
layer over all the training samples. Generally, a small p value close to 0
will result in a very sparse hidden representation learned by the autoencoder.
Several auto-encoders can be stacked to construct a deep learning
model, called stacked auto-encoder, to learn hierarchical features or
representations for the input, as presented in Figs. 3 and 4.
The stacked auto-encoder is typically trained by two stages, i.e., pretraining
and fine-tuning. As shown in Fig. 3, let X h = 0 and hi denote
the input layer and the ith hidden layer, respectively. In the pre-training
stage, each auto-encoder model is trained in a unsupervised layer-wise
manner from bottom to top. In detail, the auto-encoder takes h X 0 = as
input and takes Y0 as output to train the parameters of the first hidden
layer, and then h1 is fed as the input to train the parameters of the
second hidden layer. This operation is repeated until the parameters of
all the hidden layers are trained. After pre-training, the parameters are
set to the initial parameters of the stacked auto-encoder. Some labeled
samples are used as the supervised objects to fine-tune the parameters
from top to bottom in the fine-tuning stage, as presented in Fig. 4.
According to Hinton et al. 12, this two-stage training strategy can
avoid the local optima effectively and achieve a better convergency for
deep learning models.
2.2. Deep belief network (DBN)
The first deep learning model that is successfully trained is the deep
belief network 12,29. Different from the stacked auto-encoder, the
deep belief network is stacked by several restricted Boltzmann machines.
The restricted Boltzmann machine consists of two layers, i.e.,
visible layer v and hidden layer h, as presented in Fig. 5 30,31.
A typical restricted Boltzmann machine uses the Gibbs sampling to
train the parameters. Specially, the restricted Boltzmann machine uses
the conditional probability P(h|v) to calculate the value of each unit in
the hidden layer and then uses the conditional probability p(h|v) to
calculate the value of each unit in the visible layer. This process is
performed repeatedly until convergence.
The joint distribution of the restricted Boltzmann machine with
regard to all the units is defined as:
= ? p vh? Evh?
Z (, ; ) exp( ( , ; )) , (7)
where Z = ? ? exp( ( , ; ) ?Evh? v h is used for normalization. E denotes
the energy function with the Bernoulli distribution that is calculated
= ? ?? ? ? ? ?
= = = =
E (, ; ) vh? w vh bv ah ,
ij i j
i i
j j
11 1 1 (8)
where I and J denote the number of the visible units and the hidden
units, respectively. ? Wba = { ,,} denotes the parameter set of the restricted
Boltzmann machine.
The sampling probability of each unit is calculated as follows:
= = ? ?
? + ?
? =
p( 1; ) h v ? f wv a , j
ij i j
1 (9)
= = ?
? +
? =
p( 1;) v h ? f wh b , j
ij j i
1 (10)
where f is typically a Sigmoid function.
An important variant is the restricted Boltzmann machine with
Gauss–Bernoulli distribution whose energy function is calculated as
follows 32:
= ? ? ? +
? ? ? ?
= =
= =
Evh? w vh
v b ah
(, ; )
( ) . i
ij i j
I i i j
J j j
1 1
2 1 1 (11)
The corresponding conditional probability of each visible unit is
calculated via:
Fig. 3. Stacked auto-encoder for pre-training.
Fig. 4. Stacked auto-encoder for fine-tuning.
Fig. 5. Restricted Boltzmann machine.
Q. Zhang et al. Information Fusion 42 (2018) 146–157
= = ?
? +
? =
p( 1 ; ) ,1 , v h ? N wh b j
ij j i
1 (12)
where vi denotes the real-value that satisfies the Gauss distribution with
the mean value of ? + = wh b j
ij j i 1 and the variance of 1. The restricted
Boltzmann machine with Gauss-Bernoulli distribution can transform
the real random variable to the binary variable.
Several restricted Boltzmann machines can be stacked into a deep
learning model, called deep belief network, as presented in Fig. 6 12.
Similar to the stacked auto-encoder, the deep belief network is also
trained by a two-stage strategy. The pre-training stage is used to train
the initial parameters in a greedy layer-wise unsupervised manner
while the fine-tuning stage uses the supervised strategy to fine-tune the
parameters with regard to the labeled samples by adding a softmax
layer on the top layer. Deep belief networks have a wide range applications
in image classification 33,34 and acoustic modeling 35 and
so on 36–38.
2.3. Convolutional neural network (CNN)
Convolutional neural network is the most widely used deep learning
model in feature learning for large-scale image classification and recognition
39–43. A convolutional neural network consists of three
layers, i.e., convolutional layer, subsampling layer (pooling layer) and
fully-connected layer, as presented in Fig. 7 44.
The convolutional layer uses the convolution operation to achieve
the weight sharing while the subsampling is used to reduce the dimension.
Take a 2-dimensional image x as example. The image is firstly
decomposed into a sequential input x xx x = … {, , , } 1 2 N . To share the
weight, the convolutional layer is defined as:
= ? ? ? ?
? + ?
yf K xb , j i ij i j
where yj denotes the jth output for the convolutional layer and Kij
denotes the convolutional kernel with the ith input map xi. ? denotes
the discrete convolution operator and bj denotes the bias. In addition, f
denotes the non-linear activation, typically a scaled hyperbolic tangent
The subsampling layer aims to reduce the dimension of the feature
map. It can typically be implemented by an average pooling operation
or a max pooling operation. Afterwards, several fully-connected layers
and a softmax layer are typically put on the top layer for classification
and recognition.
The deep convolutional neural network usually includes several
convolutional layers and subsampling layers for feature learning on
large-scale images.
In recent years, convolutional neural networks have also made a
great success in language processing and speech recognition and so on
2.4. Recurrent neural network (RNN)
The traditional deep learning models such as stacked auto-encoders,
deep belief networks and convolutional neural networks do not take the
time series into account, so they are not suitable to learn features for the
time series data. Take one natural language sentence that is a kind of
typical time series data as an example. Since each word is closely related
to other words in a sentence, the previous one or more words
should be used as inputs when using the current word to predict the
next word. Obviously, the feed-forward deep learning models cannot
work well for this task since they do not store the information of previous
The recurrent neural network is a typical sequential learning model.
It learns features for the series data by a memory of previous inputs that
are stored in the internal state of the neural network. A directed cycle is
introduced to construct the connections between neurons, as presented
in Fig. 8.
A recurrent neural network includes input units
{ , , , , , }, x x xx 01 1 … … t t+ output units … … + {, , ,, , } y y yy 01 1 t t and hidden units
{, , ,, , } s s ss 01 1 … … t t+ . As shown in Fig. 8, at the time step t, the recurrent
neural network takes the current sample xt and the previous hidden
representation st?1 as input to obtain the current hidden representation
s fx s t tt = ( , ), ?1 (14)
where f denotes the encoder function.
One widely used recurrent neural network is vanilla one, which at
the time step t is defined as the following forward pass:
s fWx Ws b = ++ ( ) ? , t sx t ss t s 1 (15)
y gW s b = + ( ), t ys t y (16)
where f and g denote the encoder and decoder, respectively, and
? W W bW b = { , ,; , } sx ss s ys y denotes the parameter set.
Therefore, the recurrent neural network captures the dependency
between the current sample xt with the previous one xt?1 by integrating
the previous hidden representation st?1 into the forward pass. From a
theoretical point of view, the recurrent neural network can capture
arbitrary-length dependencies. However, it is difficult for the recurrent
Fig. 6. Deep belief network.
Fig. 7. Convolutional neural network. Fig. 8. Recurrent neural network.
Q. Zhang et al. Information Fusion 42 (2018) 146–157
neural network to capture a long-term dependency because of the
gradient vanishing with the back-propagation strategy for training the
parameters. To tackle this problem, some models, such as long shortterm
memory, have been presented by preventing the gradient vanishing
or gradient exploding 51–54.
Multiple recurrent neural networks can be stacked into a deep
learning model. The recurrent neural network and its variants have
achieved super performance in many applications such as natural language
processing, speech recognition and machine translation 55–59.
3. Deep learning models for big data feature learning
Big data is typically defined by the following four characteristics:
volume, variety, velocity and veracity. In this section, we review the
deep learning models for big data feature learning from four aspects,
i.e., deep learning models for huge amounts of data, deep learning
models for heterogeneous data, deep learning models for real-time data
and deep learning models for low-quality data.
3.1. Deep learning models for huge amounts of data
First and foremost, huge amounts of data poses a big challenge on
deep learning models. A big dataset often includes a great many samples,
each with a large number of attributes. Furthermore, there are
many class types of samples in a big dataset. In order to learn features
and representations for large amounts of data, some large-scale deep
learning models have been developed. Generally, a large-scale deep
learning model involves a few hidden layers, each with a large number
of neurons, leading to millions of parameters. It is a typically difficult
task to train such large-scale deep learning models.
In recent years, many algorithmic methods have been presented to
train large-scale models, which can roughly grouped into three categories,
i.e., parallel deep learning models, GPU-based implementation,
and optimized deep learning models.
One of the most representative parallel deep learning models is
called deep stacking network which is presented by Deng et al. 60. A
deep stacking network is constituted by some modules. Fig. 9 shows a
specific example of the deep stacking network with three modules.
In a deep stacking network, each module is also a neural network
with a hidden layer and two sets of weights, as presented in Fig. 9. The
lowest module consists of three layers from bottom to up. The bottom is
a linear layer which uses the original data as input while the hidden
layer is a non-linear one with some hidden neurons. Similar to most of
deep learning models, the deep stacking network uses the Sigmoid
function to map the input to the hidden layer by a weight matrix and a
bias vector. The top is also a linear layer, constituted by C output
neurons which denote the targets of classification.
The original data is concatenated with the previous output layer(s)
and the concatenated vector is used as the input of each module above
the lowest module. For example, if each original data object is represented
by an n-dimensional vector and there are c class types, the
dimension, D, of the input vector of the ith counting from bottom to up,
is Dnc m = +× ? ( 1).
The deep stacking network is efficient for training since it can be
paralleled. Furthermore, a tensor deep stacking network was presented
to improve the training efficiency further on CPU clusters 61.
Recently, a software framework called DistBelief was developed to
train large-scale deep learning models in a large number of machines in
parallel 62–64. DistBelief is efficient to train large-scale models with
billions of free parameters and huge amounts of data by combination of
data parallelism and model parallelism. In order to achieve the model
parallelism, a large deep learning model is partitioned into some small
blocks and each block is assigned to a computer for training. Fig. 10
presents one example of DistBelief with four blocks 8.
DistBelief needs to transfer data among the computers for training
the deep learning models, which will result in a great deal of communication,
especially for the fully-connected network such as stacking
auto-encoder and deep belief network. In spite of this, DistBelief still
improves the training efficiency significantly by partitioning a large
deep model into 144 blocks, as reported in 62.
DistBelief achieves data parallelism by implementing two optimization
procedures, i.e., Downpour and Sandblaster. The former is used
for online optimization while the latter is used for batch optimization.
DistBelief has obtained a high speedup for training several largescale
deep learning models. For example, it achieved a speedup of 12 ×
than using only one machine for a convolutional neural network with
1.7 billion parameters and 16 million images on 81 machines. Besides,
it also achieved a significant improvement of training efficiency for
another deep learning architecture with 14 million images, each one
with a size of 200 × 200 pixels, on 1000 machines, each with 16 CPU
cores. Therefore, DistBelief is very suitable for big data feature learning
since it is able to scale up over many computers, which is the most
remarkable advantage of DistBelief 8.
Deep stacking network and DistBelief typically use multiple CPU
cores to improve the training efficiency for large-scale deep learning
models. Some details about the use of multiple CPU cores for scaling up
deep belief networks, such as implementing data layout and using SSE2
instructions, were discussed in 65.
More recently, some large-scale deep learning frameworks based on
graphic processors units (GPUs) have been explored. GPUs are typically
equipped by great computing power and a big memory bandwidth, so
they are suitable for parallel computing for large-scale deep learning
models. Some experiments have demonstrated a great advance of largescale
deep learning frameworks based on GPUs.
For example, Raina et al. 13 developed a deep learning framework
based on GPUs for parallel training large-scale deep belief networks and
sparse coding with more than 100 million parameters and millions of
training objects. In order to improve the efficiency for parallelizing the Fig. 9. A deep stacking network with three modules.
Fig. 10. Example of DistBelief with four blocks.
Q. Zhang et al. Information Fusion 42 (2018) 146–157
learning models, Raina et al. 13 used some specialized strategies in
their developed learning framework 66,67, For instance, they put the
parameters and some training objects into the global memory to reduce
the data transfer. Besides, they implemented a parallel Gibbs sampling
of hidden and visible neurons by producing two sampling matrices, i.e.,
p(h|x) and p(x|h). With this framework, a deep belief network constructed
by multiple restricted Boltzmann machines, each with 45
million free parameters and 1 million training objects, was sped up by a
factor of 70 × .
Another recent developed large-scale deep learning system is the
commodity off-the-shelf high performance computing system, which is
constituted by 16 GPU servers. Each server consists of 4 NVIDIA
GTX680 GPUs, everyone with 4GB memory. This system trains large
deep learning models by implementing CUDA kernels with effective
memory usage and efficient computation 68. For instance, Coates
et al. 68 makes full use of the matrix sparseness and local receptive
field to improve the calculation efficiency for large matrices multiplication.
Compare with DistBelief that needs 16 thousand CPU cores to
train a large deep learning model with 10 million images and billion
parameters, the commodity off-the-shelf high performance computing
system achieved almost training efficiency (e.g., about 3 days) for
training the same model on three computers.
FPGA-based approaches have also explored for large-scale deep
learning models in recent years 69,70. Chen and Lin reviews the recent
progress in large-scale deep learning frameworks in 8.
Generally, large-scale deep learning models can only be trained in
high-performance computing servers which are equipped with multiple
CPU cores or GPUs, limiting the application of such models on low-end
devices. Based on the recent researches indicating that the parameters
of large-scale deep learning models especially with fully-connected
layers are of high redundance, some methods have been presented to
improve the training efficiency by compressing the parameters significantly
without a large accuracy drop.
One typical and straightforward method for compressing deep
learning models is the use of a low-rank representation of the parameter
matrices 71–73. For example, Sainath et al. 72 applied the low-rank
factorization to a deep neural network with 5 hidden layers for speech
recognition. Specially, they decomposed the last weight matrix into two
smaller matrices to compress the parameters significantly since more
than half of the parameters are included in the final layer in the deep
learning models for speech recognition. In detail, A is denoted as the
weight matrix and A is m × n-dimensional with the rank of r. Sainath
et al. 72 decomposed A into B and C, namely ABC = × , which are of
dimension with m × r and r × n, respectively. Obviously, if
mr rn mn + < , the parameters of the deep learning model are compressed.
Furthermore, if we want to compress the parameters of the
final layer by a factor of p, the following condition must be satisfied.
+ r pmn
m n . (17)
The low-rank factorization of the deep learning model is able to
constrain the space of search directions, which is helpful to optimize the
objective function more efficiently. Some experiments indicated that
they could reduce the number of parameters of the deep learning
models up to 50% which leads to about 50% speedup without a large
loss of accuracy.
Chen et al. 74 employed the Hashing Trick to compress large-scale
deep learning models. In detail, they employed a hash function to
gather the network connections into several hash groups randomly and
the connections in one group share the weights. Fig. 11 shows one
example of a network with one hidden layer 74.
In the neural network in Fig. 11, the connections between the input
layer and the hidden layer are represented by V1 while the connections
between the hidden layer and the output layer are represented by V2
With a hash function hl
( · , · ) that projects an index (i, j) to a natural
number, the item Vij
l is assigned into an item of wl indexed by hl
(i, j):
V = w . ij
h ij
(,) l (18)
Thus, the connections can be gathered into 3 groups. Furthermore,
the connections marked by the same color share the same weights that
are denoted as w1 and w2
. Therefore, the parameters of the neural
network in Fig. 11 are compressed to 1/4, i.e., 24 parameters are denoted
by 6 parameters. Experiments demonstrated that this approach
could compress the parameters of a neural network by a factor of up to
8 × on MNIST without an accuracy drop.
More recently, tensor decomposition schemes have been used to
compress the parameters of large-scale deep learning models 75,76.
For example, Novikov et al. 75 proposed a tensorizing learning model
based on the tensor-train network. In order to use the tensor-train
network to compress the parameters, they converted the neural network
to the tensor format. Given a neural network with an N-dimensional
input ( = ? = N n k
k 1 ) and an M-dimensional hidden layer
(M = ?k= m d
k 1 ), they defined a bijection ?( ) ( ( ), ( ), , ( ) l ?l ?l ?l = … ) 1 2 d to
convert the input vector into the tensor format, where l N ? … {1, 2, , }
and ?k ( ) {1, 2, , } l n ? … k denote the coordinates of the input vector b and
the coordinate of the corresponding tensor B, respectively. Therefore,
the input tensor B can be obtained by B?l b ( ( )) = l. Similarly, they
convert the hidden vector into the tensor format. Furthermore, they
convert the weight matrix w ? RM × N into the tensor format W using the
bijections ?t v t v t v t ( ) ( ( ), ( ), , ( ) = … 1 2 d ) and ?( ) ( ( ), ( ), , ( ) l ?l ?l ?l = … ) 1 2 d
that projects the index (t, l) of w into the corresponding index of the
tensor format W.
Afterwards, they convert the weight tensor W into the tensor-train
format G:
= …
= ?
wtl W ? t ? l ? t ? l
G ?t ?l G ?t ?l
( . ) (( ( ), ( )), , ( ), ( ))
( ( ), ( ) ( ( ), ( ) , d d
d d d
1 1
1 1 1 (19)
where Gk(?k(t), ?k(l) denote the core matrices of the tensor-train representation
for W, with the index (?k(t), ?k(l).
Therefore, a linear projection y wx b = + in a fully-connected lay
can be transformed into the tensor-train form:
… = ? ?
…+ …
… Yi i i G ? t ? l
G ? t ? l Xj j Bi i
( , , , ) ( ( ), ( )
( ( ), ( ) ( , , ) ( , , ) . d j j
d d d d d
1 2 , , 1 1 2
1 1
1 d
This method could reduce the computational complexity of the
forward pass and improve the training efficiency in the back-propagation
procedure. Table 1 summarizes the computational complexity and
storage complexity of an M × N tensor-train layer (TT) compared with
the original fully-connected layer (FC) 75.
Besides, Lebdev et al. 76 employed the canonical polyadic decomposition
to compress the parameters of the convolutional neural
network and they achieved a significant speedup for the inference time.
Fig. 11. Example of a neural network which is compressed by the hashing trick.
Q. Zhang et al. Information Fusion 42 (2018) 146–157
3.2. Deep learning models for heterogeneous data
A distinct characteristic of big data is its variety, implying that big
data is collected in various formats including structured data and unstructured
data, as well as semi-structured data, from a large number of
sources. Specially, a great number of objects in big datasets are multimodel.
For example, a webpage typically contains image and text simultaneously.
Another multi-model example is a multimedia object
such as a video clip which includes still images, text and audio. Each
modality of multi-modal objects has different characteristic with each
other, leading to the complexity of heterogeneous data. Therefore,
heterogeneous data poses another challenge on deep learning models.
Some multi-model deep learning models have been proposed for
heterogeneous data representation learning. For example, Ngiam et al.
77 developed a multi-modal deep learning model for audio-video
objects feature learning. Fig. 12 shows the architecture of the multimodal
deep learning model.
Ngiam et al. 77 used the restricted Boltzmann machines to learn
features and representations for audio and video separately. The
learned features are concatenated into a vector as the joint representation
of the multi-modal object. Afterwards, the joint representation
vector is used as the input of a deep auto-encoder model
for the tasks of classification or recognition.
Srivastava and Salakhutdinov developed another multi-model deep
learning model, called bi-modal deep Boltzmann machine, for textimage
objects feature learning, as presented in Fig. 13 78.
In this model, two deep Boltzmann machines are built to learn
features for text modality and image modality, respectively. Similarly,
the learned features of the text and the image are concatenated into a
vector as the joint representation. In order to perform the classification
task, the classifier such as the supported vector machine could be
trained with the joint representation as input.
Another multi-modal deep learning model, called multi-source deep
learning model, was presented by Ouyang et al. 79 for human pose
estimation. Different from above two multi-model models, the multisource
deep learning model aims to learn non-linear representation
from different information sources, such as human body articulation
and clothing for human pose estimation. In this model, each information
source is used as input of a deep learning model with two hidden
layers for extracting features separately. The extracted features are then
fused for the joint representation.
Other representative multi-modal deep learning models include
heterogeneous deep neural networks combined with conditional
random fields for Chinese dialogue act recognition 80, multi-modal
deep neural network with sparse group lasso for heterogeneous feature
selection 81 and so on 82–84. Although they have different architectures,
their ideas are similar. Specially, multi-modal deep learning
models first learn features for single modality and then combine the
learned features as the joint representation for each multi-modal object.
Finally, the joint representation is used as input of a logical regression
layer or a deep learning model for the tasks of classification or recognition.
Multi-modal deep learning models achieved better performance
than traditional deep neural networks such as stacked auto-encoders
and deep belief networks for heterogeneous data feature learning.
However, they concatenated the learned features of each modality in a
linear way, so they are far away effective to capture the complex correlations
over different modalities for heterogeneous data. To tackle
this problem, Zhang et al. 85,86 presented a tensor deep learning
model, called deep computation model, for heterogeneous data.
Specially, they designed a tensor auto-encoder by extending the
stacked auto-encoder model to the tensor space based on the tensor
data representation. In the tensor auto-encoder model, the input layer
X, the hidden layer H, and the parameters ? W b Wb = { ,;,} (1) (1) 2 (2) are
represented by tensors. Besides, tensor distance is used to reveal the
complex features of heterogeneous data in the tensor space, which
yields a loss function with m training objects of the tensor auto-encoder
= ? ?
? + ?
? ??
+ ? ?? ?
× ×
= =
× ×
= =
J ? h x y Gh x
y W
() ( ( () ) ( ()
)) ··· ( )
··· ( )
TAE m i
W b T W b
pi i
qj j
2 , ,
2 1
1 1 ··· (1) 2
1 1 ··· (2) 2
1 (21)
where G denotes the metric matrix of the tensor distance and the second
item is used to avoid over-fitting.
Furthermore, they built a deep computation model by stacking
multiple tensor auto-encoder models. Experiments demonstrated that
the deep computation model achieved about 2%-4% higher classification
accuracy than multi-modal deep learning models for heterogeneous
3.3. Deep learning models for real-time data
High velocity is another important characteristic of big data, which
Table 1
Comparison of computational complexity and storage complexity between TT and FC
where r denotes the maximal rank of tensor-train network.
Operation Computational complexity Storage complexity
FC forward pass O(MN) O(MN)
TT forward pass O(dr2
mmax {M, N}) O(rmax {M, N})
FC backward pass O(MN) O(MN)
TT backward pass O(d2
mmax {M, N}) O(r
max {M, N})
Fig. 12. Architecture of the multi-modal deep learning model.
Fig. 13. Architecture of the bi-modal deep Boltzmann machine.
Q. Zhang et al. Information Fusion 42 (2018) 146–157
requires to analyze big data in real time. Big data is usually collected at
a high speed, posing a challenge on big data real-time processing.
Unfortunately, most of deep learning models are of high computational
complexity since they typically involves a large number of parameters
for big data feature learning, especially large-scale deep neural networks.
Therefore, it is difficult for traditional deep learning models to
learn features and representations for big data in real time.
In recent years, a lot of incremental learning methods have been
presented for high-velocity data feature learning. An kind of incremental
learning methods is online learning which only updates the
parameters when new objects are arriving with preserving the network
Fu et al. 87 presented an incremental back-propagation learning
model based on a concept of bound. A validity bound is defined as a
range of the weights that represent the knowledge learned by a neural
network. To limit the weights in the validity bound in the updating
procedure, a scaling factor s is introduced into weight modifications,
yielding a learning rule in the kth iteration:
?W k s k ?? k O k () () () () = , ji j i (22)
where ? and ?i denote the learning rate and the error gradient of the
neuron j, Oi denotes the activation level of the neuron i.
Then, s should be set according to:
= ?
? ? ?
s k Bp W t
?? k O k ( ) min 1, ( ) ? ()
() () , j i
k ji
j i ,
where B denotes a pre-defined bound on the weight modification for an
object p.
Fu et al. 87 applied the bounded weight adjustment to implement
an incremental learning model. Specially, weights are not modified to
prevent the over-training when the new arriving object is covered by
the knowledge of the current network. However, it is difficult to define
the bound beforehand since it is sensitive to the previous knowledge.
Besides, this method is only suitable for two-layer neural network, so it
is difficult to apply the bounded weight modification to the incremental
deep learning models. Wan and Banta 88 proposed an online learning
model based on parameters updating. Specially, an incremental autoencoder
model was presented by updating the current parameters ? to
? ? + ? to adapt the new arriving objects. ? denotes the previous
knowledge trained on the old objects while ?? denotes the parameters
increment on the new arriving objects. Given a new arriving object, to
achieve the goal of adaptation that requires the updated parameters
could learn the new objects effective, an objective function Jadaptation is
J = x x 1
2 adaptation ? ?? , T
where ? denotes a weight matrix defined as
= ?
? ? ?
? I ?( )( ) ?x?
T ?x?
(, ) (, ) 1
and ?x ?x ? ? = +? (, ?) x denotes the
post-reconstruction error.
To achieve the goal of preservation that requires the updated model
could still learn old objects, an objective function Jpreservation is defined:
J ?? preservation = . ?
1 T
2 (25)
Furthermore, to obtain a tradeoff between adaptation and preservation,
the global objective function is defined as:
Jx? ? J J (, ?) . += + adaptation preservation (26)
There, the parameters increment ?? can be obtained by minimizing
the above function with regard to the new arriving objects.
Online learning methods are efficient for big data feature learning
since they do not need to re-train the parameters on the old training
objects 89–91. This scheme is particularly suitable for big data because
the data size is too big to hold in the memory. However, this
strategy usually preforms poor for dynamic big data whose distribution
is changing drastically over time.
To tackle this problem, another kind incremental learning method
based on the structural modification has been presented. For example, a
structure-based incremental auto-encoder model was implemented by
adding one or more neurons into the hidden layer to adapt new arriving
objects 92–94. Fig. 14 shows an example of this model with adding
only one neuron.
After the network structure is modified, the parameters should be
also updated accordingly. Let ? WbWb = { ,; ,} (1) (1) (2) (2) represent the
parameters of a neural network with n-dimensional input an m-dimensional
hidden layer. Thus, the parameters have the forms:
? ?
? ?
WR bR .
m n m
n m n
(1) (1)
(2) (2) (27)
If only one neuron is added into the hidden layer, the weight matrices
W(1) and W(2) add one row and one column, respectively. Also,
the bias vector b(1) adds one element. Therefore, the parameters forms
are updated as:
? ?+
? ?
+ ×
× +
W R b Rm
m n
nm n
(1) ( 1) (1) (
(2) ( 1) (2) (28)
If p new neurons are added into the hidden layer, the initial parameters
are set to:
? ?
? =
? = ?
W p
0 0
( ) ( )
( ) .
(2) (29)
Furthermore, the final parameters can be trained by the learning
algorithms such as the back-propagation algorithm.
3.4. Deep learning models for low-quality data
Another emerging challenge for big data feature learning arises
from its veracity. Specially, low-quality data is prevalent in big data,
implied by the fact that there are a large number of incomplete objects,
noise, inaccurate objects, imprecise objects and redundant objects in
big data. Low-quality data is resulted from many reasons. For example,
a large amount of data is collected from sensors. If some sensors are
broken, we may collect some incomplete objects. The transmission fault
of the network may also result in some noise in big data. The topics
about big data quality have been studied in literatures 95–98.
Most of deep learning models do not take low-quality data into
account. In other words, most of deep learning models are designed for
high-quality data feature learning. In the past few years, some methods
have been proposed to learn features for low-quality data. Vincent et al.
Fig. 14. Structure-based incremental learning method with adding one hidden neuron.
Q. Zhang et al. Information Fusion 42 (2018) 146–157
99 presented a denoising auto-encoder model which could learn
features and representations for data with noise. Specially, the denoising
auto-encoder model trains the parameters by reconstructing the
original input from the corrupted input, as shown in Fig. 15.
To reconstruct the original input, the objective function of the denoising
auto-encoder model is defined as:
J = ? E Lx g f x ( , ( ( ))) ? ,
qxx t ( ) ? ? ( ) ( )t
where x?denote the corrupted instances of the original input x(t) based
on the corruption process q(x x? ) ( )t and E · qxx ( ) ( )t averages over the
instances x?.
The parameters of the objective function can be trained by the
gradient descent strategy. x? can be obtained by adding isotropic
Gaussian noise or pepper noise. Furthermore, Vincent et al. 100 developed
stacked denoising auto-encoder models for feature learning on
data with noise.
Zhang et al. 101 proposed an imputation auto-encoder model to
learn features for incomplete objects, as shown in Fig. 16.
The simulated incomplete object x? is obtained by setting a part of
attributes values of the original object x to 0. The imputation autoencoder
model takes the incomplete object x? as input and output the
reconstructed object z:
z gfx = ( ( )). ? ? (31)
The parameters ? are trained by minimizing the following objective
J = Lx z (, ). (32)
Furthermore, they built a deep imputation network for incomplete
objects feature learning by stacking several imputation auto-encoders,
as presented in Fig. 17.
More recently, Wang and Tao presented a non-local auto-encoder to
learn reliable features for corrupted data 102. Their work is motivated
by the neurological observation that similar input should stimulate
human brains to produce the similar response. Therefore, the
neural networks should yield similar hidden representations for the
similar input objects. In detail, suppose that h1, h2 and h3 are the
learned representations of x1, x2 and x3, respectively. If
xx xx 12 13 ?

Post Author: admin


I'm Eric!

Would you like to get a custom essay? How about receiving a customized one?

Check it out