Upload
lan
View
219
Download
4
Embed Size (px)
Citation preview
ARTICLE IN PRESSJID: JSS [m5G;December 8, 2014;12:13]
The Journal of Systems and Software 000 (2014) 1–9
Contents lists available at ScienceDirect
The Journal of Systems and Software
journal homepage: www.elsevier.com/locate/jss
Semantic based representing and organizing surveillance big data using
video structural description technology
Zheng Xu b,a,∗ , Yunhuai Liu a, Lin Mei a, Chuanping Hu a, Chen Lan a
a The Third Research Institute of Ministry of Public Security, Shanghai, Chinab Tsinghua University, China
a r t i c l e i n f o
Article history:
Received 23 September 2013
Revised 27 May 2014
Accepted 13 July 2014
Available online xxx
Keywords:
Video structural description
Surveillance big data
Big data representing and organizing
a b s t r a c t
Big data is an emerging paradigm applied to datasets whose size is beyond the ability of commonly used
software tools to capture, manage, and process the data within a tolerable elapsed time. Especially, the
data volume of all video surveillance devices in Shanghai, China, is up to 1 TB every day. Thus, it is im-
portant to accurately describe the video content and enable the organizing and searching potential videos
in order to detect and analyze related surveillance events. Unfortunately, raw data and low level features
cannot meet the video based task. In this paper, a semantic based model is proposed for representing and
organizing video big data. The proposed surveillance video representation method defines a number of
concepts and their relations, which allows users to use them to annotate related surveillance events. The
defined concepts include person, vehicles, and traffic sighs, which can be used for annotating and repre-
senting video traffic events unambiguous. In addition, the spatial and temporal relation between objects in
an event is defined, which can be used for annotating and representing the semantic relation between ob-
jects in related surveillance events. Moreover, semantic link network is used for organizing video resources
based on their associations. In the application, one case study is presented to analyze the surveillance big
data.
© 2014 Elsevier Inc. All rights reserved.
1
i
m
a
r
p
d
a
i
(
d
(
c
e
i
c
e
W
k
a
a
2
t
e
V
t
a
r
n
v
t
W
t
d
2
h
0
. Introduction
Big data is an emerging paradigm applied to datasets whose size
s beyond the ability of commonly used software tools to capture,
anage, and process the data within a tolerable elapsed time (Wigan
nd Clarke, 2013). Such datasets are often from various sources (Va-
iety) yet unstructured such as social media, sensors, scientific ap-
lications, surveillance, video and image archives, Internet texts and
ocuments, Internet search indexing, medical records, business trans-
ctions and web logs; and are of large size (Volume) with fast data
n/out (Velocity). More importantly, big data has to be of high value
Value). Various technologies are being discussed to support the han-
ling of big data such as massively parallel processing databases
Yuan et al., 2013), scalable storage systems (Zhang et al., 2013a),
loud computing platforms (Liu et al., 2013), and MapReduce (Zhang
t al., 2013b). Distributed systems are a classical research discipline
nvestigating various distributed computing technologies and appli-
ations such as cloud computing (Yan et al., 2013a, 2013b; Lizhe
t al., 2010) and MapReduce (Ze et al., 2014; Dan et al., 2013).
∗ Corresponding author at: Tsinghua University, China. Tel.: +86 13817917970.
E-mail address: [email protected] (Z. Xu).
f
ttp://dx.doi.org/10.1016/j.jss.2014.07.024
164-1212/© 2014 Elsevier Inc. All rights reserved.
Please cite this article as: Z. Xu et al., Semantic based representing and
technology, The Journal of Systems and Software (2014), http://dx.doi.org
ith new paradigms and technologies, distributed systems research
eeps going with new innovative outcomes from both industry and
cademy.
Recent research shows that videos “in the wild” are growing at
staggering rate (Cisco Visual Networking Index, 2013; Great Scott,
013). For example, with the rapid growth of video resources on
he world-wide-web, on YouTube1 alone, 35 h of video are unloaded
very minute, and over 700 billion videos were watched in 2010.
ast amount of videos with no metadata have emerged. Thus au-
omatically understanding raw videos solely based on their visual
ppearance becomes an important yet challenging problem. The
apid increase number of video resources has brought an urgent
eed to develop intelligent methods to represent and annotate the
ideo events. Typical applications in which representing and anno-
ating video events include criminal investigation systems (Wu and
ang, 2010), video surveillance (Liu et al., 2009), intrusion detec-
ion system (Zhang et al., 2008), video resources browsing and in-
exing system (Yu et al., 2012), sport events detection (Xu et al.,
008), and many others. These urgent needs have posed challenges
or video resources management, and have attracted the research
1 www.youtube.com.
organizing surveillance big data using video structural description
/10.1016/j.jss.2014.07.024
2 Z. Xu et al. / The Journal of Systems and Software 000 (2014) 1–9
ARTICLE IN PRESSJID: JSS [m5G;December 8, 2014;12:13]
w
g
i
t
S
s
s
2
r
i
b
c
o
t
t
t
e
c
d
t
d
t
a
a
r
m
t
s
f
s
e
w
g
o
t
p
s
of the multimedia analysis and understanding. Overall, the goal is to
enable users to search the related events from the huge number of
video resources. The ultimate goal of extracting video events brings
the challenge to build an intelligent method to automatically detect
and retrieve video events.
In fact, the huge number of new emerging video surveillance
data becomes a new application field of big data. The processing
and analysing video surveillance data follow the 4 V feature of big
data.
(1) Variety: The video surveillance data comes from the differ-
ent devices such as traffic cameras, hotel cameras and so on.
Besides the different surveillance devices, these devices also
come from the different region. The distributed feature of video
surveillance data augments the variety of the resources. For
example, in the criminal investigation systems, different video
surveillance data from the different surveillance devices are
processed and analyzed to detect the related people, car, or
things. The variety of video surveillance devices brings the
big challenges for storage and management distributed video
surveillance data.
(2) Volume: With the rapid development of the surveillance de-
vices, for example, the number of surveillance devices in
Shanghai is up to 200,000, the volume of video surveillance
data becomes the big data. The data volume of all video surveil-
lance data in Shanghai is up to 1 TB every day. The whole vol-
ume of all video surveillance data in Shanghai Pudong is up to
25 PB. The huge volume of video surveillance data brings the
big challenges for processing and analyzing distributed video
surveillance data.
(3) Velocity: The video surveillance devices are with fast data
in/out. The video surveillance devices usually work in 24 h per
day. The video surveillance devices collect real-time videos. The
real-time collected videos usually upload to the storage server
or data center. The velocity of collecting video surveillance data
is faster than that of processing and analyzing them. The high
velocity of video surveillance devices brings the big challenges
for processing and analyzing video surveillance data. For exam-
ple, the speed of processing and analyzing video surveillance
data is much lower than collecting them.
(4) Value: The video surveillance data usually has high value.
For example, in the criminal investigation systems, the video
surveillance can help the police to find the suspect. In the traf-
fic surveillance system, the video data can detect the illegal
vehicles or people. On the other hands, the huge volume brings
the challenges for mining the value from the video surveillance
data. The phenomenon of “High volume, low value” also exists
in the video surveillance big data.
In this paper, a semantic based model for representing and or-
ganizing video resources is proposed for bridging the gap between
low-level representative features and high-level semantic content in
terms of object, event, spatial and temporal relation extraction. The
proposed model is named Video Structural Description (VSD). In order
to solve the representing and annotating need for objects, events, and
spatial–temporal relations during the video understanding process,
a wide-domain applicable traffic ontology that uses objects and spa-
tial/temporal relations in an event is developed. In order to organize
the video resources based on their association, semantic link network
(Zhuge, 2011) based method is used. The major contributions of this
paper are summarized as follow.
(1) A whole framework for building domain ontology of VSD is
proposed. The basic concepts, events, and relations of a given
domain are defined. Moreover, a rule construction standard
which is domain independent is given to construct domain
Please cite this article as: Z. Xu et al., Semantic based representing and
technology, The Journal of Systems and Software (2014), http://dx.doi.org
ontologies. Domain ontologies are enriched by including addi-
tional rule definitions.
(2) The proposed method defines a number of concepts and their
relations, which allows users to use them to detect video traf-
fic events. A number of concepts including person, vehicle, and
traffic sigh is given, which can be used by users for annotating
and representing video traffic events unambiguous. In addi-
tion, the spatial and temporal relation in an event is proposed,
which can be used for annotating and representing the seman-
tic relation between objects in video traffic events.
(3) In order to organize the video resources, semantic link net-
work based method is used. The semantic link network
model can mine and organize video resources based on their
associations.
(4) A semantic video annotation tool is implemented for annotat-
ing and organizing video resources based on the video anno-
tation ontology. The annotation tool allows annotators to use
domain specific vocabularies from traffic field to describe the
video resources. These annotated video resources are man-
aged based on the semantic relation between annotations.
A semantic-based video organizing platform is provided for
searching videos. It supports reasoning operation of the anno-
tations of video resources.
The organization of the paper is as follows. In Section 2, the related
ork of the proposed work is given. The proposed VSD framework is
iven in Section 3. In Section 4, the ontology of traffic events domain
s built. In Section 5, the semantic link network model is proposed
o mine and organize video resources based on their associations. In
ections 6 and 7, the application and case study for mining video
urveillance data are given. Finally, the conclusions and future re-
earch directions are discussed.
. Related work
The key issue in semantic content extraction from videos is the
epresentation of the semantic content. Many researchers have stud-
ed this from different aspects. A simple representation method may
e associated the video events with low level features (texture, shape,
olor, etc.) using frames or shots from videos. These simple meth-
ds do not use any relations between features such as spatial or
emporal relations. Obviously, using spatial or temporal relations be-
ween objects in videos is important for achieving accurate extrac-
ion of events. Researches such as BilVideo (Donderler et al., 2005),
xtended-AVIS (Sevilmis et al., 2008), multiView (Fan et al., 2001) and
lassView (Fan et al., 2004) used spatial and temporal relations but
o not have ontology-based models for semantic content represen-
ation. Bai et al. (2007) presented a semantic based framework using
omain ontology. Their work is used to represent video events with
emporal description logic. However, the event extraction is manu-
lly and event descriptions only use temporal information. Nevatia
nd Natarajan (2005) gave an ontology model using spatial temporal
elations to extract complex events where the extraction process is
anual. In Bagdanov et al. (2007), each defined concept is related
o a corresponding visual concept with only temporal relations for
occer videos. Nevatia and Natarajan (2005) built event ontology
or natural representation of complex spatial temporal events given
impler events. A Video Event Recognition Language (VERL) (Nevatia
t al., 2005) that allows users to define the events without interacting
ith the low level processing is defined. VERL is intended to be a lan-
uage for representing events for the purpose of designing ontology
f the domain, and, Video Event Markup Language (VEML) is used
o manually annotate VERL events in videos. The lack of low level
rocessing and using manual annotation are the drawbacks of this
tudy. Akdemir et al. (2008) present a systematic approach to address
organizing surveillance big data using video structural description
/10.1016/j.jss.2014.07.024
Z. Xu et al. / The Journal of Systems and Software 000 (2014) 1–9 3
ARTICLE IN PRESSJID: JSS [m5G;December 8, 2014;12:13]
t
T
d
b
w
m
p
b
a
r
l
a
(
w
b
3
i
(
j
(
P
o
c
F
R
o
i
s
m
e
m
s
g
V
c
V
c
p
a
i
a
r
3
t
F
Fig. 1. The hierarchical structure of VSD.
j
p
T
T
l
s
c
r
3
T
a
l
he problem of designing ontologies for visual activity recognition.
he general ontology design principles are adapted to the specific
omain of human activity ontologies using spatial temporal relations
etween contextual entities. However, most of the contextual entities
hich are utilized as critical entities in spatial and temporal relations
ust be manually provided for activity recognition. Some researches
ay attention to the symbolic representation, i.e. semantic relations
etween visual symbols. Marszalek et al. (2007) used semantic hier-
rchies from WordNet to integrate prior knowledge about inter-class
elationships into the visual appearance learning. Deng et al. (2009)
aunched Image-Net aiming at building a synsets in WordNet with an
verage of 500–1000 images selected manually by humans. Yao et al.
2010) presented an image parsing to text description (I2T) frame-
ork that generates text descriptions of image and video content
ased on image understanding.
. The overview of video structural description
Video structural description (VSD) aims at parsing video content
nto the text information, which uses spatiotemporal segmentation
Chen and Ahuja, 2012), feature selection (Javed et al., 2012), ob-
ect recognition (Choi et al., 2012), and semantic web technology
Luo et al., 2011; Xu et al., 2011; Liu et al., 2010, 2011; Plebani and
ernici, 2009). The parsed text information preserves the semantics
f the video content, which can be understood by human and ma-
hine. Generally speaking, the definition of VSD includes two aspects.
irstly, VSD aims at extracting the semantic content from the video.
elying on the standard video content description mechanism, the
bjects and their features of the video are recognized and expressed
n the form of text. Secondly, VSD aims at organizing the video re-
ources with their semantic relations. With the semantic links across
ultiple cameras, it is possible to use the data mining methods for
ffective analysis and semantic retrieval of videos. Moreover, the se-
antic linking between the video resources and other information
ystems becomes possible. VSD is the foundation of building the next
eneration of intelligent and semantic video surveillance network.
SD also makes the systematical, interconnected, and diversity appli-
ations on video surveillance system to be possible. With the help of
SD, the simple data acquisition mode of video surveillance system
an be transferred to integration mode of data acquisition, content
rocessing, and semantic information services. The primary key issue
nd main innovation of VSD is the integration of video understand-
ng and semantic web technologies. The semantic web technologies
re used for representing and organizing the huge number of video
esources.
.1. The hierarchical structure of VSD
VSD is set as a hierarchical semantic data model including
hree different layers. The different layers of VSD are illustrated in
ig. 1.
(1) Pattern recognition layer: In this layer, VSD technology wants
to extract and represent the content of the videos. For exam-
ple, the people, vehicle, and traffic sigh of the traffic video are
extracted. Different from the existing video content extraction
and representation method, VSD uses the domain ontology in-
cluding basic concepts, events, and relations. These domain
ontologies can be used by users for annotating and represent-
ing video traffic events unambiguous. In addition, the spatial
and temporal relations are defined in event and concepts def-
initions, which can be used by users for annotating and repre-
senting the semantic relations between objects in video traffic
events.
(2) Video resources layer: In the pattern recognition layer, VSD
extracts and represents the content of a single video. In the
Please cite this article as: Z. Xu et al., Semantic based representing and
technology, The Journal of Systems and Software (2014), http://dx.doi.org
video resources layer, VSD technology aims at linking the video
resources with their semantic relations. Similar to the World
Wide Web which uses hyperlinks to link resources, VSD uses
semantic links instead hyperlinks to link video resources.
(3) User demands layer: The pattern recognition layer and video
resources layer focus on processing video resources using their
semantics. The user demands layer focus on processing the
need of users and returning the related resources. In the user
demand layer, the video resources are clustering and integrat-
ing according to user’s need.
From Fig. 1, the bottom layer consists of different objects. These ob-
ects recognized from related pattern recognition methods are com-
osed of single videos. The middle layer consists of different videos.
hese videos consist of the different objects from the bottom layer.
he semantic relations also exist between video resources. In the top
ayer, users can search, annotate, and browse the related video re-
ources. For example, if a user wants to know the vehicles which
ross the red traffic light in a video, the video resources layer can
eturn the related videos.
.2. The supporting technologies of VSD
In this section, the supporting technologies of VSD are introduced.
hese technologies are used in the different layers of VSD, which can
chieve the ultimate goal of VSD. The supporting technologies are
isted as follow.
(1) Computer vision: Computer vision is a field that includes
methods for acquiring, processing, analyzing, and understand-
ing images. A theme in the development of this field has been
to duplicate the abilities of human vision by electronically per-
ceiving and understanding an image. This image understanding
can be seen as the disentangling of symbolic information from
image data using models constructed with the aid of geome-
try, physics, statistics, and learning theory. The computer vision
technologies can be used in the pattern recognition layer. For
example, the car and people of a traffic video can be detected
by the object detection technologies from computer vision
field.
(2) Semantic web: The Semantic Web (Berners-Lee et al., 2001; Ma
et al., 2010; Zhuge, 2009) is a collaborative movement led by
the international standards body, the World Wide Web Consor-
tium (W3C). The standard promotes common data formats on
the World Wide Web. By encouraging the inclusion of semantic
organizing surveillance big data using video structural description
/10.1016/j.jss.2014.07.024
4 Z. Xu et al. / The Journal of Systems and Software 000 (2014) 1–9
ARTICLE IN PRESSJID: JSS [m5G;December 8, 2014;12:13]
Fig. 2. An example of representing a key frame of a video using the domain ontology.
Fig. 3. An example ontology.
D
t
W
C
w
D
v
d
2
content in web pages, the Semantic Web aims at converting the
current web dominated by unstructured and semi-structured
documents into a “web of data”. The semantic web technol-
ogy can be used in the pattern recognition layer. For example,
with the help of the specific domain ontologies, the objects and
relations of videos can be detected accurately.
(3) Semantic link network: A semantic link network (SLN) is a re-
lational network consisting of the following main parts: a set of
semantic nodes, a set of semantic links between the nodes, and
a semantic space. Semantic nodes can be anything. The seman-
tic link between nodes is regulated by the attributes of nodes or
generated by interactions between nodes. The semantic space
includes a classification hierarchy of concepts and a set of rules
for reasoning and inferring semantic links, for influence nodes
and links, for networking, and for evolving the network. The
semantic link network can be used in the video resources
layer. For example, with the help of the semantic link net-
work model, the videos can be organized with their semantic
relations.
(4) Cloud computing: Cloud computing is a colloquial expres-
sion used to describe a variety of different computing con-
cepts that involve a large number of computers that are con-
nected through a real-time communication network. In sci-
ence, cloud computing is a synonym for distributed computing
over a network and means the ability to run a program on many
connected computers at the same time. The cloud computing
technologies can be used in the video application layer. For
example, with the help of the clouding computing technolo-
gies, the huge number of videos can be managed and indexed
efficiently and robustly.
4. The bottom layer – building domain ontology for representing
video surveillance data
In this section, the domain ontology of traffic events is built. Since
the number of traffic videos is huge, the standard ontology can help
to represent videos accurately and efficiently.
4.1. Basic definitions
Concepts, objects, attributes, spatial relations, temporal relations,
and events are basic components of the proposed ontology frame-
work. In this section, we give basic definitions of these components.
Moreover, we add the ontology constrains which are used to give
the standard when building ontologies. Figs. 2 and 3 give an ex-
ample of representing a key frame of a video using the domain
ontology.
Definition 1. Domain Ontology (DO): Domain ontology is the stan-
dard representation of a special domain, including concepts, objects,
attributes, spatial relations, temporal relations, and events. The do-
main ontology can be denoted as
DO =
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎩
Concept
Attribute
Object
Temporal relation
Spatial relation
Event
(1)
Please cite this article as: Z. Xu et al., Semantic based representing and
technology, The Journal of Systems and Software (2014), http://dx.doi.org
efinition 2. Concept (C): Concept is the standard taxonomy of
he objects of a special domain. Concepts are similar to the nodes of
ordNet.2 The concept can be denoted as
oncept = {c1, c2, . . . , cm} (2)
here m means the number of concepts of domain ontology.
efinition 3. Object (O): Object is the extracted component from a
ideo. The extracted object is mapped to a concept. The object can be
enoted as
Object = {o1, o2, . . . , on}(∀oi → ∃cj ∈ Concept) → oi ⇒ cj
(3)
www.Wordnet.princeton.edu.
organizing surveillance big data using video structural description
/10.1016/j.jss.2014.07.024
Z. Xu et al. / The Journal of Systems and Software 000 (2014) 1–9 5
ARTICLE IN PRESSJID: JSS [m5G;December 8, 2014;12:13]
Fig. 4. The hierarchical structure of traffic sighs.
w
m
D
o
A
w
o
o
D
i
p
T
S
b
∀∀∀∀∀
D
r
c
S
S
o
o
I
[
o
∀∀∀
∀∀∀∀∀
∀
w
D
s
E
4
t
r
t
o
i
here n means the number of objects of domain ontology, oi ⇒ cj
eans the mapping operation from object to concept.
efinition 4. Attribute (A): Attribute is the visual feature of the
bjects from a video. The attribute can be denoted as
ttribute = {a1, a2, . . . , ak} (4)
here k means the number of attributes of domain ontology, thus,
bject can be represent as a vector of attributes:
i = {a1, a2, . . . , ak} (5)
efinition 5. Temporal Relation (TR): Temporal relation is the tim-
ng relation between the different time intervals of a video. The tem-
oral relation can be denoted as
R = {before, during, overlap, equal, meet} (6)
uppose two time interval 〈t1, t2〉, 〈t3, t4〉, the temporal relation can
e denoted as
before (〈t1, t2〉, 〈t3, t4〉) → t2 < t3
during (〈t1, t2〉, 〈t3, t4〉) → t1 > t3 ∧ t2 < t4
overlap (〈t1, t2〉, 〈t3, t4〉) → t1 < t3 ∧ t3 < t2
equal (〈t1, t2〉, 〈t3, t4〉) → t1 = t3 ∧ t2 = t4
meet (〈t1, t2〉, 〈t3, t4〉) → t2 = t3
(7)
efinition 6. Spatial Relation (SR): Spatial relation is the position
elation between the different objects of a video. The spatial relation
an be denoted as
R = {inside, touch, partially inside, right, left, above, below,
far, near} (8)
uppose the coordinate of two objects are
i = 〈[x1, y1], [x2, y2]〉j = 〈[x3, y3], [x4, y4]〉 (9)
Please cite this article as: Z. Xu et al., Semantic based representing and
technology, The Journal of Systems and Software (2014), http://dx.doi.org
t is noted that the shape of an object is a rectangle, which means that
x1, y1] and [x2, y2] are the upper left and low right coordinate of the
bject. The spatial relation can be denoted as
inside (oi, oj) → x1 < x3 ∧ x4 < x2 ∧ y1 < y3 ∧ y4 < y2
touch (oi, oj) → x2 = x3 ∨ y2 = y3
partially inside (oi, oj) → (x1 < x3 ∧ (x2 < x4 ∨ y1 > y3 ∨ y2
< y4))∨ (y1 < y3 ∧ (y2 < y4 ∨ x1 > x3 ∨ x2 < x4))
right (oi, oj) → x2 < x3
left (oi, oj) → x1 > x4 (10)
above (oi, oj) → y2 < y3
below (oi, oj) → y1 > y4
far (oi, oj) → x3 − x2 > α ∨ x1 − x4 > α ∨ y3 − y2
> α ∨ y1 − y4 > α
near (oi, oj) → x3 − x2 < α ∨ x1 − x4 < α ∨ y3 − y2
< α ∨ y1 − y4 < α
here α is a threshold to differ the far and near relation.
efinition 7. Event (E): Event is the combine of the objects and their
patial–temporal relation, which can be denoted as
vent = (object, spatial relation, temporal relation) (11)
.2. Case study on representing video traffic events
In this section, the proposed ontology building method on the
raffic events field is used. The events, objects, and spatial–temporal
elations are built together. The traffic events are all extracted from
he illegal action of vehicles from the web site of the ministry
f public security. These illegal traffic events consist of two parts
ncluding
(1) Vehicles: Vehicles are the core components of the illegal traffic
events. The concepts of all potential vehicles which may appear
in the illegal traffic events are built.
(2) Traffic sighs: Traffic sighs are the basic components of the
illegal traffic events. All of the potential traffic sighs which
may appear in the illegal traffic events is built. Fig. 4 gives an
illustration of the defined objects of the traffic domain.
organizing surveillance big data using video structural description
/10.1016/j.jss.2014.07.024
6 Z. Xu et al. / The Journal of Systems and Software 000 (2014) 1–9
ARTICLE IN PRESSJID: JSS [m5G;December 8, 2014;12:13]
Fig. 5. The ontology of example event1.
Fig. 6. The ontology of example event2.
E
a
r
o
(
R
f
n
l
i
o
G
5
r
a
D
t
e
i
c
v
D
(
s
e
r
D
l
Two examples of the illegal traffic events are given. Totally, we
build 215 illegal traffic events.3
Example Event 1: Motor vehicles cross red traffic light in the cross
roads.
In this example, the objects contain motor vehicles, traffic light,
and stop line. Besides the objects, the spatial and temporal relation
between them should be considered. Fig. 5 shows the ontology of this
event. From Fig. 5, we can see that the event1 has the temporal relation
between two different times. In the time1, the motor vehicle is not
inside the stop line. In the next time2, the motor vehicle is inside the
stop line. Through the ontology of the event1, we can detect whether
a car crosses the red light or not.
Example Event 2: Vehicle Overtake the front vehicle at the right
side.
In this example, the objects only contain motor vehicles. Besides
the motor vehicles, the spatial and temporal relation between them
should be considered. Fig. 6 shows the ontology of this event. From
Fig. 6, we can see that the event 2 has the temporal relation between
three different times. In the time1, the motor vehicle1 is below the
motor vehicle2. In the next time2, the motor vehicle1 is at the right
side of the motor vehicle2. In the last time3, the motor vehicle1 is
above the motor vehicle2.
5. The middle layer – using semantic link network for organizing
video surveillance big data
In this section, the semantic link network is used for organizing
traffic video resources. The semantic link network has been verified
in large scale resources environment (Luo et al., 2011).
5.1. The introduction of Semantic Link Network
The Semantic Link Network (SLN) (Zhuge, 2009) was proposed
as a semantic data model for organizing various Web resources by
extending the Web’s hyperlink to a semantic link. SLN is a directed
network consisting of semantic nodes and semantic links. A seman-
tic node can be a concept, an instance of concept, a schema of data
set, a URL, any form of resources, or even an SLN (Zhuge, 2011). A
semantic link reflects a kind of relational knowledge represented
as a pointer with a tag describing such semantic relations as cause
3 The 215 illegal traffic events are got from www.shjtaq.com/zwfg/dmb2012.htm.
s
e
r
Please cite this article as: Z. Xu et al., Semantic based representing and
technology, The Journal of Systems and Software (2014), http://dx.doi.org
ffect, implication, subtype, similar, instance, sequence, reference,
nd equal. The semantics of tags are usually common sense and can be
egulated by its category, relevant reasoning rules, and use cases. A set
f general semantic relation reasoning rules was suggested in Zhuge
2010) and Zhuge (2012). A relation could have a reverse relation.
elations and their corresponding reverse relations are knowledge
or supporting semantic relation reasoning. SLN is a self-organized
etwork since any node can link to any other node via a semantic
ink. SLN has been used to improve the efficiency of query routing
n P2P network (Zhuge et al., 2008), and it has been adopted as one
f the major mechanisms of organizing resources for the Knowledge
rid.
.2. Using SLN for Organizing Traffic Videos
Since ALN model focuses on Web resources, the model should be
evised when using on the video resources. Some related definitions
re given first.
efinition 8. Object Relation (OR): Object relation (OR(oi, oj)) is
he semantic relation between the different objects of a video, for
xample, the same car in the different videos.
In the VSD technology, the object relation between two objects
s detected by their attributes. For example, if two cars with same
olor appear in the different videos, the object relation between these
ideos is detected.
efinition 9. Video Spatial Relation (VSR): Video spatial relation
VSR(vi, vj)) is the spatial relation between videos.
In the VSD technology, since the videos are obtained from the
urveillance equipment, the spatial information can be got easily. For
xample, two videos are in the close cross roads, the video spatial
elation between these videos is detected.
efinition 10. Video Temporal Relation (VTR): Video temporal re-
ation (VTR(vi, vj)) is the temporal relation between videos.
In the VSD technology, since the videos are obtained from the
urveillance equipment, the time information can be got easily. For
xample, two videos are in the related time, the video temporal
elation between these videos is detected.
organizing surveillance big data using video structural description
/10.1016/j.jss.2014.07.024
Z. Xu et al. / The Journal of Systems and Software 000 (2014) 1–9 7
ARTICLE IN PRESSJID: JSS [m5G;December 8, 2014;12:13]
6
t
v
6
s
t
t
r
P
b
b
o
c
T
m
o
6
U
a
p
c
Fig. 7. The annotation interface for users.
U
s
c
d
f
f
7
T
. The top layer – the application on annotating and searching
raffic events
In this section, the applications on annotating and searching of
ideo resources are given.
.1. The video annotation ontology
The video annotation ontology and annotation instance are
tored in a Resource Description Framework (RDF)4 scheme, and
he ontologies reuses a number of RDF vocabularies. These on-
ology vocabularies are extracted from the following knowledge
epository.
(1) The traffic law of China: We analyze the traffic law of china,
and extract the basic concepts from it. For example, the traffic
light, car, people, road line and so on. These basic concepts are
provided for users when they annotate the video resources.
Since the video resources are all about traffic events, these
ontologies are enough for users.
(2) The basic features of car: We give the basic features of a car,
such as color, shape and so on.
(3) The basic features of person: We give the basic features of a
person, such as cloth’s color, the hair style and so on.
These basic concepts are built as the annotation ontologies by
rotégé5 and TBC.6 Protégé is an ontology building tool developed
y Stanford University. This tool can simply generate the ontology
ased on our selected features from traffic law, car, and person. More-
ver, Protégé can support SWRL7 based semantic reasoning, which
an facility the semantic mining procedures in the other modules.
BC is a free ontology generating platform based on Eclipse develop-
ent environment, which provides the instances and meta-field of
ntologies.
.2. The video annotation and searching module
The video annotation module provides the core function for users.
sers can use this module to annotate video resources. Of course, the
nnotation concepts should follow the ontologies. The annotation
rocedures of a user are as follow.
(1) Select or upload video resources: Users can choose annotate an
existing video resource of upload their own video resources. It
is noted that the users of the proposed annotation tool are all
policemen, the upload videos are also about traffic events.
(2) According to the given ontologies, users select the appropriate
concepts to annotate the videos. For example, if a video con-
tains a car, users should annotate the color, style, and other
features of it.
(3) In a video, users can annotate the different frame in the differ-
ent timestamp.
Fig. 7 shows the annotation interface for users. From Fig. 7, we
an see the annotation interface contains the following parts.
(1) Annotation part. The annotation part is in the middle of the
annotation interface. Users can annotate a rectangle to her/his
4 www.w3.org/RDF/, 2013.5 http://protege.stanford.edu/, 2013.6 www.docjar.org, 2013.7 www.w3.org/Submission/SWRL/, 2013.
Please cite this article as: Z. Xu et al., Semantic based representing and
technology, The Journal of Systems and Software (2014), http://dx.doi.org
interested parts. For example, in Fig. 7, users annotate the per-
son in the car.
(2) Input part: The input part is in the right of the annotation in-
terface. Users can input the detailed features of the provided
attributes. For example, in Fig. 7, users annotate the hair style,
cloth color of the person.
(3) Time scroll part: Time scroll part is in the bottom of the annota-
tion interface. Users can scroll forward or back of a video. For
example, in Fig. 7, users annotate the image in the 8:02:43 of
the video.
The video searching module provides the search function for users.
sers can use this module to search video resources. Of course, the
earch concepts should follow the ontologies. The searching interface
ontains the following parts.
(1) The queries input part: The queries input part is in the front of
the searching module. Users can input the searching queries
in this part. For example, the user searches the query “car
light”.
(2) The searching results part: The searching results part is in the
middle of the searching part. Users can browse the searching
result in this part. For example, if users search the car light
in the searching module, the returned results is all annotated
image or video resources contains the concept “car light” in the
annotated meta-data.
The video searching module is implemented on the Virtuoso8
atabase. Java is used to add, delete, and revise the database. Dif-
erent from the Larkc,9 the Virtuoso database contains the better per-
ormance and friendly interface.
. Case study
This case study aims at finding the illegal cars using other licenses.
he detailed information of this study is listed as follow.
(1) Task: Finding the illegal cars using other licenses. In China, each
car should have a sole license and a sole license number. For
example, the license number of a car in Shanghai is A-86812.
Since the license number is the sole identifier of car, some
8 www.virtuoso.com/, 2013.9 www.larkc.eu/, 2013.
organizing surveillance big data using video structural description
/10.1016/j.jss.2014.07.024
8 Z. Xu et al. / The Journal of Systems and Software 000 (2014) 1–9
ARTICLE IN PRESSJID: JSS [m5G;December 8, 2014;12:13]
o
d
A
n
t
o
i
2
u
S
R
A
B
B
B
C
C
2
D
D
D
F
F
2
J
L
L
L
L
L
L
M
M
N
N
P
owners of cars may use other licenses in order to avoid the
punishment of the ministry of public security.
(2) Data set: 1.19 billion Data from the traffic speed camera. The
data consists of the three important information for solving the
task including the license number, the GIS information of the
car, and the catching time of the traffic speed camera. Overall,
from the data set, we can know the appearing time and place
of a car with the license number.
(3) Data processing: Ten servers are used to store and process these
1.19 billion data. 380 blocks are used to store these data and
each block store 3 million data. The total number of storage
space is up to 103GB. The time of copying the data to ten servers
is up to 200 min.
(4) MapReduce: We use the map function to classify the cars into
the different time and the reduce function to classify the cars
by the license number. The MapReduce framework is used to
process the cars by the same license number. The time of the
MapReduce process is up to 50 min.
(5) Rules: We set the rule for detecting the cars with illegal license
number as “the distance between the cars of the same license
number should be lower than 15 km in the time interval of
10 min. In other words, if the time interval of the same cars
is 10 min, the distance between the cars should be lower than
15 km. For example, a car with the license number A-86812
appears in the place A in the 10:00 of 2013.9.13. Another car
with the license number A-86812 appears in the place B in the
10:05 of 2013.9.13. The distance between the place A and B is
longer than 15 km. Obviously, a car can hardly run 15 km in
5 min.
(6) Results: 394 cars are selected as the candidates of the illegal cars
according to the defined rules. These candidates are compared
with the car information database of the ministry of public
security. For example, the brand of a candidate car with the
license number A-86812 is BMW. But the brand information in
the car information database of the ministry of public security
is Audi. Thus, we can say that candidate car uses the license
number of other cars.
In the above case study, the VSD technologies are used to detect
the basic information of a car such as the brand, the color, and the
license number. The MapReduce technology is used for processing
the original data.
8. Conclusion
The increasing need of video based applications issues the impor-
tance of parsing and organizing the content in videos. However, the
accurate understanding and managing video contents at the seman-
tic level is still insufficient. In this paper, a semantic based model
named Video Structural Description (VSD) for representing and or-
ganizing the content in videos is proposed. Video structural descrip-
tion aims at parsing video content into the text information, which
uses spatiotemporal segmentation, feature selection, object recogni-
tion, and semantic web technology. In this paper, a semantic based
model has been proposed for representing and organizing video big
data. The proposed surveillance video representation method de-
fines a number of concepts and their relations, which allows users
to use them to annotate related surveillance events. The defined con-
cepts include person, vehicles, and traffic sighs, which can be used
for annotating and representing video traffic events unambiguous.
In addition, the spatial and temporal relation between objects in an
event has been defined, which can be used for annotating and repre-
senting the semantic relation between objects in related surveillance
events. Moreover, semantic link network has been used for organiz-
ing video resources based on their associations. In the application,
Please cite this article as: Z. Xu et al., Semantic based representing and
technology, The Journal of Systems and Software (2014), http://dx.doi.org
ne case study has been presented to analyze the surveillance big
ata.
cknowledgements
This work was supported in part by the National Science and Tech-
ology Major Project under Grant 2013ZX01033002-003, in part by
he National High Technology Research and Development Program
f China (863 Program) under Grant 2013AA014601, 2013AA014603,
n part by National Key Technology Support Program under Grant
012BAH07B01, in part by the National Science Foundation of China
nder Grant 61300202, and in part by the Science Foundation of
hanghai under Grant 13ZR1452900.
eferences
kdemir, U., Turaga, P., Chellappa, R., 2008. An ontology based approach for activityrecognition from video. In: Proceedings of the ACM International Conference on
Multimedia, pp. 709–712.agdanov, A., Bertini, M., Del Bimbo, A., Torniai, C., Serra, G., 2007. Semantic annotation
and retrieval of video events using multimedia ontologies. In: Proceedings of IEEEInternational Conference on Semantic Computing.
ai, L., Lao, S., Jones, G., Smeaton, A., 2007. Video semantic content analysis based
on ontology. In: Proceedings of the 11th International Machine Vision and ImageProcessing Conference, pp. 117–124.
erners-Lee, T., Hendler, J., Lassila, O., 2001. The semantic web. Sci. Am. 284 (5),34–43.
hen, H., Ahuja, N., 2012. Exploiting nonlocal spatiotemporal structure for video seg-mentation. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition,
pp. 741–748.hoi, M., Torralba, A., Willsky, A., 2012. A tree-based context model for object recogni-
tion. IEEE Trans. Pattern Anal. Mach. Intell. 34 (2), 240–252.
013. Cisco Visual Networking Index: Forecast and Methodology, 2009–2014Available: http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/
ns705/ns827/whitepaper_c11-481360_ns827_Networking_Solutions_WhitePaper.html.
an, C., Zhixin, L., Lizhe, W., Minggang, D., Jingying, C., Hui, L., 2013. Natural disastermonitoring with wireless sensor networks: a case study of data-intensive appli-
cations upon low-cost scalable systems. ACM/Springer Mob. Netw. Appl. 18 (5),
651–663.eng, J., Socher, R., Li, L.-J., Fei-Fei, L., 2009. ImageNet: a large-scale hierarchical image
database. In: IEEE Proc. CVPR.onderler, M., Saykol, E., Arslan, U., Ulusoy, O., Gudukbay, U., 2005. Bilvideo: design and
implementation of a video database management system. Multimed. Tools Appl.27 (1), 79–104.
an, J., Aref, W., Elmagarmid, A., Hacid, M., Marzouk, M., Zhu, X., 2001. Multiview:
multilevel video content representation and retrieval. J. Electron. Imaging 10 (4),895–908.
an, J., Elmagarmid, A., Zhu, X., Aref, W., Wu, L., 2004. Classview: hierarchical videoshot classification, indexing, and accessing. IEEE Trans. Multimed. 6 (1), 70–86.
013. Great Scott! Over 35 hours of video uploaded every minute to Youtube. The Of-ficial YouTube Blog Available: http://youtube-global.blogspot.com/2010/11/great-
scott-over-35-hours-of-video.html.
aved, K., Babri, H., Saeed, M., 2012. Feature selection based on class-dependent den-sities for high-dimensional binary data. IEEE Trans. Knowl. Data Eng. 24 (3),
465–477.iu, L., Li, Z., Delp, E., 2009. Efficient and low-complexity surveillance video compression
using backward-channel aware Wyner-Ziv video coding. IEEE Trans. Circuits Syst.Video Technol. 19 (4), 452–465.
iu, Y., Zhang, Q., Lionel, M.N., 2010. Opportunity-based topology control in wireless
sensor networks. IEEE Trans. Parallel Distrib. Syst. 21 (3), 405–416.iu, Y., Zhu, Y., Lionel, M., Ni, G.X., 2011. A reliability-oriented transmission service in
wireless sensor networks. IEEE Trans. Parallel Distrib. Syst. 22 (12), 2100–2107.iu, X., Yang, Y., Yuan, D., Chen, J., 2013. Do we need to handle every temporal vi-
olation in scientific workflow systems. ACM Trans. Softw. Eng. Methodol. (earlyaccess).
izhe, W., von Laszewski, G., Younge, A.J., Xi, H., Kunze, M., Jie, T., Cheng, F., 2010. Cloud
computing: a perspective study. N. Gener. Comput. 28 (2), 137–146.uo, X., Zheng, X., Yu, J., Chen, X., 2011. Building association link network for semantic
link on web resources. IEEE Trans. Autom. Sci. Eng. 8 (3), 482–494.a, H., Zhu, J., Lyu, M., King, I., 2010. Bridging the semantic gap between image contents
and tags. IEEE Trans. Multimed. 12 (5), 462–473.arszalek, M., Schmid, C., Inria, M., 2007. Semantic hierarchies for visual object recog-
nition. In: IEEE Proc. CVPR.evatia, R., Natarajan, P., 2005. EDF: a framework for semantic annotation of video.
In: Proceedings of the 10th IEEE International Conference on Computer Vision
Workshops 1876 pp.evatia, R., Hobbs, J., Bolles, R., Smith, J., 2005. VERL: an ontology framework for rep-
resenting and annotating video events. IEEE Multimed. 12 (4), 76–86.lebani, P., Pernici, B., 2009. URBE: web service retrieval based on similarity evaluation.
IEEE Trans. Knowl. Data Eng. 21 (11), 1629–1642.
organizing surveillance big data using video structural description
/10.1016/j.jss.2014.07.024
Z. Xu et al. / The Journal of Systems and Software 000 (2014) 1–9 9
ARTICLE IN PRESSJID: JSS [m5G;December 8, 2014;12:13]
S
W
W
X
X
Y
Y
Y
Y
Y
Z
Z
Z
Z
Z
Z
Z
Z
Z
evilmis, T., Bastan, M., Gudukbay, U., Ulusoy, O., 2008. Automatic detection of salientobjects and spatial relations in videos for a video database system. Image Vis.
Comput. 26 (10), 1384–1396.igan, M., Clarke, R., 2013. Big data’s big unintended consequences. Computer 46 (6),
46–53.u, L., Wang, Y., 2010. The process of criminal investigation based on grey hazy set. In:
2010 IEEE International Conference on System Man and Cybernetics, pp. 26–28.u, C., Zhang, Y., Zhu, G., Rui, Y., Lu, H., Huang, Q., 2008. Using webcast text for semantic
event detection in broadcast sports video. IEEE Trans. Multimed. 10 (7), 1342–1355.
u, Z., Luo, X., Wang, L., 2011. Incremental building association link network. Comput.Syst. Sci. Eng. 26 (3), 153–162.
an, M., Lizhe, W., Dingsheng, L., Tao, Y., Peng, L., Wanfeng, Z., 2013a. Distributeddata structure templates for data-intensive remote sensing applications. Concurr.
Comput.: Pract. Exp. 25, 1784–1797.an, M., Lizhe, W., Zomaya, A.Y., Dan, C., Ranjan, R., 2013b. Task-tree based large-scale
mosaicking for remote sensed imageries with dynamic dag scheduling. IEEE Trans.
Parallel Distrib. Comput. doi:10.1109/TPDS.2013.272.ao, B., Yang, X., Lin, L., Lee, M., Zhu, S., 2010. I2T: image parsing to text description.
Proc. IEEE 98 (8), 1485–1508.u, H., Pedrinaci, C., Dietze, S., Domingue, J., 2012. Using linked data to annotate and
search educational video resources for supporting distance learning. IEEE Trans.Learn. Technol. 5 (2), 130–142.
uan, D., Yang, Y., Liu, X., Li, W., Cui, L., Xu, M., Chen, J., 2013. A highly practical approach
towards achieving minimum datasets storage cost in the cloud. IEEE Trans. ParallelDistrib. Syst. 24 (6), 1234–1244.
e, D., Xiaomin, W., Lizhe, W., Xiaodao, C., Ranjan, R., Zomaya, A., Dan, C., 2014. Parallelprocessing of dynamic continuous queries over streaming data flows. IEEE Trans.
Parallel Distrib. Syst. doi:10.1109/TPDS.2014.2311811, (forthcoming).hang, J., Zulkernine, M., Haque, A., 2008. Random-forests-based network intrusion
detection systems. IEEE Trans. Syst. Man Cybern. C: Appl. Rev. 38 (5), 649–659.
hang, X., Liu, C., Nepal, S., Pandev, S., Chen, J., 2013a. A privacy leakage upper-boundconstraint based approach for cost-effective privacy preserving of intermediate
datasets in cloud. IEEE Trans. Parallel Distrib. Syst. 24 (6), 1192–1202.hang, X., Yang, T., Liu, C., Chen, J., 2013b. A scalable two-phase top-down specialization
approach for data anonymization using MapReduce on cloud. IEEE Trans. ParallelDistrib. Syst. (early access).
huge, H., 2009. Communities and emerging semantics in semantic link network: dis-
covery and learning. IEEE Trans. Knowl. Data Eng. 21 (6), 785–799.huge, H., 2010. Interactive semantics. Artif. Intell. 174, 190–204.
huge, H., 2011. Semantic linking through spaces for cyber-physical-socio intelligence:a methodology. Artif. Intell. 175, 988–1019.
huge, H., 2012. The Knowledge Grid – Toward Cyber-Physical Society, second ed.World Scientific Publishing Co., Singapore.
huge, H., Chen, X., Sun, X., Yao, E., 2008. HRing: a structured P2P overlay based on
harmonic series. IEEE Trans. Parallel Distrib. Syst. 19 (2), 145–158.
Zheng Xu was born in Shanghai, China, in 1984. He received
the diploma and PhD degrees from the School of ComputingEngineering and Science, Shanghai University, Shanghai, in
2007 and 2012, respectively. He is currently working in thethird research institute of ministry of public security and
Tsinghua University, China. His current research interestsinclude topic detection and tracking, semantic web and web
mining.
Please cite this article as: Z. Xu et al., Semantic based representing and
technology, The Journal of Systems and Software (2014), http://dx.doi.org
Yunhuai Liu is a professor in the third research institute
of ministry of public security, China. He received the PhDdegrees from Hong Kong University of Science and Technol-
ogy (HKUST) in 2008. His main research interests include
wireless sensor networks, pervasive computing, and wire-less network. He has authored or co-authored more than
50 publications and his publications have appeared in IEEETransactions on Parallel and Distributed Systems, IEEE Jour-
nal of Selected Areas in Communications, IEEE Transactionson Mobile Computing, IEEE Transactions on Vehicular Tech-
nology, etc.
Lin Mei received his PhD degree from Xian Jiaotong Uni-
versity, China. He is currently working in the third researchinstitute of ministry of public security, China. He is the dean
professor of the Department of Internet of things.
Chuanping Hu received his PhD degree from Tongji Uni-versity, China. He is currently working in the third research
institute of ministry of public security, China. He is the deanprofessor of the third research institute of ministry of public
security.
Lan Chen is a PhD candidate of Beihang University, China.She is currently working in the third research institute of
ministry of public security, China.
organizing surveillance big data using video structural description
/10.1016/j.jss.2014.07.024