5
Vol. 4, No. 9 September 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2013 CIS Journal. All rights reserved. http://www.cisjournal.org 670 The Capture and Reduction Technology of Image Data based on HTTP Protocol 1 Wu yan lun, 2 Zhang xiao hong, 3 Peng cui 1, 2, 3 ABSTRACT This article is built on the basis of the network protocol, to restore the original network data information, mainly for the network users to access the network, the transmission of image resources and data reduction. This article relates to the original network packet capture, packet parsing, TCP fragment reassembly, pictures restored four aspects. The whole system is based on VS platform combined with MFC and MySql database of network data reduction system. In the packet capture module uses Winpcap capture network packets to the maximum to avoid packet loss occurs. Through data parsing module for TCP protocol SEQ field and ACK ,and other important fields and categorize all of the data packets, and then use the SEQ and ACK value to complete the restructuring of the TCP data flow divided. Finally in collecting the data of all the restructuring good TCP data flow information, eventually achieve the reduction of image. School of Information Engineering, Southwest University of Science and Technology, Mianyang, SiChuan, China Keywords: network packet; TCP fragment reassembly; Image Restore; database 1. INTRODUCTION HTTP (Hypertext Transfer Protocol) is a mode based on the request and response, stateless, application layer protocol, often based on the TCP connection, HTTP1.1 version gives a constant connection mechanism, the vast majority of Web development is built on Web applications over HTTP . HTTP protocol supports double mode-client and server. When a client requests a service to the server, they can simply send the request method and path. Request method generally be GET, HEAD, POST. And each method defined the different types of connection between client and the server. Owing to the simplification of HTTP protocol, the HTTP server program scale could be smaller, and also speed up communication at the same time. HTTP protocol transfer process is very flexible; it allows the transmission of any type of data object. The type is being transmitted by the Content-Type to be labeled. HTTP in each time the connection to handle a request, only when the server finished processing the customer's request and after receiving the customer’s reply, immediately disconnect. In order to save transmission time. HTTP protocol is stateless protocol. Stateless protocol for transaction processing is no memory. The lack of state means that if the information in front of the subsequent processing need, then it must be the retransmission, so that each connection may result in increasing the amount of data transferred. On the other hand ,when the server does not require the information previously it faster response. 2. NETWORK PACKET CAPTURE AND PACKET ANALYSIS 2.1 Network Packet Capture Original network packet capture modules by calling the open source of the packet capture library WinPcap to achieve. Winpcap is a free and open software system. It is used for the direct network programming under the windows system. WinPcap is used for network packet capture a set of tools, suitable for 32-bit operating platforms parse network packets, is a Win32 platform for packet capture and analysis systems. Contains the core of the packet filter driver, an underlying dynamic link library Packet.dll and a high level of system libraries Libpcap library structure, and can be used to directly access the packet application interface. But it does not depend on the host, such as a TCP / IP protocol to send and receive data packets. This means that it cannot be blocked,, can’t deal with the same host program the communication between the data. It can only "sniffer" to the physical line datagram. WinPcap consists of three parts: The first module is Net group Packet Filter, NPF, which is a virtual device driver files. Its function is to filter data packets, and put these packets to the user mode module intact. The second module packet.dll, it is for win32 platform provides a common interface. Call Packet.dll of programs can run on different versions of Windows platforms without recompilation. The third module Winpcap.dll, it is not dependent on the operating system. It provides a more high-level, abstract function. Packet.dll and Winpcap.dll: Packet.dll direct mapping the kernel call directly. Winpcap.dll provide a more friendly, more powerful function call. WinPcap advantage is to provide a standard set of capture interface is compatible with libpcap, can make many of the original UNIX platform network analysis tools ported to facilitate rapid development of a variety of network analysis tools, full account of the various performance and efficiency optimization, including the NPF kernel-level filter support, support for core state statistical model, providing the ability to send a packet. 2.2 HTTP Protocol Network Packet Extractions By Winpcap capture raw network data packet parsing layer by layer, separated HTTP packet fragmentation. Shown in Figure 1, for the original

Journal of Computing::The Capture and Reduction …cisjournal.org/journalofcomputing/archive/vol4no9/vol4no9_1.pdf · original network packet capture, ... into the MYSQL database

Embed Size (px)

Citation preview

Page 1: Journal of Computing::The Capture and Reduction …cisjournal.org/journalofcomputing/archive/vol4no9/vol4no9_1.pdf · original network packet capture, ... into the MYSQL database

Vol. 4, No. 9 September 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

670

The Capture and Reduction Technology of Image Data based on HTTP Protocol 1 Wu yan lun, 2 Zhang xiao hong, 3 Peng cui

1, 2, 3

ABSTRACT

This article is built on the basis of the network protocol, to restore the original network data information, mainly for the network users to access the network, the transmission of image resources and data reduction. This article relates to the original network packet capture, packet parsing, TCP fragment reassembly, pictures restored four aspects. The whole system is based on VS platform combined with MFC and MySql database of network data reduction system. In the packet capture module uses Winpcap capture network packets to the maximum to avoid packet loss occurs. Through data parsing module for TCP protocol SEQ field and ACK ,and other important fields and categorize all of the data packets, and then use the SEQ and ACK value to complete the restructuring of the TCP data flow divided. Finally in collecting the data of all the restructuring good TCP data flow information, eventually achieve the reduction of image.

School of Information Engineering, Southwest University of Science and Technology, Mianyang, SiChuan, China

Keywords: network packet; TCP fragment reassembly; Image Restore; database 1. INTRODUCTION

HTTP (Hypertext Transfer Protocol) is a mode based on the request and response, stateless, application layer protocol, often based on the TCP connection, HTTP1.1 version gives a constant connection mechanism, the vast majority of Web development is built on Web applications over HTTP .

HTTP protocol supports double mode-client and

server. When a client requests a service to the server, they can simply send the request method and path. Request method generally be GET, HEAD, POST. And each method defined the different types of connection between client and the server. Owing to the simplification of HTTP protocol, the HTTP server program scale could be smaller, and also speed up communication at the same time. HTTP protocol transfer process is very flexible; it allows the transmission of any type of data object. The type is being transmitted by the Content-Type to be labeled. HTTP in each time the connection to handle a request, only when the server finished processing the customer's request and after receiving the customer’s reply, immediately disconnect. In order to save transmission time. HTTP protocol is stateless protocol. Stateless protocol for transaction processing is no memory. The lack of state means that if the information in front of the subsequent processing need, then it must be the retransmission, so that each connection may result in increasing the amount of data transferred. On the other hand ,when the server does not require the information previously it faster response. 2. NETWORK PACKET CAPTURE AND

PACKET ANALYSIS 2.1 Network Packet Capture

Original network packet capture modules by calling the open source of the packet capture library WinPcap to achieve. Winpcap is a free

and open software system. It is used for the direct network programming under the windows system. WinPcap is used for network packet capture a set of tools, suitable for 32-bit operating platforms parse network packets, is a Win32 platform for packet capture and analysis systems. Contains the core of the packet filter driver, an underlying dynamic link library Packet.dll and a high level of system libraries Libpcap library structure, and can be used to directly access the packet application interface. But it does not depend on the host, such as a TCP / IP protocol to send and receive data packets. This means that it cannot be blocked,, can’t deal with the same host program the communication between the data. It can only "sniffer" to the physical line datagram.

WinPcap consists of three parts: The first module

is Net group Packet Filter, NPF, which is a virtual device driver files. Its function is to filter data packets, and put these packets to the user mode module intact. The second module packet.dll, it is for win32 platform provides a common interface. Call Packet.dll of programs can run on different versions of Windows platforms without recompilation. The third module Winpcap.dll, it is not dependent on the operating system. It provides a more high-level, abstract function.

Packet.dll and Winpcap.dll: Packet.dll direct

mapping the kernel call directly. Winpcap.dll provide a more friendly, more powerful function call. WinPcap advantage is to provide a standard set of capture interface is compatible with libpcap, can make many of the original UNIX platform network analysis tools ported to facilitate rapid development of a variety of network analysis tools, full account of the various performance and efficiency optimization, including the NPF kernel-level filter support, support for core state statistical model, providing the ability to send a packet. 2.2 HTTP Protocol Network Packet Extractions

By Winpcap capture raw network data packet parsing layer by layer, separated HTTP packet fragmentation. Shown in Figure 1, for the original

Page 2: Journal of Computing::The Capture and Reduction …cisjournal.org/journalofcomputing/archive/vol4no9/vol4no9_1.pdf · original network packet capture, ... into the MYSQL database

Vol. 4, No. 9 September 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

671

network data packets from the top to bottom in this segmentation, filtering, to obtain the last HTTP data fragmentation.

Original network packet header is the Ethernet

frame header, the packet contains the sender and the receiver's MAC address: The destination MAC address, source MAC address, and the packet length field. Accounted for 14 bytes. Through the analysis of the key fields for data link layer section ,data types, retain data type for 0800 packets, IP packet fragmentation (network layer). According to Figure 2 shows the IP packet message format, at its head (20 bytes) parse out the packet transmission sender and receiver IP address: source IP

address, destination IP address. Again through the key word of IP data packets judgment protocol type, keep TCP protocol packet, namely the TCP packet fragmentation (transport layer). According to Figure 3 shows the TCP packet message format, at its head (20 bytes) isolated key fields: source port, destination port, serial number (SEQ) and the acknowledgment number (ACK). And by judging whether the packet transmission associated with the 80-port, retain the packets associated with 80-port, namely HTTP packets.

Fig 1: Original Network Packet Parsing Process

At this point, through the original layer by layer

network packet parsing, packet transmission separating the important field information: source MAC address, destination MAC address, source IP address, destination IP address, source port, destination port, serial number and acknowledgment number. Remove the original network packet of 54 bytes of the head, and the rest is what we need is based on the HTTP protocol to transmit the data part of the picture. 2.3 Related Fields and Data Storage Section

In order to facilitate subsequent data reorganization, here using MYSQL database tool to store each data packet in an important field information. By the MAC address and IP address to determine the client and server-side network location, the source port number and

destination port number to determine the data transmission on the server side and the client opened up the port number. Through SEQ value and ACK value to mark the packet transmission properties.

Through the MYSQL database access, you can quickly find out what the client sent to the server different requests. By judging the properties of the request, isolated from database from in the server in response to the request of all packets.

Finally, we need the data part of the HTTP

protocol in the form of a hard disk file stored in the response file, so that when doing the data behind the reorganization and reduction, can be directly done by reading the form of a binary file to complete.

Fig 2: IP Packet Format

Page 3: Journal of Computing::The Capture and Reduction …cisjournal.org/journalofcomputing/archive/vol4no9/vol4no9_1.pdf · original network packet capture, ... into the MYSQL database

Vol. 4, No. 9 September 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

672

Fig 3: TCP Packet Format

3. PACKETS RESTRUCTURING AND

RESTORE IMAGES 3.1 Extraction GET Request Packet

After the first resolution of network packets to finish, deposited the important fields in the corresponding form in the MYSQL database. Part of the data stored in the hard segment binary file.

For packet reorganization, you first need to do is

extract the client sends a request to the server packets. HTTP-based environment, where the main use of the GET request packets. By judging whether there is "GET" field in the raw network packet of information, to determine whether the packet is a request packets from the client.

Through the TCP packet contents in hexadecimal

display (Figure 4), pairs of hexadecimal content analysis, you can clearly find the HTTP protocol GET request packet contains a URI, Referrer, HOST three important identification field. And these three important TCP packet identification in the data area is fixed at a very characteristic values and formats exist. Uniform resource identifier URI, wherein the fingerprint information is "0x47, 0x45, 0x54, 0x20, (***), 0x20", wherein (***) indicates the request URI of the specific resource request; Referrer identifies the current Resource URI reference source URI, that fingerprint information is "0x52, 0x65, 0x66, 0x65, 0x72, 0x65, 0x72, 0x3a, 0x20 (***) 0x0d, 0x0a", the same token, where the (*** ) is a specific reference Referrer URI address; requested resource network host and port number that is the HOST used to locate the network location of resources and its fingerprint information is "0x48, 0x6f, 0x73, 0x74, 0x3a, 0x20 (***) 0x0d , 0x0a ", wherein (***) indicates the specific location of the HOST.

As shown in Figure 4 content, GET 54 bytes

before the field content is the packet header information, all of the important fields in which information is stored into the MYSQL database. For a GET request, you can derive a lot of information: protocol version, browser version, fonts language and so on. By "GET" field behind

the information, you can determine the contents of the second request is a picture: logo4w.png; through Referrer field behind the string can clearly see that the request to access the network address:

http:// www.google.com.hk.

Fig 4: Hexadecimal display of packet

3.2 Extraction Response Packet

Determines whether they GET request in the requested content is an image, the access MYSQL database to extract the key fields of the packet information, as an access parameter, the access database tables, extracts the server responds to the request issued by all packets. Separation standards are: request packet six vectors <source IP, source MAC address, source port, destination IP, destination MAC address, destination port> and response packets six vectors is reversed. By request packet SEQ serial number plus the request packet data length to obtain the response packet ACK confirmation number. Because the server in response to a request issued by the same all the response packets are used in the same ACK confirmation number. So here through six vectors and calculated ACK confirmation

Page 4: Journal of Computing::The Capture and Reduction …cisjournal.org/journalofcomputing/archive/vol4no9/vol4no9_1.pdf · original network packet capture, ... into the MYSQL database

Vol. 4, No. 9 September 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

673

number can extract all the response to the request of reply packets. Figure 5 shows the client to the server sends a . GET request packet list.

Through to the captured the 535th packet

parsing, judge it as a GET request packet, the requested content as a picture: logo4w.png. The packet fragment SEQ sequence number is: 348 371 802, packet length of 755 bytes. By the above formula, calculate the server in response to the request, sent to the client all the ACK response packet confirmation number is: 348371802 +755-54 = 348372503. Figure 6 below shows the server sends to the client in response to the request of all the response packet. 3.3 TCP Fragment Reassembly

Since data packet transmission process will be repeatedly transmitted packets and drain case. So first need to these response packets do restructuring. First,

these packets sorted according to SEQ serial number, which is the order of packet transmission. We assume that current TCP fragment SEQ value seq1 = 100, the value of the data length datalen len1 = 100.

Then the packet transmission process, the next

message may appear many cases, we need a very responsive handling. We next packet is defined as SEQ seq2, the data length is defined as the value datalen len2. If seq2 = 200, then this is a normal expected subsequent packets; if seq2 = 100, len2 = 100, it shows the description of the packets and a packets on a full repeat, you should choose to drop the packet; if seq2 = 100 , len2 = 50, illustrate the packets on a fragment of a packet, it should also be discarded; if seq2 = 150, len2 = 30, instructions on the packet and a packet with some duplication, so simply behind that duplicate packets that part removed, update the packet data length of the current packet , and it is ok to update the packet.

Fig 5: GET request packet important field

Fig 6: Important filed response packet

3.4 Restore Image Files Restructuring after the packet via TCP divided according to the SEQ values before and after the arrangement, and each packet SEQ value and the packet length and connected to the next packet SEQ values, forming a set of packet fragmentation, and the sum

of each shard in the collection of data is the picture of all the data.

When a TCP connection is disconnected, indicating that the server for the client's request this time to response, all packets have been sent, then will be out of

Page 5: Journal of Computing::The Capture and Reduction …cisjournal.org/journalofcomputing/archive/vol4no9/vol4no9_1.pdf · original network packet capture, ... into the MYSQL database

Vol. 4, No. 9 September 2013 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

674

the front, and finished processing TCP packet fragmentation data part, completes the file data, the request of restructuring.

The server responds to the GET request, the identity of the end of sending packets can be divided into two categories. This could be determined by the field information query GET request : "transfer-Encoding" and "content-length" field, completion of the former data transmission identification is the string "Ox0D, Ox0A, Ox0D, Ox0A"; while the latter in the GET the request, "content-length" field is followed by a number will follow, so when the data length to the length of time ,show that all the data has been sent out.

Part of the whole process of merging the data are carried out in a binary file. By "GET" field at the back of the image information ,the name and type of the image are extracted, named for the picture. Then all of the data part according to the order written in the form of binary the image file, finally complete the reduction of the image file.

Figure 7 and Figure 8 shows when accessing

http://www.google.com.hk restore image file.

Fig 7: logo4w.png

Fig 8: nav_logo143.png

4. CONCLUSION By capturing library WinPcap to capture raw

network packets, fast, high success rate, almost no packet loss. According to the format of the IP protocol, TCP protocol and HTTP protocol packet to parse the original network packet, extract relevant important fields of information, stored in MYSQL database, greatly reducing the burden on the back of data reorganization and increased restructuring efficiency.

Through the data portion of the response packet

detailed analysis, based on important field information to complete the TCP segment reorganization, effectively removes duplicate data section, and improve the accuracy of the data merge. Again after the reorganization of the TCP shard of all packets of data written to the binary file, in accordance with the order it was received successfully completed the image file reduction, further reducing the network user’s behavior. REFERENCES

[1] RFC791 , Internet protocol DARPA Internet

program protocol [S].Virginia,DARPA,1981.

[2] RFC815 , IP datagram reassembly

algorithm[S].Boston:MIT,1982.

[3] WinPcap Documentation 4.1.2 [Z, http://www.winpcap.org/docs/docs_412/html/group__NPF.html

[4] S.Boccaletti, V.Latora, Y.Moreno, M.Chavez

,D.-U.Hwang, Complex networks: Structure

and dynamics [J], Physics Reports , 2006 ,

424(2006):175-308

[5] Yang Guohai,Tan Shunhua,Chen Miao,Wang

Yizhi,The research on the t hree-layer mining of

data packet [A], ICCRD2011 [C],NJ, IEEE

Computer Society,2011:185-188s.