Upload
godfrey-bradford
View
225
Download
0
Tags:
Embed Size (px)
Citation preview
1
FTP versus HTTPS in EOSDIS Data Access
WGISS 40 – September 30, 2015
Andrew Mitchell
2
Agenda• User Registration System – URS
– Earthdata Login
• Requiring Registration for Data Access at EOSDIS – FTP/HTTP Comparison
• URS Guidance and Policy
• FTP retirement at Data Centers– Lessons Learned
• Backup: File Transfer Protocol (FTP/HTTP) – Engineering Perspective– Performance Study
3
NASA USER REGISTRATION – EARTHDATA LOGIN
4
Earthdata Login
5
Capturing User’s Area of Interest
Study Areas & Application DomainsNASA - Primary study area* ESA - Primary Application
Domain*Air sea interactionAtmospheric aerosolsBiological OceanographyCloudsCryospheric studiesGeophysicsGlobal biosphereHuman dimensions of global changeHydrologic cycle Land processesPhysical OceanographyPolar processesRadiation budgetSea iceTroposheric chemistry Upper atmospheric composition Upper atmospheric dynamics Other
AtmosphereSea-IceGeodesyGeologyHazardsHydrologyIceLand Environment MethodsOceanographyRenewable ResourcesTopographic MappingOtherCalibration/ValidationCostal Zones
6
7
Federated User Identity Study• Performing a study of other (non
OAuth2) Single Sign -On technologies that will allow Earthdata Login to become interoperable with user registration systems from other systems and agencies.
Architecture
LDAP storeLDAP store
LDAP proxy (via LDAP store)LDAP proxy (via LDAP store)
HTTP-accessible RESTish API
HTTP-accessible RESTish API
FTP clientsFTP clients
HTTP clientsHTTP clients
Web-based user maintenance
Web-based user maintenance
9
REQUIRING REGISTRATION FOR DATA ACCESS AT EOSDIS
FTP and HTTP comparison
Impact of requiring authentication with FTP at DAACs
Advantages Disadvantages
Minimal impact to existing users Multiple flavors deployed at the data centers (5 different ftp servers)
Minimal impact to data centers No direct support for LDAP authentication on some of the flavors.
No changes to firewall rules or similar configuration
Not authenticated securely: some flavors unable to support secure authentication.
*Direct support for anonymous access
Prohibited at LP DAAC due to DoI regulations
Maturity of capability / protocol Does not integrate well with REST API for support of OpenID or OGC
10
Impact of requiring authentication with HTTP at DAACs
Advantages Disadvantages
Comprehensive support from the user community: protocol is well established and mature, all data centers use the same http server (apache)
End user scripts will have to change, as will manual access to the files they access
Modules can be applied to support many extensions and metrics gathering unavailable to certain ftpds
Data center configurations will have to change (on the firewall and the apache server)
Easily accommodates a REST API and provides well established LDAP modules for simple configuration and integration
DAACs custom code will have to change
Permitted as a transfer protocol by the DoI
Data Center customizations and extensions will need to be modified
Supports a secure authentication mechanism (https)
11
12
URS GUIDANCE & POLICY
13
Guidance for EOSDIS DAACs, Subsystems And ApplicationsPurpose: To provide guidance and clarify the integration requirements for the URS into EOSDIS systems and components.
Scope: This guidance applies to all EOSDIS DAACs, subsystems (ECHO, GCMD, Earthdata, GIBS, etc.) and related EOSDIS services and applications including (Reverb, ASTER GDEM Explorer, ASF Vertex, etc.).
• Guidance: URS will be implemented by DAACs, subsystems and related services for the following capabilities: – Downloading science data files from HTTP, HTTPS and FTP services.– Web services and tools allowing access to science data files (e.g. OPeNDAP, Web Coverage
Services, analysis tools, DAAC-unique ordering tools).– Online collaboration and comment tools (e.g. Wikis, Forums, Code Repositories).– Other tools and services that currently have optional or required user registration.
• Registration is NOT required:– Read-access to Web pages and documentation.– Data discovery services such as Reverb, Earth Data Search Client (ESDC), Global Change
Master Directory keyword services, CMR and DAAC unique search clients. • Note: This portion of the policy applies up until the point where science data downloads are performed or
write operations such as saving search parameters, inputting or updating metadata records are performed.
14
Evolution and Transition Planning
• URS is available and this guidance will go into immediate effect.– A staggered approach will be utilized to implementing URS throughout
DAACs, subsystems and applications.– Schedules and transition plans for implementation will be negotiated
between effected systems and ESDIS.
• Milestones and Timeline– In 2015, HTTPS Access with URS 4 (SSO) must be available for
all current equivalent FTP/HTTP Access. – DAACs, subsystems and applications are allowed to run HTTPS
access and FTP/HTTP* access in parallel
15
FTP RETIREMENT AT DATA CENTERS
Lessons Learned
16
Near Real Time Data Access (LANCE)HTTPS File Distribution Requirements for LANCE
• LANCE Elements shall integrate with the URS and restrict access to NRT data to users with valid URS accounts.
• URL structure should be decided by the data providers• From a users perspective, it should be possible to get all the files
simply by using curl or wget, – eg : wget -r https://foo.nasa.gov/data/OMI/OMTO3/2007/05/11– which would download all the OMTO3 data files and the Manifest for the
date 2007/05/11.
– To get the entire month use:wget -r https://foo.nasa.gov/data/OMI/OMTO3/2007/05 – To get the entire year I could use:wget -r -nd https://foo.nasa.gov/data/OMI/OMTO3/2007
17
18
LP DAAC migration to HTTP• The LP DAAC switched from FTP to HTTP for data access on
June 4, 2013. This change was advertised on the LP DAAC Web site as a News item. For users who do not regularly visit our page, we encourage them to consider subscribing to the RSS News Feed (https://lpdaac.usgs.gov/news_feed) so as not to miss out on future announcements.
The News Item for the FTP to HTTP is available at (https://lpdaac.usgs.gov/lp_daac_discontinue_anonymous_ftp_june_4_2013). Note: The cURL command handles http and has been used by some to update their scripted access to Data Pool.
• LP DAAC provides a good model for HTTPS data distribution https://lpdaac.usgs.gov/data_access/data_pool
19
User Feedback“I think that the data should be delivered by a ftp server, because in my case, here in PARAGUAY the internet signal is not stable. During downloads, my connection was interrupted many times forcing me to restart the request process and download it again.”
“We used to receive order by email as ftp, currently it is only http, which is taking more time in downloading, can we go back to ftp option ?”
“The problem I have with the http protocol is I don't know how to automate my wget script to get new data. With ftp I can use a wildcard at the end of the full file path. With the current naming of the .hdf files, MYD11C1.A2013153.005.2013155051730.hdf
I don't know the filenames ahead of time, so I cannot even use a brute force, name every file to get approach. Is there some way you can recommend to automatically get these data? Can I request an automatic push to my incoming ftp site? “
20
Summary
• Understanding that many of our users use scripts to get data from our anonymous FTP servers, this will require social as well as technical changes.
• We are gathering use cases and lessons learned from other DAACs in addition to providing ‘recipes’, reference software to automate authenticated HTTPS downloads, bulk download web clients, user tutorials and documentation.
21
Summary
• URS is also being enhanced to work with multiple web services. (e.g. OGC, OAI-PMH, OpenDAP, REST/SOAP).
• How to get HTTPS directory listings fast:
https://wiki.earthdata.nasa.gov/display/HDD/HTTP+Data+Distribution+Home
Some DAACs will be exempt from the HTTP requirement (via waivers)
– Our CDDIS DAAC is serving over 1.8M files and 380 Gbytes/day to over 13K distinct users ftp.
22
FILE TRANSFER PROTOCOL ENGINEERING PERSPECTIVE
Backup - FTP versus HTTP
23
FTP/HTTP Comparison FTP HTTP
Contains notion of file format: allows transfers of data to be ASCII or binary (+)
Always sends data binary (neutral)
No metadata is provided with files (-) HTTP provides metadata with files (+)Does not provide headers since no metadata is transferred (-)
Transfers with headers that contain information such as last modified date, character encoding, server name and version, etc (+)
FTP allows requesting multiple files to get transferred in parallel using the same control connection (-)
Supports pipelining - clients are able to ask for the next transfer before the previous one has added (+)
Since FTP doesn't utilize pipelining, new TCP connections are required for each transfer, so performance metrics are affected (-)
Pipelining allows multiple documents to get sent without a round-trip delay between documents, which helps with speed optimization (+)
Clients must send commands to the servers to respond, and a single transfer can involve a large series of commands. This has a negative impact since there is a round-trip delay for each command, as retrieving a single FTP file can easily get up to 10 round-trips. (-)
Uses one request and one response for each document (+)
Uses two connections where the second connections uses dynamic port numbers. Requires firewall admins to understand FTP at the application protocol layer to work well (-)
If both parties are behind Network Address Translations, you cannot use FTP (-)
Since firewalls need to understand FTP to open ports for the secondary connection, there is a huge problem with encryption (FTP-SSL, or FTPS) since the control connection is sent encrypted and firewalls cannot interpret the commands that deal with creating the second connection (-)
Not as many options available to prevent FTP from sending passwords as plain text (-)
HTTP does not send passwords as plain text (+)
24
FTP/HTTP Comparison (con’t)
FTP HTTPResumed transfers for FTP that start beyond 2GB position has been known to cause trouble (-)
Supports more advanced byte ranges (+)
FTP must create a new connection for each new data transfer. Repeatedly doing this is bad for performance due to new handshakes/connections all the time (-)
Client can maintain a single connection to a server and keep using that for any amount of transfers (+)
Does not use chunked encoding (-) Utilizes chunked encoding, where the party sends a stream of data blocks until this is no more data to send, then sends a zero-size chunk to signal the end of it. (+)
FTP uses plain closing of ther connection, which makes it more difficult to detect premature connection shutdowns (-)
Chunked encoding helps in granting the ability to detect premature connection shutdowns (+)
FTP offers an official "built-in" run length encoding that compresses the amount of data to send, but not by a great enough amount on ordinary binary data (neutral)
Allows client and server to negotiate and choose among several compression algorithms (+)
FTP supports "third party transfers" wherein a client is allowed to ask a server to send data to a third host, a host that isn't the same as the client. This is typically disabled in modern FTP servers due to security implications (-)
Does not support "third party transfers" (FXP) (+)
Many FTP servers do not have the ability to support IPv6 (-)
HTTP supports IPv6 (+)
Cannot do name-based virtual hosting at all (-) Easily host many sites on the same server that are all differentiated by name (+)
FTP has commands for listing directory contents of the remote server (+)
Concept does not exist in HTTP (-)
FTP has not been standardized for proxies, so this functionality is generally done in lots of different ad-hoc approaches (-)
HTTP has built-in support for proxies natively. (+)Legend
Performance (speed) Security
25
FILE TRANSFER PROTOCOL PERFORMANCE STUDY
Backup
26
Study Background
• Sending files over a high-speed network doesn’t guarantee that the end-to-end performance will match the network capacity or meet user expectations. When transferring data, network latency (round-trip time or RTT) and packet loss can impact the transmission rate in conjunction with the file transfer protocol used, and the characteristics and tuning parameters of the end systems.
• EOSDIS performed a study of a set of file transfer protocols from ESDIS Networks to determine how each one performed in different network environments– All protocols studied use TCP for transport
27
Study Summary• High speed networks don’t come with high speed end-to-end
performance guarantees– File transfer protocol performance impacted by file size, host buffer size
and TCP behavior• Network latency (round-trip time, RTT) and packet loss
• Most common file transfer protocols were designed when network capacity was much less than today– FTP over TCP/IP was developed in the 1980s– Single TCP stream
• New file transfer protocols are designed to better adapt to changes in high speed network environments– Multiple, parallel TCP streams
• Other strategies are being employed to increase performance– Increasing packet size– Encrypting only sensitive data
28
Study Conclusions
• No single file transfer protocol works best in every network environment
• Data delivery requirements should be used to determine choice of file transfer protocol– Multi-stream protocols (bbFTP and GridFTP) are best at sending larger
files over WANs (long RTT, higher packet loss)– Efficient, single stream protocols (FTP, HTTP) work best at sending
smaller files over LANs (short RTT, lower packet loss)– Encryption processing software overhead lowers throughput
• Increased CPU load