Path

A path identifies where a file is stored and how to reach it, so that the streaming pipeline knows where to read the data from or write it to.

Create a path with IOStreams.path, passing the file name, which may also be a URI, followed by any arguments specific to that storage location. IOStreams infers the storage mechanism from the URI scheme, so the same call returns a local file path, an S3 path, an SFTP path, and so on, all sharing the identical interface.

IOStreams supports accessing files in the following places:

Are you using another cloud provider and want to add support for your favorite? Checkout the supplied IOStreams S3 path provider for an example of what is required. Pull requests welcome.

File

The simplest case is a file on the local disk:

path = IOStreams.path("somewhere/example.csv")

Optional Arguments:

AWS S3 (s3://)

If the supplied file name string includes a URI. For example if AWS is configured locally:

path = IOStreams.path("s3://bucket-name/path/example.csv")

Required Arguments:

Optional Arguments:

Writer specific options:

SFTP (sftp://)

If the supplied file name string includes the sftp URI.

path = IOStreams.path("sftp://hostname/path/example.csv")

Read a file from a remote sftp server.

IOStreams.path("sftp://example.org/path/file.txt", 
               username: "jbloggs", 
               password: "secret").
  reader do |input|
    puts input.read
  end

Raises Net::SFTP::StatusException when the file could not be read.

Write to a file on a remote sftp server.

IOStreams.path("sftp://example.org/path/file.txt", 
               username: "jbloggs", 
               password: "secret").
  writer do |output|
    output.write('Hello World')
  end

Display the contents of a remote file, supplying the username and password in the url:

IOStreams.path("sftp://jack:OpenSesame@test.com:22/path/file_name.csv").reader do |io|
  puts io.read
end

Use an identity file instead of a password to authenticate:

path = IOStreams.path("sftp://test.com/path/file_name.csv", 
                      username: "jack", 
                      ssh_options: {IdentityFile: "~/.ssh/private_key"})
path.reader do |io|
  puts io.read
end

Pass in the IdentityKey itself instead of a password to authenticate. For example, retrieve the identity key stored in Secret Config:

identity_key = SecretConfig.fetch("suppliers/sftp/identity_key")

path = IOStreams.path("sftp://test.com/path/file_name.csv", 
                      username: "jack", 
                      ssh_options: {IdentityKey: identity_key})
path.reader do |io|
  puts io.read
end

Required Arguments:

Optional Arguments:

HTTP (http://, https://)

Read from a remote file over HTTP or HTTPS using an HTTP Get.

IOStreams.path('https://www5.fdic.gov/idasp/Offices2.zip').read

Notes:

Required Arguments:

Optional Arguments:

path = IOStreams.path("http://hostname/path/example.csv")

Security: untrusted URLs (SSRF)

Reading an HTTP(S) path causes the application to issue a request to the host named in the url. When the url, or any part of it, can be influenced by untrusted input, an attacker can point it at internal services or cloud metadata endpoints (Server Side Request Forgery).

Because redirect targets are chosen by the remote server, validating only the url that is passed in is not sufficient: a trusted (or compromised) server can redirect the request to an internal address. IOStreams provides a few controls to reduce this exposure:

IOStreams.path("https://supplier.example.com/report.csv", allow_hosts: ["supplier.example.com"]).read
IOStreams.path(untrusted_url, http_redirect_count: 0).read
IOStreams.path(untrusted_url, maximum_file_size: 50 * 1024 * 1024).read

Basic authentication credentials are only ever sent to the original host. They are not resent when a redirect points at a different scheme, host, or port, so a redirect cannot leak them to another server. For stronger guarantees, route these downloads through an egress proxy or network policy that blocks private, loopback, and link-local (cloud metadata) addresses.

Similarly when using https:

path = IOStreams.path("https://hostname/path/example.csv")

This time IOStreams inferred that the file lives on an HTTP Server and returns IOStreams::Paths::HTTP.

Path Operations

Paths support common file operations, regardless of where the file is stored:

path = IOStreams.path("sample/example.csv")

# Does the file exist?
path.exist?
# => true

# Size of the file in bytes.
path.size
# => 64

# Delete the file.
path.delete

# Move the file to another path, returning the target path.
path.move_to("sample/moved.csv")

# Create the directory path, when it does not already exist.
IOStreams.path("sample/data").mkpath

Inspect the components of a path’s file name:

# The last component of the path.
IOStreams.path("/home/gumby/work/ruby.rb").basename
# => "ruby.rb"

# Remove a specific suffix from the file name.
IOStreams.path("/home/gumby/work/ruby.rb").basename(".rb")
# => "ruby"

# Remove any extension by supplying ".*".
IOStreams.path("/home/gumby/work/ruby.rb").basename(".*")
# => "ruby"

# The directory portion of the path.
IOStreams.path("a/b/d/test.rb").dirname
# => "a/b/d"

# The extension, including the leading period.
IOStreams.path("a/b/d/test.rb").extname
# => ".rb"

# The extension, without the leading period.
IOStreams.path("a/b/d/test.rb").extension
# => "rb"

Notes:

Iterate over the files in a path using a wildcard pattern:

IOStreams.path("sample").each_child("*.csv") do |child|
  puts child
end

# Recursively, including sub-directories:
IOStreams.path("sample").each_child("**/*.csv") do |child|
  puts child
end

each_child is also available directly on IOStreams when the pattern includes the full path:

IOStreams.each_child("sample/**/*.csv") { |child| puts child }

Notes:

Using root paths

Roots allow paths to reference a particular root directory, so that all path names are appended to that root. By using IOStreams.join instead of IOStreams.path, the storage location is no longer embedded in the application code, it is configured once at startup.

The primary purpose of roots is to allow the exact same code to run in production and development, yet use completely different data sources in each. For example, in production the root can point to an S3 bucket, while in development it points to the local file system.

Roots are configured via an initializer at startup. Multiple roots can be setup, for example one for input files, another for output files, another for reports, etc. During development the roots can all point to a common location, while in production they could be completely different S3 buckets.

For example, inside an initializer:

IOStreams.add_root(:default, "tmp/export")
IOStreams.add_root(:ftp, "tmp/ftp")

:default is used whenever a root is not supplied when calling IOStreams.join:

# Uses the :default root: "tmp/export/sample/example.csv"
path = IOStreams.join("sample", "example.csv")

# Uses the :ftp root: "tmp/ftp/sample/example.csv"
path = IOStreams.join("sample", "example.csv", root: :ftp)

The following code:

path = IOStreams.path("tmp/export", "sample", "example.csv")
path.writer(:line) do |io|
  io << "Welcome"
  io << "To IOStreams"
end

Can be reduced to:

path = IOStreams.join("sample", "example.csv")
path.writer(:line) do |io|
  io << "Welcome"
  io << "To IOStreams"
end

Most importantly the root path information and storage mechanism are externalized from the application code.

For example, to make the above code write to S3 in production, change the initializer to:

IOStreams.add_root(:default, "s3://my-app-bucket-name/export")
IOStreams.add_root(:ftp, "s3://my-app-ftp-bucket-name/ftp")

The code calling IOStreams.join does not change at all, see Config for more examples.