Understanding Distributed File Systems: Concepts and Requirements
Introduction to Distributed File Systems
- A distributed file system enables programs to store and access remote files exactly as they do local ones, allowing users to access files from any computer on a network.
- File systems were originally developed for centralized computer systems and desktop computers as an operating system facility providing a convenient programming interface to disk storage.
- A well-designed file service provides access to files stored at a server with performance and reliability similar to, and in some cases better than, files stored on local disks.
- The concentration of persistent storage at a few servers reduces the need for local disk storage and (more importantly) enables economies to be made in the management and archiving of the persistent data owned by an organization.
Characteristics of File Systems
- File systems are responsible for the organization, storage, retrieval, naming, sharing, and protection of files.
- Files are stored on disks or other non-volatile storage media.
- The attributes are held as a single record containing information such as the length of the file, timestamps, file type, owner’s identity, and access control lists.
- A directory is a file, often of a special type, that provides a mapping from text names to internal file identifiers.
- Directories may include the names of other directories, leading to the familiar hierarchic filenaming scheme and the multi-part pathnames for files used in UNIX and other operating systems.
- The term metadata is often used to refer to all of the extra information stored by a file system that is needed for the management of files.
File System Operations
- The primitives given here are an indication of the operations that file services are expected to support and for comparison with the file service interfaces.
Distributed File System Requirements
- Initially, they offered access transparency and location transparency; performance, scalability, concurrency control, fault tolerance, and security requirements emerged and were met in subsequent phases of development.
Transparency
- The file service is usually the most heavily loaded service in an intranet, so its functionality and performance are critical.
Concurrent File Updates
- The need for concurrency control for access to shared data in many applications is widely accepted, and techniques are known for its implementation, but they are costly.
File Replication
- In a file service that supports replication, a file may be represented by several copies of its contents at different locations.
Hardware and Operating System Heterogeneity
- The service interfaces should be defined so that client and server software can be implemented for different operating systems and computers.
Fault Tolerance
- The central role of the file service in distributed systems makes it essential that the service continue to operate in the face of client and server failures.
- The servers can be stateless, so that they can be restarted and the service restored after a failure without any need to recover previous state.
Consistency
- This refers to a model for concurrent access to files in which the file contents seen by all of the processes accessing or updating a given file are those that they would see if only a single copy of the file contents existed.
Security
- Virtually all file systems provide access-control mechanisms based on the use of access control lists.
- In distributed file systems, there is a need to authenticate client requests so that access control at the server is based on correct user identities and to protect the contents of request and reply messages with digital signatures and (optionally) encryption of secret data.
Efficiency
- A distributed file service should offer facilities that are of at least the same power and generality as those found in conventional file systems and should achieve a comparable level of performance.
File Service Architecture
- An architecture that offers a clear separation of the main concerns in providing access to files is obtained by structuring the file service as three components – a flat file service, a directory service, and a client module.
- The flat file service and the directory service each export an interface for use by client programs, and their RPC interfaces, taken together, provide a comprehensive set of operations for access to files.
- The client module provides a single programming interface with operations on files similar to those found in conventional file systems.
- The design is open in the sense that different client modules can be used to implement different programming interfaces, simulating the file operations of a variety of different operating systems and optimizing the performance for different client and server hardware configurations.
- Directory service: The directory service provides a mapping between text names for files and their UFIDs. The directory service provides the functions needed to generate directories, to add new file names to directories, and to obtain UFIDs from directories.
- Client module: A client module runs in each client computer, integrating and extending the operations of the flat file service and the directory service under a single application programming interface that is available to user-level programs in client computers.
Flat File Service Interface
- The Read and the Write operations require a parameter i specifying a position in the file.
The Read Operation
- Copies the sequence of n data items beginning at item i from the specified file into Data, which is then returned to the client.
The Write Operation
- Copies the sequence of data items in Data into the specified file beginning at item i, replacing the previous contents of the file at the corresponding position and extending the file if necessary.
- Create: Creates a new, empty file and returns the UFID that is generated.
- Delete: Removes the specified file.
- GetAttributes and SetAttributes: Enable clients to access the attribute record.
UNIX File System
- A seek operation is provided to enable the read/write pointer to be explicitly repositioned.
- Stateless servers: The interface is suitable for implementation by stateless servers. Stateless servers can be restarted after a failure and resume operation without any need for clients or the server to restore any state.
Access Control
- In the UNIX file system, the user’s access rights are checked against the access mode (read or write) requested in the open call, and the file is opened only if the user has the necessary rights.
Access Control
- The user identity (UID) used in the access rights check is retrieved during the user’s earlier authenticated login and cannot be tampered with in non-distributed implementations.
Directory Service Interface
The primary purpose of the directory service is to provide a service for translating text names to UFIDs.
- In order to do so, it maintains directory files containing the mappings between text names for files and UFIDs.
- Each directory is stored as a conventional file with a UFID, so the directory service is a client of the file service.
Directory Service Interface
- We define only operations on individual directories. For each operation, a UFID for the file containing the directory is required (in the Dir parameter).
- AddName: Adds an entry to a directory and increments the reference count field in the file’s attribute record.
- UnName: Removes an entry from a directory and decrements the reference count. If this causes the reference count to reach zero, the file is removed.
- GetName: Is provided to enable clients to examine the contents of directories and to implement pattern-matching operations on file names such as those found in the UNIX shell.
- It returns all or a subset of the names stored in a given directory. The names are selected by pattern matching against a regular expression supplied by the client.
Hierarchic File System
- A hierarchic file system such as the one that UNIX provides consists of a number of directories arranged in a tree structure.
- Any file or directory can be referenced using a pathname – a multi-part name that represents a path through the tree.
- The root has a distinguished name, and each file or directory has a name in a directory.
- The UNIX file-naming scheme is not a strict hierarchy – files can have several names, and they can be in the same or different directories.
File Groups
- A file group is a collection of files located on a given server.
- A server may hold several file groups, and groups can be moved between servers, but a file cannot change the group to which it belongs.
- File groups were originally introduced to support facilities for moving collections of files stored on removable media between computers.
- In a distributed file service, file groups support the allocation of files to file servers in larger logical units and enable the service to be implemented with files stored on several servers.
- In a distributed file system that supports file groups, the representation of UFIDs includes a file group identifier component, enabling the client module in each client computer to take responsibility for dispatching requests to the server that holds the relevant file group.
- File group identifiers must be unique throughout a distributed system.