Cloud Computing: Architecture, Security, and Programming

Architectural Design of Compute and Storage Clouds: This section presents basic cloud design principles. We start with basic cloud architecture to process massive amounts of data with a high degree of parallelism. Then we study virtualization support, resource provisioning, infrastructure management, and performance modeling. A) Generic Cloud Architecture Design: An Internet cloud is envisioned as a public cluster of servers provisioned on demand to perform collective web services or distributed applications using data- center resources. In this section, we will discuss cloud design objectives and then present a basic cloud architecture design.  Cloud Platform Design Goals: Scalability, virtualization, efficiency, and reliability are four major design goals of a cloud computing platform. Clouds support Web 2.0 applications. Cloud management receives the user request, finds the correct resources, and then calls the provisioning services which invoke the resources in the cloud. The cloud management software needs to support both physical and virtual machines. Security in shared resources and shared access of data centers also pose another design challenge.  Enabling Technologies for Clouds: The key driving forces behind cloud computing are the ubiquity of broadband and wireless networking, falling storage costs, and progressive improvements in Internet computing software. Cloud users are able to demand more capacity at peak demand, reduce costs, experiment with new services, and remove unneeded capacity, whereas service providers can increase system utilization via multiplexing, virtualization, and dynamic resource provisioning.  A Generic Cloud Architecture: The Internet cloud is envisioned as a massive cluster of servers. These servers are provisioned on demand to perform collective web services or distributed applications using data-center resources. The cloud platform is formed dynamically by provisioning or deprovisioning servers, software, and database resources. Servers in the cloud can be physical machines or VMs. User interfaces are applied to request services. The provisioning tool carves out the cloud system to deliver the requested service. B) Layered Cloud Architectural Development: The architecture of a cloud is developed at three layers: infrastructure, platform, and application, as demonstrated. These three development layers are implemented with virtualization and standardization of hardware and software resources provisioned in the cloud. The services to public, private, and hybrid clouds are conveyed to users through networking support over the Internet and intranets involved. In turn, the platform layer is a foundation for implementing the application layer for SaaS applications. Different types of cloud services demand application of these resources separately.  Market-Oriented Cloud Architecture: As consumers rely on cloud providers to meet more of their computing needs, they will require a specific level of QoS to be maintained by their providers, in order to meet their objectives and sustain their operations. Cloud providers consider and meet the different QoS parameters of each individual consumer as negotiated in specific SLAs.  Quality of Service Factors: The data center comprises multiple computing servers that provide resources to meet service demands. In the case of a cloud as a commercial offering to enable crucial business operations of companies, there are critical QoS parameters to consider in a service request, such as time, cost, reliability, and trust/security.
C) Virtualization Support and Disaster Recovery: One very distinguishing feature of cloud computing infrastructure is the use of system virtualization and the modification to provisioning tools. Virtualization of servers on a shared cluster can consolidate web services. As the VMs are the containers of cloud services, the provisioning tools will first find the corresponding physical 

machines and deploy the VMs to those nodes before scheduling the service to run on the virtual nodes.  Hardware Virtualization In many cloud computing systems, virtualization software is used to virtualize the hardware. System virtualization software is a special kind of software which simulates the execution of hardware and runs even unmodified operating systems. Virtualization software is also used as the platform for developing new cloud applications that enable developers to use any operating systems and programming environments they like.  Virtualization Support in Public Clouds Armbrust, et al. [4] have assessed three public clouds in the context of virtualization support: AWS, Microsoft Azure, and GAE. AWS provides extreme flexibility (VMs) for users to execute their own applications. GAE provides limited application-level virtualization for users to build applications only based on the services that are created by Google.  Storage Virtualization for Green Data Centers: IT power consumption in the United States has more than doubled to 3 percent of the total energy consumed in the country. The large number of data centers in the country has contributed to this energy crisis to a great extent. Recent surveys from both IDC and Gartner confirm the fact that virtualization had a great impact on cost reduction from reduced power consumption in physical computing systems.  Virtualization for IaaS: VM technology has increased in ubiquity. This has enabled users to create customized environments atop physical infrastructure for cloud computing. Use of VMs in clouds has the following distinct benefits: o System administrators consolidate workloads of underutilized servers in fewer servers. O VMs have the ability to run legacy code without interfering with other APIs. O VMs can be used to improve security through creation of sandboxes for running applications with questionable reliability. O virtualized cloud platforms can apply performance isolation, letting providers offer some guarantees and better QoS to customer applications.  VM Cloning for Disaster Recovery: VM technology requires an advanced disaster recovery scheme. One scheme is to recover one physical machine by another physical machine. The second scheme is to recover one VM by another VM. Total recovery time is attributed to the hardware configuration, installing and configuring the OS, installing the backup agents, and the long time to restart the physical machine. 






Inter-cloud Resource Management: This section characterizes the various cloud service models and their extensions. The cloud service trends are outlined. Cloud resource management and intercloud resource exchange schemes are reviewed.  Extended Cloud Computing Services: There are six layers of cloud services, ranging from hardware, network, and collocation to infrastructure, platform, and software applications. We
already introduced the top three service layers as SaaS, PaaS, and IaaS, respectively. The cloud platform provides PaaS, which sits on top of the IaaS infrastructure. The top layer offers SaaS. These must be implemented on the cloud platforms provided. The implication is that one cannot launch SaaS applications with a cloud platform.  Cloud Service Tasks and Trends: Cloud services are introduced in five layers. The top layer is for SaaS applications, as further subdivided into the five application areas, mostly for business applications. The approach is to widen market coverage by investigating customer behaviors and revealing opportunities by statistical analysis. SaaS tools also apply to distributed collaboration, and financial and human resources management.  Software Stack for Cloud Computing: Despite the various types of nodes in the cloud computing cluster, the overall software stacks are built from scratch to meet rigorous goals. Developers have to consider how to design the system to meet critical requirements such as high throughput, HA, and fault tolerance. Even the operating system might be modified to meet the special requirement of cloud data processing.  Runtime Support Services: As in a cluster environment, there are also some runtime supporting services in the cloud computing environment. Cluster monitoring is used to collect the runtime status of the entire cluster. The scheduler queues the tasks submitted to the whole cluster and assigns the tasks to the processing nodes according to node availability. 





Data Security in the Cloud: The An Introduction to the Idea of Data Security is Taking information and making it secure, so that only yourself or certain others can see it, is obviously not a new concept. However, it is one that we have struggled with in both the real world and the digital world. In the real world, even 

information under lock and key, is subject to theft and is certainly open to accidental or malicious misuse. In the digital world, this analogy of lock-and-key protection of information has persisted, most often in the form of container-based encryption.
But even our digital attempt at protecting information has proved less than robust, because of the limitations inherent in protecting a container rather than in the content of that container. This limitation has become more evident as we move into the era of cloud computing: Information in a cloud environment has much more dynamism and fluidity than information that is static on a desktop or in a network folder, so we now need to start to think of a new way to protect information. If we can start off our view of data security as more of a risk mitigation exercise and build systems that will work with humans (i.E., human- centric), then perhaps the software we design for securing data in the cloud will be successful. 1) The Current State of Data Security in the Cloud: At the time of writing, cloud computing is at a tipping point: It has many arguing for its use because of the improved interoperability and cost savings it offers. On the other side of the argument are those who are saying that cloud computing cannot be used in any type of pervasive manner until we resolve the security issues inherent when we allow a third party to control our information. These security issues began life by focusing on the securing of access to the datacenters that cloud-based information resides in. As I write, the IT industry is beginning to wake up to the idea of content- centric or information-centric protection, being an inherent part of a data object. This new view of data security has not developed out of cloud computing, but instead is a development out of the idea of the “de-parametrization” of the enterprise.






CryptDb: a weaker attacker model.
CryptDB is an implementation that allows query processing over encrypted databases. The database managed by the cloud provider, but database items are encrypted with keys that are only known by the data owner. SQL queries run over the encrypted database using a collection of operations such as equality checks and order comparisons. CryptDB uses encryption schemes that allow such comparisons to be made on ciphertexts. CryptDB represents a weak attacker model because it assumes the existence of a trusted cloud-based application server and proxy. Nevertheless, CryptDB represents an interesting position on the trade-off between functionality and confidentiality from cloud providers. In this paper, we will go into details of CryptDB. Processing a query in CryptDB involves four steps: ● The application issues a query, which the proxy intercepts and rewrites: it anonymizes each table and column name, and, using the master key MK, encrypts each constant in the query with an encryption scheme best suited for the desired operation. ● The proxy checks if the DBMS server should be given keys to adjust encryption layers before executing the query, and if so, issues an UPDATE query at the DBMS server that invokes a UDF to adjust the encryption layer of the appropriate columns. ● The proxy forwards the encrypted query to the DBMS server, which executes it using standard SQL (occasionally invoking UDFs for aggregation or keyword search). ● The DBMS server returns the (encrypted) query result, which the proxy decrypts and returns to the application. 1) Onion Encryption layer: The encryption of a data in the database is computed in a layered way. There are four different main goals to achieve, and for each goal there exists a different layered particle, which is called as onion. EQ, ORD, SEARCH and ADD onion. EQ onion aims to adjust layers for equality queries, while ORD onion aims to adjust the order leakage for the queries including comparison. SEARCH onion is used to search a text in the database without leaking any information. This onion is not allowed to execute integer values. Finally, ADD onion aims to add encrypted values which only supports integer values. These onions have different layers each encrypted by using different algorithms. 




DET: DET has a slightly weaker guarantee, yet it still provides strong security: it leaks only which encrypted values correspond to the same data value, by deterministically generating the same ciphertext for the same plaintext. This encryption layer allows the server to perform equality checks, which means it can perform selects with equality predicates, equality joins, GROUP BY, COUNT, DISTINCT, etc. In cryptographic terms, DET should be a pseudo-random permutation (PRP) [20]. For 64-bit and 128-bit values, we use a block cipher with a matching block size (Blowfish and AES respectively); we make the usual assumption that the AES and Blowfish block ciphers are PRPs. We pad smaller values out to 64 bits, but for data that is longer than a single 128-bit AES block, the standard 3) RND: Random (RND) provides the maximum security in CryptDB, indistinguishability under an adaptive chosen-plaintext attack (IND-CPA); the scheme is probabilistic, meaning that two equal values are mapped to different ciphertexts with overwhelming probability. RND does not allow any computation to be performed efficiently on the ciphertext. An efficient construction of RND is to use a block cipher like AES or Blowfish in CBC mode together with a random initialization vector (IV). (We mostly use AES, except for integer values, where we use Blowfish for its 64-bit block size because the 128-bit block size of AES would cause the ciphertext to be significantly longer). Since, in this threat model, CryptDB assumes the server does not change results, CryptDB does not require a stronger IND-CCA2 construction (which would be secure under a chosen-ciphertext attack). 4) JOIN: A separate encryption scheme is necessary to allow equality joins between two columns, because we use different keys for DET to prevent cross-column correlations. JOIN also supports all operations allowed by DET, and also enables the server to determine repeating values between two columns. OPE-JOIN enables joins by order relations. We provide a new cryptographic scheme for JOIN. 5) OPE Order-preserving encryption: OPE allows order relations between data items to be established based on their encrypted values, without revealing the data itself. The server can also perform ORDER BY, MIN, MAX, SORT, etc. OPE is a weaker encryption scheme than DET because it reveals order. Thus, the CryptDB proxy will only reveal OPE-encrypted columns to the server if users request order queries on those columns. OPE has provable security guarantees: ● the encryption is equivalent to a random mapping that preserves order. ● The scheme we use is the first provably secure such scheme. Until CryptDB, there was no implementation nor any measure of the practicality of the scheme. ● The direct implementation of the scheme took 25 ms per encryption of a 32-bit integer on an Intel 2.8 GHz Q9550 processor. 6) SEARCH: SEARCH is used to perform searches on encrypted text to support operations such as MySQL’s LIKE operator. We implemented the cryptographic protocol of Song et al. [46], which was not previously implemented by the authors; we also use their protocol in a different way, which results in better security guarantees. For each column needing SEARCH, we split the text into keywords using standard delimiters (or using a special keyword extraction function specified by the schema developer). We then remove repetitions in these words, randomly permute the positions of the words, and then encrypt each of the words using Song et al.’s scheme, padding each word to the same size. SEARCH is nearly as secure as RND: the encryption does not reveal to the DBMS server whether a certain word repeats in multiple rows, but it leaks the number of keywords encrypted with SEARCH; an adversary may be able to estimate the number of distinct or duplicate words (e.G., by comparing

the size of the SEARCH and RND ciphertexts for the same data). The server would learn the same information when returning the result set to the users, so the overall search scheme reveals the minimum amount of additional information needed to return the result 7) HOM and Homomorphic Encryption: HOM is a secure probabilistic encryption scheme (IND- CPA secure), allowing the server to perform computations on encrypted data with the final result decrypted at the proxy. While fully homomorphic encryption is prohibitively slow, homomorphic encryption for specific operations is efficient. To support summation, we implemented the Paillier cryptosystem. With Paillier, multiplying the encryptions of two values results in an encryption of the sum of the values. HOM can also be used for computing averages by having the DBMS server return the sum and the count separately, and for incrementing values (e.G., SET id=id+1), on which we elaborate shortly. With HOM, the ciphertext is 2048 bits. In theory, it should be possible to pack multiple values from a single row into one HOM ciphertext for that row, using the scheme of Ge and Zdonik, which would result in an amortized space overhead of 2× (e.G., a 32-bit value occupies 64 bits) for a table with many HOM-encrypted columns. However, we have not implemented this optimization in our prototype. This optimization would also complicate partial row UPDATE operations that reset some— but not all—of the values packed into a HOM ciphertext. 




Cloud Programming and Software Environments: We introduce major cloud programming paradigms:
MapReduce, Bigtable, Twister, Dryad, Dryad LINQ, Hadoop, Sawzall, and Pig Latin. We use concrete service examples to explain the implementation and application requirements in the cloud. We review core service models and access technologies. Cloud services provided by Google App Engine, Amazon Web Service, and Microsoft Windows Azure are illustrated by example applications. In particular, we illustrate how-to programming the GAE, AWS EC2, S3, EBS, and others. We review the open-source Eucalyptus, Nimbus, and Open Nebula and the startup Manjra soft Aneka system for cloud computing. 1) Features of Cloud and Grid Platforms: In this section, we summarize important features in real cloud and grid platforms. In four tables, we cover the capabilities, traditional features, data features, and features for programmers and runtime systems to use. The entries in these tables are source references for anyone who wants to program the cloud efficiently. A) Cloud Capabilities and Platform Features: Commercial clouds need broad capabilities. These capabilities offer cost-effective utility computing with the elasticity to scale up and down in power. However, as well as this key distinguishing feature, commercial clouds offer a growing number of additional capabilities commonly termed “Platform as a Service” (PaaS). For Azure, current platform features include Azure Table, queues, blobs, Database SQL, and web and Worker roles. B) Traditional Features Common to Grids and Clouds In this section, we concentrate on features related to workflow, data transport, security, and availability concerns that are common to today’s computing grids and clouds. ● Workflow As A recent entry is Trident [2] from Microsoft Research which is built on top of Windows Workflow Foundation. If Trident runs on Azure or just any old Windows machine, it will run workflow proxy services on external (Linux) environments. Workflow links multiple cloud and noncloudy services in real applications on demand. ● Data Transport The cost (in time and money) of data transport in (and to a lesser extent, out of) commercial clouds is often discussed as a difficulty in using clouds. If commercial clouds become an important component of he national cyberinfrastructure we can expect that high-bandwidth links will be made available between clouds and Tera Grid. ● Security, Privacy, and Availability: The following techniques are related to security, privacy, and availability requirements for developing a healthy and dependable cloud programming environment. Use virtual clustering to achieve dynamic resource provisioning with minimum overhead cost.  Use stable and persistent data storage with fast queries for information retrieval.  Use special APIs for authenticating users and sending e-mail using commercial accounts.









parallel and distributed Programming Paradigms: We define a parallel and distributed program as a parallel program running on a set of computing engines or a distributed computing system. The term carries the notion of two fundamental terms in computer science: distributed computing system and parallel computing. A distributed computing system is a set of computational engines connected by a network to achieve a common goal of running a job or an application. Parallel computing is the simultaneous use of more than one computational engine (not necessarily connected via a network) to run a job or an application. For instance, parallel computing may use either a distributed or a no distributed computing system such as a multiprocessor platform. Running a parallel program on a distributed computing system (parallel and distributed programming) has several advantages for both users and distributed computing systems. A) Parallel Computing and Programming Paradigms: Consider a distributed computing system consisting of a set of networked nodes or workers. The system issues for running a typical parallel program in either a parallel or a distributed manner would include the following: ● Partitioning: This is applicable to both computation and data as follows. ● Computation partitioning: This splits a given job or a program into smaller tasks. Partitioning greatly depends on correctly identifying portions of the job or program that can be performed concurrently. In other words, upon identifying parallelism in the structure of 
the program, it can be divided into parts to be run on different workers. Different parts may process different data or a copy of the same data. ● Data partitioning: This splits the input or intermediate data into smaller pieces. Similarly, upon identification of parallelism in the input data, it can also be divided into pieces to be processed on different workers. Data pieces may be processed by different parts of a program or a copy of the same program. ● Mapping: This assigns the either smaller parts of a program or the smaller pieces of data to underlying resources. This process aims to appropriately assign such parts or pieces to be run simultaneously on different workers and is usually handled by resource allocators in the system. ● Synchronization Because different workers may perform different tasks, synchronization and coordination among workers is necessary so that race conditions are prevented and data dependency among different workers is properly managed. Multiple accesses to a shared resource by different workers may raise race conditions, whereas data dependency happens when a worker needs the processed data of other workers. ● Communication Because data dependency is one of the main reasons for communication among workers, communication is always triggered when the intermediate data is sent to workers. ● Scheduling For a job or program, when the number of computation parts (tasks) or data pieces is more than the number of available workers, a scheduler selects a sequence of tasks or data pieces to be assigned to the workers. B) MapReduce, Twister, and Iterative MapReduce: MapReduce, is a software framework which supports parallel and distributed computing on large data sets. This software framework abstracts the data flow of running a parallel program on a distributed computing system by providing users with two interfaces in the form of two functions: Map and Reduce. Users can override these two functions to interact with and manipulate the data flow of running their programs. ● Formal Definition of MapReduce: The MapReduce software framework provides an abstraction layer with the data flow and flow of control to users, and hides the implementation of all data flow steps such as data partitioning, mapping, synchronization, communication, and scheduling. ● MapReduce Logical Data Flow: The input data to both the Map and the Reduce functions has a particular structure. This also pertains for the output data. The input data to the Map function is in the form of a (key, value) pair. ● Strategy to Solve MapReduce Problems As mentioned earlier, after grouping all the intermediate data, the values of all occurrences of the same key are sorted and grouped together. As a result, after grouping, each key becomes unique in all intermediate data. Therefore, finding unique keys is the starting point to solving a typical MapReduce problem. C) Hadoop Library from Apache: Hadoop is an open source implementation of MapReduce coded and released in Java (rather than C) by Apache. The Hadoop implementation of MapReduce uses the Hadoop Distributed File System (HDFS) as its underlying layer rather than GFS. The Hadoop core is divided into two fundamental layers: the MapReduce engine and HDFS. The MapReduce engine is the computation engine running on top of HDFS as its data storage manager. The following two sections cover the details of these two fundamental layers.