Fragment Reconstruction:Providing Global Cache Coherence in a
Transactional Storage System
Atul Adya Miguel Castro Barbara Liskov Umesh Maheshwari Liuba Shrira
Laboratory for Computer Science
Massachutts Institute of Technology
Cambridge,MA02139
暗黑3攻略Abstract
Cooperative caching is a promising technique to avoid the increasingly formidable disk bottleneck problem in dis-tributed storage systems;it reduces the number of disk ac-cess by rvicing client cache miss from the caches of other clients.However,existing cooperative caching techniques do not provide adequate support forfine-grained sharing.In this paper,we describe a new storage system architecture,split caching,and a new cache coherence pro-tocol,fragment reconstruction,that combine cooperative caching with efficient support forfine-grained sharing and transactions.We also
prent the results of performance studies that show that our scheme introduces little overhead over the basic cooperative caching mechanism and provides better performance when there isfine-grained sharing.
1Introduction
This paper describes how to integrate two techniques that have been shown to improve performance in distributed client/rver storage systems:cooperative caching[8,11, 12,20]and efficient support forfine-grained sharing[1,4, 6,9,10,15].Cooperative caching combines client caches so that miss at one client can be rviced from caches of other clients.It takes advantage of fast local area networks to reduce access latency,replacing slow disk access by significantly faster fetches from the memory of other clients, and it reduces the load on the rver processors and disks, thus improving scalability.
However,existing cooperative caching techniques u pages as the unit of consistency and do not provide ade-quate support forfine-grained sharing,although studies in databas[4]and DSM systems[6,15,16]have shown that such support is important to avoid performance problems caud by fal sharing.This paper prents a new storage system architecture for client/rver transactiona
l storage Copyright1997IEEE.Published in the Proceedings of the International Conference on Distributed Computing Systems(ICDCS’97),May27-30, 1997,Baltimore,Maryland.systems,split caching,and a new cache consistency pro-tocol,fragment reconstruction,that together prerve the benefits of both cooperative caching and efficient support forfine-grained sharing.The new architecture is designed for an environment where clients and rvers are connected by a high-speed network,and machines trust each other and are willing to ,a corporate intranet.
Split caching divides functionality between clients and rvers:client caches are ud to avoid disk reads,while rver caches are ud to avoid disk writes.Disk reads are avoided by using the combined memory of the clients as a cooperative cache.Disk writes are avoided by using the entire rver main memory as an object cache,called the m-cache[13,19],in which the rver stores new versions of recently modified objects.When a transaction commits,a client nds its modified objects to the rver,which stores them in the m-cache replacing any previously cached ver-sions of tho objects.The modifications are installed into their disk pages in the background when the m-cachefills up.Since the m-cache stores modifications more compactly than a page cache,it can absorb more modifications and delay installations for a longer time,which in turn reduces the number of writes.However,to install a modification, the systemfirst needs the containing page.The disk reads
needed to obtain containing pages are called installation reads.Split caching fetches the containing pages from the cooperative cache thus providing the benefits of the m-cache without the cost of the installation reads.
Fragment reconstruction ensures client cache consis-tency.It supportsfine-grained concurrency control tech-niques such as adaptive call-back locking[4]and optimistic control[1];such techniques improve performance becau they avoid conflicts due to fal sharing.When a transaction commits,copies of its modified objects in other client caches become obsolete,causing containing pages to become frag-ments with holes where the obsolete objects are.We u fragment reconstruction to bring fragments up to date by filling the holes with the latest object versions stored in the m-cache.This is done lazily when the client holding a frag-
ment needs the missing objects,or when the fragment is needed to rvice another client’s cache miss or to install modifications on disk.Laziness reduces communication cost when pages are modified repeatedly and it is , caus no loss in reliability)becau the m-cache is recov-erable from the transaction log.
To evaluate the effectiveness of our techniques,we im-plemented them in Thor[17]and ran a t of ex
periments using the multi-ur OO7benchmark[5].Our results show that our approach prerves the benefits of both coopera-tive caching and support forfine-grained sharing:it adds almost no overhead to the basic cooperative caching mecha-nism,and it substantially improves performance whenfine-grained sharing affects the pages that clients fetch from the cooperative cache.Furthermore,the results indicate that many disk reads can be avoided by fetches from the coop-erative cache,thus substantially off-loading work from the rver processors and disks.
The paper is organized as follows.Section2discuss related work.Section3describes the system model;Sec-tion4describes split caching and fragment reconstruction. Section5evaluates the effectiveness of the new approach. We clo with a discussion of what we have accomplished. 2Related Work
Our proposal builds on previous work on cooperative caching and support forfine-grained sharing.Cooperative caching has been studied in veral contexts:distributed vir-tual memory[11],file systems[8,7,20],and a transactional databa[12].All studies show that cooperative caching can reduce access latency and improve system scalability in workloads with coar-grained sharing.The studies are complementary to ours;we explain how to extend the ben-efits of cooperative caching to workloads withfine-grained sharing.We can u the techniques developed in[
8,11,20] to perform page replacement in the cooperative cache.
Cache coherence work in client/rver databas[4,9] has addresd the performance problems caud by fal sharing.The study by Carey et al.[4]describes a con-currency control and coherence protocol that supportsfine-grained sharing efficiently.A coherence protocol using a cache of recent updates similar to the m-cache is studied in[9].However,neither of the studies integratesfine-grained sharing support with cooperative caching.
DSM is similar to cooperative caching;it allows clients to fetch data from the memory of other clients.Some DSM systems[6,15]provide efficient support forfine-grained sharing.However,the systems do not deal with access to large on-disk databas;they assume infinite client caches and they do not address the problems of efficiently updating on-disk data and reducing the latency of capacity miss. Furthermore,they do not support transactions.
The work clost to ours is the log-bad coherence study by Feeley et al.[10],which extends DSM to support trans-actional updates to a persistent store.One key difference is that they associate mutexes with gments and require gments to be large to reduce the time overhead of acquir-ing mutexes.This coar-grained concurrency control can cau vere performance degradation due to
lock conflicts in workloads withfine grained sharing.Ourfine-grained optimistic concurrency control algorithm avoids the prob-lems.The log-bad coherence protocol in[10]ensures cache consistency by propagatingfine-grained updates to all cached copies of a gment eagerly when a transaction commits,but the authors acknowledge that eager propaga-tion does not scale to a large number of clients.In contrast, our coherence protocol us a scalable lazy invalidation pol-icy.Furthermore,like other DSM systems,their system assumes infinite client caches.
The study prented in[18]propos nding pages from clients to the rver at commit time to avoid installation reads.Split caching allows the rver to delay installations for a longer time,thus reducing their number while still avoiding most installation reads by fetching pages from the cooperative cache.
3Ba System Architecture
Our work is done in the context of Thor,a client/rver object-oriented databa system[17].This ction describes the system architecture before we extended it to support split caching and fragment reconstruction.
Servers provide persistent storage for objects and clients cache copies of the objects.Applications
run at the clients and interact with the system by making calls on methods of cached objects.All method calls occur within atomic transactions.Clients communicate with rvers only to fetch pages or to commit a transaction.
The rvers have a disk for storing persistent objects,a stable transaction log,and volatile memory.The disk is organized as a collection of pages which are the units of disk access.The stable log holds commit information and object modifications for committed transactions.The rver memory consists of a page cache and the m-cache.The page cache contains pages recently read from the rver disk.The m-cache holds recently modified objects that have not yet been written back to their pages on disk.
3.1Fetches
When a client C access an object x that is not prent in its cache,it fetches the page p containing x from p’s rver.At this point,the rver adds an entry to its directory, indicating that client C is now caching p;the directory keeps track of which pages are cached by each client.
3.2Transactions and Cache Coherence
Transactions are rialized using optimistic concurrency control[1,14].The client keeps track of objects that are read and modified by its transaction;it nds this information, along with new copies of modified objects,to the rvers
when it tries to commit the transaction.The rvers de-termine whether the commit is possible,using a two-pha commit protocol if the transaction ud objects at multiple rvers.If the transaction commits,the new copies of mod-ified objects are appended to the log and also inrted in the m-cache,but they are not immediately installed in their containing disk pages.
Since objects are not locked before being ud,a trans-action commit can cau caches to contain obsolete objects. Servers will abort a transaction that ud obsolete objects. However,to reduce the probability of aborts,rvers no-tify clients when their objects become obsolete by nding them invalidation messages;a rver us its directory plus information about the committing transaction to determine what invalidation messages to nd.Invalidation messages are small becau they simply identify obsolete objects. Furthermore,they are nt in the background,batched and piggybacked on other messages.
When a client receives an invalidation message,it re-moves obsolete objects from its cache and abor
ts the current transaction if it ud them.The client continues to retain pages containing invalidated objects;the pages are now fragments with holes in place of the invalidated objects. Performing invalidation on an object basis means that fal sharing does not cau unnecessary aborts;keeping frag-ments in the client cache means that fal sharing does not lead to unnecessary cache miss.Invalidation messages prevent some aborts,and accelerate tho that must hap-pen—thus wasting less work and offloading detection of aborts from rvers to clients.
When a transaction aborts,its client restores the cached copies of modified objects to the state they had before the transaction started;this is possible becau a client makes a copy of an object thefirst time it is modified by a transaction. If the copy has been evicted due to cache management,the modified object becomes a hole and its page becomes a fragment.
3.3Installation
When the m-cachefills up,the rver installs some of the cached modifications in their disk pages.The rver reads the pages into memory(the are installation reads), installs the modified objects in their pages,and writes the pages to disk.It then removes the modified objects from the m-cache,freeing up space for future transactions.If the rver crashes,the m-cache is reconstructed at recovery by scanning the log.
Since installation is lazy,pages on disk can contain ob-solete versions of some of their objects.Therefore before nding a page to a client in respon to a fetch request,the system retrieves new versions of the page’s objects from the m-cache and installs them in the page.
The m-cache architecture improves the efficiency of disk writes forfine-grained updates[13,19].It avoids instal-lation reads at commit time and stores modifications in a compact form,since only the modified objects are stored. Nevertheless,installation reads can consume a significant portion of the available disk bandwidth[13,19].
4The Split Caching Architecture
The split caching architecture differs from what was de-scribed above in two important ways.First,it us rver memory only for the m-cache;there is no rver page cache. Second,it implements cooperative caching.The coopera-tive cache is ud both to rvice fetches and to avoid instal-lation reads;it decreas fetch latency and makes the system more scalable by decreasing the load on the rver disk.
Fragment reconstruction ensures cache consistency while allowing clients and rvers to fetch fragments from the cooperative cache.To allow a fetch or an installation to u a copy of a page obta
ined from a client C,the sys-tem must ensure that C’s copy can be brought up-to-date using information in the m-cache.To determine whether information in the m-cache is sufficient for this purpo,we augment the rvers’directory information to record a sta-tus for each cached page.A page is complete at client C if C has the latest versions of all its ,the page is not a fragment;it is reconstructible if the m-cache contains new versions for all holes in C’s copy of the page(so that apply-ing them to this copy will make it complete);otherwi,it is unreconstructible.Only complete and reconstructible pages are ud to rvice fetches and avoid installation reads. 4.1Fetches画家范增
When a client A requests a page p from the rver,the rver reads p from disk only if no client cache contains a complete or reconstructible copy of p.If such a client B exists,the rver forwards the request to B along with any updates in the m-cache for p and updates the directory to record p as complete at A and B.(It lects a client B containing a complete copy of p if possible,becau in this ca no updates from the m-cache are nt to B.)B merges the updates into its fragment and nds the page to client A.This situation is illustrated in Figure1.B includes only the latest committed versions of objects in the page;if its current transaction has modified the page,it us the copies it made when the objects werefirst modified to restore the page to its committed state.(The copies were discusd in Section3.2.)
If client B does not have page p or does not have the committed state for one of p’s objects,it informs the rver that it cannot rvice the fetch.The rver marks p as abnt or unreconstructible for B,and obtains it from another client if possible.To improve performance,clients inform the rver when they evict pages or copies of modified objects by piggybacking that information on messages nt to the rver.
We also provide a cond kind of fetch request,which
Client B
Figure1:Fetch from cooperative cache. indicates that the client just needs tofill in the holes in a page.In such a ca,the client need not receive the whole page and the special fetch allows us to avoid a disk read or extra network communication.If the page is reconstructible, the rver nds the updates in the m-cache to the client and marks the page as complete;otherwi,it treats the request as a page-fetch request.
4.2Transaction Commits
Commits are handled as described in Section3.2with one addition:when a transaction committed by a client A modifies an object on page p,the rver modifies the directory to mark all complete copies of p at clients other than A as reconstructible.
4.3Installation
Client caches are ud to avoid installation reads,in a manner similar to what happens for fetches.If a client B has a complete or reconstructible copy of page p,the rver nds it an installation fetch message that includes updates for p.B completes ,fills the holes with the received updates,and
nds it back to the rver,which marks B as having a complete copy of p.If no client B has a complete or reconstructible copy,the page is read from disk and the changes are installed in it in the usual way.关于雷锋手抄报
Then the rver writes the page to disk,and removes its updates from the m-cache.Additionally,it modifies the directory to mark all reconstructible copies of p as unrecon-structible.This is necessary becau after discarding the modified objects from the m-cache,the rver can no longer bring tho copies up-to-date using the m-cache.
Figure2illustrates using page p from client B’s cache to avoid an installation read.It also shows client C’s entry being marked unreconstructible after the installation has been performed.
4.4Discussion
Our scheme prerves correctness becau only com-plete and reconstructible pages are ud in fetches from the cooperative cache;this is equivalent to fetching pages from disk in the earlier system.Unreconstructible pages are never ud:Servers cau pages to become unreconstructible only by removing objects from the m-cache;they record this in-formation in directories and never request unreconstructible pages from clients.Clients cau pages to become unre-constructible only
via evicting copies of modified objects; they keep track of this information and refu to satisfy client and installation fetch requests with unreconstructible (or missing)pages.
Sending modifications to the client as part of fetching the page from its cache is similar to update propagation[3] but is not exactly the same.The rver nds the m-cache updates relatively ,the updates may be nt after many updates to the same page have occurred.Furthermore, the rver does not update all cached copies of the page but only the copy it us to satisfy the client or installation fetch.
It is possible to reduce the number of modified objects nt from the m-cache to the clients.This can be done by augmenting the status information in the rver directo-ries to record the latest transaction who modifications are reflected in the client’s cache,and nding only the mod-ifications made by transactions that committed later.We rejected this optimization becau our current scheme is simpler and requires less storage at rvers.
Figure2:Installation fetch from cooperative cache.
5Performance Evaluation
This ction evaluates the costs and benefits of fragment reconstruction and split caching.We do not attempt to show the benefits of cooperative caching,since that has been shown by others[8,11,12,20].
The key performance goals are reductions in client cache miss latency and in the latency to install modified objects. Therefore,we start(Section5.1)by prenting a simple analytic model of the latency of different types of client and installation fetches.Section5.2us micro-benchmarks run in our prototype and published performance numbers for fast networks and disks to estimate values for the variables ud in the model.The values are ud to compute an estimated latency for the different types of fetches and show that our techniques introduce only a small overhead over the basic cooperative caching mechanism.
The average fetch latency is determined not only by the latency of each type of fetch but also by the number of fetches of each type.Therefore,we ran the multi-ur OO7 benchmark[2]in a version of our prototype instrumented to count the number of fetches of each type.The counts obtained were fed to the analytic model and ud to predict average fetch latency.The results are prented in Sec-tion5.3.They show that our techniques retain the benefits of cooperative caching in workloads with coar-grained shar-ing and can significantly improve performance in workloads withfine-grained sharing;they reduce average client fetch latency by up to a factor of3and the number of disk reads to rvice fetches by up to a factor of3.5.Furthermore, installation fetches from the cooperative cache reduce the number of installation reads by up to a factor of52and the average installation fetc
h latency by up to a factor of11. The significant reductions in the number of disk reads show that our techniques improve the scalability of the system when disk I/O is the performance bottleneck.
We cho the experimental methodology just described instead of directly measuring elapd times becau our ma-chines are connected by a slow Ethernet network;elapd times measured in our environment would be dominated by the network overheads.Furthermore,cooperative caching was designed for a large client population but we have only a small number of client machines.We simulate veral client machines using a single machine,which prevents us from collecting meaningful elapd times.Our experimen-tal methodology allows us to circumvent the problems while still obtaining sound results.
5.1Analytic Model
This ction prents a simple analytic model to estimate the cost of fetches and installation of modified objects in the databa.Using the notation shown in Figure3,the fetch and installation costs for various cas in our scheme are: 1.Modified objects fetched from m-cache:
Disk time taken to read bytes
Network latency for message with bytes
(including processing overheads)手自一体和自动挡的区别
Processor time taken to perform job
Size of a page
Average size of a message containing
modified objects
Size of a request message(without data)
Figure3:Parameters of the Model.
m-cache
2.Page fetch from rver disk:
disk
丁俊晖台球3.Page fetch from a client with a complete page:
complete unswiz
4.Page fetch from a client with a reconstructible page:
reconstructible recon
unswiz
5.Installation fetch from rver disk:disk
6.Installation fetch from client with a complete page:
complete unswiz
7.Installation fetch from client with a reconstructible
page:
reconstructible recon
unswiz
Here,P(recon)is the cost to reconstruct a fragment byfill-ing its holes.P(unswiz)is the cost of unswizzling.Like many other object-oriented databas,Thor us pointer swizzling[17]to make code at clients run faster:persistent references stored in cached copies of objects are replaced by virtual memory pointers before they are ud.Swiz-zled references in a page must be unswizzled into persistent references before the page can be shipped to another client.
Our cache coherence mechanism introduces extra pro-cessing overhead at the rver to maintain the status of pages cached by clients;we ignore the costs becau they are negligible compared to the total fetch or installation costs.
5.2Latency of Different Fetch Types
山海成语This ction prents results of microbenchmarks that measure the processing costs defined in our analytic model in Section5.1.The results show that the overhead intro-duced by our cache consistency mechanism on the fetch path is small.The experiments were run on an unloaded DEC Al-pha3000/400,133MHz,workstation running DEC/OSF1; we ud a page size of8KB.
Percentage of swizzled objects in page
U n s w i z z l i n g t i m e (u s e c s )
口育
Figure 4:Elapd time unswizzling a page,
Number of reconstructed objects
R e c o n s t r u c t i o n t i m e (u s e c s )
Figure 5:Elapd time reconstructing a page,
.
Figures 4and 5prents the unswizzling and recon-struction costs in our prototype as the number of objects to be unswizzled or reconstructed is incread.Reconstruc-tion times are low becau they only involve tting up an I/O vector for a readv routine to read the new versions of modified objects into the cache.Unswizzling costs are high becau unswizzling is memory intensive.However,in the experiments described in the next ction,we obrved an average percentage of swizzled objects per-page of 25%;the unswizzling cost for this percentage is about 320s.Furthermore,the unswizzling cost is not directly related to support for fine-grained sharing;this cost will be incurred by any cooperative caching system that us pointer swizzling.On the other hand,unswizzling does increa the cost of fetching pages (whether complete or reconstructible)from client caches.Our algorithm also incurs a cost for restoring objects modified by the current transaction to their com-mitted state;we ignore this cost becau restoring rarely occurred in our experiments.
Figures 6and 7prent the predicted fetch and installa-tion fetch latencies using the analytic model,t
he previous graphs,and the worst ca obrved values for number of objects unswizzled or reconstructed that we obrved in the experiments described in Section 5.3.The network times are for a fast implementation of TCP over ATM,U-Net [22],and the disk times are for a
modern disk,a Barracuda 4[21].
F e t c h t i m e (u s e c s )
Figure 6:Predicted fetch
latencies.
I n s t a l l a t i o n f e t c h t i m e (u s e c s )
鲁花花生油广告Figure 7:Predicted installation fetch latencies.Note that the y-axis is broken to reprent the true disk times.Fetches from disk have the same latency with or without our techniques,as do fetches of complete pages from other client caches.Split caching and fragment reconstruction introduce two new fetch types:fetches of modified objects from the m-cache and fetches of reconstructible pages from other client caches.The first type of fetch is the fastest,and can therefore reduce fetch overheads relative to other systems when there is true fine-grained sharing.Fetch-ing reconstructible pages from other client caches is only slightly more expensive than fetching complete pages be-cau reconstruction messages were smaller than 512bytes in our experiments and the processing cost of reconstructing a page is lower than 14s.
We ignored contention for the disk and network in our model.Our scheme does not increa disk contention rel-ative to other systems.Since reconstruction messages and replies to fetches from the m-cache are small,we also expect contention on the network to be similar in our prototype and in other cooperative caching systems.
Our scheme slows down clients that rvice fetch re-quests.The clients need to nd and receive messages,reconstruct pages and unswizzle pages.The only cost that is related to support for fine-grained sharing is the recon-struction cost,and it is low.
Therefore,we conclude that split caching and fragment reconstruction do not introduce any significant overhead