accept()able Strategies for Improving Web Server Performance
Tim Brecht
University of Waterloo brecht@cs.uwaterloo.ca
David Pariag
University of Waterloo
db2pariag@cs.uwaterloo.ca
Louay Gammo
University of Waterloo
lgammo@cs.uwaterloo.ca
Abstract
This paper evaluates techniques for improving the per-
formance of three architecturally different web rvers.We
海山骨study strategies for effectively accepting incoming connec-
红烧肉土豆tions under conditions of high load.Our experimental eval-
uation shows that the method ud to accept new connec-
tion requests can significantly impact rver performance.
By modifying each rver’s accept strategy,we improve the
氢气在空气中燃烧的现象
performance of the kernel-mode TUX rver,the multi-
threaded Knot rver and the event-driven rver.Un-
der two different workloads,we improve the throughput
of the rvers by as much as19%–36%for TUX,0%
–32%for Knot,and39%–71%for the rver.Interest-
ingly,the performance improvements realized by the ur-
mode rver allow it to obtain performance that rivals an
unmodified TUX rver.
1Introduction
Internet-bad applications have experienced incredible
growth in recent years and all indications are that such ap-
plications will continue to grow in number and importance.
Operating system support for such applications is the sub-
高中演讲稿
ject of much activity in the rearch community,where it
is commonly believed that existing implementations and
interfaces are ill-suited to network-centric applications[4]
[29][22].
In many systems,once client demand exceeds the
rver’s capacity the throughput of the rver degrades
sharply,and may even approach zero.This is reflected in
long(and unpredictable)client wait times,or even a com-
plete lack of respon for some clients.Ironically,it is pre-
cily during the periods of high demand that quality of
rvice matters most.Breaking news,changes in the stock
market,and even the Christmas shopping ason can gen-
erateflash crowds or even prolonged periods of overload.
Unfortunately,over-provisioning of rver capacity is nei-
篮球赛时间existing connections.That is a balance must be main-tained between accepting new connections and work-ing on existing connections.
The different rvers that we examine can signifi-cantly improve their throughput byfinding this bal-ance.
Contrary to previousfindings,we are able to demon-strate that a ur-level rver is able to rve an in-memory static SPECWeb99-like workload at a rate that compares very favourably with the kernel-mode TUX rver.
2Background and Related Work
Current approaches to implementing high-performance In-ternet rvers require special techniques for dealing with high levels of concurrency.This point is illustrated byfirst considering the logical steps taken by a web rver to han-dle a single client request,as shown in Figure1.
1.Wait for and accept an incoming network connection.
2.Read the incoming request from the network.
3.Par the request.
4.For static requests,check the cache and possibly open
and read thefile.
5.For dynamic requests,compute the result.
王猎豹
6.Send the reply to the requesting client.
7.Clo the network connection.
Figure1:Logical steps required to process a client request. Note that almost all Internet rvers and rvices follow similar steps.For simplicity,the example in Figure1does not handle persistent or pipelined connections(although all rvers ud in our experiments handle persistent connec-tions).
Several of the steps can block becau of network or disk I/O,or becau the web rver must interact with an-other process.Conquently,a high performance rver must be able to concurrently process partially completed connections by quickly identifying tho connections that are ready to be ,tho for which the applica-tion would not have to block).This means the rver must be able to efficiently multiplex veral thousand simultane-ous connections[4]and to dispatch network I/O events at high rates.
Rearch into improving web rver performance tends to focus on improving operating system support for web rvers,or on improving the rver’s architecture and de-sign.We now briefly describe related work in the areas.
2.1Operating System Improvements
Significant rearch[3][2][4][18][21][22][7]has been conducted into improving web rver performance by im-proving both operating system mechanisms and interfaces for obtaining information about the state of socket andfile descriptors.The studies have been motivated by the over-head incurred by lect,poll,and similar system calls under high loads.As a result,much rearch has focud on developing improvements to lect,poll and sig-waitinfo by reducing the amount of data that needs to be copied between ur space and kernel space or by reducing the amount of work that must be done in the , by only delivering one signal per descriptor in the ca of sigwaitinfo).
Other work[20]has focud on reducing data copying costs by providing a unified buffering and caching system. In contrast to previous rearch on improving the oper-ating system,this paper prents strategies for accepting new connections which improve rver performance under existing operating systems,and which are relevant to both ur-mode and kernel-mode rvers.
2.2Server Application Architecture
One approach to multiplexing a large number of connec-tions is to u a SPED architecture,which us a single process in conjunction with non-blocking socket I/O and an event notification mechanism such as lect to de-liver high throughput,especially on in-memory workloads [19].The event notification mechanism is ud to deter-mine when a network-related system call can be made with-out blocking.This allows the rver to focus on tho con-nections that can be rviced without blocking its single process.
Of cour,a single process cannot leverage the process-ing power of multiple processors.However,in multipro-cessor environments multiple copies of a SPED rver can be ud to obtain excellent performance[31].
The multi-process(MP)and multi-threaded(MT)mod-els[19]offer an alternative approach to multiplexing si-multaneous connections by utilizing a thread(or process) per TCP connection.In this approach,connections are multiplexed by context-switching from a thread that can no longer process its connection becau it will block, to another thread that can process its connection without blocking.Unfortunately threads and process can con-sume large amounts of resources and architects of early systems found it necessary to restrict the number of exe-cuting threads[12][4].
The Flash rver implements a hybrid of the SPED and MP models called AMPED(asymmetric multi-process event-driven)architecture[19].This architecture builds on the SPED model by using veral helper process to per-form disk access on behalf of the main event-driven pro-cess.This approach was found to perform very well on a variety of workloads and in particular it outperformed the MP and MT models.
More recent work has revived the debate on event-driven versus multi-threaded architectures.Some papers[30][9]
[31]conclude that event-driven architectures afford higher-performance.Others[26][27]argue that highly efficient implementations of threading libraries allow high perfor-mance while providing a simpler programming model. Our work in this paper us rvers that are implemented using both event-driven and multi-threaded architectures. We demonstrate that improved accept strategies can in-crea throughput in either type of rver.
2.3Kernel-mode Servers
In light of the considerable demands placed on the oper-ating system by web rvers,some people[23][11]have argued that the web rver should be implemented in the kernel as an operating
system rvice.Recent work[14] has found that there is a significant gap in performance be-tween kernel-mode and ur-mode rvers.Ourfindings in this paper challenge the results.In fact on a static, memory-bad,SPECWeb99-like workload our rver performance is compares very favourably with the kernel-mode TUX rver.
2.4Accept Strategies
In early web rver implementations,the strategy for ac-cepting new connections was to accept one connection each time the rver obtained notification that there were pending connections available.Recent work by Chandra and Mosberger[7]discovered that a simple modification to a lect-bad web-rver(with a stock operating system)outperformed operating system modifications they and other rearchers[21]had performed in order to im-prove event dispatch scalability.They referred to this rver as a multi-accept rver becau upon learning of a pending connection,the rver attempts to accept as many incoming connections as possible by repeatedly calling accept un-til the call fails(and the errno is t to EWOULDBLOCK) or the limit on the maximum number of open connections is reached.This multi-accept behaviour means that the rver periodically attempts to drain the entire accept queue.Their experiments demonstrate that this aggressive strategy to-wards accepting new connections improved event dispatch scalability for wor
kloads that request a single one bytefile or a single6KBfile.
In this paper we explore more reprentative workloads and demonstrate that their multi-accept approach can actu-ally lead to poor performance becau of an imbalance that is created by an over-emphasis on accepting new connec-tions at the expen of processing existing connections.We devi a simple mechanism to permit us to implement and tune a variety of accept strategies,and to experimentally evaluate the impact of different accept strategies on three rver architectures.We demonstrate that a carefully tuned accept policy can significantly improve performance across all three of the rver architectures.
More recent work[26][27]has also noted that the strat-egy ud to accept new connections can significantly im-pact performance.Our work specifically examines differ-ent strategies ud under a variety of rvers in order to understand how to choo a good accept strategy.
3Improving Accept Strategies
In order for a client to nd a request to the rver it must first establish a TCP connection to the rver.This is done by using the TCP three-way handshake[25].Once the three-way handshake succeeds the kernel adds a socket to the accept queue(sometimes referred to as the listen queue)
[5].Each time the rver invokes the accept system call
a socket is removed from the front of the accept queue,and an associatedfile descriptor is returned to the rver.
We have configured our Linux kernel with SYN
processing existing connections.
4The Web Servers
This ction provides background information on each of the rvers investigated.We describe the architecture of each rver,as well as its procedure for accepting new con-nections.Lastly,we describe any modifications we have made to the ba rver behaviour.
4.1The rver
The micro-rver(rver)[6][10]is a single process event-driven web rver.Its behaviour can be carefully controlled through the u of more thanfifty command-line parameters,which allow us to investigate the effects of veral different rver configurations using a single web-rver.The rver
us either the lect,poll, or epoll system call(chon through command line op-tions)in concert with non-blocking socket I/O to process multiple connections concurrently.
The rver operates by tracking the state of each active connection(states roughly correspond to the steps in Figure 1).It repeatedly loops over three phas.Thefirst pha (which we call the getevents-pha)determines which of the connections have accrued events of interest.In our ex-periments this is done using lect.The cond pha (called the accept-pha)is entered if lect reports that connections are pending on the listening socket.The third pha(called the work-pha)iterates over each of the non-listening connections that have events of interest that can be procesd without blocking.Bad on the state of the connection the rver calls the appropriate function to per-form the work.A key point is that for the rver options ud in our experiments the work-pha does not consider any of the new connections accumulated in the immediately preceeding accept-pha.That is,it only works on connec-tions when lect informs it that work can proceed on that connection without blocking.
The rver is bad on the multi-accept rver writ-ten by Chandra and Mosberger[7].That rver imple-ments an accept policy that drains its accept queue when it is notified of a pending connection request.In contrast, the rver us a parameter that permits us to accept up to a pre-defined numbe
r of the currently pending connec-tions.This defines an upper limit on the number of con-nections accepted concutively.For ea of reference,we call this parameter the accept-limit parameter,and refer to it throughout the rest of this paper(the same name is also ud in referring to modifications we make to the other rvers we examine).Parameter values range from one to infinity(Inf).A value of one forces the rver to accept a single connection,while Inf caus the rver to accept all currently pending connections.
Our early investigations[6]revealed that the accept-limit parameter could significantly impact the rver’s perfor-mance.This motivated us to explore the possibility of im-proving the performance of other rvers,as well as quan-tifying the performance gains under more reprentative workloads.As a result,we have implemented accept-limit mechanisms in two other well-known web rvers.We now describe the rvers and mechanisms.
4.2Knot
Knot[26]is a multi-threaded web rver which makes u of the Capriccio[27]threading package.Knot is a sim-ple web rver.It derives many benefits from the Capric-cio threading package,which provides lightweight,co-operatively scheduled,ur-level threads.Capriccio fea-tures a number of diff
erent thread schedulers,including a resource-aware scheduler which adapts its scheduling poli-cies according to the application’s resource usage.Knot operates in one of two modes[26]which are referred to as Knot-C and Knot-A.
Knot-C us a thread-per-connection model,in which the number of threads isfixed at runtime(via a command-line parameter).Threads are pre-forked during initializa-tion.Thereafter,each thread executes a loop in which it accepts a single connection and process it to completion. Knot-A creates a single acceptor thread which loops at-tempting to accept new connections.For each connection that is accepted,a new worker thread is created to com-pletely process that connection.
Knot-C is meant to favour the processing of existing connections over the accepting of new connections,while Knot-A is designed to favour the accepting of new connec-tion.By having afixed number of threads,and using one thread per connection,Knot-C contains a built-in mech-anism for limiting the number of concurrent connections in the rver.In contrast,Knot-A allows incread con-currency by placing no limit on the number of concurrent threads or connections.雄飞雌从绕林间
Our preliminary experiments revealed that Knot-C per-forms significantly better than Knot-A,especially under overload where the number of threads(and open connec-tions)in Knot-A becomes
very large.Our comparison agrees withfindings by the authors of Knot[26],and as a result we focus our tuning efforts on Knot-C.
We modified Knot-C to allow each of its threads to ac-cept multiple connections before processing any of the new connections.This was done by implementing a new func-tion that is a modified version of the accept call in the Capriccio library.This call loops to accept up to accept-limit new connections provided that they can be accepted without blocking.If the call to accept would block and at least one connection has been accepted the call returns and the processing of the accepted connections proceeds. Otherwi the thread is put to sleep until a new connec-tion request arrives.After accepting new connections,each
thread fully process the accepted connections before ad-mitting any new connections.Therefore,in our modified version of Knot each thread oscillates between an accept-pha and a work-pha.As in the rver,the accept-limit parameter ranges from1to infinity.The rest of this paper us the accept-limit parameter to explore the performance of our modified version of Knot-C.Note that the default Knot behaviour is when the accept-limit is t to1.
4.3TUX
TUX[23][15](which is also referred to as the Red Hat Content Accelerator)is an event-driven kernel-mode web rver for Linux developed by Red Hat.It is compiled as a kernel-loadable module(similar to many Linux device drivers),which can be loaded and unloaded on demand. TUX’s kernel-mode status affords it many I/O advantages including true zero-copy disk reads,zero-copy network writes,and zero copy request parsing.In addition,TUX ac-cess kernel data ,the listening socket’s ac-cept queue)directly,which allows it to obtain events of in-terest with relatively low overhead compared to ur-level mechanisms like lect.Lastly,TUX avoids the over-head of kernel crossings that ur-mode rvers must incur when making system calls.This is important in light of the large number of system calls needed to process a single HTTP request.
芙蓉花的诗句
A look at the TUX source code provides detailed insight into TUX’s structure.TUX’s processing revolves around two queues.Thefirst queue is the listening socket’s ac-cept queue.The cond is the work
pending queue before starting the next accept-pha.Note that new events can be added to each queue while TUX removes and process them.
We modified TUX to include an accept-limit parame-ter,which governs the number of connections th
at TUX will accept concutively.Since TUX is a kernel-loadable module,it does not accept traditional command line pa-rameters.Instead,the new parameter was added to the Linux/procfilesystem,in the/proc/sys/net/tux subdirectory.The/proc mechanism is convenient in that it allows the new parameter to be read and written with-out restarting TUX.It gives us a measure of control over TUX’s accept policy,and allows us to compare different accept-limit values with the default policy of accepting all pending connections.
Note that there is an important difference between how the rver and TUX operate.In the rver the work-pha process afixed number of connections(determined by lect).In contrast TUX’s work
WAIT state before commencing the next experiment.Prior to running experiments,all non-esntial Linux d-mail,dhcpd,cron etc.)are shutdown.This eliminated in-terference from daemons and periodic jobs)which might confound results.
Prior to determining which accept-limit values to include in each graph a number of alternatives were run and exam-ined.Thefinal values prented in each graph were chon in order to highlight the interesting accept policies.
The following ctions describe our experimental envi-ronment and the parameters ud to configur
e each rver.
5.1Environment
Our experimental environment is made up of two parate client-rver clusters.Thefirst cluster(Cluster1)con-tains a single rver and eight clients.The rver con-tains dual Xeon processors running at2.4GHz,1GB of RAM,a high-speed(10,000RPM)SCSI disk,and two In-tel e1000Gbps Ethernet cards.The clients are identical to the rver with the exception of their disks which are EIDE. The rver and clients are connected with a24-port Gbps switch.Since the rver has two cards,we avoid network bottlenecks by partitioning the clients into different sub-nets.In particular,thefirst four clients communicate with the rver’sfirst ethernet card,while the remaining four u a different IP address linked to the cond ethernet card. Each client runs Red Hat9.0which us the2.4.20-8 Linux kernel.The rver also us the2.4.20-8kernel,but not the binary that is distributed by Red Hat.Instead,the Red Hat sources were re-compiled after we incorporated
our changes to TUX.The resulting kernel was ud for all experiments on this machine.The aforementioned kernel is a uni-processor kernel that does not provide SMP sup-port.The reasons for this are twofold.Firstly,the Capriccio threading package does not currently include SMP support. Sec
ondly,wefind it instructive to study the simpler single-processor problem,before considering complex SMP inter-actions.
The cond machine cluster(Cluster2)also consists of a single rver and eight clients.The rver contains dual Xeon processors running at2.4GHz,4GB of RAM, high-speed SCSI drives and two Intel e1000Gbps Ethernet cards.The clients are dual-processor Pentium III machines running at550MHz.Each client has256MB of RAM,an EIDE disk,and one Intel e1000Gbps Ethernet card.The rver runs a Linux2.4.19uni-processor kernel,while the clients u the2.4.7-10kernel that ships with Redhat7.1. This cluster of machines is networked using a parate 24-port Gbps switch.Like thefirst cluster,the clients are divided into two groups of four with each group commu-nicating with a different rver NIC.In addition to the Gbps network,all machines are connect via a parate100 Megabit network which is ud for experimental control (starting and stopping web rvers,and copying experi-mental results).Each cluster is completely isolated from other network traffic.
Cluster1is ud to run all rver and TUX experiments while Cluster2is ud to run all Knot experiments.Be-cau our clusters are slightly different,we do not directly compare results taken from different clusters.Instead,each graph prents data gathered from a single cluster.Ideally, we would u one cluster for all our experiments,but the number of experiments required necessitated t
he u of two clusters.
5.2Web Server Configuration
In the interest of making fair and scientific comparisons,we carefully configured TUX and the rver to u the same resource limits.TUX was configured to u a single kernel thread.This enables comparisons with the single process rver,and was also recommended in the TUX ur man-ual[23].The TUX accept queue backlog was t to128 (via the/proc/sys/net/tux/max
SETSIZE.For TUX this was done through/proc/sys/net/tux/max