XrdAdaptor/doc/multisource_algorithm_design.md

0001 # Multi-source algorithm design
0002
0003 ## Introduction
0004
0005 The existing AAA infrastructure has a relatively high penalty for poor redirection decisions.  A client which is redirected to a "poor" server in terms of network locality (in terms of high latency or low bandwidth) will continue to use that server as long as it doesn't fail outright, even if there is a much better source available.
0006
0007 Improving network locality argues for improving the redirector's redirection logic.  This is certainly needed, however network locality is not the only concern.  A site's storage can degrade in the middle of a file transfer (due to a device loss in the RAID, or sudden spurt of new transfers), making a "reasonable choice" at redirection time perform poorly overall.
0008
0009 To reduce the penalty for poor redirection decisions, we aim to upgrade the CMSSW XrdAdaptor to be multi-source capable.  By actively using multiple sources and probing for additional ones, the client will be able to discover the fastest sources, even if the designation of "fastest" changes through the job's lifetime.
0010
0011 ## Design Goals
0012
0013 Design goals of the multi-source Xrootd client:
0014 0. Determine a metric for quality of the source server.
0015 1. Actively balance transfers over multiple links in order to determine several high-quality sources of the file.
0016 2. Recover from transient IO errors at a single source.
0017 3. Probe for additional possible sources during the file transfer.
0018 4. Minimize the impact on the source site versus a single-source client. Understand both average case and the worst case scenarios.
0019    - In particular, actively utilizing too many sources can cause TCP windows to stay small and OS read-ahead to be inefficient.
0020    - Any speculative probes for additional sources should be a small percentage of the total traffic.
0021    - Any retry mechanisms must have reasonable delays prior to failure.
0022 5. Have the number of requests per source be proportional to source quality.
0023    - Servers or sites experiencing a "soft failure" causing a degrade in quality will receive the least amount of traffic.
0024
0025 ## Implementation
0026
0027 ### Quality metric
0028
0029 A source's quality is to be defined, per-file, to be the average request response time over the last 5 time intervals. Each time interval is one minute long; time intervals with no data points are discarded.
0030
0031 If there is no previously recorded data for a given source, it is assumed to have an average of 260ms (assumes a 256KB request, 1MB/s server speed, and 10ms of latency) in one time interval. When a new file is opened on a source, the prior average for that source is used for the first time interval.
0032
0033 Notes:
0034 - The request splitting algorithm outlined below will split all client requests into a series of requests at most 256KB (similar to what the Xrootd client does internally).  Since the request has a maximum size, it makes looking at the unweighted time per request more reasonable.
0035 - It may seem strange to the reader to not differentiate between bandwidth and latency, or somehow factoring in request size to the quality metric.  I believe it is acceptable to ignore this as the distribution of small and large requests will remain approximately constant throughout the job lifetime.
0036
0037 ### Source selection algorithm
0038
0039 The client will maintain a set of up to two "active servers" and an arbitrary number of inactive servers. When the client opens a file, the initial data server it receives from the redirector becomes the first active server.
0040
0041 For a 5-second grace period, this initial server remains the only active one. (The use of "5 seconds" as the grace period is motivated by the redirector's implementation; any internal file location requests triggered by the initial file open should be finished after 5 seconds.) After the grace period, the client enters source search mode.
0042
0043 When in source search mode, the client will try trigger a new file open at most once every 5 seconds. When trying a new file-open, it explicitly requests the redirector avoid any previously-used sources for that file. Only one file open operation may be in progress at a time. When a file-open is complete, that server is added to the set of active servers.  When a "file not found" is returned by the redirector, a new file open will not be performed for at least 2 minutes.
0044
0045 When an active source's quality goes above 5130 (256kb request, 50kb/s bandwidth, 10ms latency), the source is moved to the inactive set if it is not the only active server.
0046
0047 If an active source's quality is a factor 10 worse than the other active source, it is moved to the inactive set.
0048
0049 When there is only one source in the active set, a source is promoted from the inactive set if its quality is below 5130 and is not factor 10 worse than the other active source. If no server in the inactive set is eligible, the client re-enters search mode for additional sources.
0050
0051 If a source encounters an error (either a file IO error or a disconnect), then it is marked as disabled and removed from both active and inactive sets.
0052
0053 If an inactive source's quality metric is better than an active source's metric, the two are swapped. This swap is not performed if the inactive source itself has been removed from the active set in the last two minutes. The "Active probe algorithm" section below describes one mechanism for updating an inactive source's quality metric.
0054
0055 ### Request splitting algorithm
0056
0057 When a client performs a new request, the request is balanced amongst the active servers using the following algorithm:
0058
0059 1. Source A removes 256KB from the beginning of the request and places it at the end of its request queue.
0060 2. Source B removes 256KB from the end 256KB of the request and places it at the beginning of its request queue.
0061 3. Steps (1) and (2) are repeated until the original request has been completely split into the two queues.
0062
0063 After each client request has been fully split, the labels "A" and "B" are swapped. This allows a series of small client requests to be load-balanced between the two sources.
0064
0065 Examples:
0066
0067 For example, suppose a client requests to read 1024KB starting at offset 0. The client request and queues look like:
0068 ```
0069 A: []
0070 B: []
0071 client request: [1024KB @ 0]
0072 ```
0073
0074 After steps 1 and 2, the sources have the following queues:
0075 ```
0076 A: [256KB @ 0]
0077 B: [256KB @ 768KB]
0078 client request: [512KB @ 256KB]
0079 ```
0080
0081 After steps 1 and 2 are applied again, we have the following queues:
0082 ```
0083 A: [256KB @ 0, 256KB @ 256KB]
0084 B: [256KB @ 512KB, 256KB @ 768KB]
0085 client request: []
0086 ```
0087
0088 Consider the following example for a vectored read request:
0089 Start:
0090 ```
0091 A: []
0092 B: []
0093 client request: [192KB @ 0, 128KB @ 256KB, 128KB @ 512KB, 192KB @ 768KB]
0094 ```
0095
0096 Iteration 1:
0097 ```
0098 A: [192KB @ 0, 64KB @ 256KB]
0099 B: [64KB @ 576KB, 192KB @ 768KB]
0100 client request: [64KB @ 320KB, 64KB @ 512KB]
0101 ```
0102
0103 Iteration 2:
0104 ```
0105 A: [192KB @ 0, 64KB @ 256KB, 64KB @ 320KB, 64KB @ 512KB]
0106 B: [64KB @ 576KB, 192KB @ 768KB]
0107 client request: []
0108 ```
0109
0110 Notes:
0111 - The algorithm halts because, after each iteration, the client request is reduced by 512KB and the algorithm terminates once the client request is empty.
0112 - The request is split between the two clients; the difference in number of bytes assigned is at max 256KB.
0113 - 256KB is a natural unit of size as the Xrootd client break requests into this size by default.
0114 - If the client request, when performed in order, results in the server performing non-overlapping reads with monotonically-increasing offsets, then the split requests will also have this property.
0115 - TODO: We should have the initial assignment done with respect to the quality metric.  When the two servers are heavily out-of-balance, one may steal a lot of work from the other.  When work is stolen, it is performed via backward reads, which won't cause ideal filesystem performance.
0116
0117 ### Load-balance algorithm
0118
0119 When a client request is made, it is split into two sets of requests as described previously if there are two active servers. Each source performs the IO operations in its queue in order to completion.
0120
0121 If two sources are active and one source has already finished its queue, it may steal work from the end of the other source's queue if the other source has not already started on that IO operation.
0122
0123 If one source has not completed its request in more than 4 times the quality metric and the other source is idle, then the other source may speculatively start the same IO operation. The results of this "speculative read" are stored in a separate, statically-allocated 256KB buffer; this means only one speculative read at a time is allowed. The first request to complete is returned to the client.
0124
0125 If an IO error occurs on one active source, the same IO operation is inserted into the other source's queue. If the IO operation fails in the other active source, it is repeated immediately on all inactive source. The first inactive source to successfully complete the IO is swapped into the active set, removing the currently worst-performing active source.
0126
0127 Notes:
0128 - Due to the way the quality metric is calculated, it could take a few minutes for an active source that starts moving data at 1 byte / sec to become inactive. Should we re-weight the time intervals to give more weight to recent results.
0129   - TODO: calculate the worst case over-read for very slow sources if they get their work stolen all the time.
0130   - The over-read is probably not as bad as having to take quite some time to give up on the source. We can probably introduce a penalty for sources that are the "victim" of a successful speculative read.
0131
0132 ### Active probe algorithm
0133
0134 If the client is not in search mode, for every 1024KB of data read, generate a random number such that there will be a .25% chance the request will also be issued as a speculative read to an inactive source. If it has been at least 2 minutes since the last file-open request to the redirector, there is also a .25% chance the request will trigger a file-open and a speculative read in the redirector.
0135
0136 The results of the speculative read are kept in the same statically-allocated buffer as used by the load-balance algorithm. Only one speculative reads can be issued at a time, including reads from the load-balance algorithm. The timing results of the read are used to update the quality metric for the inactive or new sources. If the request is finished prior to the active source, the results of the speculative read will be used to fulfill the client request.
0137
0138 Notes:
0139 - Assuming a client request is 256KB, the active probe algorithm should be no more than 2% of the total traffic.
0140