Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build model of computron-to-wallclock relationship #3459

Closed
8 tasks done
warner opened this issue Jul 9, 2021 · 21 comments
Closed
8 tasks done

build model of computron-to-wallclock relationship #3459

warner opened this issue Jul 9, 2021 · 21 comments
Assignees
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Jul 9, 2021

What is the Problem Being Solved?

To support #3460 (in particular to build the runPolicy object it uses), we want to build a model that can look at the metering results for a given delivery, and return an estimate of the amount of wallclock time that would have taken on our validators.

We want to build that model from the slogfile corpus we collected during the most recent testnet phase (the "stress test" phase 4). This contains 564GB of data from 117 participants, spread across 172 slogfiles, with varying degrees of overlap. Some validators will be faster or slower than others. For each delivery in the corpus, we want to get statistics on the range of wallclock times taken by these validators. The model should probably return the time taken by the slowest validator which still managed to keep up with the load we applied.

I'm hoping that we can find a roughly linear fit between our metering values (computrons consumed) and the wallclock time of each delivery: a + bx where x is the computron count and a/b are the deduced constants. We also know the number of syscalls made during each crank, so we might improve the linear regression with a + bx + cy where y is the number of syscalls. Finally, many of these deliveries use a variable number of arguments (e.g. the number of virtual purses whose balances were updated by the cosmos-sdk "VBank" module). So the model might get even better if we sum some function of the delivery arguments and feed that into the model.

The over-engineered approach to this would involve some form of machine-learning tool that gets the complete set of deliveries and syscalls (including arguments) as training data, and emits a compact function which gets a single delivery-plus-syscalls. The practical approach will be to compute a linear regression over x. Once we've got a model, we'll want to feed it into the simulator described in #3460 to see how badly it manages to limit blocks to the P block time target. If it's anywhere close, we should stop there and not try to improve it with more data or more sophisticated analysis.

For the next testnet phase, we want:

  • choose a target time P
    • 10s compute should be 2/3rds throughput, with 5s voting
  • add runPolicy object to controller.run() #3460 build a runPolicy object, based upon what we learn from this model, which tries to finish blocks when their estimated/predicted wallclock time reaches P

It's ok if the result is not exact: some variation of runtime is entirely acceptable, and we'll be tuning this model (and our choice of P) over time. We should expect that both P and the constants that define the runPolicy function will be changed by governance votes over the life of the chain.

tasks:

@warner warner added the enhancement New feature or request label Jul 9, 2021
@warner warner added this to the Testnet: Metering Phase milestone Jul 9, 2021
@warner warner added SwingSet package: SwingSet cosmic-swingset package: cosmic-swingset labels Jul 9, 2021
@dckc dckc self-assigned this Jul 13, 2021
@dckc
Copy link
Member

dckc commented Jul 13, 2021

2000x inaccuracy?

I'm hoping that we can find a roughly linear fit between our metering values (computrons consumed) and the wallclock time of each delivery: a + bx where x is the computron count and a/b are the deduced constants.

When I fit the data to a line like that, I get 2000x between the fastest and slowest delivery.

I'm not confident about the analysis just yet, though.

What's good enough?

We also know the number of syscalls made during each crank, so we might improve the linear regression with a + bx + cy where y is the number of syscalls.

Under what conditions might we refine the model this way?

Slowest validator?

The model should probably return the time taken by the slowest validator which still managed to keep up with the load we applied.

How can I tell which one that is?

PostScript: Discussion notes

## reduced data set

# chain-wide deliveries
# chain_id, crankNum -> blockHeight, vatID, deliveryNum, kd, compute

# chain_id, vatID, deliveryNum -> blockHeight, kd, compute
# except vatTP?

# per-validator data
# chain_id, crankNum, run (slogfile, kernel-start) -> dur
# global crankNum -> vatID, deliveryNum

c500[['crankNum', 'vatID', 'deliveryNum']].set_index()

# ignore un-full blocks?
# histogram of block durations; interval between...
# {"time":1625059432.2093444,"type":"cosmic-swingset-end-block-start","blockHeight":58394,"blockTime":1625059394}
# {"time":1625059432.2096362,"type":"cosmic-swingset-end-block-finish","blockHeight":58394,"blockTime":1625059394}

# "blockTime":1625059381 <- consensus block time is median of block times (?)   


# vatID, deliveryNum -> args / syscalls
# watch out for GC esp.

# c.run(runPolicy)
# simple model: kernel says how many computrons
# refinement: computrons, syscalls

# fitness: block distribution... 10s blocks...
#   blocks that aren't too big (latency, validator variance risk)
#   cpu that isn't idle (throughput)
# an ideal: median block time 10s
# 80 20 %ile


# importing a contract is an outlier
# median validator - existing distribution of deliveries / compute -> blocks
#  supplement: study wallclock stuff

@dckc
Copy link
Member

dckc commented Jul 15, 2021

Very slow blocks in agorictest-16

Note: these slow blocks are all after the official phase 4 end time, 2021-07-02 19:00:00.

blockTime delta
blockHeight  
87413 2021-Jul-03 21:54:54 82
88179 2021-Jul-04 0:00:42 96
88232 2021-Jul-04 0:10:04 91
88263 2021-Jul-04 0:16:34 101
88282 2021-Jul-04 0:20:53 81
88296 2021-Jul-04 0:23:45 120
91374 2021-Jul-04 8:44:42 77
91485 2021-Jul-04 9:05:17 77
91558 2021-Jul-04 9:18:48 224
91566 2021-Jul-04 9:23:13 98
91835 2021-Jul-04 10:09:29 124
100113 2021-Jul-05 0:00:07 96

agorictest-16 block-to-block time deltas

image

image

image

delta
count 39974
mean 7.59544
std 5.80749
min 1
25.00% 6
50.00% 6
75.00% 6
max 224

Phase 4 officially ended 2021-07-02 19:00:00. If we take only blocks before the official end, we get a nicer picture:

image

  delta
count 12480
mean 6.92332
std 2.87267
min 1
25.00% 6
50.00% 6
75.00% 7
max 37

Analysis code

See testnet-notes/nb4

@dckc
Copy link
Member

dckc commented Jul 15, 2021

172 slogfiles, ~500MB ea

image

b64 is bytes / 64, to make the scale comparable to lines

  st_size lines
count 172 172
mean 4.8E+08 9969421
std 4.6E+08 9925292
min 1839335 2665
25.00% 3.1E+07 1632114
50.00% 4.9E+08 4733944
75.00% 8.9E+08 1.8E+07
max 1.8E+09 3.5E+07

@dckc
Copy link
Member

dckc commented Jul 20, 2021

status: 0aab592 in https://github.com/Agoric/testnet-notes/commits/main/nb4 :

feat: find compute meter mismatches (WIP)

It looked like ElectricCoinCo-agorictest16-chain.slog.gz had different compute meter results from about 17 other validators; but it turns out that those results were from agorictest-15 whereas the others were from agorictest-16. The connection from delivery back to blockTime (which distinguishes agorictest-15 from agorictest-16) and run had been lost.

I tried to re-attach deliveries to blockHeight and runs based on line numbers, but that seems to be O(n^3) where n (number of lines in the dataset) is in the millions, so not practical.

TODO: rebuild delivery data, this time including blockTime and run context.

@dckc
Copy link
Member

dckc commented Jul 21, 2021

@ski I just ran across a performance chart you made in #2630 (comment) and it makes me wonder about feeding these data into some tools that pros like yourself use.

@dckc
Copy link
Member

dckc commented Jul 22, 2021

TODO: separate runs by chain

I figured out in discussion with @ski that if I do this, I don't have to re-crunch all the delivery data. It was missing blockTime but it does have run:

  time type crankNum vatID deliveryNum kd line dr syscalls method compute dur file_id blockHeight run
0 2021-Jul-02 4:16:39 deliver-result 17590 v14 5015 ['message', 'ko33', {'method': 'deliverInbound... 2387778 ['ok', None, {'meterType': 'xs-meter-8', 'comp... 42 deliverInboundMessages 773365 0.06711 1.2E+15 66371 455
1 2021-Jul-02 4:16:39 deliver-result 17591 v14 5016 ['message', 'ko33', {'method': 'deliverInbound... 2387808 ['ok', None, {'meterType': 'xs-meter-8', 'comp... 14 deliverInboundAck 290649 0.04505 1.2E+15 66371 455
2 2021-Jul-02 4:16:39 deliver-result 17592 v14 5017 ['message', 'ko33', {'method': 'deliverInbound... 2387831 ['ok', None, {'meterType': 'xs-meter-8', 'comp... 9 deliverInboundMessages 184700 0.03628 1.2E+15 66371 455

@rowgraus rowgraus assigned warner and unassigned mhofman Jul 23, 2021
@dckc
Copy link
Member

dckc commented Jul 24, 2021

tonight's approach:

  • agorictest-15/
    • ...
  • agorictest-16
    • vat12/
      • validator-10-run-3-summary.csv.gz
      • validator-10-run-3-detail.csv.gz
      • validator-12-run-3-summary.csv.gz
      • validator-12-run-3-detail.csv.gz

where the summary files should be identical across all validators and the detail files have stuff that varies by validator.

@dckc
Copy link
Member

dckc commented Jul 24, 2021

ran out of memory. Processing whole runs at a time in memory doesn't work: some are too big to fit.

also, even the summaries were rarely identical because the runs cover different timespans.

@dckc
Copy link
Member

dckc commented Jul 28, 2021

Simple simulation of 10M computrons per block

Using data from just one run (legio-agorictest16-chain.slog.gz) with 102,869 deliveries from blocks 64,628 to 69,425...

This is a histogram of actual computrons per block (excluding executeContract deliveries) compared with a simulated always-busy chain with a threshold of 1e7 (10M) computrons per block:

image

The histogram of block durations looks like:

image

Breakdown of rate (compute / duration) by method:

  count mean std
method      
executeContract 15 8.0E+07 1.2E+07
inbound 615 6.7E+07 3.7E+07
makeMintAndIssuer 7 4.3E+07 6889523
makeFakePriceAuthority 5 4.2E+07 3856267
makeAddTypeInvitation 5 4.0E+07 1376630
... ... ... ...
getDebtAmount 203 1066019 334573
getPayouts 238 1031555 2547606
getPromise 1 942617 NaN
getPayout 1722 750406 457742
addEgress 0 NaN NaN

p.s. I hope the sim function in my notebook does something reasonable.

@warner
Copy link
Member Author

warner commented Jul 29, 2021

@dckc and I spent some analysis time this afternoon.

Suppose we crunked our data and magically extracted a stable, consistent ratio from computron to wallclock time that was exactly accurate for a single validator. Now suppose we measure that ratio across a range of validators, both the fast ones and the slow ones. Imagine we get some sort of normal distribution, or maybe normal plus a long tail.

Now how should we use this to choose a target computron limit? What are our goals? One goal is to allow results to be returned in a timely fashion (reducing average latency, or reducing 90% latency). But another consequence of long blocks is that it threatens validator liveness. This diagram tries to capture that consequence:

validator-speeds

As the set of validators works on executing the transactions in the recently-finalized block, the faster ones will finish that execution before the slower ones. They cannot vote on the next block until they've finished execution. The voting time itself is not significant (and doesn't depend upon the swingset work being done), so their vote on the second block will be delivered some fairly constant amount of time after they finish execution. So these votes will arrive spread out over time.

The proposer of block-2 will wait until it sees a 2/3rds majority and then wait for the timeout_commit timer to expire (which defaults to 5 seconds) for more votes to arrive. Once that timer expires, the block proposer finalizes the block (all subsequent votes are ignored) and publishes block-2. Everybody then starts executing the txns inside block-2, and the cycle repeats.

But, the poor slow validators in the region on the right will still be working on the block-1 execution. If block-2 does not have much work to do, they'll be able to catch up, because of that 5 second delay. But if their excess execution time (beyond the 2/3rds percentile of all validators) is more than 5 seconds, they won't be able to catch up, and their votes for block-2 will be ignored too. If the chain continues to be busy, and never gives these validators a rest, they'll eventually miss so many blocks that they're kicked out of the validator pool. The remaining validator set will be smaller, so the median and 2/3rds percentile time will be slightly smaller, which might put the next-slowest validator at risk of falling behind.

Longer blocks (more computrons) will scale the X-axis of that shape, but the 5s timeout does not change. This may cause more validators to fall into the "can't catch up" zone, hastening their departure.

So one of our goals in limiting the block duration is to reduce the chances that slower validators will be kicked out.

We might want to have the scheduler pay attention to the consensus block time, and it it appears to be elevated, reduce the amount of swingset work we do (reduce the duty cycle) for some number of blocks until it falls back down to the target.

@dckc
Copy link
Member

dckc commented Aug 2, 2021

(WIP notes...)

Deliveries with little variation

In one slogfile (run1 below) we found that getPayout usually cost 381240 computrons:

_run1[_run1.method == 'getPayout'].compute.hist()

image

Ignoring other than _run1.compute == 31654, we see an increasing trend in wall-clock time:

image

including a rather dramatic peak:

image

later:

image

run1: legio-agorictest16-chain.slog.gz

run of 31M lines; 1GB file (compressed):

index                                    445
time           2021-07-01 18:55:21.136728320
type                    import-kernel-finish
parent                             Legiojuve
name        legio-agorictest16-chain.slog.gz
st_size                           1094680419
lines                               31407771
chain                                   16.0

@dckc
Copy link
Member

dckc commented Aug 3, 2021

One busy block suggests: 8-10M computrons to 10 sec

In agorictest-16, consensus blockTime for blockHeight 78842 was 35sec.

A 2:1 ratio of compute time to voting time (i.e. 2/3rds utilization) suggests breaking it into 3 10 second blocks.

A plot of cumulative computrons vs. time suggests an 8-10M computron threshold would do it, at least for this block:

image

based on investigation of Agoric_by_QuantNode-agorictest16.slog.gz with @warner

Which vats use how much of the time?

image

@dckc
Copy link
Member

dckc commented Aug 4, 2021

Visualizing JavaScript messaging in blocks

Block 78851 is a normal 6 second block with 72 deliveries. In Agoric_by_QuantNode-agorictest16.slog.gz, it starts at line 12199139 and goes for 9072 lines. The causal_slog.py script turns this into a graphviz diagram:

bk78851 gv

Larger Blocks

block 78840 took 35 seconds; what's going on in there?

https://gist.github.com/dckc/e701a84cfa6e22de06966a46bed5cf1a
https://gist.githubusercontent.com/dckc/e701a84cfa6e22de06966a46bed5cf1a/raw/a1933606b286a193f708fe324607e048e9db9df1/bk1.gv.svg

@dckc
Copy link
Member

dckc commented Aug 6, 2021

Run Policy Simulation

We take the deliveries and the block in which they arrived. In the simulation, when the compute meter goes over the threshold (8M), we start a new block:

image

Using time on both axes:

image

Zooming out to all 300K deliveries from 2021-07-01 10:15 to 2021-07-03 00:27, we get a max compute duration in a block of 13.7 sec simulated vs. 37.4 observed:

image

The run_policy_simulation notebook has more details.

@dckc
Copy link
Member

dckc commented Aug 6, 2021

Latency: Usually 0 blocks, but max 72!

The 8M computron run policy usually delays deliveries by 2 or fewer blocks but in some cases it seems to be as many as 72. I'm looking into what's going on.

count    293100.000000
mean          0.663733
std           3.921317
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max          72.000000
dtype: float64

image

@dckc
Copy link
Member

dckc commented Aug 7, 2021

Strange run of fromBridge, inbound deliveries

crankNum blockHeight blockTime
103200 68817 2021-07-02 02:33:54
105500 69707 2021-07-02 05:07:10
  count median mean count median mean
method            
fromBridge 592 1.0E+07 1.4E+07 592 0.55812 0.8516
inbound 601 1.4E+07 2.0E+07 601 0.20442 0.32188
. compute duration
x = pd.read_sql('''
select *
from j_delivery
where crankNum between 103200 and 105500
and file_id = 3288529541296525
''', _db4)

x[x.compute > 1000000].groupby('method')[['compute']].aggregate(['count', 'median', 'mean'])

@dckc
Copy link
Member

dckc commented Aug 9, 2021

In each block, balances for hundreds of accounts are updated in the bank vat:
escalated to its own issue: #3629

blockTime blockHeight crankNum type dt method updated
2021-Jul-02 2:33:54 68817 103198 deliver message inbound 482
2021-Jul-02 2:33:54 68817 103199 deliver message fromBridge 482
2021-Jul-02 2:33:54 68817 103200 deliver notify None NaN
2021-Jul-02 2:34:00 68818 103201 deliver message inbound 508
2021-Jul-02 2:34:00 68818 103202 deliver message fromBridge 508
2021-Jul-02 2:34:00 68818 103203 deliver notify None NaN
2021-Jul-02 2:34:06 68819 103204 deliver message inbound 503
2021-Jul-02 2:34:06 68819 103205 deliver message fromBridge 503
2021-Jul-02 2:34:06 68819 103206 deliver notify None NaN
2021-Jul-02 2:34:14 68820 103207 deliver message inbound 670

@dckc
Copy link
Member

dckc commented Aug 11, 2021

Up to 5000 accounts updated per delivery

In just the first 100 deliveries in blocks 68817 to 69707:

image

count      44.000000
mean      887.272727
std      1071.550981
min       221.000000
25%       483.000000
50%       507.000000
75%       770.000000
max      5000.000000

1000 deliveries: stable trend

image

count     516.000000
mean     2482.430233
std      1696.908192
min        40.000000
25%       784.000000
50%      2485.500000
75%      4094.000000
max      5001.000000
  blockHeight blockTime crankNum vatID deliveryNum tag target method updated
0 68817 2021-Jul-02 2:33:54 103198 11 3165 message ko35 inbound 482
1 68817 2021-Jul-02 2:33:54 103199 1 11644 message ko38 fromBridge 482
2 68817 2021-Jul-02 2:33:54 103200 11 3166 notify None None NaN
3 68818 2021-Jul-02 2:34:00 103201 11 3167 message ko35 inbound 508
4 68818 2021-Jul-02 2:34:00 103202 1 11645 message ko38 fromBridge 508
... ... ... ... ... ... ... ... ... ...
995 69176 2021-Jul-02 3:53:33 104195 11 3679 message ko35 inbound 5000
996 69176 2021-Jul-02 3:53:33 104196 1 11907 message ko38 fromBridge 5000
997 69176 2021-Jul-02 3:53:33 104197 11 3680 notify None None NaN
998 69177 2021-Jul-02 3:54:01 104198 11 3681 message ko35 inbound 727
999 69177 2021-Jul-02 3:54:01 104199 1 11908 message ko38 fromBridge 727
  • 5fbc01c feat: bank_trace etc

@dckc
Copy link
Member

dckc commented Aug 11, 2021

Early vs. Late Simulation: Computrons go 3.7x faster early

early: crankNum 20000 to 70000
block 66411 to 67134 in agorictest-16

image

late: crankNum >= 250000
block 78286 to 78912

image

rate is compute / duration of a delivery

>>> rate_lo_median = 1.738564e+06
>>> rate_hi_median = 4.711452e+05
>>> round(rate_lo_median / rate_hi_median, 1)
3.7

@dckc
Copy link
Member

dckc commented Aug 11, 2021

Vat Creation: 0.385 sec => charge 300K computrons for create-vat

_ value
time to compute in a block 10.0
time to create vats in a block 3.0
computron threshold in a block 8000000.0
time to create a vat (mean observed) 0.385451
computrons per vat creation (projected) 308361
count    8.000000
mean     0.385451
std      0.003590
min      0.380300
25%      0.382785
50%      0.385533
75%      0.388407
max      0.390025
  blockHeight blockTime type dur
0 64628 2021-Jul-01 19:00:00 create-vat 0.39003
1 64628 2021-Jul-01 19:00:00 create-vat 0.3803
2 64628 2021-Jul-01 19:00:00 create-vat 0.3893
3 64628 2021-Jul-01 19:00:00 create-vat 0.38811
4 64628 2021-Jul-01 19:00:00 create-vat 0.38628
5 64628 2021-Jul-01 19:00:00 create-vat 0.38317
6 64628 2021-Jul-01 19:00:00 create-vat 0.38479
7 64628 2021-Jul-01 19:00:00 create-vat 0.38164

@dckc
Copy link
Member

dckc commented Aug 11, 2021

Let's declare victory on this much; we have a working model.

We can track work on examining the remaining slogfiles in #3660.

@dckc dckc closed this as completed Aug 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

3 participants