Meeting 6 - October 22nd, 2025¶

What Got Done¶

More Load Testing¶

To see full results of load testing. Go to the load test runs tab
getversion
- Never fails
getclient
- Fails at 10 users after around 60k req
- When throttled, fails somewhere between 100-500 req/s at 10 user
- At 500 req/s, still failed after 60k req
createkey
- Fails with 10 users, no throttle, between 2500-3000 iterations
- Throttling helps, but will still freeze eventually
getkey
- Need help with environment variables

Looking at System while running¶

The test I chose to repeat while looking at system was createkey, 10 users, 20,000 iterations, 250 req/s. When I last ran this, it failed after ~60K iterations.

The only difference I made in the test was increased iteration to 50,000

I used top and isolated tms_server's PID to view diagnostics

Results:

Got through almost 200k iterations, I did clear the sqlite db inbetween these runs. I expect that it was less full during this run. Maybe thats why it was able to get through almost 200K as opposed to 60K
Lasted 7 minutes of CPU runtime
Was sleeping more that it was running
MEM stable at 15M
CPU usage ranged between 20-60% the entire time until it crashed and instantly dropped to 0%

Thoughts:

My inital thoughts are that because tms_server is sleeping much more than it is running. The server is I/O bounded writing things to sqlite and not CPU or memory bounded
I'm guessing that it can compute much faster than it can output the results, a queue forms, then at some point the queue gets too long for sqlite to handle and then crashes
This would explain why throttling helps because it gives tms_server more time to output results
I feel like this also explains why regardless of certain factors, different test will fail in similar spots (i.e. different test both failing after 50k iterations)
All tests for 1 user, regardless of how many iterations, never failed. Also supports this hypothesis
Once we add more users, only 1 can write the sqlite at a time, others wait, queue forms
I then removed throttling and it failed after about ~8 seconds of CPU run time, further supports current hypothesis

A Solution?¶

After this, I thought about if there would be an obtainable solution if this is the problelm
I thought about the extreme case, say this server becomes immesly popular
I ran a test with 1000 users, all trying to do 100000 iterations, but limited them to 10 req/s
The server runs with no issues, even with that many concurrent users as long as we limit the amount of requests they do
Can we make a global/server max amount of requests that can happen, and then spread that among the current users doing requests?

Road Blocker and Questions¶

Need environment variables for getkey

Next Steps¶

If we think this is the issue, modifying tms_server to limit the amount of global requests it takes
This would require more testing to more accuretly figuring how much it can handle
We would need to note if adding things like tls back in the picture would affect this limit
Would also need to account for testing being done locally, not where the server would be hosted at TACC
Would be interesting to test this on a TACC cluster, which would likly have much better I/O than I am getting running locally