Meeting 6 - October 22nd, 2025¶
What Got Done¶
More Load Testing¶
-
To see full results of load testing. Go to the load test runs tab
-
getversion- Never fails
-
getclient- Fails at 10 users after around 60k req
- When throttled, fails somewhere between 100-500 req/s at 10 user
- At 500 req/s, still failed after 60k req
-
createkey- Fails with 10 users, no throttle, between 2500-3000 iterations
- Throttling helps, but will still freeze eventually
-
getkey- Need help with environment variables
Looking at System while running¶
The test I chose to repeat while looking at system was createkey, 10 users, 20,000 iterations, 250 req/s. When I last ran this, it failed after ~60K iterations.
The only difference I made in the test was increased iteration to 50,000
I used top and isolated tms_server's PID to view diagnostics
Results:
- Got through almost 200k iterations, I did clear the sqlite db inbetween these runs. I expect that it was less full during this run. Maybe thats why it was able to get through almost 200K as opposed to 60K
- Lasted 7 minutes of CPU runtime
- Was sleeping more that it was running
- MEM stable at 15M
- CPU usage ranged between 20-60% the entire time until it crashed and instantly dropped to 0%
Thoughts:
- My inital thoughts are that because tms_server is sleeping much more than it is running. The server is I/O bounded writing things to sqlite and not CPU or memory bounded
- I'm guessing that it can compute much faster than it can output the results, a queue forms, then at some point the queue gets too long for sqlite to handle and then crashes
- This would explain why throttling helps because it gives tms_server more time to output results
- I feel like this also explains why regardless of certain factors, different test will fail in similar spots (i.e. different test both failing after 50k iterations)
- All tests for 1 user, regardless of how many iterations, never failed. Also supports this hypothesis
- Once we add more users, only 1 can write the sqlite at a time, others wait, queue forms
- I then removed throttling and it failed after about ~8 seconds of CPU run time, further supports current hypothesis
A Solution?¶
- After this, I thought about if there would be an obtainable solution if this is the problelm
- I thought about the extreme case, say this server becomes immesly popular
- I ran a test with 1000 users, all trying to do 100000 iterations, but limited them to 10 req/s
- The server runs with no issues, even with that many concurrent users as long as we limit the amount of requests they do
- Can we make a global/server max amount of requests that can happen, and then spread that among the current users doing requests?
Road Blocker and Questions¶
- Need environment variables for
getkey
Next Steps¶
- If we think this is the issue, modifying tms_server to limit the amount of global requests it takes
- This would require more testing to more accuretly figuring how much it can handle
- We would need to note if adding things like tls back in the picture would affect this limit
- Would also need to account for testing being done locally, not where the server would be hosted at TACC
- Would be interesting to test this on a TACC cluster, which would likly have much better I/O than I am getting running locally