[Neo-users] Current tests
Vincent Pelletier
vincent at nexedi.com
Fri Apr 23 14:40:19 CEST 2010
Hi.
We are currently running "real-world" tests on NEO, to see how fast and how
stable it currently is. Here is the result of tests conducted last night.
Note that below setup is a very early layout, it will evolve as we find and
fix issues or limits.
Test environment:
- 3 blades in a single enclosure, each blade having:
- 32GB ram
- 8 cores (2 4-cores AMD CPUs)
- 2 * 1TB hard disks, no raid
- gigabit network between blades, no trunking, switched
NEO cluster:
- 4 master nodes (on a single blade at the moment)
- 6 storage nodes (2 per blades), 1 replica
- 24 Zope processes (8 per blade) acting as client nodes
Application:
- ERP5: 12 + 1 activity processes (ie, doing asynchronous actions generated by
user actions), 11 "user" processes (ie, nodes which are directly accessed by
test suites).
- ERP5 catalog stored in MySQL of one blade
Test suite:
mechanize-based scripts creating, editing and validating documents
All documents are created in a single BTreeFolder2 container (ERP5 Module).
Results:
Throughput of 5200 documents/hour, including the entire validation.
For reference, we achieved around 6000 documents/hour on ZEO when creating
them in 4 different BTreeFolder2, each in a different FileStorage to reduce
the effect of transaction-level lock.
After 2 hours and a half of tests, 2 storage nodes crashed on an assertion
failure (which we are investigating now) and made the cluster go out of
running state for verification state.
After restarting all processes, the cluster went back to running state. There
was no data loss as far as we could tell:
- We checked the consistency of current objects (ie, latest versions of
objects) by reindexing the whole site (ERP5 feature iterating over objects
to index them - this goes through most of ZODB objects)
- We didn't check non-current versions (ie, transaction data vs. object data
consistency) yet.
Sadly, due to a human error (dropping the wrong storage nodes in later
unrelated operations) the dataset is lost.
Conclusions and remarks on the test setup:
- It is already very good news to get such performance compared with ZEO in
such "high level" test, involving much complexity outside of the storage
part.
- If confirmed in next similar event, it is good news that no data was lost
during the downtime.
- We must make it harder for a human to break an otherwise working cluster.
- It is not optimal to have MySQL used for storage nodes and ERP5 index, we
will reduce the number of storage nodes to 4, MySQL on remaining blade being
only used as ERP5 index.
- It is not satisfying to become out of service by loosing just 2 storage
nodes out of 6. We will investigate automated partition distribution to see
if it is a corner case (and how it can be solved) or if it is a hard
restriction (and we will document it).
- We will use haproxy for our next tests to even out load on Zope processes,
as our monitoring showed test suites (accessing Zope directly) were using
more a subset.
Regards,
--
Vincent Pelletier
More information about the Neo-users
mailing list