Exablox testing journal

MattSpeller

Connected the delinquent "vancouver" one back up to clean 100mbit lan, will see if anything changes.

MattSpeller

Note: all errors mentioned below still exist on the "vancouver" unit as I post this, including wacky dedupe ratio, 100GB greater capacity in use and file system inconsistency error

scottalanmiller

Time to get support to take a gander, I would say.

MattSpeller

@scottalanmiller I'll give it 24h to sort it's self out - if it was the hub, that'll give it time to heal. I really don't expect anyone's tested it with such poor networking gear and while I didn't expect it to cause issues, it's outside of what I'd expect them to support.

JaredBusch

@MattSpeller No, that is exactly what you are paying for. make use of it. By not contacting support you are wasting time and money.

MattSpeller

@JaredBusch If it was in production, 100% agree. I unfortunately don't have time to poke at it today so I'll let it sit.

SeanExablox

@MattSpeller said:

Have not yet contacted support as I suspect this may be related to the really crappy networking I gave it. I will let it sit and stew for another day as I have no time to poke it today.

support will be reaching out to you to look at the secondary. Better to be proactive than reactive

SeanExablox

@MattSpeller said:

Seems like the errors from last week have gotten strange. I've only been loading data to the "victoria' unit, however the "vanouver" one is showing greater storage used, along with a really sweet amount of dedupe

You'll see really good dedupe ratios like this as it's strictly a calculation based on how much data is written the Ring and then to the physical drives. Over the week and lifetime you'll get a more 'reasonable' ratio

MattSpeller

Contact established with Support at 1:30pm, email at 4:42pm requesting permission to reboot, I replied immediately, box rebooted at 4:46 with all errors cleared. I'm a happy camper.

MattSpeller

So, error came back maybe 30 minutes after it was cleared. I rebooted the unit again and it seems to be gone for good. Support reached out to me again and suggested I run the software upgrade they released - it's now underway on both.

scottalanmiller

Cool, thanks for all of the feedback.

MattSpeller

@scottalanmiller Helps me later when I go to write it up for the boss

MattSpeller

An interesting find developed over the long weekend. The troubled oneblox unit finally stopped talking some time friday evening. I came in this morning and traced it down to a faulty ethernet cable. How much this has contributed to it's chain of errors is unknown, and I am happy to give it the benefit of the doubt moving forwards.

MattSpeller

To speed up testing I've installed a gigabit switch, this absolutely will change their performance.

scottalanmiller

@MattSpeller said:

To speed up testing I've installed a gigabit switch, this absolutely will change their performance.

Welcome to 2003.

nadnerB

@scottalanmiller said:

Welcome to 2003.

Where everything is fine and dandy... Until July

MattSpeller

Having some further errors on the oneblox units, apparently they're not talking to each other right now. While I find this slightly irritating I want to point out the following email chain, check out the times.

Matthew Speller
Feb 13, 11:27
Hello Support,
Could I please get a hand resolving this?

Bob Gardiner (Exablox Support)
Feb 13, 11:30
I will take a look and let you know what I find.
Bob

nadnerB

Hey @MattSpeller here's some interesting reading:
http://www.theregister.co.uk/2015/02/13/exablox_not_good_for_restore/

Have you tested the restore times?
Did you find them adequate?

MattSpeller

@nadnerB Not yet, though today may be the day.

MattSpeller

Far from an exhaustive test, but I managed to toss around 100GB to get some BASIC performance numbers. Data used is my ISO cache, so really big files and not very many of them. Test setup was on gig lan, external ESATA 7200rpm drive connected to a Latitude 6410. Virtually no other traffic exists on this switch. I also double checked the drives in the exablox; Seagate 4TB 5900rpm consumer grade junk.

Results:
33MB/s up
20MB/s down

Obviously this is very primitive testing, the article linked by Brendan (and the test linked inside it) is much more exhaustive and includes queue depth >1. While these numbers are unimpressive, I feel it's important to point out they make zero claims about it's performance. It's also not currently capable of hosting VM's (lacks SMB3) so that is a demanding use case off the table.

Other points of interest while I'm on the topic:

File system reclamation has been "paused" (deleting data does not free up space) for quite some time. Support tells me it's a software issue and will be patched in the next release. Having just updated to 2.3 Merlot I'm not hopeful that will be soon, however I'm unaware of their release schedule and may be overly pessimistic due to lack of coffee.
Support continues to be exceptionally prompt and friendly.
Unfortunately, I've had to contact support (or they've contacted me) several times to clear errors. Notably most errors have been with replication & reclamation.
On the troublesome "Vancouver" unit, I just yanked a drive counted to ten and put it back in. Got an email alert promptly afterwards "The filesystem rebalancer is currently running to protect data health. Do not remove drives or OneBlox from the Ring while this is in progress. " - fair play. Will update later when it finishes if I remember.