Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 11 Jul 2014 02:14:52 +0300
From:      Alexander Motin <mav@FreeBSD.org>
To:        Kashyap Desai <kashyap.desai@avagotech.com>
Cc:        FreeBSD-scsi <freebsd-scsi@freebsd.org>
Subject:   Re: SSDs peformance on head/freebsd-10 stable using FIO
Message-ID:  <53BF1E6C.5030806@FreeBSD.org>
In-Reply-To: <9f138f242e278476e5c542d695e58bc8@mail.gmail.com>
References:  <8fbe38cdad1e66717a9de7fdf63812c2@mail.gmail.com> <53BE8784.8060503@FreeBSD.org> <9f138f242e278476e5c542d695e58bc8@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 10.07.2014 16:28, Kashyap Desai wrote:
> From: Alexander Motin [mailto:mavbsd@gmail.com] On Behalf Of Alexander
>> On 10.07.2014 15:00, Kashyap Desai wrote:
>>> I have 8 SSDs in my setup and all 8 SSDs are behind LSI’s 12Gp/s
>>> MegaRaid Controller as JBOD. I also found FIO can be used in Async
>>> mode after loading “aio” kernel module.
>>>
>>> Using single SSD, I am able to see  110K-130K IOPs.  This IOPs counts
>>> are matching with what I see on Linux machine.
>>>
>>> Now, I am not able to scale IOPs on my machine after 200K.  I see CPU
>>> is almost occupied and no idle time after IOPs reach to 200K.
>>>
>>> If you have any pointers to try with,  I can do some experiment on my
>> setup.
>>
>> Getting such results I would immediately start doing profiling with
>> pmcstat.
>> Quite likely you are hitting some new lock congestion. Start with simple
>> `pmcstat -n 100000000 -TS unhalted-cycles`. It it hard to say for sure
>> what
>> went wrong there without more data, so just couple
> I have attached profile output for the command mentioned above. I will dig
> further and see if this is what we have theoretical limit for CAM attached
> HBA.

First thing I noticed in this profile output is bunch of TLB shutdowns.
You can not reach reasonable performance from user-level without having
HBA support unmapped I/O. Both mps and mpr drivers support it, but for
some reason still not mrsas. Even at non-peak I/O rates on multi-core
system TLB shutdowns in such case can eat additional 30% of CPU time.

Another thing I see is mentioned congestion on driver's CAM SIM lock.
You need either multiple cards or multiqueue.

>> thoughts:
>>
>> First of all, I've never tried aio in my benchmarks, only synchronous
>> ones. Try
>> to run 8 instances of `dd if=/dev/daX of=/dev/null bs=512` per each SSD
>> same time, just as I did. You may vary number of dd's, but keep total
>> below
>> 256, or you mad to increase nswbuf limit in kern_vfs_bio_buffer_alloc().
> 
> I ran multiple dd instance also and seeing IOPs throttle somewhere ~200K .
> 
> Do we have any mechanism to check CAM layer's max IOPs support without
> involving actual Device ? Something like _null_ device driver which just
> send the command back to CAM layer ?

There is not such one now. Such test would radically change timings of
operation, and I am not sure how useful would results be.

>> For second, you are using single HBA, that should create significant
>> congestion around its CAM SIM lock.  Proper solution would be to add
>> multiple queues support to the driver, and we discussed it with Scott Long
>> for quite some time, but that requires more work (I hope you may be
>> interested in it ;) ). Or you may just insert 3-4 HBAs. My million IOPS I
>> was
>> reaching with four 2008/2308 6Gbps HBAs and 16 SATA SSDs.
> 
> I remember this part and really good to contribute for this work.  As part
> of this we have initiated multiple MSIx implementation in <mrsas>, which
> will have multiple reply queue per MSI-x.

Cool!

> Do we really require to have multiple Submission queue at low level driver ?
> I thought it will be a CAM interface for multi queue which _all_ low level
> drivers need to hook into .

Now CAM is still oriented on single submission queue, but allows driver
to have multiple completion queues. So I would start from implementing
last ones, each bound to own MSI-X interrupt and calling completion
without using the SIM lock or holding any other locks during the upcall.
CAM provides way to avoid extra context switch in that case, that could
be very useful.

-- 
Alexander Motin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53BF1E6C.5030806>