Date: Mon, 14 Jul 2014 14:06:15 +0530 From: Kashyap Desai <kashyap.desai@avagotech.com> To: Alexander Motin <mav@freebsd.org> Cc: FreeBSD-scsi <freebsd-scsi@freebsd.org> Subject: RE: SSDs peformance on head/freebsd-10 stable using FIO Message-ID: <a4e86127552716ba989836fbcfc7676b@mail.gmail.com> In-Reply-To: <53BF1E6C.5030806@FreeBSD.org> References: <8fbe38cdad1e66717a9de7fdf63812c2@mail.gmail.com> <53BE8784.8060503@FreeBSD.org> <9f138f242e278476e5c542d695e58bc8@mail.gmail.com> <53BF1E6C.5030806@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
> -----Original Message----- > From: Alexander Motin [mailto:mavbsd@gmail.com] On Behalf Of Alexander > Motin > Sent: Friday, July 11, 2014 4:45 AM > To: Kashyap Desai > Cc: FreeBSD-scsi > Subject: Re: SSDs peformance on head/freebsd-10 stable using FIO > > On 10.07.2014 16:28, Kashyap Desai wrote: > > From: Alexander Motin [mailto:mavbsd@gmail.com] On Behalf Of > Alexander > >> On 10.07.2014 15:00, Kashyap Desai wrote: > >>> I have 8 SSDs in my setup and all 8 SSDs are behind LSI=E2=80=99s 12G= p/s > >>> MegaRaid Controller as JBOD. I also found FIO can be used in Async > >>> mode after loading =E2=80=9Caio=E2=80=9D kernel module. > >>> > >>> Using single SSD, I am able to see 110K-130K IOPs. This IOPs > >>> counts are matching with what I see on Linux machine. > >>> > >>> Now, I am not able to scale IOPs on my machine after 200K. I see > >>> CPU is almost occupied and no idle time after IOPs reach to 200K. > >>> > >>> If you have any pointers to try with, I can do some experiment on > >>> my > >> setup. > >> > >> Getting such results I would immediately start doing profiling with > >> pmcstat. > >> Quite likely you are hitting some new lock congestion. Start with > >> simple `pmcstat -n 100000000 -TS unhalted-cycles`. It it hard to say > >> for sure what went wrong there without more data, so just couple > > I have attached profile output for the command mentioned above. I will > > dig further and see if this is what we have theoretical limit for CAM > > attached HBA. > > First thing I noticed in this profile output is bunch of TLB shutdowns. > You can not reach reasonable performance from user-level without having > HBA support unmapped I/O. Both mps and mpr drivers support it, but for > some reason still not mrsas. Even at non-peak I/O rates on multi-core > system > TLB shutdowns in such case can eat additional 30% of CPU time. Thanks.! For this part, I can try In mrsas. Can you help me to understand what you mean by unmapped I/O ? > > Another thing I see is mentioned congestion on driver's CAM SIM lock. > You need either multiple cards or multiqueue. > > >> thoughts: > >> > >> First of all, I've never tried aio in my benchmarks, only synchronous > >> ones. Try to run 8 instances of `dd if=3D/dev/daX of=3D/dev/null bs=3D= 512` > >> per each SSD same time, just as I did. You may vary number of dd's, > >> but keep total below 256, or you mad to increase nswbuf limit in > >> kern_vfs_bio_buffer_alloc(). > > > > I ran multiple dd instance also and seeing IOPs throttle somewhere ~200= K > > . > > > > Do we have any mechanism to check CAM layer's max IOPs support > without > > involving actual Device ? Something like _null_ device driver which > > just send the command back to CAM layer ? > > There is not such one now. Such test would radically change timings of > operation, and I am not sure how useful would results be. > > >> For second, you are using single HBA, that should create significant > >> congestion around its CAM SIM lock. Proper solution would be to add > >> multiple queues support to the driver, and we discussed it with Scott > >> Long for quite some time, but that requires more work (I hope you may > >> be interested in it ;) ). Or you may just insert 3-4 HBAs. My million > >> IOPS I was reaching with four 2008/2308 6Gbps HBAs and 16 SATA SSDs. > > > > I remember this part and really good to contribute for this work. As > > part of this we have initiated multiple MSIx implementation in > > <mrsas>, which will have multiple reply queue per MSI-x. > > Cool! > > > Do we really require to have multiple Submission queue at low level > > driver > ? > > I thought it will be a CAM interface for multi queue which _all_ low > > level drivers need to hook into . > > Now CAM is still oriented on single submission queue, but allows driver t= o > have multiple completion queues. So I would start from implementing last > ones, each bound to own MSI-X interrupt and calling completion without > using the SIM lock or holding any other locks during the upcall. > CAM provides way to avoid extra context switch in that case, that could b= e > very useful. > > -- > Alexander Motin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?a4e86127552716ba989836fbcfc7676b>