Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 15 Aug 2013 13:40:25 -0700
From:      aurfalien <aurfalien@gmail.com>
To:        Charles Swiger <cswiger@mac.com>
Cc:        frank2@fjl.co.uk, freebsd-questions@freebsd.org
Subject:   Re: copying milllions of small files and millions of dirs
Message-ID:  <8AB33749-728B-48FD-B17F-72FE54BD564A@gmail.com>
In-Reply-To: <B09E50DD-81F8-4EE6-8295-0DD56A5A97A9@mac.com>
References:  <7E7AEB5A-7102-424E-8B1E-A33E0A2C8B2C@gmail.com> <520D33D6.8050607@fjl.co.uk> <B09E50DD-81F8-4EE6-8295-0DD56A5A97A9@mac.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Aug 15, 2013, at 1:22 PM, Charles Swiger wrote:

> [ ...combining replies for brevity... ]
>=20
> On Aug 15, 2013, at 1:02 PM, Frank Leonhardt <frank2@fjl.co.uk> wrote:
>> I'm reading all this with interest. The first thing I'd have tried =
would be tar (and probably netcat) but I'm a probably bit of a dinosaur. =
(If someone wants to buy me some really big drives I promise I'll =
update). If it's really NFS or nothing I guess you couldn't open a =
socket anyway.
>=20
> Either tar via netcat or SSH, or dump / restore via similar pipeline =
are quite traditional.  tar is more flexible for partial filesystem =
copies, whereas the dump / restore is more oriented towards complete =
filesystem copies.  If the destination starts off empty, they're =
probably faster than rsync, but rsync does delta updates which is a huge =
win if you're going to be copying changes onto a slightly older version.

Yep, so looks like it is what it is as the data set is changing while I =
do the base sync.  So I'll have to do several more to pick up new comers =
etc...

> Anyway, you're entirely right that the capabilities of the source =
matter a great deal.
> If it could do zfs send / receive, or similar snapshot mirroring, that =
would likely do better than userland tools.
>=20
>> I'd be interested to know whether tar is still worth using in this =
world of volume managers and SMP.
>=20
> Yes.
>=20
> On Aug 15, 2013, at 12:14 PM, aurfalien <aurfalien@gmail.com> wrote:
> [ ... ]
>>>>>> Doin 10Gb/jumbos but in this case it don't make much of a hoot of =
a diff.
>>>>>=20
>>>>> Yeah, probably not-- you're almost certainly I/O bound, not =
network bound.
>>>>=20
>>>> Actually it was network bound via 1 rsync process which is why I =
broke up 154 dirs into 7 batches of 22 each.
>>>=20
>>> Oh.  Um, unless you can make more network bandwidth available, =
you've saturated the bottleneck.
>>> Doing a single copy task is likely to complete faster than splitting =
up the job into subtasks in such a case.
>>=20
>> Well, using iftop, I am now at least able to get ~1Gb with 7 scripts =
going were before it was in the 10Ms with 1.
>=20
> 1 gigabyte of data per second is pretty decent for a 10Gb link; 10 =
MB/s obviously wasn't close saturating a 10Gb link.

Cool.  Looks like I am doing my best which is what I wanted to know.  I =
chose to do 7 rsync scripts as it evenly divides into 154 parent dirs :)

You should see how our backup system deal with this; Atempo Time =
Navigator or Tina as its called.

It takes an hour just to lay down the dirs on tape before even starting =
to backup, crazyness.  And thats just for 1 parent dir having an avg of =
500,000 dirs.  Actually I'm prolly wrong as the initial creation is =
125,000 dirs, of which a few are sym links.

Then it grows from there.  Looking at the Tina stats, we see a million =
objects or more.

- aurf=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8AB33749-728B-48FD-B17F-72FE54BD564A>