Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 18 Apr 2024 07:21:48 +0000
From:      bugzilla-noreply@freebsd.org
To:        ports-bugs@FreeBSD.org
Subject:   [Bug 278424] deskutils/py-paperless-ngx: man page doesn't mention NLTK's Snowball Stemmer
Message-ID:  <bug-278424-7788@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D278424

            Bug ID: 278424
           Summary: deskutils/py-paperless-ngx: man page doesn't mention
                    NLTK's Snowball Stemmer
           Product: Ports & Packages
           Version: Latest
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: Individual Port(s)
          Assignee: grembo@FreeBSD.org
          Reporter: freebsd.bugzilla@mail.tinsuke.com
             Flags: maintainer-feedback?(grembo@FreeBSD.org)
          Assignee: grembo@FreeBSD.org

The man page states, about setting up NLTK:

> NLTK DATA
>     In order to process scanned documents using machine learning, paperle=
ss-
>     ngx requires NLTK (natural language toolkit) data.  The required files
>     can be downloaded by using these commands:
>
>           /usr/local/bin/python3.9 -m nltk.downloader \
>             stopwords punkt -d /var/db/paperless/nltkdata

It is missing the "snowball_data" file to be downloaded. The file is referr=
ed
to in the project's doc (https://docs.paperless-ngx.com/setup/#bare_metal):

> Optional: If using the NLTK machine learning processing (see PAPERLESS_EN=
ABLE_NLTK for details), download the NLTK data for the Snowball Stemmer, St=
opwords and Punkt tokenizer to your PAPERLESS_DATA_DIR/nltk. Refer to the N=
LTK instructions for details on how to download the data.

I can't vouch for how handy it is to have that in NLTK or not, but it sounds
very useful from its description
(https://github.com/snowballstem/snowball?tab=3Dreadme-ov-file#what-is-stem=
ming):

> What is Stemming?
> Stemming maps different forms of the same word to a common "stem" - for e=
xample, the English stemmer maps connection, connections, connective, conne=
cted, and connecting to connect. So a search for connected would also find =
documents which only have the other forms.

I suggest "snowball_data" is added to the man page's sample NLTK download
command so it is in line with the project's docs and can be useful to users=
 of
this port (thanks for it, btw!).

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-278424-7788>