Date: Thu, 18 Apr 2024 07:21:48 +0000 From: bugzilla-noreply@freebsd.org To: ports-bugs@FreeBSD.org Subject: [Bug 278424] deskutils/py-paperless-ngx: man page doesn't mention NLTK's Snowball Stemmer Message-ID: <bug-278424-7788@https.bugs.freebsd.org/bugzilla/>
next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D278424 Bug ID: 278424 Summary: deskutils/py-paperless-ngx: man page doesn't mention NLTK's Snowball Stemmer Product: Ports & Packages Version: Latest Hardware: Any OS: Any Status: New Severity: Affects Some People Priority: --- Component: Individual Port(s) Assignee: grembo@FreeBSD.org Reporter: freebsd.bugzilla@mail.tinsuke.com Flags: maintainer-feedback?(grembo@FreeBSD.org) Assignee: grembo@FreeBSD.org The man page states, about setting up NLTK: > NLTK DATA > In order to process scanned documents using machine learning, paperle= ss- > ngx requires NLTK (natural language toolkit) data. The required files > can be downloaded by using these commands: > > /usr/local/bin/python3.9 -m nltk.downloader \ > stopwords punkt -d /var/db/paperless/nltkdata It is missing the "snowball_data" file to be downloaded. The file is referr= ed to in the project's doc (https://docs.paperless-ngx.com/setup/#bare_metal): > Optional: If using the NLTK machine learning processing (see PAPERLESS_EN= ABLE_NLTK for details), download the NLTK data for the Snowball Stemmer, St= opwords and Punkt tokenizer to your PAPERLESS_DATA_DIR/nltk. Refer to the N= LTK instructions for details on how to download the data. I can't vouch for how handy it is to have that in NLTK or not, but it sounds very useful from its description (https://github.com/snowballstem/snowball?tab=3Dreadme-ov-file#what-is-stem= ming): > What is Stemming? > Stemming maps different forms of the same word to a common "stem" - for e= xample, the English stemmer maps connection, connections, connective, conne= cted, and connecting to connect. So a search for connected would also find = documents which only have the other forms. I suggest "snowball_data" is added to the man page's sample NLTK download command so it is in line with the project's docs and can be useful to users= of this port (thanks for it, btw!). --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-278424-7788>