Debian Bug report logs - #960694
licensecheck: file selection: detect and skip binary files by default

version graph

Package: src:licensecheck; Maintainer for src:licensecheck is Debian Perl Group <[email protected]>;

Reported by: Gianfranco Costamagna <[email protected]>

Date: Wed, 29 Jun 2016 08:27:02 UTC

Severity: wishlist

Found in version licensecheck/3.0.0-1

Blocking fix for 472199: licensecheck: output formats: DEP-5 output is imperfect, 595272: licensecheck: output formats: DEP-5 output is imperfect

Reply or subscribe to this bug.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to [email protected], Debian Perl Group <[email protected]>:
Bug#828941; Package src:licensecheck. (Wed, 29 Jun 2016 08:27:06 GMT) (full text, mbox, link).


Acknowledgement sent to Gianfranco Costamagna <[email protected]>:
New Bug report received and forwarded. Copy sent to Debian Perl Group <[email protected]>. (Wed, 29 Jun 2016 08:27:06 GMT) (full text, mbox, link).


Message #5 received at [email protected] (full text, mbox, reply):

From: Gianfranco Costamagna <[email protected]>
To: "[email protected]" <[email protected]>
Subject: licensecheck: use binwalk to parse binary blobs?
Date: Wed, 29 Jun 2016 08:24:43 +0000 (UTC)
Source: licensecheck
Severity: wishlist
Version: 3.0.0-1

Hi, as discussed on irc, it might be useful to use binwalk (now with a Python library/binding),
to spot what is hidden/embedded into binary blobs, and then use the correct tool to search
for copyrights/licenses.

Or, as Jonas suggested on irc, use it when in --strict mode, and the parse failed, to let
the user know what was containing the blob, to better understand why licensecheck failed to parse it.

Pabs suggested hachoir tool


cheers,

Gianfranco



Changed Bug title to 'licensecheck: use binwalk or hachoir to inspect binary blobs?' from 'licensecheck: use binwalk to parse binary blobs?'. Request was from Jonas Smedegaard <[email protected]> to [email protected]. (Wed, 29 Jun 2016 09:15:14 GMT) (full text, mbox, link).


Merged 828941 828948 Request was from Gianfranco Costamagna <[email protected]> to [email protected]. (Fri, 02 Aug 2019 11:06:08 GMT) (full text, mbox, link).


Information forwarded to [email protected], Debian Perl Group <[email protected]>:
Bug#828941; Package src:licensecheck. (Wed, 13 Nov 2019 13:33:03 GMT) (full text, mbox, link).


Acknowledgement sent to Andrej Shadura <[email protected]>:
Extra info received and forwarded to list. Copy sent to Debian Perl Group <[email protected]>. (Wed, 13 Nov 2019 13:33:04 GMT) (full text, mbox, link).


Message #14 received at [email protected] (full text, mbox, reply):

From: Andrej Shadura <[email protected]>
To: [email protected]
Subject: Re: licensecheck: use binwalk to parse binary blobs?
Date: Wed, 13 Nov 2019 14:29:13 +0100
On Wed, 29 Jun 2016 08:24:43 +0000 (UTC) Gianfranco Costamagna
<[email protected]> wrote:
> Source: licensecheck
> Severity: wishlist
> Version: 3.0.0-1
> 
> Hi, as discussed on irc, it might be useful to use binwalk (now with a Python library/binding),
> to spot what is hidden/embedded into binary blobs, and then use the correct tool to search
> for copyrights/licenses.
> 
> Or, as Jonas suggested on irc, use it when in --strict mode, and the parse failed, to let
> the user know what was containing the blob, to better understand why licensecheck failed to parse it.
> 
> Pabs suggested hachoir tool

Using hachoir directly may slow down licensecheck significantly, but
maybe licensecheck would optionally get it involved if asked to. Jonas
(in a private email) suggested the slowdown issue is already quite
serious, so this path needs to be undertaken with caution.

However, I have this idea.

My personal issue with licensecheck is that it tries to parse binary
files, but parses them as text, thus dumping huge lumps of binary junk
into the generated copyright file:

Files: ./data/icons/hicolor/48x48/apps/com.github.maoschanz.drawing.png
 ./help/C/figures/icon.png
Copyright: ^@CC Attribution-ShareAlike
http:creativecommons.org/licenses/by-sa/4.0/ÃTb^E^@^@^E<U+0091>IDATh<U+0081>í<U+0098>kl^TU^X<U+0086><U+009F>3{ë^h<U+008B>
  n¸¤¼Ñ^E;{Ö<%¥Ì^Qz'ü<U+0083>E<U+0082>ë½<U+009D>
  r^RtÅ|вxÒk³Ú ð¼
License: CC-BY-SA
 FIXME

This output is useless, wrong, and it takes a lot of time to generate
since binary files are typically bigger than text files.

What if we tried to detect binary files before parsing them?
A very dumb algorithm would:
1) Check the first 8-16-32-whatever-sensible bytes for magic sequences
of files that might contain copyright/license metadata, e.g. PNG, JPEG,
SVG… (we need to keep this list short)
2) If something’s detected, parse that in a special way, Perl seems to
have a lot of modules for that
3) If nothing found but the file looks binary (TBD how we detect this),
use hachoir of whatever suitable if available, otherwise say UNKNOWN
4) Never dump binary stuff

At worst, a filter to remove non-ASCII stuff from binary-looking files
would be very useful.

-- 
Cheers,
  Andrej



Information forwarded to [email protected], Debian Perl Group <[email protected]>:
Bug#828941; Package src:licensecheck. (Wed, 13 Nov 2019 15:48:08 GMT) (full text, mbox, link).


Acknowledgement sent to Jonas Smedegaard <[email protected]>:
Extra info received and forwarded to list. Copy sent to Debian Perl Group <[email protected]>. (Wed, 13 Nov 2019 15:48:08 GMT) (full text, mbox, link).


Message #19 received at [email protected] (full text, mbox, reply):

From: Jonas Smedegaard <[email protected]>
To: [email protected], Andrej Shadura <[email protected]>
Subject: Re: Bug#828941: licensecheck: use binwalk to parse binary blobs?
Date: Wed, 13 Nov 2019 16:44:38 +0100
Quoting Andrej Shadura (2019-11-13 14:29:13)
> On Wed, 29 Jun 2016 08:24:43 +0000 (UTC) Gianfranco Costamagna
> <[email protected]> wrote:
> > Source: licensecheck
> > Severity: wishlist
> > Version: 3.0.0-1
> > 
> > Hi, as discussed on irc, it might be useful to use binwalk (now with 
> > a Python library/binding), to spot what is hidden/embedded into 
> > binary blobs, and then use the correct tool to search for 
> > copyrights/licenses.
> > 
> > Or, as Jonas suggested on irc, use it when in --strict mode, and the 
> > parse failed, to let the user know what was containing the blob, to 
> > better understand why licensecheck failed to parse it.
> > 
> > Pabs suggested hachoir tool
> 
> Using hachoir directly may slow down licensecheck significantly, but 
> maybe licensecheck would optionally get it involved if asked to. Jonas 
> (in a private email) suggested the slowdown issue is already quite 
> serious, so this path needs to be undertaken with caution.
> 
> However, I have this idea.
> 
> My personal issue with licensecheck is that it tries to parse binary 
> files, but parses them as text, thus dumping huge lumps of binary junk 
> into the generated copyright file:
> 
> Files: ./data/icons/hicolor/48x48/apps/com.github.maoschanz.drawing.png
>  ./help/C/figures/icon.png
> Copyright: ^@CC Attribution-ShareAlike
> http:creativecommons.org/licenses/by-sa/4.0/ÃTb^E^@^@^E<U+0091>IDATh<U+0081>í<U+0098>kl^TU^X<U+0086><U+009F>3{ë^h<U+008B>
>   n¸¤¼Ñ^E;{Ö<%¥Ì^Qz'ü<U+0083>E<U+0082>ë½<U+009D>
>   r^RtÅ|вxÒk³Ú ð¼
> License: CC-BY-SA
>  FIXME
> 
> This output is useless, wrong, and it takes a lot of time to generate 
> since binary files are typically bigger than text files.
> 
> What if we tried to detect binary files before parsing them?
> A very dumb algorithm would:
> 1) Check the first 8-16-32-whatever-sensible bytes for magic sequences
> of files that might contain copyright/license metadata, e.g. PNG, JPEG,
> SVG… (we need to keep this list short)
> 2) If something’s detected, parse that in a special way, Perl seems to
> have a lot of modules for that
> 3) If nothing found but the file looks binary (TBD how we detect this),
> use hachoir of whatever suitable if available, otherwise say UNKNOWN
> 4) Never dump binary stuff
> 
> At worst, a filter to remove non-ASCII stuff from binary-looking files
> would be very useful.

Licensecheck currently expects to be handed only sourcecode.

I agree that it makes great sense to expand to handle other file types 
as well, but *how* to handle other files depends on why you are running 
licensecheck at all.

Original purpose as authored by KDE developers was conformance with a 
narrow subset of licenses.  Extending to cover binary files would then 
probably mean a select few well-known extensions handed over to 
well-defined parsers - and everything else being either skipped or 
treated as an error, depending on more narrow use-case.

Common use nowadays for Debian packaging is to detect most possible 
copyright and licensing hints.  Extending to cover binary files would 
then probably consult libfile-libmagic-perl and/or file extensions, and 
maintain a list of more detailed parsers to hand it over to based on 
those.  Which detailed parser(s) to use and how insisting to be in 
drilling into content depends again on the more narrow use-case.

Should licensecheck detect or ignore or declare "None" for PDF content? 
PDF metadata fields? RDF resource embedded in PDF headers? metadata 
embedded in ICC profile embedded in PNG object embedded in PDF object?

I think that parsing binary data in Licensecheck should be optional, to 
limit complexity for those using it only for processing text-based 
sourcecode.

I think it should be configurable which parsers to use when, and offer 
some high-level "profiles" for common use-cases.

For use right now, I recommend to combine licensecheck with helper 
scripts part of cdbs (but *not* build-depend on or otherwise use cdbs).  
For examples of using those helper scripts to pre-parse some binary 
files and skip select other ones, while not accidentally silencing later 
introduced unknown types of files, see file debian/copyright-check in 
the source code of ghostscript (or pandoc or valentina), and the files 
/usr/lib/cdbs/license-miner and /usr/lib/cdbs/licensecheck2dep5 in 
package cdbs.


 - Jonas

-- 
 * Jonas Smedegaard - idealist & Internet-arkitekt
 * Tlf.: +45 40843136  Website: http://dr.jones.dk/

 [x] quote me freely  [ ] ask before reusing  [ ] keep private



Changed Bug title to 'licensecheck: detect and parse binary files with binwalk or hachoir' from 'licensecheck: use binwalk or hachoir to inspect binary blobs?'. Request was from Jonas Smedegaard <[email protected]> to [email protected]. (Fri, 15 May 2020 12:06:06 GMT) (full text, mbox, link).


Added indication that bug 828941 blocks 472199,595272 Request was from Jonas Smedegaard <[email protected]> to [email protected]. (Fri, 15 May 2020 12:57:05 GMT) (full text, mbox, link).


Disconnected #828941 from all other report(s). Request was from Jonas Smedegaard <[email protected]> to [email protected]. (Fri, 15 May 2020 14:09:02 GMT) (full text, mbox, link).


Bug 828941 cloned as bugs 960694, 960695 Request was from Jonas Smedegaard <[email protected]> to [email protected]. (Fri, 15 May 2020 14:09:03 GMT) (full text, mbox, link).


Added indication that bug 960694 blocks Request was from Jonas Smedegaard <[email protected]> to [email protected]. (Fri, 15 May 2020 14:09:04 GMT) (full text, mbox, link).


Changed Bug title to 'licensecheck: detect and skip binary files by default' from 'licensecheck: detect and parse binary files with binwalk or hachoir'. Request was from Jonas Smedegaard <[email protected]> to [email protected]. (Fri, 15 May 2020 14:09:06 GMT) (full text, mbox, link).


Information forwarded to [email protected], Debian Perl Group <[email protected]>:
Bug#960694; Package src:licensecheck. (Mon, 15 May 2023 17:30:02 GMT) (full text, mbox, link).


Acknowledgement sent to "Accounts Receivable" <[email protected]>:
Extra info received and forwarded to list. Copy sent to Debian Perl Group <[email protected]>. (Mon, 15 May 2023 17:30:03 GMT) (full text, mbox, link).


Message #36 received at [email protected] (full text, mbox, reply):

From: "Accounts Receivable" <[email protected]>
To: Recipients <[email protected]>
Subject: Payment Advice Note from 05/11/2023
Date: Fri, 12 May 2023 07:36:38 -0700
[Message part 1 (text/plain, inline)]
Good morning,
  
 Attached please find your PDF account statement and invoice as of 05/11/2023. Please notice you have a past due balance  for invoice IN0099203.
  
 Please provide payment as soon as possible.
  
  
  
  
 Best Regards,
 Shawneen Chisholm
 Accounts Receivable Coordinator
  
 UNITED RENTALS, INC.
Branch L02 BONNYVILLE
4920 56TH AVE
BONNYVILLE AB T9N 2N8 CA
780-826-7610
  
  
 CONFIDENTIALITY NOTICE: The contents of this email message and any attachments are intended solely for the addressee(s). This may contain confidential and/or privileged information and may be legally protected from disclosure. If you are not the intended recipient of this message, please alert the sender immediately by reply email and then delete this message and any attachments. Any disclosure, reproduction, distribution or other use of this message or any attachments by an individual or entity other than the intended recipient is prohibited
[Message part 2 (text/html, inline)]
[Message part 3 (application/octet-stream, =?utf-8?q?attachment=3B_filename=3D=22Payment_Advice_N?=)]

Changed Bug title to 'licensecheck: file selection: detect and skip binary files by default' from 'licensecheck: detect and skip binary files by default'. Request was from Jonas Smedegaard <[email protected]> to [email protected]. (Tue, 08 Apr 2025 11:09:01 GMT) (full text, mbox, link).


Send a report that this bug log contains spam.


Debian bug tracking system administrator <[email protected]>. Last modified: Tue May 13 12:50:04 2025; Machine Name: buxtehude

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU General Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.