Debian Bug report logs - #566645
/usr/bin/uniq: uniq tells 2 lines with different invalid utf-8 characters are duplicate

version graph

Package: coreutils; Maintainer for coreutils is Michael Stone <[email protected]>; Source for coreutils is src:coreutils (PTS, buildd, popcon).

Reported by: Stephane Chazelas <[email protected]>

Date: Sun, 24 Jan 2010 11:33:02 UTC

Severity: normal

Found in version coreutils/8.4-1

Reply or subscribe to this bug.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to [email protected], Michael Stone <[email protected]>:
Bug#566645; Package coreutils. (Sun, 24 Jan 2010 11:33:05 GMT) (full text, mbox, link).


Acknowledgement sent to Stephane Chazelas <[email protected]>:
New Bug report received and forwarded. Copy sent to Michael Stone <[email protected]>. (Sun, 24 Jan 2010 11:33:05 GMT) (full text, mbox, link).


Message #5 received at [email protected] (full text, mbox, reply):

From: Stephane Chazelas <[email protected]>
To: Debian Bug Tracking System <[email protected]>
Subject: /usr/bin/uniq: uniq tells 2 lines with different invalid utf-8 characters are duplicate
Date: Sun, 24 Jan 2010 11:31:40 +0000
Package: coreutils
Version: 8.4-1
Severity: normal
File: /usr/bin/uniq


~$ locale charmap
UTF-8
~$ locale collate-codeset
UTF-8
~$ sort .zsh-history|uniq -D|sed -n l
cd Pyr\202n\202es$
cd Pyr\351n\351es$


Both lines are identical except for the invalid UTF-8
characters, uniq reports them as identical.

"sort -u" and "comm" also treat them as identical:
~$ echo '\0300\n\0301' | sort -u | sed -n l
\300$
~$ sed -n l a
cd Pyr\202n\202es$
~$ sed -n l b
cd Pyr\351n\351es$
~$ comm -12 a b | sed -n l
cd Pyr\351n\351es$

If that's an expected behavior, I think it should be better
documented as I think "Comparisons honor the rules specified
by the `LC_COLLATE' locale category." is not enough to cover
that rather unintuitive behavior.

-- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (50, 'experimental')
Architecture: i386 (i686)

Kernel: Linux 2.6.32-trunk-686 (SMP w/1 CPU core)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_US.ISO-8859-15 (charmap=UTF-8) (ignored: LC_ALL set to en_US.UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages coreutils depends on:
ii  libacl1                       2.2.49-1   Access control list shared library
ii  libc6                         2.10.2-5   Embedded GNU C Library: Shared lib
ii  libselinux1                   2.0.89-4   SELinux runtime shared libraries

coreutils recommends no packages.

coreutils suggests no packages.

-- debconf-show failed




Information forwarded to [email protected], Michael Stone <[email protected]>:
Bug#566645; Package coreutils. (Sun, 24 Jan 2010 11:42:05 GMT) (full text, mbox, link).


Acknowledgement sent to Stephane Chazelas <[email protected]>:
Extra info received and forwarded to list. Copy sent to Michael Stone <[email protected]>. (Sun, 24 Jan 2010 11:42:05 GMT) (full text, mbox, link).


Message #10 received at [email protected] (full text, mbox, reply):

From: Stephane Chazelas <[email protected]>
To: [email protected]
Subject: Re: /usr/bin/uniq: uniq tells 2 lines with different invalid utf-8 characters are duplicate
Date: Sun, 24 Jan 2010 11:39:19 +0000
Sorry,

the email address I used to submit the bug is incorrect. It
should have been [email protected].

-- 
Stephane




Send a report that this bug log contains spam.


Debian bug tracking system administrator <[email protected]>. Last modified: Tue May 13 08:37:11 2025; Machine Name: buxtehude

Debian Bug tracking system

Debbugs is free software and licensed under the terms of the GNU General Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.