'tr' character class support?

Eric Luehrsen ericluehrsen at gmail.com
Fri Jul 10 21:36:03 EDT 2020

On 7/10/20 8:15 PM, Jordan Geoghegan wrote:
> On 2020-07-10 16:59, Rosen Penev wrote:
>> On Fri, Jul 10, 2020 at 4:17 PM Jordan Geoghegan <jordan at geoghegan.ca> 
>> wrote:
>>> On 2020-07-10 14:54, Rosen Penev wrote:
>>>> On Fri, Jul 10, 2020 at 2:29 PM Jordan Geoghegan 
>>>> <jordan at geoghegan.ca> wrote:
>>>>> On 2020-07-10 14:15, Magnus Kroken wrote:
>>>>>> Hi Jordan
>>>>>> On 10.07.2020 22:45, Jordan Geoghegan wrote:
>>>>>>> Hey folks,
>>>>>>> Does the 'tr' utility support character classes in OpenWRT? I was
>>>>>>> playing around with an OpenWRT x86_64 VM and I noticed that 'tr'
>>>>>>> doesn't seem to support character classes.
>>>>>>> The command " echo HELLO | tr '[:upper:]' '[:lower:]' "  does not
>>>>>>> convert to the text to lowercase as it should (and as required by
>>>>>>> POSIX).
>>>>>> This would be expected behavior. OpenWrt disables tr character 
>>>>>> classes
>>>>>> in BusyBox by default, see [1]:
>>>>>>           bool
>>>>>>           default n
>>>>>>           bool
>>>>>>           default n
>>>>>> I don't know what the size cost in the BusyBox binary is, but that
>>>>>> will likely be the deciding factor for such a change.
>>>>>> 1:
>>>>>> https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/utils/busybox/Config-defaults.in 
>>>>>> Regards,
>>>>>> Magnus Kroken
>>>>> Hi Magnus,
>>>>> Thanks for confirming that so quickly.
>>>>> I obviously understand that space saving is essential to OpenWRT, but
>>>>> POSIX does require[1] that 'tr' support character classes:
>>>> awk '{print toupper($0)}' is an alternative.
>>> Yes, but this means that any script expecting tr to work correctly could
>>> explode, as tr silently ignores the character class and treats all the
>>> characters literally.
>> git grep upper | grep tr\ | wc -l
>> 3
>> In the packages feed. All those results are things that run on the
>> host, not on OpenWrt.
>> tr a-z A-Z works as an alternative and is used in many places.
> tr a-z A-Z is bad practice as it can behave unexpectedly in different 
> locales; I've also heard tales of folks with Turkish locales having 
> issues with '0-9' for example.
> Is a couple kb of space worth such a loss in portability (not to mention 
> deviating heavily from POSIX)?
>>>>> :class:
>>>>>                 Represents all characters belonging to the defined 
>>>>> character class, as defined by the current setting of the LC_CTYPE  
>>>>> locale  cate-
>>>>>                 gory. The following character class names shall be 
>>>>> accepted when specified in string1:
>>>>>                   alnum    blank   digit   lower   punct   upper
>>>>>                   alpha    cntrl   graph   print   space   xdigit
>>>>> 1: https://www.unix.com/man-page/posix/1posix/tr/

Unless there is an overwhelming size cost, basic POSIX binaries should 
be provided "POSIX'ly correct" by default. Applying experimental theory, 
a discipline's standard is the null hypothesis (H0) which is the default 
decision. A deviation to the standard and especially _shorting_ the 
standard is the alternate hypothesis (H1) and requires good data with 
separation to accept. (standards often permit well formulated extensions 
to them.)

