fluencelabs/musl - musl - Gitea: Git with a cup of tea

mirror of https://github.com/fluencelabs/musl synced 2025-05-03 19:12:16 +00:00

Author	SHA1	Message	Date
Szabolcs Nagy	4260dfe1ec	regcomp: propagate allocation failures The error code of an allocating function was not checked in tre_add_tag.	2015-09-24 02:33:18 -04:00
Rich Felker	1507ebf837	byte-based C locale, phase 1: multibyte character handling functions this patch makes the functions which work directly on multibyte characters treat the high bytes as individual abstract code units rather than as multibyte sequences when MB_CUR_MAX is 1. since MB_CUR_MAX is presently defined as a constant 4, all of the new code added is dead code, and optimizing compilers' code generation should not be affected at all. a future commit will activate the new code. as abstract code units, bytes 0x80 to 0xff are represented by wchar_t values 0xdf80 to 0xdfff, at the end of the surrogates range. this ensures that they will never be misinterpreted as Unicode characters, and that all wctype functions return false for these "characters" without needing locale-specific logic. a high range outside of Unicode such as 0x7fffff80 to 0x7fffffff was also considered, but since C11's char16_t also needs to be able to represent conversions of these bytes, the surrogate range was the natural choice.	2015-06-16 05:28:48 +00:00
Szabolcs Nagy	c498efe117	regex: fix character class repetitions Internally regcomp needs to copy some iteration nodes before translating the AST into TNFA representation. Literal nodes were not copied correctly: the class type and list of negated class types were not copied so classes were ignored (in the non-negated case an ignored char class caused the literal to match everything). This affects iterations when the upper bound is finite, larger than one or the lower bound is larger than one. So eg. the EREs [[:digit:]]{2} [^[:space:]ab]{1,4} were treated as .{2} [^ab]{1,4} The fix is done with minimal source modification to copy the necessary fields, but the AST preparation and node handling code of tre will need to be cleaned up for clarity.	2015-03-27 20:24:30 -04:00
Szabolcs Nagy	32dee9b9b1	do not treat \0 as a backref in BRE The valid BRE backref tokens are \1 .. \9, and 0 is not a special character either so \0 is undefined by the standard. Such undefined escaped characters are treated as literal characters currently, following existing practice, so \0 is the same as 0.	2015-03-23 12:28:49 -04:00
Rich Felker	7c8c86f630	suppress backref processing in ERE regcomp one of the features of ERE is that it's actually a regular language and does not admit expressions which cannot be matched in linear time. introduction of \n backref support into regcomp's ERE parsing was unintentional.	2015-03-20 18:28:37 -04:00
Rich Felker	39dfd58417	fix memory-corruption in regcomp with backslash followed by high byte the regex parser handles the (undefined) case of an unexpected byte following a backslash as a literal. however, instead of correctly decoding a character, it was treating the byte value itself as a character. this was not only semantically unjustified, but turned out to be dangerous on archs where plain char is signed: bytes in the range 252-255 alias the internal codes -4 through -1 used for special types of literal nodes in the AST.	2015-03-20 18:06:04 -04:00
Nagy Szabolcs	efa9d396f9	implement FNM_CASEFOLD extension to fnmatch function	2014-12-17 14:54:37 -05:00
Szabolcs Nagy	ec1aed0a14	rewrite the regex pattern parser in regcomp The new code is a bit simpler and the generated code is about 1KB smaller (on i386). The basic design was kept including internal interfaces, TNFA generation was not touched. The old tre parser had various issues: [^aa-z] negated overlapping ranges in a bracket expression were handled incorrectly (eg [^aa-z] was handled as [^a] instead of [^a-z]) a{,2} missing lower bound in a counted repetition should be an error, but it was accepted with broken semantics: a{,2} was treated as a{0,3}, the new parser rejects it a{999,} large min count was not rejected (a{5000,} failed with REG_ESPACE due to reaching a stack limit), the new parser enforces the RE_DUP_MAX limit \xff regcomp used to accept a pattern with illegal sequences in it (treated them as empty expression so p\xffq matched pq) the new parser rejects such patterns with REG_BADPAT or REG_ERANGE [^b-fD-H] with REG_ICASE old parser turned this into [^b-fB-F] because of the negated overlapping range issue (see above), the new parser treats it as [^b-hB-H], POSIX seems to require [^d-fD-F], but practical implementations do case-folding first and negate the character set later instead of the other way around. (Supporting the posix way efficiently would require significant changes so it was left as is, it is unclear if any application actually expects the posix behaviour, this issue is raised on the austingroup tracker: http://austingroupbugs.net/view.php?id=872 ). another case-insensitive matching issue is that unicode case folding rules can group more than two characters together while towupper and towlower can only work for a pair of upper and lower case characters, this is a limitation of POSIX so it is not fixed. invalid bracket and brace expressions may return different error codes now (REG_ERANGE instead of REG_EBRACK or REG_BADBR instead of REG_EBRACE) otherwise the new parser should be compatible with the old one. regcomp should be able to handle arbitrary pattern input if the pattern length is limited, the only exception is the use of large repetition counts (eg. (a{255}){255}) which require exp amount of memory and there is no easy workaround.	2014-09-13 00:20:55 +02:00
Szabolcs Nagy	546f6b322b	fix memory leak in regexec when input contains illegal sequence	2014-09-05 15:12:34 -04:00
Rich Felker	c5b8f19305	add support for LC_TIME and LC_MESSAGES translations for LC_MESSAGES, translation of strerror and similar literal message functions is supported. for messages in other places (particularly the dynamic linker) that use format strings, translation is not yet supported. in order to make it possible and safe, such messages will need to be refactored to separate the textual content from the format. for LC_TIME, the day and month names and strftime-style format strings provided by nl_langinfo are supported for translation. however there may be limitations, as some of the original C-locale nl_langinfo strings are non-unique and thus perhaps non-suitable as keys. overall, the locale support activated by this commit should not be seen as complete and polished but as a basis for beginning to test locale functionality and implement locales.	2014-07-26 05:36:25 -04:00
Rich Felker	72ed3d47e5	fix crash in regexec for nonzero nmatch argument with REG_NOSUB per POSIX, the nmatch and pmatch arguments are ignored when the regex was compiled with REG_NOSUB.	2014-07-17 19:56:27 -04:00
Szabolcs Nagy	571744447c	include cleanups: remove unused headers and add feature test macros	2013-12-12 05:09:18 +00:00
Rich Felker	a4e10e304d	implement FNM_LEADING_DIR extension flag in fnmatch previously this flag was defined and accepted as a no-op, possibly breaking some software that uses it. given the choice to remove the definition and possibly break applications that were already working, or simply implement the feature, the latter turned out to be easy enough to make the decision easy. in the case where the FNM_PATHNAME flag is also set, this implementation is clean and essentially optimal. otherwise, it's an inefficient "brute force" implementation. at some point, when cleaning up and refactoring this code, I may add a more direct code path for handling FNM_LEADING_DIR in the non-FNM_PATHNAME case, but at this point my main interest is avoiding introducing new bugs in the code that implements the standard fnmatch features specified by POSIX.	2013-12-02 02:08:41 -05:00
Rich Felker	6ec82a3b58	fix fnmatch corner cases related to escaping the FNM_PATHNAME logic for advancing by /-delimited components was incorrect when the / character was escaped (i.e. \/), and a final \ at the end of pattern was not handled correctly.	2013-12-01 14:36:22 -05:00
Szabolcs Nagy	da0fcdb8e9	fix the end of string matching in fnmatch with FNM_PATHNAME a '/' in the pattern could be incorrectly matched against the terminating null byte in the string causing arbitrarily long sequence of out-of-bounds access in fnmatch("/","",FNM_PATHNAME)	2013-12-01 17:32:48 +00:00
Szabolcs Nagy	1e81fa4524	fix allocation sizes in regcomp sizeof had incorrect argument in a few places, the size was always large enough so the issue was not critical.	2013-10-07 13:25:19 +00:00
Rich Felker	ae4b0b96d6	revert regex "cleanup" that seems unjustified and may break backtracking it's not clear to me at the moment whether the code that was removed (and which is now being re-added) is needed, but it's far from being a no-op, and i don't want to risk breaking regex in this release.	2013-02-01 01:10:59 -05:00
Szabolcs Nagy	f05f59b804	remove unused "params" related code from regex some structs and functions had reference to the params feature of tre that is not used by the code anymore	2013-01-15 01:05:29 +01:00
Szabolcs Nagy	dd95916382	regex: remove an unused local variable from regexec pos_start local variable is not used in tre_tnfa_run_backtrack	2013-01-14 00:06:49 +01:00
Rich Felker	400c5e5c83	use restrict everywhere it's required by c99 and/or posix 2008 to deal with the fact that the public headers may be used with pre-c99 compilers, __restrict is used in place of restrict, and defined appropriately for any supported compiler. we also avoid the form [restrict] since older versions of gcc rejected it due to a bug in the original c99 standard, and instead use the form *restrict.	2012-09-06 22:44:55 -04:00
Rich Felker	8b4c232efe	fix regex on arm TRE has a broken assumption that wchar_t is signed, which is a sane expectation, but not required by the standard, and false on ARM's ABI. i leave tre_char_t as wchar_t for now, since a pointer to it is directly passed to functions that need pointer to wchar_t. it does not seem to break anything. and since the maximum unicode scalar value is 0x10ffff, just use that explicitly rather than using the max value of any particular C type.	2012-05-25 10:45:05 -04:00
Rich Felker	13b2945a3c	remove some no-op end of string tests from regex parser these are cruft from the original code which used an explicit string length rather than null termination. i blindly converted all the checks to null terminator checks, without noticing that in several cases, the subsequent switch statement would automatically handle the null byte correctly.	2012-05-13 17:20:01 -04:00
Rich Felker	e9cddc8e32	another BRE fix: in ^, is literal i don't understand why this has to be conditional on being in BRE mode, but enabling this code unconditionally breaks a huge number of ERE test cases.	2012-05-13 17:16:10 -04:00
Rich Felker	952700e8c3	fix error checking for \ at end of regex (this was broken previously)	2012-05-07 17:55:13 -04:00
Rich Felker	1736148210	fix copy and paste error in regex code causing mishandling of \) in BRE	2012-05-07 17:50:32 -04:00
Rich Felker	a5a4778335	fix regex breakage in last commit (failure to handle empty regex, etc.)	2012-05-07 17:43:38 -04:00
Rich Felker	d7a90b35b9	fix ugly bugs in TRE regex parser 1. * in BRE is not special at the beginning of the regex or a subexpression. this broke ncurses' build scripts. 2. \\( in BRE is a literal \ followed by a literal (, not a literal \ followed by a subexpression opener. 3. the ^ in \\(^ in BRE is a literal ^ only at the beginning of the entire BRE. POSIX allows treating it as an anchor at the beginning of a subexpression, but TRE's code for checking if it was at the beginning of a subexpression was wrong, and fixing it for the sake of supporting a non-portable usage was too much trouble when just removing this non-portable behavior was much easier. this patch also moved lots of the ugly logic for empty atom checking out of the default/literal case and into new cases for the relevant characters. this should make parsing faster and make the code smaller. if nothing else it's a lot more readable/logical. at some point i'd like to revisit and overhaul lots of this code...	2012-05-07 14:50:49 -04:00
Rich Felker	45b38550ee	new fnmatch implementation unlike the old one, this one's algorithm does not suffer from potential stack overflow issues or pathologically bad performance on certain patterns. instead of backtracking, it uses a matching algorithm which I have not seen before (unsure whether I invented or re-invented it) that runs in O(1) space and O(nm) time. it may be possible to improve the time to O(n), but not without significantly greater complexity.	2012-04-28 18:05:29 -04:00
Rich Felker	2b87a5db82	update fnmatch to POSIX 2008 semantics an invalid bracket expression must be treated as if the opening bracket were just a literal character. this is to fix a bug whereby POSIX left the behavior of the "[" shell command undefined due to it being an invalid bracket expression.	2012-04-26 12:24:44 -04:00
Rich Felker	b9dd43db04	fix signedness error handling invalid multibyte sequences in regexec the "< 0" test was always false due to use of an unsigned type. this resulted in infinite loops on 32-bit machines (adding -1U to a pointer is the same as adding -1) and crashes on 64-bit machines (offsetting the string pointer by 4gb-1b when an illegal sequence was hit).	2012-04-14 22:32:42 -04:00
Rich Felker	386b34a07b	remove invalid code from TRE TRE wants to treat + and ? after a +, ?, or * as special; ? means ungreedy and + is reserved for future use. however, this is non-conformant. although redundant, these redundant characters have well-defined (no-op) meaning for POSIX ERE, and are actually _literal_ characters (which TRE is wrongly ignoring) in POSIX BRE mode. the simplest fix is to simply remove the unneeded nonstandard functionality. as a plus, this shaves off a small amount of bloat.	2012-04-13 19:50:58 -04:00
Rich Felker	b6dbdc69b6	fix broken regerror (typo) and missing message	2012-04-13 18:40:38 -04:00
Rich Felker	ad47d45e9d	upgrade to latest upstream TRE regex code (0.8.0) the main practical results of this change are 1. the regex code is no longer subject to LGPL; it's now 2-clause BSD 2. most (all?) popular nonstandard regex extensions are supported I hesitate to call this a "sync" since both the old and new code are heavily modified. in one sense, the old code was "more severely" modified, in that it was actively hostile to non-strictly-conforming expressions. on the other hand, the new code has eliminated the useless translation of the entire regex string to wchar_t prior to compiling, and now only converts multibyte character literals as needed. in the future i may use this modified TRE as a basis for writing the long-planned new regex engine that will avoid multibyte-to-wide character conversion entirely by compiling multibyte bracket expressions specific to UTF-8.	2012-03-20 19:44:05 -04:00
Rich Felker	d0678b58ab	make glob mark symlinks-to-directories with the GLOB_MARK flag POSIX is unclear on whether it should, but all historical implementations seem to behave this way, and it seems more useful to applications.	2012-01-23 19:51:34 -05:00
Rich Felker	787c2648a9	support GLOB_PERIOD flag (GNU extension) to glob function patch by sh4rm4	2012-01-22 15:49:42 -05:00
Rich Felker	32aea2087a	duplicate re_nsub in LSB/glibc ABI compatible location	2011-06-16 16:53:11 -04:00
Rich Felker	da88b16a22	fix handling of d_name in struct dirent basically there are 3 choices for how to implement this variable-size string member: 1. C99 flexible array member: breaks using dirent.h with pre-C99 compiler. 2. old way: length-1 string: generates array bounds warnings in caller. 3. new way: length-NAME_MAX string. no problems, simplifies all code. of course the usable part in the pointer returned by readdir might be shorter than NAME_MAX+1 bytes, but that is allowed by the standard and doesn't hurt anything.	2011-06-06 18:04:28 -04:00
Rich Felker	0dc99ac413	safety fix for glob's vla usage: disallow patterns longer than PATH_MAX this actually inadvertently disallows some valid patterns with redundant / or * characters, but it's better than allowing unbounded vla allocation. eventually i'll write code to move the pattern to the stack and eliminate redundancy to ensure that it fits in PATH_MAX at the beginning of glob. this would also allow it to be modified in place for passing to fnmatch rather than copied at each level of recursion.	2011-06-05 19:29:52 -04:00
Rich Felker	a6c399cf62	eliminate (harmless in this case) vla usage in fnmatch.c	2011-06-05 13:30:56 -04:00
Rich Felker	74f75541ff	fix bug in TRE found by clang (typo && instead of &)	2011-04-07 23:13:47 -04:00
Rich Felker	0b44a0315b	initial check-in, version 0.5.0	2011-02-12 00:22:29 -05:00

41 Commits