doc.go (5465B)
1 /* 2 Package uniseg implements Unicode Text Segmentation, Unicode Line Breaking, and 3 string width calculation for monospace fonts. Unicode Text Segmentation conforms 4 to Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and Unicode 5 Line Breaking conforms to Unicode Standard Annex #14 6 (https://unicode.org/reports/tr14/). 7 8 In short, using this package, you can split a string into grapheme clusters 9 (what people would usually refer to as a "character"), into words, and into 10 sentences. Or, in its simplest case, this package allows you to count the number 11 of characters in a string, especially when it contains complex characters such 12 as emojis, combining characters, or characters from Asian, Arabic, Hebrew, or 13 other languages. Additionally, you can use it to implement line breaking (or 14 "word wrapping"), that is, to determine where text can be broken over to the 15 next line when the width of the line is not big enough to fit the entire text. 16 Finally, you can use it to calculate the display width of a string for monospace 17 fonts. 18 19 # Getting Started 20 21 If you just want to count the number of characters in a string, you can use 22 [GraphemeClusterCount]. If you want to determine the display width of a string, 23 you can use [StringWidth]. If you want to iterate over a string, you can use 24 [Step], [StepString], or the [Graphemes] class (more convenient but less 25 performant). This will provide you with all information: grapheme clusters, 26 word boundaries, sentence boundaries, line breaks, and monospace character 27 widths. The specialized functions [FirstGraphemeCluster], 28 [FirstGraphemeClusterInString], [FirstWord], [FirstWordInString], 29 [FirstSentence], and [FirstSentenceInString] can be used if only one type of 30 information is needed. 31 32 # Grapheme Clusters 33 34 Consider the rainbow flag emoji: 🏳️🌈. On most modern systems, it appears as one 35 character. But its string representation actually has 14 bytes, so counting 36 bytes (or using len("🏳️🌈")) will not work as expected. Counting runes won't, 37 either: The flag has 4 Unicode code points, thus 4 runes. The stdlib function 38 utf8.RuneCountInString("🏳️🌈") and len([]rune("🏳️🌈")) will both return 4. 39 40 The [GraphemeClusterCount] function will return 1 for the rainbow flag emoji. 41 The Graphemes class and a variety of functions in this package will allow you to 42 split strings into its grapheme clusters. 43 44 # Word Boundaries 45 46 Word boundaries are used in a number of different contexts. The most familiar 47 ones are selection (double-click mouse selection), cursor movement ("move to 48 next word" control-arrow keys), and the dialog option "Whole Word Search" for 49 search and replace. This package provides methods for determining word 50 boundaries. 51 52 # Sentence Boundaries 53 54 Sentence boundaries are often used for triple-click or some other method of 55 selecting or iterating through blocks of text that are larger than single words. 56 They are also used to determine whether words occur within the same sentence in 57 database queries. This package provides methods for determining sentence 58 boundaries. 59 60 # Line Breaking 61 62 Line breaking, also known as word wrapping, is the process of breaking a section 63 of text into lines such that it will fit in the available width of a page, 64 window or other display area. This package provides methods to determine the 65 positions in a string where a line must be broken, may be broken, or must not be 66 broken. 67 68 # Monospace Width 69 70 Monospace width, as referred to in this package, is the width of a string in a 71 monospace font. This is commonly used in terminal user interfaces or text 72 displays or editors that don't support proportional fonts. A width of 1 73 corresponds to a single character cell. The C function [wcswidth()] and its 74 implementation in other programming languages is in widespread use for the same 75 purpose. However, there is no standard for the calculation of such widths, and 76 this package differs from wcswidth() in a number of ways, presumably to generate 77 more visually pleasing results. 78 79 To start, we assume that every code point has a width of 1, with the following 80 exceptions: 81 82 - Code points with grapheme cluster break properties Control, CR, LF, Extend, 83 and ZWJ have a width of 0. 84 - U+2E3A, Two-Em Dash, has a width of 3. 85 - U+2E3B, Three-Em Dash, has a width of 4. 86 - Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide" 87 (W) have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both 88 have a width of 1.) 89 - Code points with grapheme cluster break property Regional Indicator have a 90 width of 2. 91 - Code points with grapheme cluster break property Extended Pictographic have 92 a width of 2, unless their Emoji Presentation flag is "No", in which case 93 the width is 1. 94 95 For Hangul grapheme clusters composed of conjoining Jamo and for Regional 96 Indicators (flags), all code points except the first one have a width of 0. For 97 grapheme clusters starting with an Extended Pictographic, any additional code 98 point will force a total width of 2, except if the Variation Selector-15 99 (U+FE0E) is included, in which case the total width is always 1. Grapheme 100 clusters ending with Variation Selector-16 (U+FE0F) have a width of 2. 101 102 Note that whether these widths appear correct depends on your application's 103 render engine, to which extent it conforms to the Unicode Standard, and its 104 choice of font. 105 106 [wcswidth()]: https://man7.org/linux/man-pages/man3/wcswidth.3.html 107 */ 108 package uniseg