src

Go monorepo.
git clone git://code.dwrz.net/src
Log | Files | Refs

README.md (7415B)


      1 # Unicode Text Segmentation for Go
      2 
      3 [![Go Reference](https://pkg.go.dev/badge/github.com/rivo/uniseg.svg)](https://pkg.go.dev/github.com/rivo/uniseg)
      4 [![Go Report](https://img.shields.io/badge/go%20report-A%2B-brightgreen.svg)](https://goreportcard.com/report/github.com/rivo/uniseg)
      5 
      6 This Go package implements Unicode Text Segmentation according to [Unicode Standard Annex #29](https://unicode.org/reports/tr29/), Unicode Line Breaking according to [Unicode Standard Annex #14](https://unicode.org/reports/tr14/) (Unicode version 15.0.0), and monospace font string width calculation similar to [wcwidth](https://man7.org/linux/man-pages/man3/wcwidth.3.html).
      7 
      8 ## Background
      9 
     10 ### Grapheme Clusters
     11 
     12 In Go, [strings are read-only slices of bytes](https://go.dev/blog/strings). They can be turned into Unicode code points using the `for` loop or by casting: `[]rune(str)`. However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls "grapheme cluster". Here are some examples:
     13 
     14 |String|Bytes (UTF-8)|Code points (runes)|Grapheme clusters|
     15 |-|-|-|-|
     16 |Käse|6 bytes: `4b 61 cc 88 73 65`|5 code points: `4b 61 308 73 65`|4 clusters: `[4b],[61 308],[73],[65]`|
     17 |🏳️‍🌈|14 bytes: `f0 9f 8f b3 ef b8 8f e2 80 8d f0 9f 8c 88`|4 code points: `1f3f3 fe0f 200d 1f308`|1 cluster: `[1f3f3 fe0f 200d 1f308]`|
     18 |🇩🇪|8 bytes: `f0 9f 87 a9 f0 9f 87 aa`|2 code points: `1f1e9 1f1ea`|1 cluster: `[1f1e9 1f1ea]`|
     19 
     20 This package provides tools to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.
     21 
     22 ### Word Boundaries
     23 
     24 Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection), cursor movement ("move to next word" control-arrow keys), and the dialog option "Whole Word Search" for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another. Searching may also use word boundaries in determining matching items. This package provides tools to determine word boundaries within strings.
     25 
     26 ### Sentence Boundaries
     27 
     28 Sentence boundaries are often used for triple-click or some other method of selecting or iterating through blocks of text that are larger than single words. They are also used to determine whether words occur within the same sentence in database queries. This package provides tools to determine sentence boundaries within strings.
     29 
     30 ### Line Breaking
     31 
     32 Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. This package provides tools to determine where a string may or may not be broken and where it must be broken (for example after newline characters).
     33 
     34 ### Monospace Width
     35 
     36 Most terminals or text displays / text editors using a monospace font (for example source code editors) use a fixed width for each character. Some characters such as emojis or characters found in Asian and other languages may take up more than one character cell. This package provides tools to determine the number of cells a string will take up when displayed in a monospace font. See [here](https://pkg.go.dev/github.com/rivo/uniseg#hdr-Monospace_Width) for more information.
     37 
     38 ## Installation
     39 
     40 ```bash
     41 go get github.com/rivo/uniseg
     42 ```
     43 
     44 ## Examples
     45 
     46 ### Counting Characters in a String
     47 
     48 ```go
     49 n := uniseg.GraphemeClusterCount("🇩🇪🏳️‍🌈")
     50 fmt.Println(n)
     51 // 2
     52 ```
     53 
     54 ### Calculating the Monospace String Width
     55 
     56 ```go
     57 width := uniseg.StringWidth("🇩🇪🏳️‍🌈!")
     58 fmt.Println(width)
     59 // 5
     60 ```
     61 
     62 ### Using the [`Graphemes`](https://pkg.go.dev/github.com/rivo/uniseg#Graphemes) Class
     63 
     64 This is the most convenient method of iterating over grapheme clusters:
     65 
     66 ```go
     67 gr := uniseg.NewGraphemes("👍🏼!")
     68 for gr.Next() {
     69 	fmt.Printf("%x ", gr.Runes())
     70 }
     71 // [1f44d 1f3fc] [21]
     72 ```
     73 
     74 ### Using the [`Step`](https://pkg.go.dev/github.com/rivo/uniseg#Step) or [`StepString`](https://pkg.go.dev/github.com/rivo/uniseg#StepString) Function
     75 
     76 This avoids allocating a new `Graphemes` object but it requires the handling of states and boundaries:
     77 
     78 ```go
     79 str := "🇩🇪🏳️‍🌈"
     80 state := -1
     81 var c string
     82 for len(str) > 0 {
     83 	c, str, _, state = uniseg.StepString(str, state)
     84 	fmt.Printf("%x ", []rune(c))
     85 }
     86 // [1f1e9 1f1ea] [1f3f3 fe0f 200d 1f308]
     87 ```
     88 
     89 ### Advanced Examples
     90 
     91 The [`Graphemes`](https://pkg.go.dev/github.com/rivo/uniseg#Graphemes) class offers the most convenient way to access all functionality of this package. But in some cases, it may be better to use the specialized functions directly. For example, if you're only interested in word segmentation, use [`FirstWord`](https://pkg.go.dev/github.com/rivo/uniseg#FirstWord) or [`FirstWordInString`](https://pkg.go.dev/github.com/rivo/uniseg#FirstWordInString):
     92 
     93 ```go
     94 str := "Hello, world!"
     95 state := -1
     96 var c string
     97 for len(str) > 0 {
     98 	c, str, state = uniseg.FirstWordInString(str, state)
     99 	fmt.Printf("(%s)\n", c)
    100 }
    101 // (Hello)
    102 // (,)
    103 // ( )
    104 // (world)
    105 // (!)
    106 ```
    107 
    108 Similarly, use
    109 
    110 - [`FirstGraphemeCluster`](https://pkg.go.dev/github.com/rivo/uniseg#FirstGraphemeCluster) or [`FirstGraphemeClusterInString`](https://pkg.go.dev/github.com/rivo/uniseg#FirstGraphemeClusterInString) for grapheme cluster determination only,
    111 - [`FirstSentence`](https://pkg.go.dev/github.com/rivo/uniseg#FirstSentence) or [`FirstSentenceInString`](https://pkg.go.dev/github.com/rivo/uniseg#FirstSentenceInString) for sentence segmentation only, and
    112 - [`FirstLineSegment`](https://pkg.go.dev/github.com/rivo/uniseg#FirstLineSegment) or [`FirstLineSegmentInString`](https://pkg.go.dev/github.com/rivo/uniseg#FirstLineSegmentInString) for line breaking / word wrapping (although using [`Step`](https://pkg.go.dev/github.com/rivo/uniseg#Step) or [`StepString`](https://pkg.go.dev/github.com/rivo/uniseg#StepString) is preferred as it will observe grapheme cluster boundaries).
    113 
    114 If you're only interested in the width of characters, use [`FirstGraphemeCluster`](https://pkg.go.dev/github.com/rivo/uniseg#FirstGraphemeCluster) or [`FirstGraphemeClusterInString`](https://pkg.go.dev/github.com/rivo/uniseg#FirstGraphemeClusterInString). It is much faster than using [`Step`](https://pkg.go.dev/github.com/rivo/uniseg#Step), [`StepString`](https://pkg.go.dev/github.com/rivo/uniseg#StepString), or the [`Graphemes`](https://pkg.go.dev/github.com/rivo/uniseg#Graphemes) class because it does not include the logic for word / sentence / line boundaries.
    115 
    116 Finally, if you need to reverse a string while preserving grapheme clusters, use [`ReverseString`](https://pkg.go.dev/github.com/rivo/uniseg#ReverseString):
    117 
    118 ```go
    119 fmt.Println(uniseg.ReverseString("🇩🇪🏳️‍🌈"))
    120 // 🏳️‍🌈🇩🇪
    121 ```
    122 
    123 ## Documentation
    124 
    125 Refer to https://pkg.go.dev/github.com/rivo/uniseg for the package's documentation.
    126 
    127 ## Dependencies
    128 
    129 This package does not depend on any packages outside the standard library.
    130 
    131 ## Sponsor this Project
    132 
    133 [Become a Sponsor on GitHub](https://github.com/sponsors/rivo?metadata_source=uniseg_readme) to support this project!
    134 
    135 ## Your Feedback
    136 
    137 Add your issue here on GitHub, preferably before submitting any PR's. Feel free to get in touch if you have any questions.