README.md (3207B)
1 An implementation of grapheme cluster boundaries from [Unicode text segmentation](https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) (UAX 29), for Unicode version 15.0.0. 2 3 [](https://pkg.go.dev/github.com/clipperhouse/uax29/v2/graphemes) 4  5  6 7 ## Quick start 8 9 ``` 10 go get "github.com/clipperhouse/uax29/v2/graphemes" 11 ``` 12 13 ```go 14 import "github.com/clipperhouse/uax29/v2/graphemes" 15 16 text := "Hello, 世界. Nice dog! 👍🐶" 17 18 tokens := graphemes.FromString(text) 19 20 for tokens.Next() { // Next() returns true until end of data 21 fmt.Println(tokens.Value()) // Do something with the current grapheme 22 } 23 ``` 24 25 _A grapheme is a “single visible character”, which might be a simple as a single letter, or a complex emoji that consists of several Unicode code points._ 26 27 ## Conformance 28 29 We use the Unicode [test suite](https://unicode.org/reports/tr41/tr41-26.html#Tests29). 30 31  32  33 34 ## APIs 35 36 ### If you have a `string` 37 38 ```go 39 text := "Hello, 世界. Nice dog! 👍🐶" 40 41 tokens := graphemes.FromString(text) 42 43 for tokens.Next() { // Next() returns true until end of data 44 fmt.Println(tokens.Value()) // Do something with the current grapheme 45 } 46 ``` 47 48 ### If you have an `io.Reader` 49 50 `FromReader` embeds a [`bufio.Scanner`](https://pkg.go.dev/bufio#Scanner), so just use those methods. 51 52 ```go 53 r := getYourReader() // from a file or network maybe 54 tokens := graphemes.FromReader(r) 55 56 for tokens.Scan() { // Scan() returns true until error or EOF 57 fmt.Println(tokens.Text()) // Do something with the current grapheme 58 } 59 60 if tokens.Err() != nil { // Check the error 61 log.Fatal(tokens.Err()) 62 } 63 ``` 64 65 ### If you have a `[]byte` 66 67 ```go 68 b := []byte("Hello, 世界. Nice dog! 👍🐶") 69 70 tokens := graphemes.FromBytes(b) 71 72 for tokens.Next() { // Next() returns true until end of data 73 fmt.Println(tokens.Value()) // Do something with the current grapheme 74 } 75 ``` 76 77 ### Benchmarks 78 79 On a Mac M2 laptop, we see around 200MB/s, or around 100 million graphemes per second, and no allocations. 80 81 ``` 82 goos: darwin 83 goarch: arm64 84 pkg: github.com/clipperhouse/uax29/graphemes/comparative 85 cpu: Apple M2 86 BenchmarkGraphemes/clipperhouse/uax29-8 173805 ns/op 201.16 MB/s 0 B/op 0 allocs/op 87 BenchmarkGraphemes/rivo/uniseg-8 2045128 ns/op 17.10 MB/s 0 B/op 0 allocs/op 88 ``` 89 90 ### Invalid inputs 91 92 Invalid UTF-8 input is considered undefined behavior. We test to ensure that bad inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”. 93 94 Your pipeline should probably include a call to [`utf8.Valid()`](https://pkg.go.dev/unicode/utf8#Valid).