src

Go monorepo.
git clone git://code.dwrz.net/src
Log | Files | Refs

README.md (3207B)


      1 An implementation of grapheme cluster boundaries from [Unicode text segmentation](https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) (UAX 29), for Unicode version 15.0.0.
      2 
      3 [![Documentation](https://pkg.go.dev/badge/github.com/clipperhouse/uax29/v2/graphemes.svg)](https://pkg.go.dev/github.com/clipperhouse/uax29/v2/graphemes)
      4 ![Tests](https://github.com/clipperhouse/uax29/actions/workflows/gotest.yml/badge.svg)
      5 ![Fuzz](https://github.com/clipperhouse/uax29/actions/workflows/gofuzz.yml/badge.svg)
      6 
      7 ## Quick start
      8 
      9 ```
     10 go get "github.com/clipperhouse/uax29/v2/graphemes"
     11 ```
     12 
     13 ```go
     14 import "github.com/clipperhouse/uax29/v2/graphemes"
     15 
     16 text := "Hello, 世界. Nice dog! 👍🐶"
     17 
     18 tokens := graphemes.FromString(text)
     19 
     20 for tokens.Next() {                     // Next() returns true until end of data
     21 	fmt.Println(tokens.Value())         // Do something with the current grapheme
     22 }
     23 ```
     24 
     25 _A grapheme is a “single visible character”, which might be a simple as a single letter, or a complex emoji that consists of several Unicode code points._
     26 
     27 ## Conformance
     28 
     29 We use the Unicode [test suite](https://unicode.org/reports/tr41/tr41-26.html#Tests29).
     30 
     31 ![Tests](https://github.com/clipperhouse/uax29/actions/workflows/gotest.yml/badge.svg)
     32 ![Fuzz](https://github.com/clipperhouse/uax29/actions/workflows/gofuzz.yml/badge.svg)
     33 
     34 ## APIs
     35 
     36 ### If you have a `string`
     37 
     38 ```go
     39 text := "Hello, 世界. Nice dog! 👍🐶"
     40 
     41 tokens := graphemes.FromString(text)
     42 
     43 for tokens.Next() {                     // Next() returns true until end of data
     44 	fmt.Println(tokens.Value())         // Do something with the current grapheme
     45 }
     46 ```
     47 
     48 ### If you have an `io.Reader`
     49 
     50 `FromReader` embeds a [`bufio.Scanner`](https://pkg.go.dev/bufio#Scanner), so just use those methods.
     51 
     52 ```go
     53 r := getYourReader()                        // from a file or network maybe
     54 tokens := graphemes.FromReader(r)
     55 
     56 for tokens.Scan() {                         // Scan() returns true until error or EOF
     57 	fmt.Println(tokens.Text())              // Do something with the current grapheme
     58 }
     59 
     60 if tokens.Err() != nil {                    // Check the error
     61 	log.Fatal(tokens.Err())
     62 }
     63 ```
     64 
     65 ### If you have a `[]byte`
     66 
     67 ```go
     68 b := []byte("Hello, 世界. Nice dog! 👍🐶")
     69 
     70 tokens := graphemes.FromBytes(b)
     71 
     72 for tokens.Next() {                     // Next() returns true until end of data
     73 	fmt.Println(tokens.Value())         // Do something with the current grapheme
     74 }
     75 ```
     76 
     77 ### Benchmarks
     78 
     79 On a Mac M2 laptop, we see around 200MB/s, or around 100 million graphemes per second, and no allocations.
     80 
     81 ```
     82 goos: darwin
     83 goarch: arm64
     84 pkg: github.com/clipperhouse/uax29/graphemes/comparative
     85 cpu: Apple M2
     86 BenchmarkGraphemes/clipperhouse/uax29-8    	    173805 ns/op	 201.16 MB/s      0 B/op	   0 allocs/op
     87 BenchmarkGraphemes/rivo/uniseg-8           	   2045128 ns/op	  17.10 MB/s      0 B/op	   0 allocs/op
     88 ```
     89 
     90 ### Invalid inputs
     91 
     92 Invalid UTF-8 input is considered undefined behavior. We test to ensure that bad inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.
     93 
     94 Your pipeline should probably include a call to [`utf8.Valid()`](https://pkg.go.dev/unicode/utf8#Valid).