Alexander Schaaf

Automate parsing Kindle highlights using Go

Syncing highlights from my Kindle without relying on Amazon’s Whispersync involves copying your clippings file from the Kindle, either using tools such as Calibre, or manually. And the Kindle just dumps all highlights into one file - which is great if you just want to search over all your highlights. But I find myself often just wanting to conveniently look at my highlights from single book.

So as an exercise in further learning Go I decided to write a little program that can be easily executed whenever I connect my kindle. An advantage of using a compiled language like Go is that provides us with a nice binary program we can compile for any common platflorm and easily automate it to run whenever we connect a Kindle - without having to manage Python environments.

So let’s start out with our main function and load open the clippings text file from the connected Kindle device. You may have to modify the path depending on your device and operating system.

package main

func main() {
    file, err := os.Open("/Volumes/Kindle/documents/My Clippings.txt")
    if err != nil {
		log.Fatal(err)
	}
    file.Close()
}

We can now use a Scanner from Go’s bufio package, and use it’s Split read the file line by line. We can iterate to the next line by calling scanner.Scan() and access the line’s text using scanner.Text().

func main() {
    file, err := os.Open("/Volumes/Kindle/documents/My Clippings.txt")
    if err != nil {
		log.Fatal(err)
	}

    scanner := bufio.NewScanner(file)
    scanner.Split(bufio.ScanLines)

    for scanner.Scan() {
        line := scanner.Text()
        fmt.Println(line)
    }
    file.Close()
}

So now that we have access to every line in the file, we can have a look at the structure of the text file to figure out how to parse it. Here are two highlights:

The Color of Magic (Terry David John Pratchett)
- Your Highlight at location 2-3 | Added on Sunday, 14 February 2021 09:24:24

The Colour of Magic is Terry Pratchett’s maiden voyage through the bizarre land of Discworld.
==========
The Color of Magic (Terry David John Pratchett)
- Your Highlight at location 4-5 | Added on Sunday, 14 February 2021 09:32:41

“All wizards get like that… it’s the quicksilver fumes. Rots their brains. Mushrooms, too.”
==========

The individual highlights are separated by ==========. Each first line of a highlight contains the book title and author. In the second line we find the highlight’s location and when it was created. The actual highlight comes afterwards.

So, naturally we want to save all this info. We could use a map as a data structure for the highlights, but a simple data struct can handles much more intuitively and with autocompletion:

type HighlightData struct {
	Author    string
	Book      string
	Timestamp string
	Location  string
	Text      string
}

We also define a constant for the separator:

const separator string = "=========="

Now, for the parsing we always want to know the current line, but not of the humongous My Clippings.txt file itself, but within each highlight. That way we can easily determine where the metadata is located. We can use a simple counter variable to keep track of the line we are in within each individual highlight. When we reach a seperator line, we reset the counter to zero, otherwise we increment:

    // ...
    var counter int
    for scanner.Scan() {
        line := scanner.Text()

        if line == separator {
            counter = 0
            continue
        }
        counter++
    }
    file.Close()

Parsing author and book title

Alright, next up is parsing the first line of each highlight: book title and author. The function parseAuthorTitle takes in the target line as a string, splits it at the opening parenthesis into book title and author. We then just need to trim the spaces and trailing parenthesis off, and return the two strings.

// example line:
// The Color of Magic (Terry David John Pratchett)
func parseAuthorTitle(line string) (string, string) {
	splitLines := strings.Split(line, "(")
	bookRaw := splitLines[0]
	authorRaw := splitLines[1]

	book := strings.TrimSpace(bookRaw)
	author := strings.Trim(authorRaw, ")")
	return author, book
}

Okay, now that we have our first part of parsed data and our data structure defined, we need a convenient way of saving it. For that we create an empty slice to append our highlights to (var highlights []HighlightData) and instantiate our first empty highlight var highlight HighlightData = HighlightData{}, which we will fill with our first clipping. Once we’ve parsed all lines of the first highlight, we’ll hit a separator. We then append the current highlight object to our slice of highlights and overwrite our highlight variable with a new empty HighlightData object which we can fill with the next highlight.

    // ...
    var counter int

    var highlights []HighlightData
    var hl HighlightData = highlight{}

    for scanner.Scan() {
        line := scanner.Text()

        if counter == 0 {
            hl.author, hl.book = parseAuthorTitle(line)
        }

        if line == separator {
            highlights = append(highlights, hl)
			hl = HighlightData{}
            counter = 0
            continue
        }
        counter++
    }
    file.Close()

Parsing location and timestamp

Next up is the second line in each highlight, containing its book location and timestamp. So we create a new function called parseLocDatetime, which also takes the line as an input and outputs the location and timestamp as strings. We split the line at the pipe | into the location on the left and timestamp on the right.

We then use a regular expression to extract just the location numbers. [\d]+-[\d]+ will do just fine for that. For the timestamp, we can just remove the Added on and be done with it. If you want to actually parse the timestamp string into a timestamp-like object, this would be the place to do it. But I’m fine with the timestamp as it is.

// example line
// - Your Highlight at location 2-3 | Added on Sunday, 14 February 2021 09:24:24
func parseLocDatetime(line string) (string, string) {
	header := strings.Split(line, " | ")
	locRaw := header[0]
	re := regexp.MustCompile(`[\d]+-[\d]+`)
	locRange := re.FindString(locRaw)

	location := locRange

	dateRaw := header[1]
	timestamp := dateRaw[7:]
	return location, timestamp
}

Adding the function call to our parsing loop, only triggering it in the second line of each clipping:

    // ...
    var counter int

    var highlights []HighlightData
    var hl HighlightData = highlight{}

    for scanner.Scan() {
        line := scanner.Text()

        if counter == 0 {
            hl.author, hl.book = parseAuthorTitle(line)
        } else if counter == 1 {
            hl.location, hl.timestamp = parseLocDatetime(line)
        }

        if line == separator {
            highlights = append(highlights, hl)
			hl = HighlightData{}
            counter = 0
            continue
        }
        counter++
    }
    file.Close()

Next up is the actually highlighted text - which is much simpler to parse, as we just need to append each text line to our highlight.Text property. We can do that in the else block:

    // ...
    var counter int

    var highlights []HighlightData
    var hl HighlightData = HighlightData{}

    for scanner.Scan() {
        line := scanner.Text()

        if counter == 0 {
            hl.author, hl.book = parseAuthorTitle(line)
        } else if counter == 1 {
            hl.location, hl.timestamp = parseLocDatetime(line)
        } else {
            if line == separator {
                highlights = append(highlights, hl)
			    hl = HighlightData{}
                counter = 0
                continue
            }
            hl.text = hl.text + line
        }
        counter++
    }
    file.Close()

Note that for this to work properly, we have to first check if the line is a separator, as we really don’t want that in our highlight text. So we move the code checking for the separator into the else block in front of where we append the text.

And with that we’re all done with the actual parsing code. But what’s missing is saving it to disk in convenient file structure.

Saving it all to disk

I want my clippings to be stored on my NAS server in a kindle-clippings folder, with a folder for each author and a markdown file for each book. For that let’s code up a saveHighlight function that takes as arguments the highlight to save, and the clippings root folder.

As a first step, we check if the folder exists - and create it if not:

func saveHighlight(hl HighlightData, loc string) {
	if _, err := os.Stat(loc); os.IsNotExist(err) {
		os.Mkdir(loc, os.ModePerm)
	}
}

Then we do the same thing with the author folder:

func saveHighlight(hl HighlightData, loc string) {
	if _, err := os.Stat(loc); os.IsNotExist(err) {
		os.Mkdir(loc, os.ModePerm)
	}

    authorFolder := highlightsFolder + "/" + highlight.Author
	if _, err := os.Stat(authorFolder); os.IsNotExist(err) {
		os.Mkdir(authorFolder, os.ModePerm)
	}
}

And then check if the book file exists. If not we create it with both the book title and author at the top:

func saveHighlight(hl HighlightData, loc string) {
	if _, err := os.Stat(loc); os.IsNotExist(err) {
		os.Mkdir(loc, os.ModePerm)
	}

    authorFolder := highlightsFolder + "/" + highlight.Author
	if _, err := os.Stat(authorFolder); os.IsNotExist(err) {
		os.Mkdir(authorFolder, os.ModePerm)
	}

    bookFile := authorFolder + "/" + highlight.Book + ".md"
	if _, err := os.Stat(bookFile); os.IsNotExist(err) {
		err := ioutil.WriteFile(
            bookFile,
            []byte("# "+highlight.Book+"\n## "+highlight.Author),
            0755) // unix file permissions code
		if err != nil {
			log.Fatal(er)
		}
	}
}

What we really don’t want to do next is to just overwrite our highlights files everytime we run the program. Rather, let’s check if the file already contains our highlight text. If so, we return, if not we can then go ahead and append the highlight. We do this by reading in the entire file, turn it into a big string and use strings.Contains to do the check.

func saveHighlight(highlight HighlightData, highlightsFolder string) {
	if _, err := os.Stat(highlightsFolder); os.IsNotExist(err) {
		os.Mkdir(highlightsFolder, os.ModePerm)
	}

	authorFolder := highlightsFolder + "/" + highlight.Author
	if _, err := os.Stat(authorFolder); os.IsNotExist(err) {
		os.Mkdir(authorFolder, os.ModePerm)
	}

	bookFile := authorFolder + "/" + highlight.Book + ".md"
	if _, err := os.Stat(bookFile); os.IsNotExist(err) {
		err := ioutil.WriteFile(
            bookFile,
            []byte("# "+highlight.Book+"\n## "+highlight.Author),
            0755)
		if err != nil {
			log.Fatal(err)
		}
	}

	fileBytes, err := ioutil.ReadFile(bookFile)
	if err != nil {
		log.Fatal(err)
	}
	fileContent := string(fileBytes)
	if strings.Contains(fileContent, highlight.Text) {
		return
	}
}

After that we can go ahead with appending the highlight metadata and text:

func saveHighlight(highlight HighlightData, highlightsFolder string) {
	if _, err := os.Stat(highlightsFolder); os.IsNotExist(err) {
		os.Mkdir(highlightsFolder, os.ModePerm)
	}

	authorFolder := highlightsFolder + "/" + highlight.Author
	if _, err := os.Stat(authorFolder); os.IsNotExist(err) {
		os.Mkdir(authorFolder, os.ModePerm)
	}

	bookFile := authorFolder + "/" + highlight.Book + ".md"
	if _, err := os.Stat(bookFile); os.IsNotExist(err) {
		err := ioutil.WriteFile(
            bookFile,
            []byte("# "+highlight.Book+"\n## "+highlight.Author),
            0755)
		if err != nil {
			log.Fatal(err)
		}
	}

	fileBytes, err := ioutil.ReadFile(bookFile)
	if err != nil {
		log.Fatal(err)
	}
	fileContent := string(fileBytes)
	if strings.Contains(fileContent, highlight.Text) {
		return
	}

    file, err := os.OpenFile(bookFile, os.O_APPEND|os.O_WRONLY, 0644)
    if err != nil {
        log.Fatal(err)
    }

    if _, err = file.WriteString("\n\n### " + highlight.Timestamp); err != nil {
        panic(err)
    }
    if _, err = file.WriteString("\n#### " + strings.Title(highlight.Location)); err != nil {
        panic(err)
    }
    if _, err = file.WriteString("\n\n" + highlight.Text); err != nil {
        panic(err)
    }
}

With that done, we iterate over all highlights in the main function and pass every highlight into our newly written saveHighlight function, which will write all the contents to the given folder path:

    // ...
    for _, highlight := range highlights {
            saveHighlight(highlight, "~/kindle-highlights")
    }

An example file with one highlight looks like the following:

# The Color of Magic
## Terry David John Pratchett

### Sunday, 14 February 2021 09:24:24
#### Location 2-3

The Colour of Magic is Terry Pratchett’s maiden voyage through the bizarre land of Discworld.

The only thing missing is to allow passing a destination folder parameter to the main function so that we can conveniently use it as a command line tool without having to always edit the filepath in the source code. We can easily access the raw command-line arguments using os.Args, which is of type slice. Its first value is the program path. So if we want to call our program using go run main.go "~/kindle-highlights" (or programName "~/kindle-highlights" when compiled), we can access the path with os.Args[1]:

    // ...
    for _, highlight := range highlights {
		saveHighlight(highlight, os.Args[1])
	}

Awesome! That’s it. Compile it and run it whenever you want to sync your highlights. Or write some code to automate running the script whenever a USB device mounts!

All the code

package main

import (
	"bufio"
	"io/ioutil"
	"log"
	"os"
	"regexp"
	"strings"
)

const separator string = "=========="

type HighlightData struct {
	author    string
	book      string
	timestamp string
	location  string
	text      string
}

func parseAuthorTitle(line string) (string, string) {
	splitLines := strings.Split(line, "(")
	bookRaw := splitLines[0]
	authorRaw := splitLines[1]

	book := strings.TrimSpace(bookRaw)[3:]
	author := strings.Trim(authorRaw, ")")
	return author, book
}

func parseLocDatetime(line string) (string, string) {
	header := strings.Split(line, " | ")
	locRaw := header[0]
	re := regexp.MustCompile(`location [\d]+-[\d]+`)
	location := re.FindString(locRaw)

	dateRaw := header[1]
	timestamp := dateRaw[9:]
	return location, timestamp
}

func saveHighlight(highlight HighlightData, highlightsFolder string) {
	if _, err := os.Stat(highlightsFolder); os.IsNotExist(err) {
		os.Mkdir(highlightsFolder, os.ModePerm)
	}

	authorFolder := highlightsFolder + "/" + highlight.Author
	if _, err := os.Stat(authorFolder); os.IsNotExist(err) {
		os.Mkdir(authorFolder, os.ModePerm)
	}

	bookFile := authorFolder + "/" + highlight.Book + ".md"
	if _, err := os.Stat(bookFile); os.IsNotExist(err) {
		err := ioutil.WriteFile(
			bookFile,
			[]byte("# "+highlight.Book+"\n## "+highlight.Author),
			0755)
		if err != nil {
			log.Fatal(err)
		}
	}

	fileBytes, err := ioutil.ReadFile(bookFile)
	if err != nil {
		log.Fatal(err)
	}
	fileContent := string(fileBytes)
	if strings.Contains(fileContent, highlight.Text) {
		return
	}

	file, err := os.OpenFile(bookFile, os.O_APPEND|os.O_WRONLY, 0644)
	if err != nil {
		log.Fatal(err)
	}

	if _, err = file.WriteString("\n\n### " + highlight.Timestamp); err != nil {
		panic(err)
	}
	if _, err = file.WriteString("\n#### " + strings.Title(highlight.Location)); err != nil {
		panic(err)
	}
	if _, err = file.WriteString("\n\n" + highlight.Text); err != nil {
		panic(err)
	}

	file.Close()
}

func main() {
	file, err := os.Open("/Volumes/Kindle/documents/My Clippings.txt")
	if err != nil {
		log.Fatal(err)
	}

	scanner := bufio.NewScanner(file)

	scanner.Split(bufio.ScanLines)

	var counter int

	var highlights []HighlightData
	var hl HighlightData = HighlightData{}
	for scanner.Scan() {
		line := scanner.Text()
		if counter == 0 {
			hl.author, hl.book = parseAuthorTitle(line)
		} else if counter == 1 {
			hl.location, hl.timestamp = parseLocDatetime(line)
		} else {
			if line == separator {
				highlights = append(highlights, hl)
				hl = HighlightData{}
				counter = 0
				continue
			}
			hl.text = hl.text + line
		}
		counter++
	}

	file.Close()

	for _, highlight := range highlights {
		saveHighlight(highlight, os.Args[1])
	}
}