Context (Kanji)	Context (POJ)
去日本食壽司	khì Ji̍t-pún tsia̍h sú-sih
香港、澳門...、臺灣佮日本	Hiong-káng, Ò-mn̂g...Tâi-uân kah Ji̍t-pún
的時，日本義工共臺灣人	ê sî, Ji̍t-pún gī-kang kā Tâi-uân-lâng

Kanji	POJ
紲落	suà-lo̍h
來看	lâi-khuànn
新竹市	Sin-tik-tshī
明仔載	bîn-á-tsài
二十六	gī-tsap-lak
號	hō
的	ê
天氣	thinn-khì


kanji_array:	`["做工課", "的", "Lín--sàng", "。"]`
roman_array:	`["tsò-khang-khuè", "ê", "Lín--sàng", "."]`
roman_kanji_array	`[["tsò-khang-khuè", "做工課"], ["ê", "的"], ["Lín--sàng", "Lín--sàng"], [".", "。"]]`

RubyWorld Conference 2025

Parslet DSL Fundamentals

Basic Syntax

`rule()` - Define Rules

rule(:letter) { match['a-zA-Z'] }
rule(:digit) { match['0-9'] }

Meaning: Define reusable parser rules

`match[]` - Character Classes

match['a-z']           # a-z
match['a-zA-Z0-9']     # alphanumeric
match['\u0300-\u036F'] # tone marks

Meaning: Same as regex [...]

`str()` - String Matching

str('-')      # hyphen
str('--')     # double hyphen
str(' - ')    # space-hyphen-space

Meaning: Exact string match

Combinations

`>>` - Sequence

# A followed by B
rule(:word) { letter >> letter }

Meaning: Ordered concatenation (AND)

`|` - Alternative

# A or B (order matters!)
rule(:token) do
  double_hyphen_word |  # try first
  hyphenated_word       # try later
end

Important: PEG takes first match

`.repeat` - Repetition

match['a-z'].repeat      # 0 or more
match['a-z'].repeat(1)   # 1 or more

AST Construction

`.as(:symbol)` - Naming

# Give token a type
rule(:word) {
  letter.repeat(1).as(:word)
}

# Output AST
{ word: "hello" }

Meaning: Name for AST identification

`root()` - Start Rule

# Specify parser entry point
rule(:sentence) {
  token >> space?
}
root(:sentence)

Meaning: Which rule to start parsing from

Parsing Taiwanese Like Code

3-Phase Analysis of POJ Romanization with Ruby

RubyWorld Conference 2025

Shimane Prefectural Industrial Trade Hall "Kunibiki Messe" Nov. 7, 2025

Self Introduction

10-Year Story with RubyCity Matsue

About 5xRuby

5xRuby's Business

1. Contract Development Services

2. SOSI Product

Agenda

Today's Content

Slide Materials

https://rwc2025.ryudo.tw

The Story of No Bidders

Peculiarities of Taiwan Government Projects

Technical Constraints

Process Issues

Hidden Costs

Lessons from 8 Consecutive Losses

Reasons for Losses (Non-Technical)

9th Time: Surprising Turn of Events

Truth After Winning the Bid

What is POJ (Taiwanese Romanization)?

What is Tâi-lô (Taiwanese Romanization)?

Romanization for Taiwanese Language

Not Mandarin Chinese

Writing Systems: Japanese vs Taiwanese

Japanese System

Taiwanese System

Real Example of Word Segmentation Alignment

Word Segmentation Alignment Implementation

Implementation Overview: 3 Phases

Phase 1: Normalization (WASH)

washed_kanji - Kanji Side

washed_roman - POJ Side

Phase 2-1: splitted_kanji - Kanji Splitting

Implementation Code

Processing Explanation

Example

Phase 2-2: splitted_roman - POJ Splitting

Implementation Code

Processing Explanation

Example

Phase 3: Alignment and Validation

Processing Explanation

Encounter with Parser

Insights from Kaneko-san's Talk

Conference Driven Development

Discovering Parslet gem

Why Parslet?

Parslet's Design Philosophy

Parslet DSL Fundamentals

Basic Syntax

rule() - Define Rules

match[] - Character Classes

str() - String Matching

Combinations

>> - Sequence

| - Alternative

.repeat - Repetition

AST Construction

.as(:symbol) - Naming

root() - Start Rule

Regexp → Parslet Conversion (From GSUB Patterns to Parser Rules)

Punctuation Handling

GSUB Approach

Parslet Approach

Regexp → Parslet Conversion (Hyphen Handling and Syllable-based Kanji Matching)

Preserving Hyphens (Page 17)

GSUB Approach

Parslet Approach

Comparison with Ruby Parser

Ruby Parser (Prism)

Taiwanese Parser (RomanParserPure)

Kanji Processing Depends on POJ Parser

POJ Parser (Complex)

Kanji Processing (Simple)

Try RomanParserPure Implementation

Project Results

`rule()` - Define Rules

`match[]` - Character Classes

`str()` - String Matching

`>>` - Sequence

`|` - Alternative

`.repeat` - Repetition

`.as(:symbol)` - Naming

`root()` - Start Rule