Parsing Taiwanese Like Code

3-Phase Analysis of POJ Romanization with Ruby

Mu-Fan Teng

RubyWorld Conference 2025

Shimane Prefectural Industrial Trade Hall "Kunibiki Messe" Nov. 7, 2025

Self Introduction

Mu-Fan Teng

  • Known as Ryudo Awaru (竜堂 終) in Japan
  • Founder of 5xRuby CO., LTD
  • Ruby Evangelist in Taiwan
  • Chief Organizer of RubyConf Taiwan
  • Third time speaking at RubyWorld (2015, 2023, 2025)
RubyWorld Conference 2025

10-Year Story with RubyCity Matsue

🌸 First Meeting
  • First time speaking at RWC
  • Met RubyCity Matsue
2015
🤝 Deepening Bond
  • Mayor Kamimori visited 5xRuby
  • Strengthening ties with RubyCity
2024
2023
💝 Mutual Exchange
  • Mutual exchange with RubyCity
  • Meeting with mayor at city hall
  • Returned to RWC stage
2025
💍 Official Partnership
  • MOU signed at
    RubyConf Taiwan × COSCUP 2025
  • Formalized bond with RubyCity
RubyWorld Conference 2025

About 5xRuby

"Creating beloved products with beloved technology"

  • Founded: 2014 (Taipei)
  • Expertise: Software development centered on Ruby/Rails
  • Track Record: Specializing in startup systems, also handling government collaborations

RubyWorld Conference 2025

5xRuby's Business

1. Contract Development Services

  • Taiwan's largest Ruby development company (Founded 2014)
  • Infrastructure operations for both cloud and on-premise
  • International expansion including Japan, US, and Singapore
  • Long-term partnerships from startups to listed companies
  • https://5xruby.com/en

2. SOSI Product

  • Secure remote access management system
  • Bastion server functionality
  • Browser-based VDI solution
  • https://www.sosi.com.tw
RubyWorld Conference 2025

Agenda

Today's Content

  1. The Story of No Bidders

    • Why did nobody dare to bid?
  2. What is POJ (Taiwanese Romanization)?

    • Romanization system for Taiwanese
  3. Word Segmentation Alignment Implementation

    • Implementation using GSUB
  4. Encounter with Parser

    • Re-implementation with Parslet
  5. Project Results

    • Ruby's strengths and achievements
  6. Conclusion

    • Summary

Slide Materials

https://rwc2025.ryudo.tw

The Story of No Bidders

Why did nobody dare to bid?

RubyWorld Conference 2025

Peculiarities of Taiwan Government Projects

Technical Constraints

  • Dependency on Microsoft products
  • .NET/MS-SQL/Windows Server
  • Ruby/Rails tends to lose bids

Process Issues

  • Inadequate RFP (Request for Proposal)
  • Lack of expertise from project managers
  • Gap between RFP and actual needs

Hidden Costs

  • Extensive documentation requirements
  • Security audits and vulnerability assessments
  • Mandatory on-site operational support
RubyWorld Conference 2025

Lessons from 8 Consecutive Losses

Reasons for Losses (Non-Technical)

  • Requirements based on Microsoft products
  • "Compatibility requirements" with existing systems
  • Opaque evaluation criteria
  • Not about price competition, but technology stack constraints

9th Time: Surprising Turn of Events

  • Competitors: Zero
  • "Why is nobody bidding?"
  • Even the project manager was confused: "Are you sure about this?"
  • What happened?

Truth After Winning the Bid

"Word segmentation is too complex,
nobody dared to touch it"

What is POJ (Taiwanese Romanization)?

Understanding through similarities with Japanese

RubyWorld Conference 2025

What is Tâi-lô (Taiwanese Romanization)?

Context (Kanji) Context (POJ)
日本食壽司 khì Ji̍t-pún tsia̍h sú-sih
香港、澳門...、臺灣佮日本 Hiong-káng, Ò-mn̂g...Tâi-uân kah Ji̍t-pún
的時,日本義工共臺灣人 ê sî, Ji̍t-pún gī-kang kā Tâi-uân-lâng

Romanization for Taiwanese Language

  • Official Name: Taiwan Minnanyu Luomazi Pinyin Fang'an
  • Abbreviation: Tâi-lô
  • Established: October 2006, promulgated by Taiwan's Ministry of Education
  • Status: Official writing system for Taiwanese

Not Mandarin Chinese

  • Taiwanese: Minnan language family
  • Features:
    • 9 tones
    • Unique consonant and vowel systems
    • Nasalization notation
  • History: Developed based on POJ (Pe̍h-ōe-jī) incorporating IPA (International Phonetic Alphabet) elements
RubyWorld Conference 2025

Writing Systems: Japanese vs Taiwanese

Japanese System

Kanji → Hiragana

Correspondence:

  • One group of words → One group of kana

Examples:

生活    → せいかつ
新幹線  → しんかんせん
東京駅  → とうきょうえき

Taiwanese System

Kanji → POJ

Correspondence:

  • One group of words → One group of POJ

Examples:

紲落      → suà-lo̍h
新竹市    → Sin-tik-tshī
明仔載    → bîn-á-tsài

Common Point: One group of Kanji ↔ One group of phonetic symbols

→ That's why "Word Segmentation Alignment" is necessary!

RubyWorld Conference 2025

Real Example of Word Segmentation Alignment

Input Data (Before Segmentation):

  • Kanji:紲落來看新竹市明仔載二十六號的天氣
  • POJ:suà-lo̍h lâi-khuànn Sin-tik-tshī bîn-á-tsài gī-tsap-lak hō ê thinn-khì

Expected Output (After Word Segmentation Alignment):

Kanji POJ
紲落 suà-lo̍h
來看 lâi-khuànn
新竹市 Sin-tik-tshī
明仔載 bîn-á-tsài
二十六 gī-tsap-lak
ê
天氣 thinn-khì

Word Segmentation Alignment Implementation

3-Phase Processing Flow

Implementation Overview: 3 Phases

RubyWorld Conference 2025

Phase 1: Normalization (WASH)

washed_kanji - Kanji Side

def washed_kanji
  KANJI_GSUB_PATTERNS.reduce(kanji) do |ks, (mt, kp)|
    ks.gsub(mt, kp)
  end
end

Processing:

  • Uses dozens of regex patterns for string processing
  • Insert spaces around punctuation
  • Separate periods, commas, parentheses, etc.

Example:

Input:  做工課的Lín--sàng。
Output: 做工課的Lín --sàng。

washed_roman - POJ Side

def washed_roman
  ROMAN_GSUB_PATTERNS.reduce(roman) do |rs, (mt, rp)|
    rs.gsub(mt, rp)
  end
end

Processing:

  • Normalize punctuation with 65+ patterns
  • ✅ Hyphens are preserved - syllable separators
  • ✅ Double hyphens (--) also preserved - inter-word pause

Example:

Input:  tsò-khang-khuè ê Lín--sàng.
Output: tsò-khang-khuè ê Lín--sàng .
RubyWorld Conference 2025

Phase 2-1: splitted_kanji - Kanji Splitting

Implementation Code

def splitted_kanji
  combine_one_word(
    washed_kanji.scan(RXP_SPK).map do |spka|
      spka.split(/\s/)
    end.flatten.join(' ')
  ).split
end

# RXP_SPK - Identifies CJK and non-CJK characters
RXP_SPK = /[\p{Han}\p{Katakana}\p{Hiragana}
  \p{Hangul}\u3000-\u303F\uFF00-\uFFEF]|
  [^\p{Han}\p{Katakana}\p{Hiragana}
  \p{Hangul}\u3000-\u303F\uFF00-\uFFEF]+/x

# combine_one_word - Special combination handling
def combine_one_word(text)
  ONE_KANJI_WORDS.reduce(text) do |ks, (mt, kp)|
    ks.gsub(mt, kp)
  end
end

Processing Explanation

  1. Character scan with RXP_SPK

    • CJK characters (Kanji, Hiragana, etc.)
    • Non-CJK characters (POJ, numbers, etc.)
    • Character by character or grouped non-CJK sequences
  2. Special handling with combine_one_word

    • Apply ONE_KANJI_WORDS patterns
    • Combine specific symbol sequences
  3. Split by spaces

  4. Edge Case handling:

    • Lín--sàng is recognized as a single token

Example

Input: 做工課的Lín--sàng 。
Output:
["做", "工", "課", "的", "Lín--sàng", "。"]

RubyWorld Conference 2025

Phase 2-2: splitted_roman - POJ Splitting

Implementation Code

def splitted_roman
  washed_roman
    .split(/\s/)
    .compact_blank
end

Simple! Just 3 lines

Processing Explanation

  1. Simple: Split by spaces

    • Phase 1 pre-processing makes Phase 2 simple
    • Can split by spaces alone
  2. Important design:

    • ✅ Hyphens are syllable boundary markers
    • ✅ Don't split by hyphens
    • ✅ Preserve double hyphens (--)
  3. compact_blank removes empty elements

Example

Input: tsò-khang-khuè ê Lín--sàng .
Output:
["tsò-khang-khuè", "ê", "Lín--sàng", "."]

Syllable counts:

  • tsò-khang-khuè = 3 syllables
  • Lín--sàng = 2 syllables (-- not counted in syllables)
RubyWorld Conference 2025

Phase 3: Alignment and Validation

def roman_kanji_array
  spk = splitted_kanji.dup
  splitted_roman.map do |rword|
    if rword == '--' || (SP_MIRRORS.key?(rword) &&
        #... Edge Case handling
        [rword, spss]
      end
    end
  end
end

def set_arrays
  rka = roman_kanji_array.transpose
  assign_attributes(
    roman_array: rka[0],
    kanji_array: rka[1]
  )
  self.arrays_balanced = [
    roman_array.size.positive?,
    roman_array.size == kanji_array.size,
    kanji_array.join.size ==
      washed_kanji.delete(' ').size
  ].all?
end

Processing Explanation

  1. Syllable-based matching

    • Hyphen = syllable separator
    • tsò-khang-khuè (3 syllables) → 3 Kanji characters
  2. Edge Case handling:

    • When Roman appears in Kanji side, keep as-is
    • Double hyphen (--) not counted in syllables
  3. Balance validation (3 conditions)

    • ✅ Array is not empty
    • ✅ Roman and Kanji element counts match
    • ✅ Total Kanji character count matches original text
kanji_array: ["做工課", "的", "Lín--sàng", "。"]
roman_array: ["tsò-khang-khuè", "ê", "Lín--sàng", "."]
roman_kanji_array [["tsò-khang-khuè", "做工課"], ["ê", "的"], ["Lín--sàng", "Lín--sàng"], [".", "。"]]

Encounter with Parser

From 2024 Implementation to 2025 Insight

RubyWorld Conference 2025

Insights from Kaneko-san's Talk

"Understanding Ruby Grammar Through Conflicts"

Parser's 3-Phase Processing

  1. Lexical Analysis
  2. Syntax Analysis
  3. Semantic Analysis
💡 **"What I was doing in this project was... a Parser!"**
→ **"Let's re-implement it as a Parser"**

Conference Driven Development

Implementing word segmentation alignment using Parser approach

RubyWorld Conference 2025

Discovering Parslet gem

A DSL library for writing Parsers in Ruby

Why Parslet?

  • PEG Parser: Parsing Expression Grammar
  • Ruby DSL: Define Parser using Ruby syntax
  • Clear Structure: Naturally implements 3-phase design
# Parslet basics
class MyParser < Parslet::Parser
  # Phase 1 & 2: Rule definition
  rule(:word) { match['a-z'].repeat(1) }
  rule(:sentence) { word >> space }

  root(:sentence)
end

Parslet's Design Philosophy

Parslet makes developers aware of 3 Phases:

Phase 1: Lexical Analysis

  • Define token types with rule()
  • Character patterns with match[], str()

Phase 2: Syntax Analysis

  • Combine rules with >>, |
  • Automatically construct AST

Phase 3: Semantic Analysis

  • Transform with Transform class
  • AST → Final data structure
RubyWorld Conference 2025

Parslet DSL Fundamentals

Basic Syntax

rule() - Define Rules

rule(:letter) { match['a-zA-Z'] }
rule(:digit) { match['0-9'] }

Meaning: Define reusable parser rules

match[] - Character Classes

match['a-z']           # a-z
match['a-zA-Z0-9']     # alphanumeric
match['\u0300-\u036F'] # tone marks

Meaning: Same as regex [...]

str() - String Matching

str('-')      # hyphen
str('--')     # double hyphen
str(' - ')    # space-hyphen-space

Meaning: Exact string match

Combinations

>> - Sequence

# A followed by B
rule(:word) { letter >> letter }

Meaning: Ordered concatenation (AND)

| - Alternative

# A or B (order matters!)
rule(:token) do
  double_hyphen_word |  # try first
  hyphenated_word       # try later
end

Important: PEG takes first match

.repeat - Repetition

match['a-z'].repeat      # 0 or more
match['a-z'].repeat(1)   # 1 or more

AST Construction

.as(:symbol) - Naming

# Give token a type
rule(:word) {
  letter.repeat(1).as(:word)
}

# Output AST
{ word: "hello" }

Meaning: Name for AST identification

root() - Start Rule

# Specify parser entry point
rule(:sentence) {
  token >> space?
}
root(:sentence)

Meaning: Which rule to start parsing from

RubyWorld Conference 2025

Regexp → Parslet Conversion (From GSUB Patterns to Parser Rules)

Punctuation Handling

GSUB Approach

# Part of 65+ patterns
ROMAN_GSUB_PATTERNS = {
  /,/ => ' , ',      # spaces around comma
  /\./ => ' . ',     # spaces around period
  /!/ => ' ! ',      # spaces around exclamation
  /\?/ => ' ? ',     # spaces around question mark
  # ... 60+ more patterns
}

# Application
text = "suà-lo̍h,lâi-khuànn"
ROMAN_GSUB_PATTERNS.each do |pattern, replacement|
  text = text.gsub(pattern, replacement)
end
# => "suà-lo̍h , lâi-khuànn"

Feature: Surround symbols with spaces → split later

Parslet Approach

# Directly recognize punctuation as tokens
rule(:punctuation) do
  str('...') | str('⋯⋯') | str('……') |  # multi-char first
  match[',.:;()!??!/~、─…⋯'] |         # single chars
  match["\"'\u201C\u201D\u2018\u2019"] |  # quotes
  match['\u3000-\u303F']                  # CJK symbols
end

# Token rules
rule(:token) do
  hyphenated_word.as(:word) |
  punctuation.as(:punct)
end

Input: "suà-lo̍h,lâi-khuànn"

Output (AST):

[
  { word: "suà-lo̍h" },
  { punct: "," },
  { word: "lâi-khuànn" }
]

Feature: Structured as tokens → no split needed

RubyWorld Conference 2025

Regexp → Parslet Conversion (Hyphen Handling and Syllable-based Kanji Matching)

Preserving Hyphens (Page 17)

GSUB Approach

# Step 1: Normalize punctuation
text = "suà-lo̍h lâi-khuànn"
# Preserve hyphens (important!)

# Step 2: Split by spaces
tokens = text.split(/\s/)
# => ["suà-lo̍h", "lâi-khuànn"]

# Step 3: Count syllables
syllables = "suà-lo̍h".split('-').size
# => 2

# Step 4: Take kanji characters by syllable count
kanji_chars = ["紲", "落", "來", "看"]
combined = kanji_chars.shift(syllables).join
# => "紲落"

Principle: Hyphen = syllable separator

Parslet Approach

# Recognize hyphenated word as single token
rule(:hyphenated_word) do
  syllable >>
  (single_hyphen >> syllable).repeat
end

# "suà-lo̍h" → { word: "suà-lo̍h" }

Syllable count calculation:

# Phase 3: Transform
rule(word: simple(:w)) do
  syllables = w.to_s.split('-').size
  # => 2
end

Kanji matching:

# Syllable count = Kanji character count
"suà-lo̍h".split('-').size  # => 2
"紲落".chars.size            # => 2
# ✅ Match!

Principle: Parser preserves syllable structure → automatic matching

RubyWorld Conference 2025

Comparison with Ruby Parser

Ruby Parser (Prism)

# Input
"def foo(x); x + 1; end"

Phase 1: Lexical

[DEF][IDENTIFIER][LPAREN][IDENTIFIER]
[RPAREN][SEMICOLON][IDENTIFIER][PLUS]
[INTEGER][SEMICOLON][END]

Phase 2: Syntax

DefNode(
  name: :foo,
  parameters: ParametersNode(...),
  body: StatementsNode(...)
)

Phase 3: Semantic

  • Type checking
  • Scope analysis
  • Code generation

Taiwanese Parser (RomanParserPure)

# Input
"suà-lo̍h lâi-khuànn"

Phase 1: Lexical

[suà-lo̍h][lâi-khuànn]

Phase 2: Syntax

{
  sentence: [
    { word: "suà-lo̍h" },
    { word: "lâi-khuànn" }
  ]
}

Phase 3: Semantic

  • AST transformation
  • Array generation
["suà-lo̍h", "lâi-khuànn"]

Note: Experimental implementation (educational purpose)

RubyWorld Conference 2025

Kanji Processing Depends on POJ Parser

One-way dependency: Parse complex structure first

POJ Parser (Complex)

# RomanParserPure - Implemented with Parslet
roman_array = [
  "suà-lo̍h",      # 2 syllables
  "lâi-khuànn",    # 2 syllables
  "Sin-tik-tshī"   # 3 syllables
]

# Syllable count calculation
"suà-lo̍h".split('-').size  # => 2
"Sin-tik-tshī".split('-').size  # => 3

Complex processing:

  • ✅ Hyphen semantic analysis (syllable vs inter-word)
  • ✅ Tone mark recognition (Unicode combining characters)
  • ✅ Double hyphen (--) special handling
  • ✅ Grammar rule definition and AST construction

Kanji Processing (Simple)

# Just follow POJ syllable counts
kanji = "紲落來看新竹市"

# 1. "suà-lo̍h" = 2 syllables
#    → 2 Kanji chars: "紲落"
# 2. "lâi-khuànn" = 2 syllables
#    → 2 Kanji chars: "來看"
# 3. "Sin-tik-tshī" = 3 syllables
#    → 3 Kanji chars: "新竹市"

kanji_array = ["紲落", "來看", "新竹市"]

Simple processing:

  • ✅ Syllable count = character count correspondence
  • ✅ Pattern matching (for Edge Cases)
  • ✅ No independent syntax parsing needed
RubyWorld Conference 2025

Try RomanParserPure Implementation

Published on GitHub: Test data and verification scripts

https://github.com/ryudoawaru/rwc2025-slide

What's included:

  • Complete RomanParserPure implementation
  • WASHING_PATTERNS (65+ rules)
  • 3000 real corpus data records
## 🧪 Test Results
$ ruby test_parser.rb

================================================================================
Testing RomanParserPure with 3000 records
================================================================================
[██████████████████████████████████████████████████] 100.0% (3000/3000)

================================================================================
Final Results
================================================================================
Total records:    3000
Parse success:    3000 (100.0%)
Parse errors:     0 (0.0%)
================================================================================

🎉 PERFECT! 100% success rate achieved!

Key Points:

  • ✅ 100% Parse success rate - All 3000 records parsed accurately
  • ✅ Zero errors, fully functional - Theory meets practice

Project Results

Taiwanese Language Education System Built with Ruby

RubyWorld Conference 2025

Taiwanese Corpus System

  • Official Name: 臺灣台語語料庫 應用檢索系統 (Taiwanese Language Corpus Application Search System)
  • Public URL: https://tggl.naer.edu.tw
  • Client: Ministry of Education (similar to MEXT) / National Academy for Educational Research

RubyWorld Conference 2025

Main Feature 1: Corpus Search

Integrated search system for Kanji, POJ, and audio files

Features:

  • Simultaneous display of Kanji and Tâi-lô (POJ)
  • Audio file playback
  • Context display
  • Advanced search filters
RubyWorld Conference 2025

Main Feature 2: Textbook Vocabulary

Database of Taiwanese vocabulary used in Taiwan's textbooks

RubyWorld Conference 2025

Main Feature 3: Grammar Point Search

Search for important Taiwanese grammar patterns and example sentences

Conclusion

Universality of Compiler Theory

  • Parser for programming languages → Natural language processing
  • Ruby's 3-phase analysis → Taiwanese word segmentation alignment

With the right tools and understanding of principles, complex problems can be solved

Learn from conferences and challenge new domains

  • Apply existing knowledge to new problems
  • The path of growth as an engineer
RubyWorld Conference 2025

Thank You for Your Attention

🎪 Joint Booth Exhibition

Ruby Taiwan Community & RUBYCITY MATSUE
Exhibition Hall, 1st Floor

Please visit us!