Genomics & Informatics: Organizing an in-class hackathon to correct PDF-to-text conversion errors of Genomics & Informatics 1.0

Greetings

⚠️ This is an old website ⚠️

Please visit my new website at
https://www.dgp.toronto.edu/~suhyeon423/

I'm a 4th-year Ph.D. candidate in Computer Science at the University of Toronto, working with Prof. Khai Truong in the DGP lab.

My research lies at the intersection of human–AI interaction, accessibility, and creativity support, with a special focus on improving music accessibility for d/Deaf and hard-of-hearing (DHH) individuals. My projects include song signing (CHI ’23) to explore how music is experienced and expressed within Deaf culture, and ELMI (CHI ’25), an LLM-supported English lyrics to ASL gloss translation system.

I completed my [B.Sc](http://B.Sc). in Computer Science and Engineering at Ewha Womans University, where I was advised by Prof. Uran Oh (Human-Computer Interaction Lab) and Prof. Hyokyung Bahn (Distributed Computing and Operating Systems Lab).

Additionally, I worked as a research intern at Samsung AI Center Toronto, where I was mentored by Iqbal Mohomed, and at NAVER AI LAB under the supervision of Young-Ho Kim. Most recently, I interned at Adobe Research in the STORIE Lab, supervised by Anh Truong and Justin Salamon .

💼 I am currently exploring academic (Assistant Professor) and industry (Research Scientist) positions starting in Fall 2026.

Google Scholar | LinkedIn | suhyeon.yoo[at]mail.utoronto.ca

October 10, 2020

Genomics & Informatics: Organizing an in-class hackathon to correct PDF-to-text conversion errors of Genomics & Informatics 1.0

https://genominfo.org/journal/view.php?number=620

This paper describes a community effort to improve earlier versions of the full-text corpus of Genomics & Informatics by semi-automatically detecting and correcting PDF-to-text conversion errors and optical character recognition errors during the first hackathon of Genomics & Informatics Annotation Hackathon (GIAH) event. Extracting text from multi-column biomedical documents such as Genomics & Informatics is known to be notoriously difficult. The hackathon was piloted as part of a coding competition of the ELTEC College of Engineering at Ewha Womans University in order to enable researchers and students to create or annotate their own versions of the Genomics & Informatics corpus, to gain and create knowledge about corpus linguistics, and simultaneously to acquire tangible and transferable skills. The proposed projects during the hackathon harness an internal database containing different versions of the corpus and annotations.

Conclusion

In this paper, we listed issues associated with upgrading the G&I corpus, and discussed methodological strategies to develop the next version of the G&I corpus based on a semi-automatic approach. Besides manual corrections, the outcome using pattern matching techniques and machine learning methods was noteworthy, and it greatly improved the error correction rate.

This is a progress report, and the current debate regarding our post-processing procedures focuses on how to ensure the quality of this semi-automatically modified corpus. It is taken as axiomatic that any correction must be confirmed by at least two, and usually more, people acting independently, so that their modification decisions can be compared. We suggest that a couple more rounds of the GIAH hackathon be organized to construct the future G&I 2.0 corpus. A semi-automatic method should be designed to build and improve the corpus, with a diminishing amount of manual checking.

Search This Blog

Suhyeon Yoo

Greetings

Genomics & Informatics: Organizing an in-class hackathon to correct PDF-to-text conversion errors of Genomics & Informatics 1.0

Conclusion

Comments

Post a Comment