Skip to content

Markdown support #274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 54 commits into
base: develop
Choose a base branch
from
Open

Markdown support #274

wants to merge 54 commits into from

Conversation

Matistjati
Copy link
Contributor

@Matistjati Matistjati commented Aug 17, 2024

supported for both pdf and html:

  • problem name and id are automatically inserted at the top of the statement. Spec that we automatically prepend it from problem.yaml
  • samples
  • interactive samples
  • {{nextsample}} and {{remainingsamples}}
  • commonmark-style images
  • "common" footnotes. extended commonmark syntax
  • "common" tables. extended commonmark syntax
  • inline and display math with $a+b$ and $$a+b$$
  • svgs, both for md and html (svgs are not sanitized)

A PR of the spec can be found here Kattis/problem-package-format#386.

TODO testing:

  • are the dependencies correct on all systems (pandoc + rsvg-convert)?
  • do we catch all images so their source can be sanitized?

not planned to be added in this PR (maybe ever):

  • tex -> html using pandoc (remove plastex)
  • tex -> pdf (we will keep using pdflatex for this)
  • pdf images (they look terrible)
  • webp images (poor support)
  • SVG sanitization (we'll do that on the Kattis side, don't want to add a huge dependency for this)

@Matistjati Matistjati marked this pull request as ready for review March 12, 2025 22:59
@Matistjati
Copy link
Contributor Author

@pehrsoderman want to take a look? I think this PR is ready for a review. Here are the statements i tested it on:

https://github.com/Matistjati/kattis-markdown-examples

In general, most statements don't look perfect right off the bat, mostly because they come from their own markdown dialect. However, I believe that all ugliness can be fixed by small tweaks.

Copy link
Contributor

@gkreitz gkreitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a pass over this. As mentioned on slack, it seems like pandoc makes no security guarantees on html output and recommends using a sanitizer (https://pandoc.org/chunkedhtml-demo/19-a-note-on-security.html). That's gonna get messy, I'm not a huge fan of any of the python options. If anyone happens to look at this PR and has any recommendations, feel free to chime in.

Looking through, I think I spotted a number of places where we need to be more careful with escaping stuff we inject (problem name, sample, and just as an extra precaution, problem id). I'm not quite sure how to do that cleanly, t.b.h. If I understand the code correctly, it seems like when pandoc goes from markdown to pdf, it goes via LaTeX, which means that inserting plain LaTeX in the markdown works. That dual-step translation makes me wonder how to sanely escape stuff like \ or $ when we inject it (or, as an extreme example, the sample of a problem about parsing latex :)).

# Add problem name and id to the top
problem_id = os.path.basename(problem_root)
statement_md = r'\centerline{\large %s}' % f"Problem id: {problem_id}" + statement_md
statement_md = r'\centerline{\huge %s}' % problem_name + statement_md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to escape LaTeX command characters in strings we inject this way. This seems to break if the name contains characters like }, \, or $.

config = verifyproblem.ProblemConfig(prob)
if not config.check(None):
raise Exception("Invalid problem.yaml")
names = config.get("name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break after #286, but you're probably aware of that. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, talked with Rasmus and Hugo to hopefully work something out

temp_file.write(sample)
temp_file.flush()
command = ["pandoc", temp_file.name, "-t" , "markdown"]
return subprocess.run(command, capture_output=True, text=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a corner case, but what happens if the sample contains latex (or, more likely, latex control characters like $)? I'm assuming pandoc won't escape that when converting HTML to markdown, so won't that then later mess up the PDF rendering once you've injected this markdown into the document and are trying to pandoc into a pdf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Painstakingly, this has been fixed.

@Matistjati
Copy link
Contributor Author

Matistjati commented Apr 7, 2025

TODO:

  • investigate under which conditions Pandoc will make web request more thoroughly.

Known issues (that will either be wontfix or fixed in follow-up pr):

  • (HTML) If you place {{remainingsamples}} in the text, you end up with something like
    image
  • SVGs are currently not supported. Will be fixed in a follow-up PR.
  • Does not support multipass samples
  • if problem id has _, it breaks pdf rendering. wontfix, invalid problem name according to spec
  • samples can overflow in normal tex -> pdf rendering. would be nice to fix, but low priority
    image

Copy link
Contributor

@gkreitz gkreitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another pass over it, as requested. This is getting close to ready for merging. The main problem I see are a couple of places where we need to escape latex characters (the rendering path of markdown -> latex -> pdf that pandoc does is funky...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants