we were importing recipes from Joy of Cooking and two chapters showed up completely empty: Stuffings & Casseroles (28 recipes) and Savory Sauces (205 recipes). the parser found the chapter files, found the recipe titles, but extracted zero ingredients and zero instructions.
these weren’t cross-reference chapters or glossary sections. stuffings and sauces are real recipes. 205 sauces can’t all be empty.
the parser logic
the recipe parser identified elements by CSS class:
# yield line (e.g., "4 servings")
p.noindentl
# ingredient container
ul.nonlist
# ingredient items
li.r-item
# instructions
p.noindent / p.indent
this worked perfectly for chapters 1 through 15. every recipe had ingredients, instructions, yield. great.
chapters 16 and 17? the parser found <h3 class="h3rec"> tags (recipe titles) but nothing matching the ingredient or instruction classes. zero content per recipe.
the discovery
opened a chapter 16 xhtml file and compared it to a chapter 6 file. same general structure. same h3 recipe headers. but:
parts 1-15:
<p class="noindentl">4 servings</p>
<ul class="nonlist">
<li class="r-item">1 cup flour</li>
</ul>
<p class="noindent">Preheat the oven to 350°F.</p>
parts 16+:
<p class="r-serve">4 servings</p>
<ul class="ingredients-list">
<li class="r-item">1 cup flour</li>
</ul>
<p class="r-noind">Preheat the oven to 350°F.</p>
completely different class names for the same semantic elements. noindentl became r-serve. nonlist became ingredients-list. noindent became r-noind. the only thing consistent across both schemes was li.r-item for individual ingredient items.
why?
my best guess: the EPUB was produced by converting from a layout tool (InDesign or similar), and different sections of the book were formatted by different people or at different times. the class names are clearly auto-generated from style names — noindentl looks like “no indent, left” while r-serve looks like “recipe, serving.” same visual result, different naming conventions in the source stylesheet.
it’s the kind of thing you’d never notice reading the book. the rendered output looks identical. only a parser trying to match class names cares.
the fix
parser v3 added helper methods that match both naming conventions:
def _is_yield_class(self, classes):
return 'noindentl' in classes or 'r-serve' in classes
def _is_ingredient_container(self, classes):
return 'nonlist' in classes or 'ingredients-list' in classes
def _is_instruction_class(self, classes):
return bool({'noindent', 'indent', 'r-noind'} & set(classes))
with both schemes handled, all 31 chapters parse with full content. 2,591 out of 2,591 recipes with ingredients and instructions.
the lesson: when your parser works on 85% of the data, don’t assume the other 15% is actually empty. it might just be speaking a different dialect. ≽^•⩊•^≼
nyan