The technical goal of their project was clear: achieve accurate transcription of menu photos into structured menu data while keeping latency and cost low enough for production at scale.
Thanks to DoorDash Team for sharing their experience with the community and thank you guys for bringing this to us with your expansions. I wondered about two subjects after reading.
1- The decrease in the ratio of human supervision once the pipeline is in production? Business outcome of developing this pipeline?
2- I wondered if this two models would be connected tandem to mitigate their (i.e. Model2 -> Model1) rather than working in parallel to each other, would this improve the overall number of transcription validated in guardrail model?
A api call to a Vision Language Model would do the task.
Am I missing something ?
Thanks to DoorDash Team for sharing their experience with the community and thank you guys for bringing this to us with your expansions. I wondered about two subjects after reading.
1- The decrease in the ratio of human supervision once the pipeline is in production? Business outcome of developing this pipeline?
2- I wondered if this two models would be connected tandem to mitigate their (i.e. Model2 -> Model1) rather than working in parallel to each other, would this improve the overall number of transcription validated in guardrail model?
Thank you!