Turning Photos into Flavors — Multimodal Input with Gemini and Flutter’s AI Dart SDK

9 min readApr 3, 2024

In my last article in the “WineSnob series”, I described refactoring the WineSnob, my slightly silly AI demo app, to use Flutter’s new AI Dart SDK. I also used the opportunity to switch from the PaLM model to Gemini Pro. However, v2 of the WineSnob still supported text-only input.

Today I am tackling v3, where I add support for multimodal input. Rather than entering the name, vintage, and other wine specifications as text, the user should be able to simply upload a picture of the wine bottle. The model will then identify the wine in the picture and generate tasting notes.

My primary goal with this project is to demonstrate the steps needed to integrate multimodal interactions into a Flutter app using the AI Dart SDK. But of course, I am also curious if extracting wine details from a picture rather than from text input affects the quality of my results. V3 of the WineSnop therefore continues to support text-only interactions in parallel with multimodal prompts to allow for a casual quality evaluation of generated tasting notes.

Previously in this WineSnob series

WineSnob v1: With Flutter and PaLM API to Instant Wine Expertise 🍷
WineSnob v2: From PaLM to Gemini with Flutter’s new AI Dart SDK
WineSnob v3: Turning Photos into Flavors — Multimodal Input with Gemini and Flutter’s AI Dart SDK

The changes covered in this post build on v2 of the WineSnob. V2 allowed the user to manually enter a wine description. The user input was dropped into a longer string that added some context to the query and the completed prompt was sent to the Gemini Pro model using the AI Dart SDK. Once the generated response was returned, it was displayed on the page and the user was offered the opportunity to comment and save the interaction.

Add Multi-Modal Interactions (“Multi Oracle”) to the WineSnob Flutter App

V3 challenge : With this rewrite the user should receive a choice between text-only and multimodal input.

Step 1: Clean up prompt management for text-only flow

My initial goal with the WineSnob had been to evaluate different prompt strategies (structured vs. freeform, one-shot vs. few-shots, …) and to get user feedback on the results. Therefore I initially laid out the project to pull preconfigured prompts out of Firestore and to let the user switch between them. Each prepared tmplate contained a placeholder into which the user input was dropped to make up the final prompt.

As the focus shifted over time, keeping prepared prompt templates in Firestore was adding unnecessary complexity and had to go.

In this rewrite I am redesigning the page to be more clear about the interaction between user input (“Joostenberg Bakermat 2020”) and prompt template (“Write tasting notes for <Joostenberg Bakermat 2020>. The tasting notes should be in the style of a wine critic…”). The app now offers a default template but allows the user to overwrite my suggestion. At query time the input is then dropped into the template string (called scaffold in some places) to form the final prompt.

When the user decides to save an interaction for later review, both template and input are saved with the result for text-only interactions.

const TEXT_TEMPLATE =
    'Write tasting notes for $INPUT_PLACEHOLDER. The tasting notes should be '
    'in the style of a wine critic and should mention the wine style, taste, '
    'and production process. Keep the result to one paragraph.';

const INPUT_PLACEHOLDER = '\${input}';

class TextQuery extends BaseQuery {
  final String? input;
  final String scaffold;

  const TextQuery({this.input, this.scaffold = TEXT_TEMPLATE});

  Content toContent() {
    var finalText = scaffold.replaceAll(INPUT_PLACEHOLDER, input ?? '');
    return Content.text(finalText);
  }
}

Incidentally I am also making some changes to my naming conventions in this step. From now on the prompt, i.e. the input for my interaction with the model, is called content to better match the Dart AI SDK.

Step 2: Split text-only and multi-modal flows

Here I am choosing a lazy approach for the price of some code duplication. The oracle route with its target OracleScreen and the two controllers OracleController and QueryController are all copied and renamed xxxText and xxxMultimodal. Ugly but good enough for a demo.

          StatefulShellBranch(
              navigatorKey: _oracleTextNavigatorKey,
              routes: [
                GoRoute(
                    path: '/oracle_text',
                    name: AppRoute.oracletext.name,
                    pageBuilder: (context, state) => NoTransitionPage(
                        key: state.pageKey, child: const OracleTextScreen()))
              ]),
          StatefulShellBranch(
              navigatorKey: _oracleMultimodalNavigatorKey,
              routes: [
                GoRoute(
                    path: '/oracle_multimodal',
                    name: AppRoute.oraclemultimodal.name,
                    pageBuilder: (context, state) => NoTransitionPage(
                        key: state.pageKey,
                        child: const OracleMultimodalScreen()))
              ]),

Step 3: Capture the image upload

With the help of the popular image_picker package the technical implementation of this step issimple. ImagePicker().pickMultiImage() allows the browser to take over and guide the user through navigating the file system and picking images.

@riverpod
class ImagesController extends _$ImagesController {
  @override
  FutureOr<List<XFile>> build() {
    return <XFile>[];
  }

  Future<void> pickImages() async {
    state = const AsyncValue.loading();
    try {
      state = await AsyncValue.guard(() async {
        final images = await ImagePicker()
            .pickMultiImage(maxHeight: 800, maxWidth: 800, imageQuality: 50);
        ref
            .read(queryMultimodalControllerProvider.notifier)
            .updateImages(images: images);
        return images;
      });
    } catch (e, stacktrace) {
      state = AsyncError(e, stacktrace);
    }
  }
}

What is left here is to solve the UX challenge. Designing a delightful image input form with form validation and error messaging is left as an exercise for the reader 😅.

Side note: During the testing phase I ran into this documented issue. My image_picker call worked fine on localhost.But in production, the file selector never popped up. In the end cleaning and rebuilding my project solved the issue just like the original poster had claimed.

Step 4: Build the multimodal prompt

At this point, all pieces to capture the user input are in place. What’s left is to combine text and input into a prompt and to query the model.

For a multimedia request, there is no need to drop the user input into a larger prompt template to add some context. Instead, the text portion of the prompt is initialized with a predefined constant prompt that most users will be able to use “as is”.

const MULTIMODAL_TEXT =
    'Identify all wine bottles in the pictures. For each wine, provide details '
    'such as name, vineyard, vintage, grapes and process. '
    'For each wine, then generate tasting notes in the style of a wine critic. '
    'The tasting notes should mention the style '
    'of the wine, the tasting profile, and the production process. '
    'Keep the results to one paragraph per wine.';

The user images were stored in memory in step 3 as XFile objects. We now have to convert the XFile image to Uint8List bytes for the AI Dart SDK. This can be done with an asynchronous call to image.readAsBytes().

  Future<Content> toContent() async {
    final List<(String, Uint8List)> imageTuples = [];
    for (final i in images) {
      imageTuples.add((i.mimeType ?? '', await i.readAsBytes()));
    }

    final List<Part> parts = [
      // Gemini docs recommend to keep the text component last
      ...imageTuples.map((tuple) => DataPart(tuple.$1, tuple.$2)),
      TextPart(text),
    ];

    return Content.multi(parts);
  }

At this point, we can call Content.multi() from the AI Dart SDK to combine all input pieces into a single prompt that can be sent off to the model via model.generateContent(Iterable<Content> prompt, …).

  Future<List<String>> fetchResults(Iterable<Content> content) async {
    try {
      final response = await model.generateContent(content);
      return [response.text ?? 'no result'];
    } catch (error) {
      throw Exception('Error on model.generateContent: $error');
    }
  }

By the way, the AI Dart SDK’s generateContent() takes a prompt as input. This prompt parameter is of type Content object. The reason is probably that the prompt is in fact a message with a “message content” but the naming still trips me off. We make a call to generateContent() and give it a … Content? 🤔

But this naming convention seems to be established so I am going with the scheme and calling toContent() functions in my queryXXX classes when I want a prompt.

Step 5: Take it for a spin

A comprehensive evaluation of the multimodal interactions is outside the scope of this experiment but I can confirm that the multimodal interaction works as expected.

For testing I mostly used random photos of wines I found on my camera. All photos display the labels but some contain more than one bottle.

I also threw in a few curveballs. An outdoor sculpture in the form of a lemonade bottle (you know it if you have been to New Zealand,) a label with Greek lettering, a fuzzy shot of some obscure Capetownian craft beers, and a picture of a picture with a winemaker's mascot.

Each time I asked the model to first identify all wine bottles in the shot and then generate tasting notes for each. Most of the time the model correctly identified all wines. It often threw in details that weren’t visible in the picture so I assume this wasn’t a case of simple OCR. The model also correctly identified the L&P bottle-shaped statue as “not a wine” and deciphered the Greek label. Well done, Gemini.

The second part of the challenge, generating the actual tasting notes, was more hit-and-miss. A few times the model correctly identified the wine but then diverged in the discussion. Here a French red wine from the Rhone Valley is described as consisting of 100% Viognier grapes, a white variety. In the second and third paragraph, the model switches to entirely different wines, ignoring my instructions.

The specs for the same wine entered into a text-only prompt result in much more realistic notes.

Some highlights stand out with the obscure cases. When confronted with the beer bottles, the model reinterpreted my instructions and became a beer critic. It even threw in the ABV. Nice.

Assessment (and disclaimer)

In my initial experiments with v1 of the WineSnob, I was very impressed with the quality of the generated tasting notes. While the output contained some lies and half-truths, most tasting notes appeared well-written and legit to the casual eye.

This time around, I caught a lot more bloopers, especially when using multimodal input. At times the model mislabeled the image or ignored my instructions. Sometimes it got important details like the wine’s color wrong. However, most surprising were the cases of clear disconnect between phase 1 (identify the wines in the picture) and phase 2 (generate tasting notes for each wine) during multimodal interactions. To me, these served as a stark reminder that I was talking to a machine model, not a human.

Of course, it would be unfair to judge the quality of the model or the effectiveness of multimodal input based on my mostly anecdotal evidence:

All my examples are hand-picked from ad-hoc queries and do not constitute a proper evaluation.
More generally, my WineSnob experiments are hardly fair. LLMs like Gemini are designed to solve certain types of challenges best. Most experts would argue that generating a factually accurate description of a sensory experience from a few words of input falls outside of the scope of these models.
Most likely the quality of both text-only and multimodal results could be noticeably improved with prompt engineering. For example, experts often advise breaking the path to the result into several steps and including a sample output with the prompt.
Some of the bloopers or hallucinations might have been avoided by turning down the temperature of the model, i.e.: making the responses more predictable and less creative.
My goal with all three versions of the WineSnob was to demonstrate technical feasibility, not to evaluate a model.

Conclusions

I am happy to report that multimodal input is easy to set up for Flutter and that it can really help simplify the UX of an app. But don’t expect a model to read your mind and be prepared to spend some effort on prompt design. A picture might be worth 1000 words but this chatty input can confuse the model.

Source Code

The WineSnob repo is public and you can find most of the changes discussed in this article in this commit.

You can play with the latest iteration of the WineSnob app here.

Turning Photos into Flavors — Multimodal Input with Gemini and Flutter’s AI Dart SDK

Add Multi-Modal Interactions (“Multi Oracle”) to the WineSnob Flutter App

Assessment (and disclaimer)

Conclusions

Source Code

The “ WineSnob series”

Written by Sylvia Dieckmann