{ "cells": [ { "cell_type": "code", "execution_count": 6, "id": "0b9ffe5f", "metadata": {}, "outputs": [], "source": [ "from pydantic import BaseModel,Field\n", "from typing import Literal,List\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "cd7bb64d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from dotenv import load_dotenv\n", "load_dotenv()" ] }, { "cell_type": "code", "execution_count": 7, "id": "dd8207ef", "metadata": {}, "outputs": [], "source": [ "class ImageSpec(BaseModel):\n", " placeholder:str=Field(...,description=\"e.g. [[IMAGE_1]]\")\n", " filename:str=Field(...,description=\"Save under images/, e.g. qkv_flow.png\")\n", " prompt:str=Field(...,description=\"Prompt to send to the image model\")\n", " size:Literal[\"1024x1024\",\"1024x1536\",\"1536x1024\"]=\"1025x1024\"\n", " quality: Literal[\"low\", \"medium\", \"high\"] = \"medium\"\n", "\n", "\n", "class GlobalImagePlan(BaseModel):\n", " md_with_placeholders:str\n", " images:List[ImageSpec]=Field(default_factory=list)" ] }, { "cell_type": "code", "execution_count": 8, "id": "63f25031", "metadata": {}, "outputs": [], "source": [ "from langchain_aws import ChatBedrockConverse\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "255a2613", "metadata": {}, "outputs": [], "source": [ "LLM_MODEL_ID = \"us.meta.llama3-3-70b-instruct-v1:0\"\n", "LLM_REGION = \"us-east-1\"\n", "llm = ChatBedrockConverse(\n", " model_id=LLM_MODEL_ID,\n", " region_name=LLM_REGION\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "id": "849c528a", "metadata": {}, "outputs": [], "source": [ "placehonder=\"\"\"You are an expert technical blog image planning assistant.\n", "\n", "Your job is to analyze a Markdown blog post and generate a structured image plan.\n", "\n", "You MUST return output strictly matching the Pydantic model `GlobalImagePlan`.\n", "\n", "-----------------------------------------\n", "YOUR TASK\n", "-----------------------------------------\n", "\n", "You will receive a Markdown blog as input.\n", "\n", "You must:\n", "\n", "1. Keep the Markdown EXACTLY the same.\n", "2. DO NOT rewrite, summarize, improve, or modify any text.\n", "3. DO NOT remove or change any formatting.\n", "4. Only insert image placeholders where images would improve clarity.\n", "\n", "-----------------------------------------\n", "WHERE TO INSERT IMAGES\n", "-----------------------------------------\n", "\n", "Insert placeholders only:\n", "- After major section headings (## or ###)\n", "- After complex explanations\n", "- After architecture descriptions\n", "- After workflows\n", "- After comparisons\n", "- Where diagrams would help understanding\n", "- Where visual examples would add clarity\n", "\n", "DO NOT:\n", "- Add images randomly\n", "- Add too many images\n", "- Break code blocks\n", "- Insert placeholders inside code blocks\n", "- Modify existing content\n", "\n", "-----------------------------------------\n", "PLACEHOLDER FORMAT\n", "-----------------------------------------\n", "\n", "Use this exact format:\n", "\n", "[[IMAGE_1]]\n", "[[IMAGE_2]]\n", "[[IMAGE_3]]\n", "\n", "Number them sequentially.\n", "\n", "-----------------------------------------\n", "IMAGE SPEC RULES\n", "-----------------------------------------\n", "\n", "For each placeholder generate an ImageSpec with:\n", "\n", "- placeholder: exact placeholder string (e.g. [[IMAGE_1]])\n", "- filename: save under images/ directory (example: images/attention_flow.png)\n", "- prompt: highly detailed image generation prompt describing what the image should show\n", "- size: choose one of:\n", " - 1024x1024 (for square diagrams)\n", " - 1536x1024 (for wide architecture diagrams)\n", " - 1024x1536 (for vertical infographics)\n", "- quality: \"medium\" unless diagram is complex → use \"high\"\n", "\n", "The prompt must:\n", "- Be descriptive\n", "- Mention diagram style\n", "- Mention labels\n", "- Mention arrows and flow\n", "- Mention clean white background\n", "- Mention professional technical illustration style\n", "\n", "-----------------------------------------\n", "IMPORTANT OUTPUT RULES\n", "-----------------------------------------\n", "\n", "You MUST return ONLY a valid GlobalImagePlan JSON object.\n", "\n", "Do NOT include:\n", "- Explanations\n", "- Extra text\n", "- Markdown fences\n", "- Comments\n", "- Any text before or after the JSON\n", "\n", "-----------------------------------------\n", "OUTPUT FORMAT\n", "-----------------------------------------\n", "\n", "{\n", " \"md_with_placeholders\": \"...full markdown with inserted placeholders...\",\n", " \"images\": [\n", " {\n", " \"placeholder\": \"[[IMAGE_1]]\",\n", " \"filename\": \"images/example.png\",\n", " \"prompt\": \"Detailed image generation prompt...\",\n", " \"size\": \"1536x1024\",\n", " \"quality\": \"medium\"\n", " }\n", " ]\n", "}\"\"\"" ] }, { "cell_type": "code", "execution_count": 14, "id": "332e03d8", "metadata": {}, "outputs": [], "source": [ "from langchain.messages import SystemMessage,HumanMessage" ] }, { "cell_type": "code", "execution_count": 16, "id": "1a7a4167", "metadata": {}, "outputs": [], "source": [ "markdown=\"\"\"\n", "# State of Multimodal LLMs in 2026\n", "\n", "## Introduction to Multimodal LLMs\n", "Recent developments in multimodal LLMs have shown significant progress, with models now capable of processing and generating multiple forms of data, such as text, images, and audio [Not found in provided sources]. \n", "* Multimodal LLMs have been applied to various tasks, including visual question answering, image captioning, and text-to-image synthesis.\n", "* The impact of multimodal LLMs can be seen in industries like healthcare, education, and entertainment, where they are used for applications such as medical image analysis, interactive learning systems, and content creation [Not found in provided sources].\n", "* Despite the advancements, key challenges in multimodal LLM research remain, including the need for large-scale datasets, improved model architectures, and better evaluation metrics [Not found in provided sources].\n", "\n", "## Recent Advances in Multimodal LLMs\n", "Recent breakthroughs in multimodal LLM architecture have led to significant improvements in the field. \n", "* Multimodal transformers, which combine visual and textual features, have shown promising results in tasks such as visual question answering and image-text retrieval [Not found in provided sources].\n", "* The use of multimodal attention mechanisms has also been explored, allowing models to focus on specific parts of the input data [Not found in provided sources].\n", "\n", "Multimodal LLMs play a crucial role in both computer vision and natural language processing. \n", "They can be used to analyze and understand visual data, such as images and videos, and generate text-based descriptions or summaries.\n", "In natural language processing, multimodal LLMs can be used to improve language understanding and generation tasks, such as machine translation and text summarization.\n", "\n", "The potential applications of multimodal LLMs in healthcare are vast. \n", "They can be used to analyze medical images, such as X-rays and MRIs, and generate text-based diagnoses or recommendations.\n", "Additionally, multimodal LLMs can be used to develop personalized treatment plans and improve patient outcomes [Not found in provided sources].\n", "Overall, the latest advancements in multimodal LLMs have the potential to revolutionize various fields, including healthcare, and improve the way we interact with and understand visual and textual data.\n", "\n", "## Challenges and Limitations\n", "The development of multimodal LLMs has made significant progress, but there are still several challenges and limitations that need to be addressed. \n", "* The limitations of current multimodal LLM models include their inability to fully understand the nuances of human communication, such as sarcasm, idioms, and figurative language [Not found in provided sources].\n", "* Training and deploying multimodal LLMs pose significant challenges, including the need for large amounts of diverse and high-quality training data, as well as the requirement for significant computational resources [Not found in provided sources].\n", "* Further research is needed to improve the performance and robustness of multimodal LLMs, particularly in areas such as common sense reasoning, emotional intelligence, and adaptability to new contexts and domains [Not found in provided sources]. \n", "Overall, addressing these challenges and limitations will be crucial to unlocking the full potential of multimodal LLMs and achieving more effective and engaging human-computer interactions.\n", "\n", "## Future Directions\n", "The future of multimodal LLMs holds great promise, with potential applications in areas such as [virtual assistants](Not found in provided sources) and [human-computer interaction](Not found in provided sources). \n", "* Multimodal LLMs may be used to improve accessibility and user experience in various domains.\n", "* The role of multimodal LLMs in shaping the future of AI is significant, as they can enable more natural and intuitive interactions between humans and machines.\n", "* Continued research in multimodal LLMs is crucial to overcome current limitations and unlock their full potential, driving innovation and progress in the field of AI [Not found in provided sources].\n", "\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 18, "id": "796739f7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GlobalImagePlan(md_with_placeholders='# State of Multimodal LLMs in 2026\\n## Introduction to Multimodal LLMs\\nRecent developments in multimodal LLMs have shown significant progress, with models now capable of processing and generating multiple forms of data, such as text, images, and audio [Not found in provided sources]. \\n* Multimodal LLMs have been applied to various tasks, including visual question answering, image captioning, and text-to-image synthesis.\\n* The impact of multimodal LLMs can be seen in industries like healthcare, education, and entertainment, where they are used for applications such as medical image analysis, interactive learning systems, and content creation [Not found in provided sources].\\n* Despite the advancements, key challenges in multimodal LLM research remain, including the need for large-scale datasets, improved model architectures, and better evaluation metrics [Not found in provided sources].\\n[[IMAGE_1]]\\n## Recent Advances in Multimodal LLMs\\nRecent breakthroughs in multimodal LLM architecture have led to significant improvements in the field. \\n* Multimodal transformers, which combine visual and textual features, have shown promising results in tasks such as visual question answering and image-text retrieval [Not found in provided sources].\\n* The use of multimodal attention mechanisms has also been explored, allowing models to focus on specific parts of the input data [Not found in provided sources].\\n[[IMAGE_2]]\\nMultimodal LLMs play a crucial role in both computer vision and natural language processing. \\nThey can be used to analyze and understand visual data, such as images and videos, and generate text-based descriptions or summaries.\\nIn natural language processing, multimodal LLMs can be used to improve language understanding and generation tasks, such as machine translation and text summarization.\\n[[IMAGE_3]]\\nThe potential applications of multimodal LLMs in healthcare are vast. \\nThey can be used to analyze medical images, such as X-rays and MRIs, and generate text-based diagnoses or recommendations.\\nAdditionally, multimodal LLMs can be used to develop personalized treatment plans and improve patient outcomes [Not found in provided sources].\\nOverall, the latest advancements in multimodal LLMs have the potential to revolutionize various fields, including healthcare, and improve the way we interact with and understand visual and textual data.\\n[[IMAGE_4]]\\n## Challenges and Limitations\\nThe development of multimodal LLMs has made significant progress, but there are still several challenges and limitations that need to be addressed. \\n* The limitations of current multimodal LLM models include their inability to fully understand the nuances of human communication, such as sarcasm, idioms, and figurative language [Not found in provided sources].\\n* Training and deploying multimodal LLMs pose significant challenges, including the need for large amounts of diverse and high-quality training data, as well as the requirement for significant computational resources [Not found in provided sources].\\n* Further research is needed to improve the performance and robustness of multimodal LLMs, particularly in areas such as common sense reasoning, emotional intelligence, and adaptability to new contexts and domains [Not found in provided sources]. \\nOverall, addressing these challenges and limitations will be crucial to unlocking the full potential of multimodal LLMs and achieving more effective and engaging human-computer interactions.\\n[[IMAGE_5]]\\n## Future Directions\\nThe future of multimodal LLMs holds great promise, with potential applications in areas such as [virtual assistants](Not found in provided sources) and [human-computer interaction](Not found in provided sources). \\n* Multimodal LLMs may be used to improve accessibility and user experience in various domains.\\n* The role of multimodal LLMs in shaping the future of AI is significant, as they can enable more natural and intuitive interactions between humans and machines.\\n* Continued research in multimodal LLMs is crucial to overcome current limitations and unlock their full potential, driving innovation and progress in the field of AI [Not found in provided sources].\\n[[IMAGE_6]]', images=[ImageSpec(placeholder='[[IMAGE_1]]', filename='images/multimodal_llm_architecture.png', prompt='A diagram showing the architecture of a multimodal LLM, with visual and textual features combined, and labels and arrows indicating the flow of data, on a clean white background, in a professional technical illustration style', size='1536x1024', quality='medium'), ImageSpec(placeholder='[[IMAGE_2]]', filename='images/multimodal_transformers.png', prompt='An illustration of multimodal transformers, with visual and textual features combined, and labels and arrows indicating the flow of data, on a clean white background, in a professional technical illustration style', size='1024x1024', quality='medium'), ImageSpec(placeholder='[[IMAGE_3]]', filename='images/multimodal_llm_applications.png', prompt='A diagram showing the various applications of multimodal LLMs, including computer vision and natural language processing, with labels and arrows indicating the relationships between the different applications, on a clean white background, in a professional technical illustration style', size='1024x1536', quality='medium'), ImageSpec(placeholder='[[IMAGE_4]]', filename='images/multimodal_llm_healthcare.png', prompt='An illustration of the potential applications of multimodal LLMs in healthcare, including medical image analysis and personalized treatment plans, with labels and arrows indicating the flow of data, on a clean white background, in a professional technical illustration style', size='1536x1024', quality='medium'), ImageSpec(placeholder='[[IMAGE_5]]', filename='images/multimodal_llm_challenges.png', prompt='A diagram showing the challenges and limitations of multimodal LLMs, including the need for large-scale datasets and improved model architectures, with labels and arrows indicating the relationships between the different challenges, on a clean white background, in a professional technical illustration style', size='1024x1024', quality='medium'), ImageSpec(placeholder='[[IMAGE_6]]', filename='images/multimodal_llm_future.png', prompt='An illustration of the future directions of multimodal LLMs, including potential applications in virtual assistants and human-computer interaction, with labels and arrows indicating the flow of data, on a clean white background, in a professional technical illustration style', size='1024x1536', quality='medium')])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "output=llm.with_structured_output(GlobalImagePlan)\\\n", ".invoke(\n", " [\n", " SystemMessage(content=placehonder),\n", " HumanMessage(content=markdown)\n", " ]\n", ")\n", "\n", "output" ] }, { "cell_type": "code", "execution_count": 20, "id": "0e44ffd5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'# State of Multimodal LLMs in 2026\\n## Introduction to Multimodal LLMs\\nRecent developments in multimodal LLMs have shown significant progress, with models now capable of processing and generating multiple forms of data, such as text, images, and audio [Not found in provided sources]. \\n* Multimodal LLMs have been applied to various tasks, including visual question answering, image captioning, and text-to-image synthesis.\\n* The impact of multimodal LLMs can be seen in industries like healthcare, education, and entertainment, where they are used for applications such as medical image analysis, interactive learning systems, and content creation [Not found in provided sources].\\n* Despite the advancements, key challenges in multimodal LLM research remain, including the need for large-scale datasets, improved model architectures, and better evaluation metrics [Not found in provided sources].\\n[[IMAGE_1]]\\n## Recent Advances in Multimodal LLMs\\nRecent breakthroughs in multimodal LLM architecture have led to significant improvements in the field. \\n* Multimodal transformers, which combine visual and textual features, have shown promising results in tasks such as visual question answering and image-text retrieval [Not found in provided sources].\\n* The use of multimodal attention mechanisms has also been explored, allowing models to focus on specific parts of the input data [Not found in provided sources].\\n[[IMAGE_2]]\\nMultimodal LLMs play a crucial role in both computer vision and natural language processing. \\nThey can be used to analyze and understand visual data, such as images and videos, and generate text-based descriptions or summaries.\\nIn natural language processing, multimodal LLMs can be used to improve language understanding and generation tasks, such as machine translation and text summarization.\\n[[IMAGE_3]]\\nThe potential applications of multimodal LLMs in healthcare are vast. \\nThey can be used to analyze medical images, such as X-rays and MRIs, and generate text-based diagnoses or recommendations.\\nAdditionally, multimodal LLMs can be used to develop personalized treatment plans and improve patient outcomes [Not found in provided sources].\\nOverall, the latest advancements in multimodal LLMs have the potential to revolutionize various fields, including healthcare, and improve the way we interact with and understand visual and textual data.\\n[[IMAGE_4]]\\n## Challenges and Limitations\\nThe development of multimodal LLMs has made significant progress, but there are still several challenges and limitations that need to be addressed. \\n* The limitations of current multimodal LLM models include their inability to fully understand the nuances of human communication, such as sarcasm, idioms, and figurative language [Not found in provided sources].\\n* Training and deploying multimodal LLMs pose significant challenges, including the need for large amounts of diverse and high-quality training data, as well as the requirement for significant computational resources [Not found in provided sources].\\n* Further research is needed to improve the performance and robustness of multimodal LLMs, particularly in areas such as common sense reasoning, emotional intelligence, and adaptability to new contexts and domains [Not found in provided sources]. \\nOverall, addressing these challenges and limitations will be crucial to unlocking the full potential of multimodal LLMs and achieving more effective and engaging human-computer interactions.\\n[[IMAGE_5]]\\n## Future Directions\\nThe future of multimodal LLMs holds great promise, with potential applications in areas such as [virtual assistants](Not found in provided sources) and [human-computer interaction](Not found in provided sources). \\n* Multimodal LLMs may be used to improve accessibility and user experience in various domains.\\n* The role of multimodal LLMs in shaping the future of AI is significant, as they can enable more natural and intuitive interactions between humans and machines.\\n* Continued research in multimodal LLMs is crucial to overcome current limitations and unlock their full potential, driving innovation and progress in the field of AI [Not found in provided sources].\\n[[IMAGE_6]]'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "output.md_with_placeholders" ] }, { "cell_type": "code", "execution_count": 21, "id": "00892f27", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[ImageSpec(placeholder='[[IMAGE_1]]', filename='images/multimodal_llm_architecture.png', prompt='A diagram showing the architecture of a multimodal LLM, with visual and textual features combined, and labels and arrows indicating the flow of data, on a clean white background, in a professional technical illustration style', size='1536x1024', quality='medium'),\n", " ImageSpec(placeholder='[[IMAGE_2]]', filename='images/multimodal_transformers.png', prompt='An illustration of multimodal transformers, with visual and textual features combined, and labels and arrows indicating the flow of data, on a clean white background, in a professional technical illustration style', size='1024x1024', quality='medium'),\n", " ImageSpec(placeholder='[[IMAGE_3]]', filename='images/multimodal_llm_applications.png', prompt='A diagram showing the various applications of multimodal LLMs, including computer vision and natural language processing, with labels and arrows indicating the relationships between the different applications, on a clean white background, in a professional technical illustration style', size='1024x1536', quality='medium'),\n", " ImageSpec(placeholder='[[IMAGE_4]]', filename='images/multimodal_llm_healthcare.png', prompt='An illustration of the potential applications of multimodal LLMs in healthcare, including medical image analysis and personalized treatment plans, with labels and arrows indicating the flow of data, on a clean white background, in a professional technical illustration style', size='1536x1024', quality='medium'),\n", " ImageSpec(placeholder='[[IMAGE_5]]', filename='images/multimodal_llm_challenges.png', prompt='A diagram showing the challenges and limitations of multimodal LLMs, including the need for large-scale datasets and improved model architectures, with labels and arrows indicating the relationships between the different challenges, on a clean white background, in a professional technical illustration style', size='1024x1024', quality='medium'),\n", " ImageSpec(placeholder='[[IMAGE_6]]', filename='images/multimodal_llm_future.png', prompt='An illustration of the future directions of multimodal LLMs, including potential applications in virtual assistants and human-computer interaction, with labels and arrows indicating the flow of data, on a clean white background, in a professional technical illustration style', size='1024x1536', quality='medium')]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "output.images" ] }, { "cell_type": "code", "execution_count": 23, "id": "0b4e77e2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'A diagram showing the architecture of a multimodal LLM, with visual and textual features combined, and labels and arrows indicating the flow of data, on a clean white background, in a professional technical illustration style'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "output.images[0].prompt" ] }, { "cell_type": "code", "execution_count": null, "id": "8666fa58", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "bloggig-Agent (3.12.12)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }