Site Loader
Rock Street, San Francisco

Research Proposal for Doctoral ThesisGenerating Natural-Language Image DescriptionsISMAIL KAYALI*January 2018*M.Sc.Big Data Systems, HSE University, Moscow, Russia.*B.

Sc.Informatics Engineering, Aleppo University, Syria.1 ABSTRACTNatural language processing has been one of the industry’s leading researcherson image captioning. Image description generation is a challenging problemthat has recently received a large amount of interest from the computer visionand natural language processing communities. Generating Natural-LanguageImage combines image recognition with natural language processing can helppeople who are visually impaired get an accurate description of an image Italso has applications for people who need information about an image but can’tlook at it, such as when they are driving. For example, while a computer mightaccurately describe a scene as a group of people that are sitting next to eachother,” a person may say that it’s a group of people having a good time.”The challenge is to help the technology understand what a person wouldthink was most important, and worth saying, about the picture. There’s aseparation between what’s in an image and what we say about the image.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

TheResearch describe a series of images in the same kind of way that a human would,by focusing not just on the items in the picture but also what’s happening andhow it might make a person feel.Captioning is about taking concrete objects and putting them together ina literal description using computer vision and natural language processing todescribe a person’s surroundings, read text, answer questions and even identifyemotions on people’s faces. Seeing AI, which can be used as a cell phone app orvia smart glasses. The approach leverages datasets of images and their sentencedescriptions to learn about the inter-modal correspondences between languageand visual data.A long-standing goal in this eld is to develop agents that can perceive andunderstand the rich visual world around us and who can communicate with usabout it in natural language.12 Details2.1 Introductory BackgroundDetecting and identifying objects of interest in images/video footage has attractedgreat attention in the past three decades from di erent research areas.This is due to the urgent need for having reliable machine vision systems inseveral applications domains.

The elds of computer vision and natural languageprocessing have made signi cant advances in the past few years. That’sthanks in part to the more widespread use of a machine learning methodologycalled deep neural networks. These methods have helped researchers getmuch more accurate results for pattern recognition tasks like speech recognitionand identifying objects in photos. To build the visual storytelling system, theresearchers used the deep neural networks to create a sequence to sequence”machine learning system that is similar to the kind other computer scientistshave used for automated language translation. In this case, however, insteadof translating from, say, French to English, the researchers were training thesystem to translate from images to sentences. For a machine learning system towork, it needs a training set of data that it can learn from. To build the visualstorytelling system’s training set, the researchers hired crowdsourced workersto write sentences describing various scenes.

To account for variations in howpeople described the scenes, the tool was trained to prefer language in whichthere was consensus and to create sentences based on that common ground.Image understanding has been the central goal of computer vision. Whereasa majority of work on image understanding focuses on class-based annotation,however, that describing an image using natural language is still the best wayto show one’s understanding. The task of generating descriptions for imageshas received increasing attention from both the computer vision and naturallanguage processing communities. This is an important problem, as an e ectivesolution to this problem can enable many exciting real-world applications,such as human robot interaction, image/video synopsis, and automatic captiongeneration.the rst existing methods mostly rely on prede ned templates which oftenresult in tedious descriptions. Another line of work solves the description generationproblem via retrieval, where a description for an image is borrowed fromsemantically most similar image from the training set.

This setting is, however,less applicable to complex scenes composed of a large set of objects in diversecon gurations, such as for example indoor environments.Recently, the eld has witnessed a boom in generating image descriptionsvia deep neural networks which are able to both, learn a weak language modelas well as generalize description to unseen images.The sequence of sentences needs to be as natural as possible, mimicking howhumans describe the scene. This is particularly important for example in thecontext of social robotics to enable realistic communications. Towards this goal,developing a framework with three major components:(1) a holistic visual parser based onthat couples the inference of objects,2attributes, and relations to produce a semantic representation of a 3D scene.(2) a generative grammar automatically learned from training text.

(3) a text generation algorithm that takes into account subtle dependenciesacross sentences, such as logical order, diversity, saliency of objects, and coreferenceresolution.2.2 Research QuestionsIntroduction to Generating Natural-Language Image Descriptions:-What is Generating Natural-Language Image Description?-Why would you want to describe images?-What kind of datasets are available?-What kind of tasks has been proposed?-How do we evaluate image description systems?Digging deeper and going further/The underlying semantic task:-Image Features Representation.-Sentence Representation.-Mapping images and sentences to an explicit semantic space.

-Challenges for explicit semantic mappings.-Image description as retrieval.-Advantages of retrieval.-Challenges for retrieval.-Advantages of generation.-Issues for the generation.

-Image description as cross-modal ranking.-Cross-modal image annotation.-Cross-modal image search.-Advantages of cross-modal ranking.-Issues for cross-modal ranking.-Collecting human relevance judgments.-Using human relevance judgments.

-Can evaluation be automated?2.3 Aims/Objectives of the ResearchThe pictures you get from your camera/phone have no text associated withthem. I’d like to be able to associate text queries directly with images. Sentencebasedimage description should improve: ..

. Image search for everybody …Accessibility to image collections for the visually impaired… should describe the depicted entities, events, scenes: Who did what towhom, when and where? .

.. should only describe what is in the image: Nobackground information that cannot be seen.

… may di er in the amount ofdetail they provide Each image has many correct descriptions.

Each sentencemay describe many di erent images.Perceptual image descriptions – What kind of image? (photo vs. drawing,macro, panorama) – Colors, textures, shapes Non-visual image descriptions -3Additional context (Last Sunday’s game) – Metadata (Nikon D90, f2.

8, GPScoordinatesConceptual image descriptions: Who did what where to whom? Whatevents, scenes, entities are depicted? – Generic: Kids playing football. – Speci c:Jake tackling Kevin. – Abstract: Childhood; Competition Most appropriate forimage search etc., and for image description as a test for language understanding.De nition of sentence-based image description: Sentence-based image descriptionis the task of associating images with natural language sentences thatdescribe what entities, events and scenes are depicted in them.

Applicationsof sentence-based image description: -Searching online or personal image collections-A testbed for image understanding -A testbed for grounded languageunderstandingTo develop and evaluate image description systems, we need corpora of imagespaired with appropriate captions. -What data sets are available? -Whatstrengths and weaknesses do they have? -What other data could be leveragedfor this task?Using captioned images from the web (news, photo-sharing sites) Advantage:Size, `natural’ captions Disadvantage: Online captions may not describe imagesUsing images with purposely created captions Advantage: Sentences describethe images Disadvantage: Smaller size, `unnatural’2.4 Theoretical framework and methodsThe framework for generating images descriptions is based on a key rationale:images and their corresponding descriptions are two di erent ways to express theunderlying common semantics shared by both. given an image, it rst recoversthe semantics through holistic visual analysis, which results in a scene graphthat captures detected objects and the spatial relations between them (e.g.

ontop-of and near, etc). The semantics embodied by a visual scene usually hasmultiple aspects. When describing such a complex scene, humans often use aparagraph comprised of multiple sentences, each focusing on a speci c aspect.To imitate this behavior, this framework transforms the scene graph into asequence of semantic trees, and yields multiple sentences, each from a semantictree. To make the results as natural as possible, we adopt two strategies:(1) Instead of prescribing templates in advance, we learn the grammar froma training set { a set of RGBD scenes with descriptions provided by humans.(2) We take into account dependencies among sentences, including logicalorder, saliency, coreference and diversity.Given an RGB-D image, extract semantics via holistic visual parsing. rst,parse the image to obtain the objects of interest, their attributes, and theirphysical relations, and then construct a scene graph, which provides a coherentsummary of these aspects.

A test for grounded language understanding: Imagedescription requires the ability to associate sentences with images that depict theevents, entities and scenes they describe. A test for image understanding/vision:4Image description requires the ability to detect events, entities, and scenes inimages.3 Related StudiesSome of Related Studies to Generating Natural-Language Image Descriptions:1-Natural Language Descriptions for Semantic Representations of HumanBrain Activity.2-Language representation estimated from brain activity.3-Caption generation from images.4-Natural-Language Video Descriptions Using Text-Mined Knowledge.5-Generating Image Descriptions From Computer Vision Detections.6-Deep Visual-Semantic Alignments for Generating Image Descriptions.

4 Research Plan and TimelineFirst Year: Approved written thesis proposal to be signed by the Student, Advisorand Advisory Committee. Submission of signed and approved thesis proposal to the DepartmentHead. Continuation of courses and Passing Exams. Literature review and begin data collection and analysis. Implementation of algorithms. conference article/1/. Journal article/1/.

 Meet with advisory Committee and complete progress report.Second Year: Continuation of courses and Passing Exams. Literature review and continuation of data collection and analysis. Implementation of algorithms. conference article/2/.

 Journal article/2/. Meet with advisory Committee and complete progress report.Third and Fourth Year: Continuation of courses and Passing Exams. Meet with advisory committee to determine if research goals have beenmet and establish consensus re:remaining goals and completion of yearly progress report. Completion of data collection and analysis Implementation of algorithms. conference article/3/. Journal article/3/.5 Meet with Advisory Committee to obtain approval to write thesis Writing up of thesis (and research manuscripts) and have thesis vetted byAdvisor.

 Submit Ph.D. Thesis Title Appointment of Examiners form to the Faculty. Make required revisions to the thesis Give a copy of the signed nal thesis to the faculty.6References1-Cmu’s herb robotic platform, http://www.cmu.edu/herb-robot/.2-Microsoft’s tay, https://twitter.

com/tayandyou.3-Bo Dai, Dahua Lin, Raquel Urtasun, and Sanja Fidler. Towards diverse andnatural image descriptions via a conditional gan. In arXiv:1703.06029, 2017.4-A. Karpathy and L.

Fei-Fei. Deep visual-semantic alignments for generatingimage descriptions.In CVPR, 2015.5-Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying visualsemanticembeddings with multimodal neural language models.

CoRR, abs/1411.2539,2014.6-Jacqueline Kory Westlund, Jin Joo Lee, Luke Plummer, Fardad Faridi, JesseGray, Matt Berlin, Harald Quintus-Bosz, Robert Hartmann, Mike Hess, StacyDyer, Kristopher dos Santos, Sigurdhur Orn Adhalgeirsson, Goren Gordon,Samuel Spaulding, Marayna Martinez, Madhurima Das, Maryam Archie, SooyeonJeong, and Cynthia Breazeal. Tega: A social robot. In International Conferenceon Human-Robot Interaction, 2016.7-Karen Simonyan and Andrew Zisserman. Very deep convolutional networksfor large-scale image recognition. 23, 2015.

.8-Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, RuslanSalakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell:Neural image caption generation with visual attention. In ICML, 2015.9-Yanchao Yu, Arash Eshghi, Gregory Mills, and Oliver Lemon.

The burchakcorpus: a challenge data set for interactive learning of visually grounded wordmeanings. In Workshop on Vision and Language, 2017.10-Jason Weston. Dialog-based language learning. In arXiv, 2016.11-A. Thomaz and C.

Breazeal. Reinforcement learning with human teachers:Evidence of feedback and guidance. In AAAI, 2006.12-Bradski, Gary, and Adrian Kaehler. Learning OpenCV: Computer visionwith the OpenCV library. ” O’Reilly Media, Inc.”, 2008.

13-Sonka, Milan, Vaclav Hlavac, and Roger Boyle. Image processing, analysis,and machine vision. Cengage Learning, 2014.

14-Agarwal, Shivani, Aatif Awan, and Dan Roth. “Learning to detect objects inimages via a sparse, part-based representation.” IEEE transactions on patternanalysis and machine intelligence 26.11 (2004)15-Nowozin, Sebastian, and Christoph H.

Lampert. “Structured learning andprediction in computer vision.” Foundations and Trends R in Computer Graphicsand Vision 6.

3{4 (2011)16-Andrieu, Christophe, et al. “An introduction to MCMC for machine learning.”Machine learning 50.1-2 (2003)17-Rautaray, Siddharth S., and Anupam Agrawal.

“Vision based hand gesturerecognition for human computer interaction: a survey.” Arti cial IntelligenceReview 43.1 (2015)18-Wing, Jeannette M. “Computational thinking.” Communications of the ACM49.3 (2006)19-Ma, Yunqian, and Guodong Guo, eds. Support vector machines applications.Springer Science Business Media, 2014.

720-Ma, Yunqian, and Guodong Guo, eds. Support vector machines applications.Springer Science Business Media, 2014.21-Collobert, Ronan, and Jason Weston. “A uni ed architecture for naturallanguage processing: Deep neural networks with multitask learning.” Proceedingsof the 25th international conference on Machine learning. ACM, 2008.

22-Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan,S., Saenko, K.

, and Darrell, T. Long-term recurrent convolutionalnetworks for visual recognition and description. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (2015).23-Farhadi, A., Hejrati, M., Sadeghi, M. A.

, Young, P., Rashtchian, C., Hockenmaier,J., and Forsyth, D.

Every picture tells a story: Generating sentencesfrom images. In European Conference on Computer Vision (2010).24-He, K., Zhang, X.

, Ren, S., and Sun, J. Deep residual learning for imagerecognition. In Proceeding of IEEE Conference on Computer Vision and PatternRecognition (2016).25Johnson, J., Krishna, R., Stark, M.

, Li, L.-J., Shamma, D. A., Bernstein, M.S.

, and Fei-Fei, L. Image retrieval using scene graphs. In Proceedings of IEEEConference on Computer Vision and Pattern Recognition (2015), IEEE.

26-Karpathy, A., and Fei-Fei, L. Deep visual-semantic alignments for generatingimage descriptions. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (2015).27-Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S.

, Li, S., Choi, Y., Berg, A.C., and Berg, T. L.

Babytalk: Understand- ing and generating simple imagedescriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence35 (2013).)28-Li, S., Kulkarni, G., Berg, T.

L., Berg, A. C., and Choi, Y.

Composing simpleimage descriptions using web-scale n-grams. In Pro- ceedings of the Conferenceon Computational Natural Language Learning (2011).29-Rohrbach, A.

, Rohrbach, M., and Schiele, B. The long-short story of moviedescription. In Proceedings of the German Conference on Pattern Recognition(2015), Springer.30-Russakovsky, O., Deng, J., Su, H., Krause, J.

, Satheesh, S., Ma, S., Huang,Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.

C., and Fei-Fei, L. ImageNetLarge Scale Visual Recog- nition Challenge. International Journal ofComputer Vision (IJCV) (2015).31-Venugopalan, S.

, Rohrbach, M., Donahue, J., Mooney, R., Darrell, T.

, andSaenko, K. Sequence to sequence-video to text. In Proceedings of the IEEEInternational Conference on Computer Vision (2015).8

Post Author: admin

x

Hi!
I'm Eric!

Would you like to get a custom essay? How about receiving a customized one?

Check it out