CODEX

Using image recognition to send an email, GUI automation series II.

William W
CodeX
Published in
9 min readFeb 27, 2021

--

Today we’ll be using PyAutoGUI to open a new chrome tab, log in to an email account, write a draft, then send it. This is the second part of the GUI automation tutorial series, so check out the first tutorial if you don’t already have experience with Python and PyAutoGUI. We’ll also use gmail as the example, but you should be able to do this with any email account when we’re done. There’s a TL;DR at the end if you want to skip to the code. Also visit my Github for the code explained here.

Let’s start with our list of steps:

  1. Hit windows key, search chrome, open chrome (covered in first tutorial)
  2. Open new tab (covered in first tutorial)
  3. Type gmail.com (or mail.yahoo.com or another domain if you’re inclined)
  4. Enter cursor at the username box. Enter the password. Log in
  5. Verify successful login through dialog boxes
  6. Locate the “compose” button
  7. Enter the recipient, the subject, and an email. Send the email

In our last tutorial, we went over some basic PyAutoGUI functions (write, hotkey, press, screenshot). This time, we want to use PyAutoGUI.locateCenterOnScreen(“image.png”, confidence = float), where 0 < float < 1.

This function is pretty cool. You can take a screenshot of any image (like a wanted poster), then ask PyAutoGUI to find that image again “in the wild.”

Totally artistic illustration made on Google Slides of how PyScreeze finds images.

PyAutoGUI does this by using PyScreeze, another Python module automatically installed with PyAutoGUI. It scans your screen from top left, then to the right and down until it finds the image matching the file you pass into the locateCenterOnScreen function. The confidence interval allows a margin of difference between the image you ask it to find and what is found on the screen. So a confidence of 90% means that just 90% of the pixels need to match, allowing for small differences.¹

The locateCenterOnScreen() function also returns the x and y coordinates of the center of the image on your screen, allowing us to target automated mouse-clicks.

The documentation about how this neat function works is here.

Designing the process:

  1. Open up a new Python file called autoEmailer.py OR open up the autoSearcher.py file and delete some code from the try/except blocks until we’re just left with steps 1 and 2 (opening a Chrome tab). Then move that code into its own function called “openChrome”
You can copy this code exactly from my Github too

2. Okay, now this is when things get interesting. You’re going to need to take some screenshots of what’s on your desktop. I like the Snipping tool (or Snip & Sketch) because we can capture the exact area we want from the beginning, but full screenshots followed by cropping works just as well. We’re taking these images and telling our code that these are the buttons it should look out for.

When you’re on the Gmail homepage, take note of what you see:

What you see if you’re logged out
What you see if you’re logged in

If you’re logged out, you see something like the top. If you’re logged in, you see something like the bottom

3. Now we ask ourselves, what‘s next? Well, we design our code so it handles both situations through a main loop. We’ll discuss making a main loop in the “Writing the Code” section, but for now, let’s just collect data and assume you need to log in. Let’s pretend we’re the mouse cursor, and take note of its journey:

Screenshot or snip the Sign in text, save it as “signInButton.png”

4. We’ll assume you have a gmail account, so we can just click the Sign in button or text. Screenshot the “Sign in” text and save it as “signInButton.png” (Later, we’ll pass in signInButton.png into a function to autoclick it)

5. Now click on “Sign in” and see what’s on the next screen

6. What would you click next? The “Email or phone” box of course. Screenshot that and name it “signInBoxBlank.png”

signInBoxBlank.png

7. There is also a chance you got a screen like this.

8. It’s a good idea to discover as many branches as possible so our code can handle different screens (although sometimes this is impossible, since some screens cannot be “summoned”, such as 2FA reminders). For now, let’s keep it simple and program for the 90% of cases. We’ll assume you’re the only user and want to log back in. Screenshot either your icon or the “Signed out” text. Note that you’ll want to select your icon or even whole email if you want to implement this for multiple logins.

My icon. I named it “smallSignInIcon.png”
The “Signed out” text. If you’re copying from my Github, this is called “chooseAccount.png”

9. Now click to start the login process for your account. The next page is most likely one of these two situations:

A.) The password field is already highlighted. In this case, the blue text box shows and we just need to have our program type the password and hit “enter.” Notice that this is a time when we identify an image, but do something other than click on it once we see it.

A.) The password field is already highlighted. In this case, the blue text box shows and we just need to have our program type the password and hit “enter.” Notice that this is a time when we identify an image, but do something other than click on it once we see it.The blue text shows when the password field is already highlighted. I called it “cursorAlreadySet.png”

B.) The password field is not highlighted. In this case, the gray text shows, and we want to click into it to type our password.

This text shows when the cursor has not auto selected the field. I called it “cursorNotSetforPass.png”

10. Here’s our logic for the code: Type the password and hit enter if cursorAlreadySet.png is visible. Otherwise, click on the field “cursorNotSetforPass.png”, then type the password and hit enter.

What you see if you’re logged in

11. Here, we’re assuming we’re logged in (we could have also gotten here directly from step 1 if we were already logged in)

12. Locate the Compose button, click on it after screenshotting it.

I called this “composeButton.png”
The window that pops up when composing an email in gmail. Notice that the “To” field is automatically selected.

13. After clicking, a new mini-window shows up. Since the “To” field is automatically selected, our code may begin typing recipient(s) with the write() function (In my code, I have it hit tab per recipient in a loop until all recipients are added).

14. When finished with recipients, just hit tab again to enter a subject with write().

15. Hit tab again to enter the main body. Now that you’re in the main body, type the message with write().

16. Now hit ctrl+Enter with hotkey() to send.

Writing the code

Okay, now that we’re done with the data collection, we need to turn this design into code. We can do this in two steps.

  1. Creating a “main loop” or “game loop” structure
  2. Creating stages or states to jump between within the main loop

In our last tutorial, we didn’t really do either. We just brushed our way through 5 steps. But this task has a few more steps with branched possibilities and simply brushing through is likely to cause errors. We can easily start with our main loop, then we can spend another tutorial improving this by adding defined states.

Loop design:

  1. A main loop continues to execute until a stop or break condition occurs. Like a game, the computer constantly checks its environment for changes, then updates or refreshes certain variables in the code, which changes other internal variables or what appears on screen.
  2. Every step or action that could exist is placed into the main loop, only tripped by conditions being met (in our case, an image, text, or button being seen). The loop runs fast, refreshing the value of each conditional per loop run, on the millisecond or microsecond scale.
  3. At any point, we can put in breaks, or even states (future tutorial). For now, making a loop like this allows us to send emails whether we are logged in or not, without having to think about branching!
General block diagram of how a game loop design works

Here’s an example of a game loop made for a random “idle” game like cookie clicker, that rewards the user for pressing the “s” key:

Psuedocode loop for an idle game
Actual code for the “game” described above. Don’t worry about running this. The point is to get a look at a real example of this on something simple

Now back to designing for our emailer, each automated decision process can be summed up like this:

  1. Do I see this image? If not, hold tight until I do (do another run of the loop)
  2. Okay, I see this image, now I need to do this action (if it’s a button, click on it. If it’s a field, write in it.)
  3. … and that’s it. As it turns out, this leads to pretty simple looking code. Let’s try an example with the Sign in button:
When you want to click on this image
Use this code

Now just repeat for every other image and add in the write functions.

Try it out for yourself with my files after you change the hardcoded fields. On a future tutorial, we might discuss these types of improvements:

  1. Creating event listeners and threads instead of arbitrary sleeps
  2. Creating fixed frame rates and using PyGame for better game loops
  3. Creating a config file so we don’t hardcode any variables
  4. Using OCR and/or machine learning trained image recognition AIs so we don’t need to collect reference images
  5. Introducing other APIs and packages

TL;DR:

  1. Complete tutorial 1 if you’re not already familiar with Python and PyAutoGUI.
  2. Go to my Github, Download the python file titled “autoEmailer.py” and the zipped file of images called “autoEmailerImages.zip” (unzip the images and make sure they’re in the same folder as autoEmailer.py)
  3. Edit the hardcoded fields, e.g. “username”, “password” and recipientList variable in the writeRecipients() function. (Alternatively, create a config file). Run the autoEmailer.py file
Change the text in these fields
Also change the highlighted values

Footnotes:

  1. From a math perspective, this is a ton of calculations. The PyAutoGUI documentation says it takes 1–2 seconds to locate an image on a 1920 x 1080 screen. The computer shifts a window pixel by pixel and compares what it sees to the reference image. If it were looking for a 40 x 40 image, it would start at the rectangle defined by [top left, bottom right] = [(0, 0), (40, -40)], then move to [(0, 0), (41, -40)], [(0, 0), (42, -40)] … [(0, 0), (1879, -40)], [(0, 0), (1880, -40)], [(0, -1), (40, -41)], [(0, -1), (41, -41)] … all the way to [(1880, -1040), (1920, -1080)]. For you algorithm nerds, this is worst case complexity of O( (a -c + 1)*(b-d+1) ) where the screen size is a x b pixels and the image we want is c x d pixels. For a 40 x 40 image, this is over 1.9 million pixels being compared if the image isn’t found until it’s at the bottom right.

--

--

William W
CodeX

Electrical Engineering. Software developer and cofounder of a startup.