Javascript Variable With Html Code Regex Email Matching
This python script is not working to output the email address example@email.com for this case. This was my previous post. How can I use BeautifulSoup or Slimit on a site to output
Solution 1:
Here is a rather interesting (I think) approach.
Instead of parsing this javascript code - execute it!
Get the ptr value, load it via BeautifulSoup and get the href attribute value from the a tag. Example using V8 engine:
from bs4 import BeautifulSoup
from pyv8 import PyV8
data = """
<script LANGUAGE="JavaScript">
function something()
{
var ptr;
ptr = "";
ptr += "<table><td class=france></td></table>";
ptr += "<table><td class=france><a href=mail";
ptr += "to:example@email.com>email</a></td></table>";
document.all.something.innerHTML = ptr;
}
</script>
"""
soup = BeautifulSoup(data)
# prepare the function to return a value and add a function call
js_code = soup.script.text.strip().replace('document.all.something.innerHTML = ptr;', 'return ptr;') + "; something()"
ctxt = PyV8.JSContext()
ctxt.enter()
soup = BeautifulSoup(ctxt.eval(str(js_code)))
print soup.a['href'].split('mailto:')[1]
Prints:
example@email.com
Solution 2:
Your problem is that you can't find "mailto" in your text, because the first half "mail" is not in the same line as the second half "to". To solve your problem properly only have to know the value of ptr at the end of this program.
I know that this is a bad way to do it, but if you are sure that the structure is always like this:
soup = """
<script LANGUAGE="JavaScript"> function ...() 
{ var ptr; 
ptr = ""; 
ptr += "..."; 
ptr += "..."; 
ptr += "...";
document.all.something.innerHTML = ptr; 
}
</script> 
"""
You can use this:
soup = BeautifulSoup(soup)
for script in soup.find_all('script'):
    #This matches everything between "{ var ptr;" 
    #and "document"
    regex = "{ var ptr;(.*)document"
    code = re.search(regex, script.text, flags=re.DOTALL).groups()[0]
    #This is actually dangerous because anything 
    #in the code will be executed here, but if
    #it's like your example everything will 
    #work fine and you can access the value of ptr
    exec(code)
    print ptr
Now you can use either Beautifulsoup or re to parse ptr. If you don't how it's structured, you can use this:
    mail = re.search("<a href=mailto:(.*?)>", ptr).groups()[0]
Post a Comment for "Javascript Variable With Html Code Regex Email Matching"